Text Mining PubMed: Parsing Medline Files in R

NOTE: The source code for the R module can be found here

I often use the Medline module from the Biopython library for parsing and extracting data from PubMed Medline files. I’ve been unable to find a similar package to be used in R (although I must admit that I did not search too hard). So, I decided to take a shot at writing my own module for parsing and extracting data from PubMed Medline files. First, it helps to know what information these files contain as well as the keys used to identify the information. Medline files can be generated based off just about any PubMed search you wish, for example the following search retrieves articles from UNM College of Pharmacy only.

"University of New Mexico"[AD] AND "pharmacy"[AD]

The AD tag designates the “affiliation” field and therefor restricts the search to articles containing the phrase "university of new mexico" in the affiliation field. We further restrict the search to include only articles with "pharmacy" in the AD field as well. This becomes more clear when looking at the Medline structure. Below is an example of a Medline file retrieved from the search. (Note that results from any PubMed search can be saved as a Medline file by clicking Send to>>File>>Medline)

PMID- 21775337
OWN - NLM
STAT- MEDLINE
DA  - 20110914
DCOM- 20120224
LR  - 20131121
IS  - 1460-2091 (Electronic)
IS  - 0305-7453 (Linking)
VI  - 66
IP  - 10
DP  - 2011 Oct
TI  - Site of infection rather than vancomycin MIC predicts vancomycin treatment
      failure in methicillin-resistant Staphylococcus aureus bacteraemia.
PG  - 2386-92
LID - 10.1093/jac/dkr301 [doi]
AB  - BACKGROUND: Therapeutic use of vancomycin is characterized by decreased
      susceptibilities and increasing reports of clinical failures. Few studies have
      examined the clinical outcomes of patients with methicillin-resistant
      Staphylococcus aureus (MRSA) bacteraemia treated with vancomycin. The primary
      objective was to compare clinical outcomes of patients with MRSA bacteraemia
      treated according to standard of care practices. METHODS: Patients were included 
      if: (i) admitted to University of New Mexico Hospital between 2002 and 2009; (ii)
      >/=18 years of age; (iii) had one blood culture positive for MRSA; and (iv)
      received vancomycin. Clinical outcomes were defined as cure, failure (relapse of 
      infection 30 days after completion of therapy, death or change in therapy) or
      unevaluable. Patient demographics, source of bacteraemia, treatment regimen, and 
      microbiological characteristics were determined. RESULTS: Two hundred patients
      with MRSA bacteraemia were included. Sixty-one patients were unevaluable, leaving
      139 patients for the final analysis. Seventy-two (51.8%) patients were cured and 
      67 (48.2%) experienced vancomycin failure. Vancomycin MIC(90) was 2 mg/L for both
      groups by Etest. Patients with endocarditis (P = 0.02) or pneumonia (P = 0.02)
      were more likely to fail therapy. Panton-Valentine leucocidin, loss of agr
      functionality and strain type were not predictors of outcomes in this study.
      CONCLUSIONS: High failure rates were observed in patients with MRSA bacteraemia
      treated with vancomycin, despite high vancomycin troughs and low rates of
      nephrotoxicity. Predictors of vancomycin failure included endocarditis and
      pneumonia. In these situations, vancomycin provides suboptimal therapy.
FAU - Walraven, Carla J
AU  - Walraven CJ
AD  - College of Pharmacy, University of New Mexico Health Sciences Center,
      Albuquerque, NM 87106, USA. cwalraven@salud.unm.edu

On the left, you can see the various tags used to specify certain information fields. In order to build a parser we must be able to use the tags to extract relevant pieces of information. The code below does just that

##########################################################################
# Michael L. Bernauer
# mlbernauer@gmail.com
# 12/14/2014
# Module for parsing PubMed Medline files.
# Files should be downloaded to your
# computer and loaded into R by passing the
# file path into the medline function.
# The function returns a list containing
# each Medline entry.
#
# USAGE:
# source('medline.R')
# medline_records <- medline("/home/user/Downloads/pubmed_results.txt")
##########################################################################
medline = function(file_name){
  lines <- readLines(file_name)
  medline_records <- list()
  key <- 0
  record <- 0
  for(line in lines){
    header <- sub(" {1,20}", "", substring(line, 1, 4))
    value <- sub("^.{6}", "", line)
    if(header == "" & value == ""){
      next
    }
    else if(header == "PMID"){
      record = record + 1
      medline_records[[record]] <- list()
      medline_records[[record]][header] <- value
    }
    else if(header == "" & value != ""){
      medline_records[[record]][key] <- paste(medline_records[[record]][key], value)
    }
    else{
      key <- header
      if(is.null(medline_records[[record]][key][[1]])){
        medline_records[[record]][key] <- value
      }
      else { 
        medline_records[[record]][key] <- paste(medline_records[[record]][key], value, sep=";")
      }
    }
  }
return(medline_records)
}

To parse a Medline file you must pass the path to the Medline file into the function

medline_records <- medline("/home/user/Downloads/pubmed_results.txt")

The code above returns a list medline_records containing records for each entry.

Advertisements

2 responses to “Text Mining PubMed: Parsing Medline Files in R

  1. Pingback: UNM College of Pharmacy Publications | MLBERNAUER·

  2. Pingback: Entrez E-Utilities API and R | MLBERNAUER·

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s