Parsing Text Files with Regular Expression

Saturday. December 28, 2019 - 11 mins

Overview:

This post is a small example of how one might process unstructured text files using regular expressions. In addition to regular expressions we will also use pandas to easily organize and store the data our expressions return. The data used in this excercise was a folder of text files containing National Science Foundation grant award information.

Part 1: Description

Dataset/Corpus Description:

“The National Science Foundation (NSF) is an independent federal agency created by Congress in 1950 “to promote the progress of science; to advance the national health, prosperity, and welfare; to secure the national defense…” NSF is vital because we support basic research and people to create knowledge that transforms the future.” [1] The NSF awards grants to researchers who want to obtain government funding for their projects. The text corpus used in this report are text documents containing the abstracts of approved projects from 1990 - 2003. Split into 4,016 .txt files each one contains a seperate approved project. Each of the abstracts are organized into sections including the project title, funding amount, sponsor, and a breif description of the project. Each year the NSF awards funding to various scientific projects across the United States. The text corpus used in this report are text documents containing the abstracts of approved projects from 1990 - 2003. An example of a single file from the corpus is shown below”

Title       : CRB: Genetic Diversity of Endangered Populations of Mysticete Whales:
               Mitochondrial DNA and Historical Demography
Type        : Award
NSF Org     : DEB 
Latest
Amendment
Date        : August 1,  1991     
File        : a9000006

Award Number: 9000006
Award Instr.: Continuing grant                             
Prgm Manager: Scott Collins                           
	      DEB  DIVISION OF ENVIRONMENTAL BIOLOGY       
	      BIO  DIRECT FOR BIOLOGICAL SCIENCES          
Start Date  : June 1,  1990       
Expires     : November 30,  1992   (Estimated)
Expected
Total Amt.  : $179720             (Estimated)
Investigator: Stephen R. Palumbi   (Principal Investigator current)
Sponsor     : U of Hawaii Manoa
	      2530 Dole Street
	      Honolulu, HI  968222225    808/956-7800

NSF Program : 1127      SYSTEMATIC & POPULATION BIOLO
Fld Applictn: 0000099   Other Applications NEC                  
              61        Life Science Biological                 
Program Ref : 9285,
Abstract    :
                                                                                             
              Commercial exploitation over the past two hundred years drove                  
              the great Mysticete whales to near extinction.  Variation in                   
              the sizes of populations prior to exploitation, minimal                        
              population size during exploitation and current population                     
              sizes permit analyses of the effects of differing levels of                    
              exploitation on species with different biogeographical                         
              distributions and life-history characteristics.  Dr. Stephen                   
              Palumbi at the University of Hawaii will study the genetic                     
              population structure of three whale species in this context,                   
              the Humpback Whale, the Gray Whale and the Bowhead Whale.  The                 
              effect of demographic history will be determined by comparing                  
              the genetic structure of the three species.  Additional studies                
              will be carried out on the Humpback Whale.  The humpback has a                 
              world-wide distribution, but the Atlantic and Pacific                          
              populations of the northern hemisphere appear to be discrete                   
              populations, as is the population of the southern hemispheric                  
              oceans.  Each of these oceanic populations may be further                      
              subdivided into smaller isolates, each with its own migratory                  
              pattern and somewhat distinct gene pool.  This study will                      
              provide information on the level of genetic isolation among                    
              populations and the levels of gene flow and genealogical                       
              relationships among populations.  This detailed genetic                        
              information will facilitate international policy decisions                     
              regarding the conservation and management of these magnificent                 
              mammals.

Part 2: Regex Processing

Process Text Files

Analysis can be tricky to conduct on multiple text files. For the first part I will clean up the data and organize it in a tabular format. The overall steps for the process are:

Use regular expression to parse text files
Put parsed data in pandas dataframe
Clean dataframe
Create functions and code to analyze the dataframe and do some transformations to the data

By the end of the analysis we hope to have a clean and processed dataset which can provide clear insight.

import pandas as pd
import nltk
import re
import os

# get a list of all the files in the directory
all_files = os.listdir("data/")

# create a new pandas dataframe to fill
new_df = pd.DataFrame(columns=['file','org','amt','abs'])

# iterate over the list of files, opening each one and extracting info
#  - then appending it to the dataframe created above
for txt_file in all_files:
    with open('data/'+txt_file, encoding='utf-8', errors='ignore') as f:
        read_data = f.read()
    # regex text selections
    nsv_file = re.findall('(?<=File        : ).*',read_data)
    nsv_org = re.findall('(?<=NSF Org     : ).*',read_data)
    nsv_amt = re.findall('(?<=Total Amt.  : ).*',read_data)
    nsv_abs = re.findall('(?<=Abstract    :)[\s\S]*',read_data)
    # append data to dataframe
    new_df = new_df.append({'file' : nsv_file, 'org' : nsv_org, 'amt' : nsv_amt, 'abs':nsv_abs},ignore_index=True)

# save as csv
new_df.to_csv('tabular_dataset.csv')

# View the dataframe shape and some data, make sure all data was loaded
print("Rows, Columns : ",new_df.shape)
new_df.head()

Rows, Columns :  (4017, 4)

	file	org	amt	abs
0	[a9000875]	[DBI ]	[$42000 (Estimated)]	[\n This award provides funds to ...
1	[a9009851]	[BES ]	[$79497 (Estimated)]	[\n Pyruvate and phosphoenol-pyru...
2	[a9008597]	[DMI ]	[$12000 (Estimated)]	[\n Research involves the ex...
3	[a9002904]	[DMS ]	[$40286 (Estimated)]	[\n Work on this project wil...
4	[a9009845]	[SES ]	[$70553 (Estimated)]	[\n ...

Part 3: Distribution of Sentence Lengths

For the final part of the analysis we want to see how many sentences there are for each file. First we will import and clean the file a little more, then we can write a function which will allow us to exam an the sentences of an individual document.

# read in csv as pandas dataframe, prevents type issues
data_set = pd.read_csv('tabular_dataset.csv')

# Clean dataset of newlines (\n), (Estimated), and extra whitespace using regex replace
data_set = data_set.replace(r'\\n',' ', regex=True) 
data_set = data_set.replace(r'\(Estimated\)',' ', regex=True)
data_set = data_set.replace(r' +',' ', regex=True) 
# Check out the first few rows of data
data_set.head()

	Unnamed: 0	file	org	amt	abs
0	0	['a9000875']	['DBI ']	['$42000 ']	[" This award provides funds to Oklahoma State...
1	1	['a9009851']	['BES ']	['$79497 ']	[' Pyruvate and phosphoenol-pyruvate (PEP) are...
2	2	['a9008597']	['DMI ']	['$12000 ']	[' Research involves the exploration of the us...
3	3	['a9002904']	['DMS ']	['$40286 ']	[' Work on this project will concentrate on pr...
4	4	['a9009845']	['SES ']	['$70553 ']	[" In this project a model of instrumental and...

# split the abstract (abs) column and append it as new column called split
data_set['split'] = data_set['abs'].str.split(".")

# take another quick peek at the data after appending the column and splitting abs
data_set.head()

	Unnamed: 0	file	org	amt	abs	split
0	0	['a9000875']	['DBI ']	['$42000 ']	[" This award provides funds to Oklahoma State...	[[" This award provides funds to Oklahoma Stat...
1	1	['a9009851']	['BES ']	['$79497 ']	[' Pyruvate and phosphoenol-pyruvate (PEP) are...	[[' Pyruvate and phosphoenol-pyruvate (PEP) ar...
2	2	['a9008597']	['DMI ']	['$12000 ']	[' Research involves the exploration of the us...	[[' Research involves the exploration of the u...
3	3	['a9002904']	['DMS ']	['$40286 ']	[' Work on this project will concentrate on pr...	[[' Work on this project will concentrate on p...
4	4	['a9009845']	['SES ']	['$70553 ']	[" In this project a model of instrumental and...	[[" In this project a model of instrumental an...

Create Final Table

Create the final table showing the number of sentences in a given abstract. Change the variable to select a row or you can use the loop to print them all. Showing one for demonstration purposes.

# set row_choice to whatever row you want to analyze the abstract of
row_ch = 1

# iterate through the row choice abstract, print relevant information
iterator = 0
for line in data_set['split'][row_ch]:
    print(data_set['file'][row_ch]," | ", iterator, " | ", line,"\n")
    iterator += 1
# Print the final number of sentences
print("Number of sentences: ", len(data_set['split'][row_ch]))

['a9009851']  |  0  |  [' Pyruvate and phosphoenol-pyruvate (PEP) are two central intermediates in cellular metabolism 

['a9009851']  |  1  |   They are the branch points of many catabolic and biosynthetic pathways 

['a9009851']  |  2  |   Although the importance of these intermediates has long been recognized, the use of both genetic and engineering techniques for studying the physiological effect of redirected pyruvate and PEP metabolism has not been reported 

['a9009851']  |  3  |   Pyruvate is not normally recycled back to PEP under glycolytic conditions 

['a9009851']  |  4  |   Because of this irreversibility, the yields of many specialty chemicals produced from glucose via bacterial fermentations remain low 

['a9009851']  |  5  |   The Principal Investigator proposes to construct and characterize strains of the bacteria E 

['a9009851']  |  6  |   coli which can recycle pyruvate back to PEP 

['a9009851']  |  7  |   In addition to improving the yields of amino acid fermentations, the results of this project could contribute to the understanding of how the cell distributes its carbon source, and how to decouple product formation from cell growth 

['a9009851']  |  8  |   '] 

Number of sentences:  9

Bibliography

[1] https://www.nsf.gov/about/

Nicholas L Brown

Data Scientist