Update: Web Application in Development

I started developing a web application based on the simple steps lined out in this blog post. I am entirely new to developing and deploying web applications so any feedback and criticism is greatly appreciated!

Check out the app: TLDRMed.

Author Actions in PLoS Biology Articles

I have downloaded a set of just over 1,700 PLoS Biology articles and have plaid a little with some Python packages to work out the ten most similar PLoS Biology articles and the most frequent bigrams in PLoS Biology.

Here I will play a tiny bit with sentence tokenization and token matching to work out author actions: Oftentimes, scientific articles are written in a descriptive way and authors commonly refer to themselves with we.

To parse my set of PLoS Biology XML files I will make use of a small utility library, PLoSPy that I have started to put together.

import plospy
import os

from nltk.tokenize import sent_tokenize

from matplotlib import pyplot
%matplotlib inline

all_names = [name for name in os.listdir('../plos/plos_biology/plos_biology_data') if '.dat' in name]

article_bodies = {}
article_titles = {}

for name_i, name in enumerate(all_names):
    docs = plospy.PlosXml('../plos/plos_biology/plos_biology_data/'+name)
    for article in docs.docs:
        article_bodies[article['id']] = article['body']
        article_titles[article['id']] = article['title']

There are 1,754 articles in my data set:



Here are the first five DOI’s in my data set:



Let us now use one of the standard sentence tokenizers shipped with NLTK:

sentences = {}
for doi in article_bodies.keys():
    tokens = sent_tokenize(article_bodies[doi])
    sentences[doi] = []
    for sentence in tokens:

The data structure sentences is a dictionary that stores the list of sentences detected for a given DOI - where the key to access this list is just the DOI itself.



For the first DOI in our data set, the first four detected sentences are the following:




[u'  Introduction  Evolution is expected to occur when selection acts on a trait that has a heritable basis of phenotypic variation.',
 u'Quantitative genetic models allow an evolutionary trajectory to be predicted from the strength of selection and the amount of genetic variance, usually expressed as the heritability, h 2 [ 1 ].',
 u'However, while simple theoretical models assume a constant environment, environmental heterogeneity has long been recognised as an important factor influencing the evolutionary dynamics of fitness-related traits in the wild [ 2 ].',
 u'Specifically, selection can vary considerably from year to year within a population [ 3 , 4 ], and it is increasingly recognised that environmental conditions also influence the heritability on which any response to selection depends [ 5 , 6 ].']

Let us now apply an oversimplified rule to detect author actions of the form “We ….”: All sentences that contain the pronoun we (or We) are assumed to describe some author action.

author_actions = {}

for doi in sentences.keys():
    author_actions[doi] = []
    for sentence in sentences[doi]:
        if any([token == 'we' or token == 'We' for token in sentence.split()]):

The data structure author_actions is similar to sentences but stores only those sentences that are assumed to describe author action.



Let us now count the number of author action sentences detected per article and plot a histogram.

no_author_actions = [len(author_actions[doi]) for doi in author_actions.keys()]

n, bins, patches = pyplot.hist(no_author_actions, max(no_author_actions)+1, normed=1, facecolor='g', alpha=0.75)
pyplot.xlabel('Number of Author Actions per Article')


This histogram shows that, most frequently, articles will describe somewhere around 30 to 50 author actions.

Anyone familiar with scientific literature in the life sciences will know that authors often describe what their article is intended to present by pointing out what they have shown.

So let us go through a few of the articles in our data set and print out all those author action sentences that contain the phrase have shown.

As you can see in the sample output below, the phrase we have shown is used to describe both what was shown in the current article and what has been shown in earlier work by the same authors.

I wonder to what extent this sort of approach may be used to summarize and condense scientific articles.

for doi in author_actions.keys()[:50]:
    if any(['have shown' in ' '.join(sentence) for sentence in author_actions[doi]]):
        print('Title: '+article_titles[doi])
        for sentence in author_actions[doi]:
            if 'have shown' in ' '.join(sentence):
                print ' '.join(sentence)
        print '\n'

Use of a Dense Single Nucleotide Polymorphism Map for In Silico Mapping in the Mouse:

Translation Repression in Human Cells by MicroRNA-Induced Gene Silencing Requires RCK/p54:

SIRT1 Regulates HIV Transcription via Tat Deacetylation:

HIV-1 Tat Stimulates Transcription Complex Assembly through Recruitment of TBP in the Absence of TAFs:

Exdpf Is a Key Regulator of Exocrine Pancreas Development Controlled by Retinoic Acid and ptf1a in Zebrafish:

USP8 Promotes Smoothened Signaling by Preventing Its Ubiquitination and Changing Its Subcellular Localization:

Hedgehog-Regulated Ubiquitination Controls Smoothened Trafficking and Cell Surface Expression in Drosophila:

Raf Activation Is Regulated by Tyrosine 510 Phosphorylation in Drosophila:

Notch-Deficient Skin Induces a Lethal Systemic B-Lymphoproliferative Disorder by Secreting TSLP, a Sentinel for Epidermal Integrity:

Nongenetic Individuality in the Host–Phage Interaction:

A Feedback Loop between Dynamin and Actin Recruitment during Clathrin-Mediated Endocytosis:

comments powered by Disqus