PLOS Biology Topics

Ever wonder what topics are discussed in PLOS Biology articles? Here I will apply an implementation of Latent Dirichlet Allocation (LDA) on a set of 1,754 PLOS Biology articles to work out what a possible collection of underlying topics could be.

I first read about LDA in Building Machine Learning Systems with Python co-authored by Luis Coelho.

LDA seems to have been first described by Blei et al. and I will use the implementation provided by gensim which was written by Radim Řehůřek.

import gensim

import plospy
import os

import nltk

import cPickle as pickle

With the following lines of code we open, parse, and tokenize all 1,754 PLOS Biology articles in our collection.

As this takes a bit of time and memory, I carried out all of these steps once and stored the resulting data structures to my hard disk for later reuse - see further below.

all_names = [name for name in os.listdir('../plos/plos_biology/plos_biology_data') if '.dat' in name]

article_bodies = []

for name_i, name in enumerate(all_names):
    docs = plospy.PlosXml('../plos/plos_biology/plos_biology_data/'+name)
    for article in docs.docs:
        article_bodies.append(article['body'])

We have 1,754 PLOS Biology articles in our collection:

len(article_bodies)

1754




punkt_param = nltk.tokenize.punkt.PunktParameters()
punkt_param.abbrev_types = set(['et al', 'i.e', 'e.g', 'ref', 'c.f',
                                'fig', 'Fig', 'Eq', 'eq', 'eqn', 'Eqn',
                                'dr', 'Dr'])
sentence_splitter = nltk.tokenize.punkt.PunktSentenceTokenizer(punkt_param)


sentences = []
for body in article_bodies:
    sentences.append(sentence_splitter.tokenize(body))


articles = []
for body in sentences:
    this_article = []
    for sentence in body:
        this_article.append(nltk.tokenize.word_tokenize(sentence))
    articles.append(this_article)


pickle.dump(articles, open('plos_biology_articles_tokenized.list', 'w'))


articles = pickle.load(open('plos_biology_articles_tokenized.list', 'r'))


is_stopword = lambda w: len(w) < 4 or w in nltk.corpus.stopwords.words('english')

Save each article as one list of tokens and filter out stopwords:

articles_unfurled = []
for article in articles:
    this_article = []
    for sentence in article:
        this_article += [token.lower().encode('utf-8') for token in sentence if not is_stopword(token)]
    articles_unfurled.append(this_article)


pickle.dump(articles_unfurled, open('plos_biology_articles_unfurled.list', 'w'))


articles_unfurled = pickle.load(open('plos_biology_articles_unfurled.list', 'r'))

Dictionary and Corpus Creation

Create a dictionary of all words (tokens) that appear in our collection of PLOS Biology articles and create a bag of words object for each article (doc2bow).

dictionary = gensim.corpora.Dictionary(articles_unfurled)


dictionary.save('plos_biology.dict')


dictionary = gensim.corpora.dictionary.Dictionary().load('plos_biology.dict')

I noticed that the word figure occurs rather frequently in these articles, so let us exclude this and any other words that appear in more than half of the articles in this data set (thanks to Radim for pointing this out to me).

dictionary.filter_extremes()


corpus = [dictionary.doc2bow(article) for article in articles_unfurled]


gensim.corpora.MmCorpus.serialize('plos_biology_corpus.mm', corpus)


model = gensim.models.ldamodel.LdaModel(corpus, id2word=dictionary, update_every=1, chunksize=100, passes=2, num_topics=20)

And these are the twenty topics we find in 1,754 PLOS Biology articles:

for topic_i, topic in enumerate(model.print_topics(20)):
    print('topic # %d: %s\n' % (topic_i+1, topic))

topic # 1: 0.014*host + 0.010*bacterial + 0.009*bacteria + 0.007*infection + 0.007*strain + 0.006*strains + 0.006*plant + 0.005*tree + 0.005*phylogenetic + 0.005*genome

topic # 2: 0.015*memory + 0.013*sleep + 0.013*trials + 0.012*learning + 0.012*task + 0.008*participants + 0.008*performance + 0.008*brain + 0.008*trial + 0.007*behavioral

topic # 3: 0.021*signaling + 0.018*mutant + 0.012*flies + 0.011*pathway + 0.008*drosophila + 0.007*overexpression + 0.006*receptor + 0.006*expressing + 0.005*mutants + 0.005*staining

topic # 4: 0.012*stimulus + 0.012*neurons + 0.009*responses + 0.009*stimuli + 0.009*firing + 0.006*visual + 0.006*frequency + 0.005*cortex + 0.005*trials + 0.005*location

topic # 5: 0.027*membrane + 0.026*syntaxin + 0.022*tiles + 0.021*vesicles + 0.020*vesicle + 0.017*recruitment + 0.017*fusion + 0.017*endocytosis + 0.012*snare + 0.011*endosomes

topic # 6: 0.014*domain + 0.013*residues + 0.007*structures + 0.007*domains + 0.007*structural + 0.004*surface + 0.004*peptide + 0.004*residue + 0.004*crystal + 0.004*conformation

topic # 7: 0.010*population + 0.007*populations + 0.005*variation + 0.005*selection + 0.004*density + 0.003*rates + 0.003*fitness + 0.003*estimates + 0.003*estimated + 0.003*individuals

topic # 8: 0.015*membrane + 0.012*images + 0.011*actin + 0.011*fluorescence + 0.008*microscopy + 0.008*video + 0.008*localization + 0.007*surface + 0.007*migration + 0.007*image

topic # 9: 0.053*mice + 0.010*animals + 0.008*glucose + 0.008*insulin + 0.008*treatment + 0.007*mouse + 0.007*mitochondrial + 0.005*muscle + 0.005*release + 0.005*cholesterol

topic # 10: 0.016*infection + 0.015*mice + 0.011*virus + 0.010*infected + 0.010*viral + 0.008*immune + 0.008*t-cells + 0.006*c57bl/6 + 0.006*treg + 0.005*thymocytes

topic # 11: 0.009*phosphorylation + 0.007*buffer + 0.006*vitro + 0.006*transfected + 0.006*assay + 0.005*kinase + 0.005*antibodies + 0.005*treated + 0.005*lane + 0.005*vivo

topic # 12: 0.017*circadian + 0.016*phase + 0.015*auxin + 0.014*clock + 0.014*period + 0.012*cycle + 0.010*temperature + 0.009*plants + 0.008*dark + 0.008*oscillations

topic # 13: 0.015*genome + 0.009*recombination + 0.008*selection + 0.006*genomic + 0.006*chromosome + 0.005*mutations + 0.005*evolution + 0.005*divergence + 0.005*clusters + 0.004*conserved

topic # 14: 0.021*chromosome + 0.012*cohesin + 0.011*methylation + 0.011*chromosomes + 0.009*mirna + 0.008*mitotic + 0.007*male + 0.006*males + 0.006*mirnas + 0.006*melanogaster

topic # 15: 0.016*mice + 0.016*differentiation + 0.011*tumor + 0.009*proliferation + 0.008*tumors + 0.008*stem + 0.008*adult + 0.007*mouse + 0.006*hscs + 0.006*cancer

topic # 16: 0.027*mutants + 0.022*animals + 0.018*rnai + 0.017*elegans + 0.014*mutant + 0.013*larvae + 0.010*worms + 0.008*phenotype + 0.007*pathway + 0.006*:gfp

topic # 17: 0.024*transcription + 0.013*promoter + 0.012*transcriptional + 0.010*mrna + 0.009*targets + 0.008*chip + 0.008*promoters + 0.007*chromatin + 0.006*regulation + 0.006*regulatory

topic # 18: 0.024*neurons + 0.020*embryos + 0.009*axons + 0.008*dorsal + 0.007*synaptic + 0.007*axon + 0.006*stage + 0.006*embryo + 0.006*ventral + 0.006*muscle

topic # 19: 0.019*strains + 0.019*mrna + 0.017*strain + 0.012*motif + 0.009*splicing + 0.009*yeast + 0.009*exon + 0.007*deletion + 0.007*mutant + 0.007*motifs

topic # 20: 0.014*state + 0.012*network + 0.007*fluorescence + 0.006*networks + 0.006*dynamics + 0.006*parameters + 0.006*force + 0.005*feedback + 0.005*constant + 0.005*rates

Let us visualize these topics as color-coded bubbles. I adapted the JavaScript code for this visualization from one of Mike Bostock’s examples.

Topics with Lemmatized Tokens

As we can notice, some of the tokens in the above topics are just singular and plural forms of the same word.

Let us see what topics we find after lemmatizing all of our tokens.

from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
articles_lemmatized = []
for article in articles_unfurled:
    articles_lemmatized.append([wnl.lemmatize(token) for token in article])

pickle.dump(articles_lemmatized, open('plos_biology_articles_lemmatized.list', 'w'))

dictionary_lemmatized = gensim.corpora.Dictionary(articles_lemmatized)

dictionary_lemmatized.save('plos_biology_lemmatized.dict')

dictionary_lemmatized.filter_extremes()

corpus_lemmatized = [dictionary_lemmatized.doc2bow(article) for article in articles_lemmatized]

gensim.corpora.MmCorpus.serialize('plos_biology_corpus_lemmatized.mm', corpus_lemmatized)

model_lemmatized = gensim.models.ldamodel.LdaModel(corpus_lemmatized, id2word=dictionary_lemmatized, update_every=1, chunksize=100, passes=2, num_topics=20)

And here are the twenty topics we find with lemmatized tokens:

for topic_i, topic in enumerate(model_lemmatized.print_topics(20)):
    print('topic # %d: %s\n' % (topic_i+1, topic))

topic # 1: 0.052*embryo + 0.011*dorsal + 0.011*zebrafish + 0.009*neural + 0.008*anterior + 0.008*ventral + 0.007*cartilage + 0.007*signaling + 0.007*posterior + 0.007*muscle

topic # 2: 0.015*infection + 0.015*elegans + 0.014*host + 0.014*rnai + 0.012*worm + 0.009*parasite + 0.009*larva + 0.009*mosquito + 0.008*adult + 0.008*bacteria

topic # 3: 0.025*promoter + 0.023*transcription + 0.012*transcriptional + 0.011*chromatin + 0.011*chip + 0.010*cohesin + 0.010*methylation + 0.009*nuclear + 0.007*histone + 0.006*ino1

topic # 4: 0.017*microtubule + 0.013*nucleus + 0.011*mitotic + 0.011*insulin + 0.011*glucose + 0.011*spindle + 0.010*nuclear + 0.009*retina + 0.008*mitochondrial + 0.007*retinal

topic # 5: 0.024*stimulus + 0.015*neuron + 0.014*trial + 0.010*firing + 0.007*spike + 0.007*task + 0.007*current + 0.007*frequency + 0.006*recording + 0.006*memory

topic # 6: 0.020*phosphorylation + 0.016*kinase + 0.013*signaling + 0.012*antibody + 0.007*inhibitor + 0.007*receptor + 0.006*phosphorylated + 0.006*transfected + 0.006*infection + 0.006*peptide

topic # 7: 0.061*neuron + 0.031*axon + 0.022*synaptic + 0.013*dendrite + 0.012*brain + 0.011*neuronal + 0.010*synapsis + 0.010*cortical + 0.010*dendritic + 0.008*axonal

topic # 8: 0.023*fly + 0.017*drosophila + 0.012*larva + 0.012*signaling + 0.012*phenotype + 0.010*clone + 0.010*defect + 0.009*rescue + 0.008*disc + 0.008*genotype

topic # 9: 0.010*network + 0.010*dynamic + 0.008*force + 0.007*parameter + 0.007*distance + 0.007*movement + 0.006*simulation + 0.006*video + 0.005*motor + 0.005*direction

topic # 10: 0.019*residue + 0.011*peptide + 0.007*substrate + 0.007*reaction + 0.007*structural + 0.007*enzyme + 0.006*crystal + 0.006*subunit + 0.006*molecule + 0.005*conformation

topic # 11: 0.038*chromosome + 0.025*allele + 0.018*recombination + 0.015*locus + 0.008*hybrid + 0.008*female + 0.008*male + 0.008*marker + 0.007*genomic + 0.007*primer

topic # 12: 0.016*tumor + 0.012*differentiation + 0.010*tissue + 0.008*blood + 0.007*proliferation + 0.007*culture + 0.007*wound + 0.007*liver + 0.006*treatment + 0.006*stem

topic # 13: 0.038*strain + 0.010*yeast + 0.008*plasmid + 0.007*coli + 0.006*grown + 0.006*deletion + 0.006*codon + 0.006*ribosome + 0.006*phage + 0.005*culture

topic # 14: 0.033*fiber + 0.026*receptor + 0.016*aggregation + 0.015*aggregate + 0.008*trkb + 0.007*granule + 0.007*liposome + 0.007*bdnf + 0.006*body + 0.006*signaling

topic # 15: 0.015*genome + 0.009*cluster + 0.006*motif + 0.005*selection + 0.005*dataset + 0.005*2003 + 0.005*family + 0.004*2002 + 0.004*conserved + 0.004*read

topic # 16: 0.007*estimate + 0.007*female + 0.007*male + 0.006*variation + 0.006*selection + 0.006*fitness + 0.005*variable + 0.005*density + 0.005*trait + 0.005*bird

topic # 17: 0.022*membrane + 0.010*antibody + 0.008*fluorescence + 0.008*vesicle + 0.006*surface + 0.006*transfected + 0.006*expressing + 0.006*fusion + 0.005*localization + 0.005*particle

topic # 18: 0.025*plant + 0.019*cycle + 0.014*circadian + 0.013*clock + 0.013*phase + 0.012*auxin + 0.012*period + 0.010*feedback + 0.010*rhythm + 0.010*oscillation

topic # 19: 0.052*mrna + 0.025*transcript + 0.020*exon + 0.011*splicing + 0.011*motif + 0.010*transcription + 0.009*regulation + 0.008*translation + 0.008*association + 0.007*intron

topic # 20: 0.014*subject + 0.014*brain + 0.014*cortex + 0.012*participant + 0.008*task + 0.007*visual + 0.007*word + 0.007*object + 0.005*neural + 0.005*trial
comments powered by Disqus