The Ten Most Similar PLoS Biology Articles

... at least by some measure.

I recently downloaded 1754 PLoS Biology articles as XML files through the PLoS API and have looked at the distribution of the time to publication of PLoS Biology and other PLoS journals.

Here I will play a little with scikit-learn to see if I can discover those PLoS Biology articles (in my data set) that are most similar to one another.

Import Packages

I started writing a Python package (PLoSPy) for more convient parsing of the XML files I have download from PLoS.

import plospy
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import itertools

Discover Data Files on Hard Disk

all_names = [name for name in os.listdir('../plos/plos_biology/plos_biology_data') if '.dat' in name]
print len(all_names)

Vectorize all Articles

To reduce memory use, I wrote the following method that returns an iterator over all article bodies. In passing this iterator to the vectorizer, we avoid loading all articles into memory at once - despite the use of an iterator here, I have not been able to repeat this experiment with all 65,000-odd PLoS ONE articles without running out of memory.

ids = []
titles = []

def get_corpus(all_names):
    for name_i, name in enumerate(all_names):
        docs = plospy.PlosXml('../plos/plos_biology/plos_biology_data/'+name)
        for article in
            yield article['body']
corpus = get_corpus(all_names)
tfidf = TfidfVectorizer().fit_transform(corpus)

Just as a sanity check, the number of DOIs in our data set should now equal 1754 as this is the number of articles I downloaded in the first place.


The vectorizer generated a matrix with 139,748 columns (these are the tokens, i.e. probably unique words used in all 1754 PLoS Biology articles) and 1754 rows (corresponding to individual articles).


Let us now compute all pairwise cosine distances betweeen all 1754 vectors (articles) in matrix tfidf. I copied and pasted most of this from a StackOverflow answer that I cannot find now - I will add a link to the answer when I come across it again.

To get the ten most similar articles, we track the top five pairwise matches.

top_five = [[-1,-1,-1] for i in range(5)]
threshold = -1.

for index in range(len(ids)):
    cosine_similarities = linear_kernel(tfidf[index:index+1], tfidf).flatten()
    related_docs_indices = cosine_similarities.argsort()[:-5:-1]
    first = related_docs_indices[0]
    second = related_docs_indices[1]
    if first != index:
        print 'Error'

    if cosine_similarities[second] > threshold:
        if first not in [top[0] for top in top_five] and first not in [top[1] for top in top_five]:
            scores = [top[2] for top in top_five]
            replace = scores.index(min(scores))
            # print 'replace',replace
            top_five[replace] = [first, second, cosine_similarities[second]]
            # print 'old threshold',threshold
            threshold = min(scores)
            # print 'new threshold',threshold

The Most Similar Articles

Let us now take a look at the results!

for tf in top_five:
    print ''
    print('Cosine Similarity: %.2f' % tf[2])
    print('Title 1: %s' %titles[tf[0]])
    print ''
    print('Title 2: %s' %titles[tf[1]])
    print ''