… at least by some measure.
I recently downloaded 1754 PLoS Biology articles as XML files through the PLoS API and have looked at the distribution of the time to publication of PLoS Biology and other PLoS journals.
Here I will play a little with scikit-learn to see if I can discover those PLoS Biology articles (in my data set) that are most similar to one another.
I started writing a Python package (PLoSPy) for more convient parsing of the XML files I have download from PLoS.
import plospy import os from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import linear_kernel import itertools
all_names = [name for name in os.listdir('../plos/plos_biology/plos_biology_data') if '.dat' in name] all_names[0:10] ['plos_biology_0004.dat', 'plos_biology_0002.dat', 'plos_biology_0001.dat', 'plos_biology_0013.dat', 'plos_biology_0008.dat', 'plos_biology_0016.dat', 'plos_biology_0010.dat', 'plos_biology_0009.dat', 'plos_biology_0000.dat', 'plos_biology_0005.dat'] print len(all_names) 18
To reduce memory use, I wrote the following method that returns an iterator over all article bodies. In passing this iterator to the vectorizer, we avoid loading all articles into memory at once - despite the use of an iterator here, I have not been able to repeat this experiment with all 65,000-odd PLoS ONE articles without running out of memory.
ids =  titles =  def get_corpus(all_names): for name_i, name in enumerate(all_names): docs = plospy.PlosXml('../plos/plos_biology/plos_biology_data/'+name) for article in docs.docs: ids.append(article['id']) titles.append(article['title']) yield article['body'] corpus = get_corpus(all_names) tfidf = TfidfVectorizer().fit_transform(corpus)
Just as a sanity check, the number of DOIs in our data set should now equal 1754 as this is the number of articles I downloaded in the first place.
The vectorizer generated a matrix with 139,748 columns (these are the tokens, i.e. probably unique words used in all 1754 PLoS Biology articles) and 1754 rows (corresponding to individual articles).
tfidf.shape (1754, 139748)
Let us now compute all pairwise cosine distances betweeen all 1754 vectors
(articles) in matrix
I copied and pasted most of this from a StackOverflow answer that I cannot find
now - I will
add a link to the answer when I come across it again.
To get the ten most similar articles, we track the top five pairwise matches.
top_five = [[-1,-1,-1] for i in range(5)] threshold = -1. for index in range(len(ids)): cosine_similarities = linear_kernel(tfidf[index:index+1], tfidf).flatten() related_docs_indices = cosine_similarities.argsort()[:-5:-1] first = related_docs_indices second = related_docs_indices if first != index: print 'Error' break if cosine_similarities[second] > threshold: if first not in [top for top in top_five] and first not in [top for top in top_five]: scores = [top for top in top_five] replace = scores.index(min(scores)) # print 'replace',replace top_five[replace] = [first, second, cosine_similarities[second]] # print 'old threshold',threshold threshold = min(scores) # print 'new threshold',threshold
Let us now take a look at the results!
for tf in top_five: print '' print('Cosine Similarity: %.2f' % tf) print('Title 1: %s' %titles[tf]) print('http://www.plosbiology.org/article/info%3Adoi%2F'+str(ids[tf])) print '' print('Title 2: %s' %titles[tf]) print('http://www.plosbiology.org/article/info%3Adoi%2F'+str(ids[tf])) print ''
Cosine Similarity: 0.86
Cosine Similarity: 0.86
Cosine Similarity: 0.89
Cosine Similarity: 0.91
Cosine Similarity: 0.92