• 19 Nov 2017 » Bayes in JavaScript
• 24 Oct 2014 » Extracting Text from PDF and HTML: An Online Data Pipeline
• In this article I describe a data pipeline that extracts text from online PDFand HTML sourcesand stores the extracted and compressed text as blobs in a SQL database forfurther processing.MotivationI consume online media through different apps on different devices (plainbrowsers, Feedly, Twitter, etc.)and e-mail myself links to interesting reads for future reference.Gmail addresses are great for this purpose since an e-mail sent to your-gmail-address+STRING@gmail.com stillends up in the inbox of your-gmail-address@gmail.com.Combining this with Gmailfilters allows…

• 10 Oct 2014 » Online Multi-Label Classification: Classifying Wikipedia Changes
• I recently joined a Kaggle competition on multi-label textclassificationand have learned a ton from basic code that one of the competitorsshared in the forums.The code of the generous competitor does logistic regression classification formultiple classes with stochastic gradient ascent.It is further well-suited for online learning as it uses the hashing trick toone-hot encode boolean, string, and categorial features.To better understand these methods and tricks I here apply some of them to amulti-label problem I chose…

• 05 Aug 2014 » Science 2 Data Science Days 1 and 2
• S2DS Day 1Science 2 Data Science (S2DS) is an industry-sponsored course that aimsat leading graduates with a quantitative background into the industry ofdata analytics anddata science.The S2DS program has at least two components:During the first six days of the program we hear about technologiesand methodologies that are deemed essential for practicing data science,and most of the remaining time we work in small teams on specific projectswith the industry partners of the program.My intention with this…

• 12 Jun 2014 » Python Shelve Thread Safety
• As the documentation of shelve says, shelve.Shelf objects are not threadsafe: The shelve module does not support concurrent read/write access to shelved objects. –https://docs.python.org/2/library/shelve.html#restrictionsTo write to a Shelf object in an environment where multiple threads may endup writing to it, use a global threading.Lock object.from threading import Lockimport shelvemutex = Lock()mutex.acquire()db = shelve.open(db_name)# write to dbdb.close()mutex.release()

• 22 Mar 2014 » PLOS Biology-Inspired PLOS Biology Articles
• This past week I had my first encounter with the concept of graphdatabaseswhich lend themselves perfectly to modeling and capturing linked data.I started reading the free and brilliant book GraphDatabases by Robinson, Webber, and Eifrem andbegan playing around with Python bulbs by JamesThornton.I further took the data set of 1754 PLOS Biology articles that I have examinedon this blog multiple times and created aRexster-based graph database fromthem.Apart from the obvious authors, DOIs, and titles I…

• 06 Mar 2014 » PLOS Biology Outlier Detection
• In this blog post I play with dimensionality reduction techniques SVD and Isomapto map a corpus of 1,754 PLOS Biology articles from 27,210-dimensional featurespace to 2-dimensional space.This sort of approach is used oftentimes to estimate which data points are neareach other.Here, I realized that the data points far away from the bulk discuss (almost)consistently neurobiological topics.As far as I am aware PLOS Biology publish articles on all biological topics sothere is probably no editorial factor…

• 26 Feb 2014 » PLOS Biology Topics Part II
• I have received some great suggestions from bothRadimand Asim thatI wanted to follow up on in this brief blog post.The Effect of Changing passesFollowing on from the unlemmatized dictionary we built in thelast blog postit is interesting to see how the quality of topics changes when altering thepasses parameter of the LdaModel object.%matplotlib inlinefrom matplotlib import pyplotimport numpyimport gensimimport cPickle as pickledictionary = gensim.corpora.dictionary.Dictionary().load(‘plos_biology.dict’)dictionary.filter_extremes()corpus = gensim.corpora.MmCorpus(‘plos_biology_corpus.mm’)Lowering passesmodel = gensim.models.ldamodel.LdaModel(corpus, id2word=dictionary, update_every=1, chunksize=100, passes=1, num_topics=20)for topic_i,…

• 19 Feb 2014 » PLOS Biology Topics
• Ever wonder what topics are discussed in PLOS Biology articles?Here I will apply an implementation of Latent DirichletAllocation (LDA)on a set of 1,754 PLOS Biology articles to work out what a possible collectionof underlying topics could be.I first read about LDA in Building Machine Learning Systems withPythonco-authored by Luis Coelho.LDA seems to have been first described by Blei etal. andI will use the implementation provided bygensimwhich was written byRadim Řehůřek.import gensimimport plospyimport osimport nltkimport cPickle…

• 19 Jan 2014 » Citation Counts in PLoS Biology Articles
• As in a number of earlier blog posts, I will play around with just over 1,700PLoS Biology articles that I downloaded last year.In this blog post I will attempt to tackle a question that I have been curiousabout for a long time now:How often are references cited in the main text of scientific articles? That is,what is the citation count ofeach reference listed in the bibliography of peer-reviewed articles?This question touches uponthe observationthat many references…

• 17 Jan 2014 » Most Frequent Author Actions in PLoS Biology Articles
• I started developing a web application based on the simple steps linedout in this blog post.I am entirely new to developing and deploying web applications so anyfeedback and criticism is greatly appreciated!Check out the app: TLDRMed.Most Frequent Author Actions in PLoS Biology ArticlesIn a previous blog post Iattempted to summarizePLoS Biology articles by extracting sentences that contained the claimed authoraction “we have shown”.As asked by Noam Ross,it might be interesting to seewhat the most frequent…

• 11 Jan 2014 » PLoS Biology Author Actions
• I started developing a web application based on the simple steps linedout in this blog post.I am entirely new to developing and deploying web applications so anyfeedback and criticism is greatly appreciated!Check out the app: TLDRMed.Author Actions in PLoS Biology ArticlesI have downloaded a set of just over 1,700 PLoS Biology articles and have plaida littlewith some Python packages to work outthe ten most similar PLoS Biologyarticles andthe most frequent bigrams in PLoSBiology.Here I will…

• 09 Jan 2014 » Combining Crank-Nicolson and Runge-Kutta to Solve a Reaction-Diffusion System
• We have already derived the Crank-Nicolson methodto integrate the following reaction-diffusion system numerically:Please refer to the earlier blog postfor details.In our previous derivation, we constructed the following stencil that we wouldgo on torearrange into a system of linear equations that we needed to solve every timestep:where $j$ and $n$ are space and time grid points respectively.Rearranging the above set of equations, we effectively integrate the reactionpart with theexplicit Euler method like so:For functions $f$ that…

• 08 Jan 2014 » Reaction Diffusion System on an Accelerating Domain
• In a previous blogpost we derivedequationsthat describe the time behaviour of a reaction-diffusion system on a growingspace domain.In that work we assumed that the velocity of domain growth is an increasingfunction ofone of the two unknown variables (here, protein concentrations) described by ourreaction diffusion system.Let us add another layer to this problem and assume that not the growthvelocity isdependent on the unknown protein concentration but that the growthacceleration isa function of this unknown concentration.We derived…

• 05 Jan 2014 » PLoS Biology Bigrams
• Here I will use the Natural Language Toolkit and a recipefromPython Text Processing with NLTK 2.0 Cookbookto work out the most frequentbigrams in the PLoS Biology articlesthatI downloaded last year and have described in previous postshere andhere.The amusing twist in this blog post is that the most frequent bigram,after filtering out stopwords,is unpublished data.As before I will use asmall helper library that I startedputtingtogether:import plospyimport osall_names = [name for name in os.listdir(‘../plos/plos_biology/plos_biology_data’) if ‘.dat’ in…

• 03 Jan 2014 » Good Reads January 2014
• Stroustrap: A Tour of C++Chapter 1 Core language features: built-in types, loops, etc. Standard-library components: containers, I/O operations Every C++ program must have exactly one function named main() read << as put to standard output stream std::cout std:: signifies that the name cout is declared in the standard-librarynamespace using namespace std;: make names from std visible defining a function / variable: specifying how an operation is done declaring a function / variable: making the name…

• 16 Dec 2013 » The Ten Most Similar PLoS Biology Articles
• … at least by some measure.I recently downloaded 1754 PLoS Biology articles as XML files through the PLoSAPIand have looked at the distribution of the time topublicationof PLoS Biology and other PLoS journals.Here I will play a little with scikit-learnto see if I can discover thosePLoS Biology articles (in my data set) that are most similar to one another.Import PackagesI started writing a Python package(PLoSPy) for more convient parsingof the XML files I have download…

• 08 Dec 2013 » Python Built-In Functions
• Python Built-In FunctionsIn this post I will try and dig through some of the source code behind thebuilt-in functions of Python2.7My hope is that by going through some of the source code of Python I will get toappreciatetechnical aspects of Python better.The source code of the built-in functions of Python 2.7 seems tobe here.Going through a small subset of functions listedhereit becomes apparent that tracking down the source code of built-in functions isnot trivial.For now,…

• 05 Dec 2013 » Reaction-Diffusion System on a Growing Domain
• Here I will discuss the same reaction-diffusion system in one space dimensionas I considered previously.While in my previous blog post I discussed integrating this system numericallyon a one-dimensional domain offixed length, I will here consider the case of a growing domain.Reaction-diffusion systems, specifically those that give rise to Turing-typepatterns, have been discussedextensively by Edmund J Crampin in his PhD thesisand related peer-reviewed publications.In their work, Crampin et al. laid out a framework for integrating growth…

• 04 Dec 2013 » Crank-Nicolson with Variable Diffusivity
• We have implementedthe Crank-Nicolson (CN) method for a two-variable reaction-diffusion system withconstant grid parameters, $\Delta t$ and $\Delta x$, and system parameters(including diffusion coefficients $D_u$ and $D_v$).When we keep all of these parameters constant, we can define constantsthat define the non-zero entries ofthe two tridiagonal matrices defined bythe CN stencil.To reuse our previous implementation of the CN methodwhen diffusion coefficients $D_u$ and $D_v$ are functions of time $t$ (theproperties of the material that themolecular species…

• 04 Dec 2013 » The Crank-Nicolson Method for Convection-Diffusion Systems
• Here we extend our discussion andimplementation of the Crank-Nicolson (CN) methodto convection-diffusionsystems.To clarify nomenclature, there is a physically important difference betweenconvection and advection.Since we are interested in the transport of a protein suspended in a(semi)liquid medium, we will usethe term advection (as opposed to convection) in the following discussion.Our Advection-Diffusion EquationWe study the following advection-diffusion equation:with concentration $u$, diffusion coefficient $D$, advection velocity $a$(velocity of the medium), reaction term $f$,and domain length $L$.Our Neumann boundaryconditions…

• 03 Dec 2013 » The Crank-Nicolson Method
• The Crank-Nicolsonmethod is a well-known finite difference method for thenumerical integration of the heat equation and closely related partialdifferential equations.We often resort to a Crank-Nicolson (CN) scheme when we integrate numericallyreaction-diffusion systems in one space dimensionwhere $u$ is our concentration variable, $x$ is the space variable, $D$ is thediffusion coefficient of $u$, $f$ is the reaction term,and $L$ is the length of our one-dimensional space domain.Note that we use Neumann boundaryconditions and specifythat the solution$u$…

• 03 Dec 2013 » The Alternating Direction Implicit Method
• The Alternating Direction Implicit MethodJust as the Crank-Nicolson (CN)method for reaction-diffusionsystems with one space dimension, theAlternating Direction Implicit (ADI)methodis used commonly for reaction-diffusion systems with two space dimensions.The ADI method has been described thoroughly many times, for instance byDehghan.In the following discussion we will consider a reaction-diffusion system similarto the one westudied previously and we willuse analogousNeumann boundaryconditions.where $u(x,y,t)$ is our concentration variable, $D$ is the diffusion coefficientof $u$, $f$ is the reaction term, and…

• 02 Dec 2013 » Structuring PLoS API Data
• A while ago I startedusing dataprovided by the PLoS API.The PLoS API allows you to download entire articles as XMLdocuments:you can download batches of articles complete with title, abstract, body (theactual article) and metadata such as date of publication.The only problem with XML is that you need toparse it to rediscover the structure ofyour document(e.g. a document contains one title, one abstract, etc.) - this task essentiallyturns your XML document into human-readable format.I encounter continually…

• How This Page Was MadeI am in the process of building my personal website that I intend to usemostly as a blog.I chose Jekyll in combination withGitHub Pages, which is a true and tested solutionfor personal and scientific blogs by now.There are countless fantastic tutorials that cover setting up a blogwith Jekyll and GitHub Pages, for instanceCecil Woebker’s post.Before building this website, I knew I would want my blog to havethe following technical features: Render…

• 06 Oct 2013 » PLoS Time to Publication
• PLoS Time to PublicationIn an earlier IPython Notebook I took alook at the time to publication in PLoS ONE.Here I’ll take a comparative look at the time to publication in PLoS ONE,Biology, Computational Biology, and Genetics.Drop me a line for any feedback.import gzipimport cPickle as picklefrom datetime import dateimport numpy as npdata_one = pickle.load(gzip.open(‘plos_one.gzip’, ‘rb’))data_biology = pickle.load(gzip.open(‘plos_biology.gzip’, ‘rb’))data_comp_biology = pickle.load(gzip.open(‘plos_comp_biology.gzip’, ‘rb’))data_genetics = pickle.load(gzip.open(‘plos_genetics.gzip’, ‘rb’))Overview of DataThe data we just loaded are dictionaries whose keys are…