Personal Weblog of Georg R Walther

  • 24 Oct 2014 » Extracting Text from PDF and HTML: An Online Data Pipeline
  • Extracting Text from PDF and HTML: An Online Data Pipeline In this article I describe a data pipeline that extracts text from online PDF and HTML sources and stores the extracted and compressed text as blobs in a SQL database for further processing. Motivation I consume online media through different apps on different devices (plain browsers, Feedly, Twitter, etc.) and e-mail myself links to interesting reads for future reference. Gmail addresses are great for this...
    Read more...

  • 10 Oct 2014 » Online Multi-Label Classification: Classifying Wikipedia Changes
  • Online Multi-Label Classification: Classifying Wikipedia Changes I recently joined a Kaggle competition on multi-label text classification and have learned a ton from basic code that one of the competitors shared in the forums. The code of the generous competitor does logistic regression classification for multiple classes with stochastic gradient ascent. It is further well-suited for online learning as it uses the hashing trick to one-hot encode boolean, string, and categorial features. To better understand these...
    Read more...

  • 05 Aug 2014 » Science 2 Data Science Days 1 and 2
  • I moved this summary to a new location at http://georg.io/s2ds where I also keep a growing collection of additional summaries. S2DS Day 1 Science 2 Data Science (S2DS) is an industry-sponsored course that aims at leading graduates with a quantitative background into the industry of data analytics and data science. The S2DS program has at least two components: During the first six days of the program we hear about technologies and methodologies that are deemed...
    Read more...

  • 12 Jun 2014 » Python Shelve Thread Safety
  • Python Shelve Thread Safety As the documentation of shelve says, shelve.Shelf objects are not thread safe: The shelve module does not support concurrent read/write access to shelved objects. –https://docs.python.org/2/library/shelve.html#restrictions To write to a Shelf object in an environment where multiple threads may end up writing to it, use a global threading.Lock object. from threading import Lock import shelve mutex = Lock() mutex.acquire() db = shelve.open(db_name) # write to db db.close() mutex.release()
    Read more...

  • 22 Mar 2014 » PLOS Biology-Inspired PLOS Biology Articles
  • PLOS Biology-Inspired PLOS Biology Articles This past week I had my first encounter with the concept of graph databases which lend themselves perfectly to modeling and capturing linked data. I started reading the free and brilliant book Graph Databases by Robinson, Webber, and Eifrem and began playing around with Python bulbs by James Thornton. I further took the data set of 1754 PLOS Biology articles that I have examined on this blog multiple times and...
    Read more...

  • 06 Mar 2014 » PLOS Biology Outlier Detection
  • PLOS Biology Outlier Detection In this blog post I play with dimensionality reduction techniques SVD and Isomap to map a corpus of 1,754 PLOS Biology articles from 27,210-dimensional feature space to 2-dimensional space. This sort of approach is used oftentimes to estimate which data points are near each other. Here, I realized that the data points far away from the bulk discuss (almost) consistently neurobiological topics. As far as I am aware PLOS Biology publish...
    Read more...

  • 26 Feb 2014 » PLOS Biology Topics Part II
  • PLOS Biology Topics Part II I have received some great suggestions from both Radim and Asim that I wanted to follow up on in this brief blog post. The Effect of Changing passes Following on from the unlemmatized dictionary we built in the last blog post it is interesting to see how the quality of topics changes when altering the passes parameter of the LdaModel object. %matplotlib inline from matplotlib import pyplot import numpy import...
    Read more...

  • 19 Feb 2014 » PLOS Biology Topics
  • PLOS Biology Topics Ever wonder what topics are discussed in PLOS Biology articles? Here I will apply an implementation of Latent Dirichlet Allocation (LDA) on a set of 1,754 PLOS Biology articles to work out what a possible collection of underlying topics could be. I first read about LDA in Building Machine Learning Systems with Python co-authored by Luis Coelho. LDA seems to have been first described by Blei et al. and I will use...
    Read more...

  • 19 Jan 2014 » Citation Counts in PLoS Biology Articles
  • Citation Counts in PLoS Biology Articles As in a number of earlier blog posts, I will play around with just over 1,700 PLoS Biology articles that I downloaded last year. In this blog post I will attempt to tackle a question that I have been curious about for a long time now: How often are references cited in the main text of scientific articles? That is, what is the citation count of each reference listed...
    Read more...

  • 17 Jan 2014 » Most Frequent Author Actions in PLoS Biology Articles
  • Update: Web Application in Development I started developing a web application based on the simple steps lined out in this blog post. I am entirely new to developing and deploying web applications so any feedback and criticism is greatly appreciated! Check out the app: TLDRMed. Most Frequent Author Actions in PLoS Biology Articles In a previous blog post I attempted to summarize PLoS Biology articles by extracting sentences that contained the claimed author action “we...
    Read more...

  • 11 Jan 2014 » PLoS Biology Author Actions
  • Update: Web Application in Development I started developing a web application based on the simple steps lined out in this blog post. I am entirely new to developing and deploying web applications so any feedback and criticism is greatly appreciated! Check out the app: TLDRMed. Author Actions in PLoS Biology Articles I have downloaded a set of just over 1,700 PLoS Biology articles and have plaid a little with some Python packages to work out...
    Read more...

  • 09 Jan 2014 » Combining Crank-Nicolson and Runge-Kutta to Solve a Reaction-Diffusion System
  • Combining Crank-Nicolson and Runge-Kutta to Solve a Reaction-Diffusion System We have already derived the Crank- Nicolson method to integrate the following reaction-diffusion system numerically: $$\frac{\partial u}{\partial x}\Bigg _{x = 0, L} = 0.$$ Please refer to the earlier blog post for details. In our previous derivation, we constructed the following stencil that we would go on to rearrange into a system of linear equations that we needed to solve every time step: where $j$ and...
    Read more...

  • 08 Jan 2014 » Reaction Diffusion System on an Accelerating Domain
  • Reaction Diffusion System on an Accelerating Domain In a previous blog post we derived equations that describe the time behaviour of a reaction-diffusion system on a growing space domain. In that work we assumed that the velocity of domain growth is an increasing function of one of the two unknown variables (here, protein concentrations) described by our reaction diffusion system. Let us add another layer to this problem and assume that not the growth velocity...
    Read more...

  • 05 Jan 2014 » PLoS Biology Bigrams
  • PLoS Biology Bigrams Here I will use the Natural Language Toolkit and a recipe from Python Text Processing with NLTK 2.0 Cookbook to work out the most frequent bigrams in the PLoS Biology articles that I downloaded last year and have described in previous posts here and here. The amusing twist in this blog post is that the most frequent bigram, after filtering out stopwords, is unpublished data. As before I will use a small...
    Read more...

  • 03 Jan 2014 » Good Reads January 2014
  • Good Reads January 2014 Stroustrap: A Tour of C++ Chapter 1 Core language features: built-in types, loops, etc. Standard-library components: containers, I/O operations Every C++ program must have exactly one function named main() read << as put to standard output stream std::cout std:: signifies that the name cout is declared in the standard-library namespace using namespace std;: make names from std visible defining a function / variable: specifying how an operation is done declaring a...
    Read more...

  • 16 Dec 2013 » The Ten Most Similar PLoS Biology Articles
  • The Ten Most Similar PLoS Biology Articles … at least by some measure. I recently downloaded 1754 PLoS Biology articles as XML files through the PLoS API and have looked at the distribution of the time to publication of PLoS Biology and other PLoS journals. Here I will play a little with scikit-learn to see if I can discover those PLoS Biology articles (in my data set) that are most similar to one another. Import...
    Read more...

  • 08 Dec 2013 » Python Built-In Functions
  • This post is work in progress. Last updated 2013-12-08 00:00:00 +0000. Python Built-In Functions In this post I will try and dig through some of the source code behind the built-in functions of Python 2.7 My hope is that by going through some of the source code of Python I will get to appreciate technical aspects of Python better. The source code of the built-in functions of Python 2.7 seems to be here. Going through...
    Read more...

  • 05 Dec 2013 » Reaction-Diffusion System on a Growing Domain
  • Reaction-Diffusion Systems on Growing Domains Here I will discuss the same reaction-diffusion system in one space dimension as I considered previously. While in my previous blog post I discussed integrating this system numerically on a one-dimensional domain of fixed length, I will here consider the case of a growing domain. Reaction-diffusion systems, specifically those that give rise to Turing-type patterns, have been discussed extensively by Edmund J Crampin in his PhD thesis and related peer-reviewed...
    Read more...

  • 04 Dec 2013 » Crank-Nicolson with Variable Diffusivity
  • Crank-Nicolson with Variable Diffusivity We have implemented the Crank-Nicolson (CN) method for a two-variable reaction-diffusion system with constant grid parameters, $\Delta t$ and $\Delta x$, and system parameters (including diffusion coefficients $D_u$ and $D_v$). When we keep all of these parameters constant, we can define constants that define the non-zero entries of the two tridiagonal matrices defined by the CN stencil. To reuse our previous implementation of the CN method when diffusion coefficients $D_u$ and...
    Read more...

  • 04 Dec 2013 » The Crank-Nicolson Method for Convection-Diffusion Systems
  • The Crank-Nicolson Method for Convection-Diffusion Systems Here we extend our discussion and implementation of the Crank- Nicolson (CN) method to convection-diffusion systems. To clarify nomenclature, there is a physically important difference between convection and advection. Since we are interested in the transport of a protein suspended in a (semi)liquid medium, we will use the term advection (as opposed to convection) in the following discussion. Our Advection-Diffusion Equation We study the following advection-diffusion equation: $$\frac{\partial u}{\partial...
    Read more...

  • 03 Dec 2013 » The Crank-Nicolson Method
  • The Crank-Nicolson Method The Crank-Nicolson method is a well- known finite difference method for the numerical integration of the heat equation and closely related partial differential equations. We often resort to a Crank-Nicolson (CN) scheme when we integrate numerically reaction-diffusion systems in one space dimension $$\frac{\partial u}{\partial x}\Bigg _{x = 0, L} = 0,$$ where $u$ is our concentration variable, $x$ is the space variable, $D$ is the diffusion coefficient of $u$, $f$ is the...
    Read more...

  • 03 Dec 2013 » The Alternating Direction Implicit Method
  • This post is work in progress The Alternating Direction Implicit Method Just as the Crank-Nicolson (CN) method for reaction-diffusion systems with one space dimension, the Alternating Direction Implicit (ADI) method is used commonly for reaction-diffusion systems with two space dimensions. The ADI method has been described thoroughly many times, for instance by Dehghan. In the following discussion we will consider a reaction-diffusion system similar to the one we studied previously and we will use analogous...
    Read more...

  • 02 Dec 2013 » Structuring PLoS API Data
  • Structuring PLoS API Data A while ago I started using data provided by the PLoS API. The PLoS API allows you to download entire articles as XML documents: you can download batches of articles complete with title, abstract, body (the actual article) and metadata such as date of publication. The only problem with XML is that you need to parse it to rediscover the structure of your document (e.g. a document contains one title, one...
    Read more...

  • 26 Nov 2013 » How This Page Was Made
  • This post is work in progress (WIP) How This Page Was Made I am in the process of building my personal website that I intend to use mostly as a blog. I chose Jekyll in combination with GitHub Pages, which is a true and tested solution for personal and scientific blogs by now. There are countless fantastic tutorials that cover setting up a blog with Jekyll and GitHub Pages, for instance Cecil Woebker’s post. Before...
    Read more...

  • 06 Oct 2013 » PLoS Time to Publication
  • This blog post is based on an IPython Notebook that I sketched out on GitHub some time ago. PLoS Time to Publication In an earlier IPython Notebook I took a look at the time to publication in PLoS ONE. Here I’ll take a comparative look at the time to publication in PLoS ONE, Biology, Computational Biology, and Genetics. Drop me a line for any feedback. import gzip import cPickle as pickle from datetime import date...
    Read more...

  • 21 Sep 2013 » PLoS ONE Citation Counts Recorded in CrossRef
  • This post is work in progress This blog post is based on an IPython Notebook that I sketched out on GitHub some time ago. You can find accompanying data there. This post is work in progress since conversion from my original IPython Notebook to Markdown (for this post) produced some glitches that I need to amend. PLoS ONE Citation Counts The data set used in this notebook has been described here: http://nbviewer.ipython.org/6211587 Here, we’ll use...
    Read more...

  • 12 Aug 2013 » PLoS ONE Time to Publication
  • This blog post is based on an IPython Notebook that I sketched out on GitHub some time ago. You can find accompanying data there. PLoS ONE Time to Publication I recently picked up working with PLOS’s API and downloaded over 63,000 articles from PLOS ONE which required about 6 hours, pagination of my requests to the API, and around 2.5 GB of hard disk space. Here is some analysis I have been doing with this...
    Read more...

    comments powered by Disqus