Crawling with Python, Selenium and Docker

TD;DR: Using selenium inside a docker container to crawl webistes that need javascript or user interaction + a cluster of those using docker swarm.

While simple HTTP requests are good enough 90% to get the data you want from a website I am always looking for better ways to optimize my crawlers specially in websites that require javascript and user interaction, a login or a click in the right place sometimes give you the access you need. I am looking at to you government websites!

Recently I have seen more solutions to some of these problems in python such as Splash from ScrapingHub that is basically a QT browser with an scriptable API. I havent tried it and it definetly looks like a viable option but if I am going to render a webpage I want to do it in a "real" (Chrome) browser.

An easy way to use Chrome (or Firefox or any other popular browser) with an scriptable and multi-language API is using Selenium

ReproduceIt: Reddit word count

ReproduceIt is a series of articles that reproduce the results from data articles focusing on having open data and open code. On this small article I reproduce the results (not visualization this time) from the reddit user /u/fhoffa, he posted a nice word cloud visualization of the most common words on some famous subreddits.

The reddit post can be found in the data is beautiful subreddit: Reddit most common words for /r/politics, /r/movies, /r/trees, /r/science and the original word cloud was this: Original Wordcloud

For some context he mentioned that he used Google BigQuery and Tableau. The data used was the most recent month available (May 2015) in a recent reddit dump that user /u/Stuck_In_the_Matrix made available in a nice torrent.

To reproduce the results I am using dask which is a nice new project from Continuum Analytics which got a lot of attention in the most recent SciPy. A little disclaimer here: I currently work for Continuum but this post is not sponsored in any way.

ReproduceIt: FiveThirtyEight - How Baltimore’s Young Black Men Are Boxed In

ReproduceIt is a series of articles that reproduce the results from data analysis articles focusing on having open data and open code. All the code and data is available on github: reproduceit-538-baltimore-black-income. This post contains a more verbose version of the content that will probably get outdated while the github version could be updated including fixes.

For this second article I (again) took an article from one of my favorite websites FiveThirtyEight. On this case I took an article by Ben Casselman "How Baltimore’s Young Black Men Are Boxed In" which I found interesting given the recent events in the US, specially on this case Baltimore, Maryland.

The article analyses the income gap between white and black people in different cities all around the US. The data source is the "American Community Survey" and is available in the "American Fact Finder"

With this app is not possible to crawl it like on the previous ReproduciIt article

ReproduceIt: FiveThirtyEight - The Three Types Of Adam Sandler Movies

ReproduceIt is a series of articles that reproduce the results from data analysis articles focusing on having open data and open code. All the code and data is available on github: reproduceit-538-adam-sandler-movies. This post contains a more verbose version of the content that will probably get outdated while the github version could be updated including fixes.

I am a fan of FiveThirtyEight and how they do most of their articles based on data analysis I am also a fan of how they open source a lot of their code and data on github. The ReproduceIt series of articles is highly based on them.

In this first article of ReproduceIt I am going to try to reproduce the analysis Walt Hickey did for the article "The Three Types Of Adam Sandler Movies". This particular article is a simple data analysis on Adam Sandler movies and they didn't provide any code or data for it so it think is a nice opportunity to start this series of posts.


I have been wanting to restart my blog for a while now. In 2013 I wrote around 20 posts but in 2014 I only wrote 3 times and this is my first post in 2015 and its not even a real one. Last one was more than 6 months ago. Time is one of the reasons but sometimes I have a couple of hours to kill that I would like to use to blog but ideas don't come to me easily sometimes.

Last weekend I attended PyData Dallas 2015 and in some of the talks about data journalism and open data I got a simple but I believe effective idea for getting me writing new posts.

The idea is to take articles from the internet that do some kind of data analysis and reproduce the results. In most of the cases the data and analysis are not published but ...

From zero to storm cluster for scikit-learn classification

Apache storm is a new technology that allows to do real-time computation that its been in the big data news lately and I was curious to try it to see if its really good or is a just the new map-reduce.

One of the first (and no brainer) ideas I had was to do real-time classification of a scikit-learn model, the main issue was that storm is Java and I didn't want to all the integration between Java and Python but after I saw in the pydata videos that the people at Parsely already took some of that pain out with their new streamparse library I had no more excuses to try it.

Storm cluster

I decided to deploy a storm cluster and after failing to use their EC2 scripts I decided to do it myself using salt. I found an amazing step by step tutorial from Michael Noll ...

IPython.parallel cluster using salt

This post describes a simple yet flexible implementation on how to deploy an IPython.parallel cluster in multiple EC2 instances using salt and a little bit of Vagrant. The final output will be one instance running the IPython notebook, ipcontroller and acting as the salt-master, also 6 instances each one running ipengine and being salt-minions, see the IPython.parallel docs for information on those commands.

In a previous post I created a one-liner for deploying an ipython notebook to the cloud, after that I have been refactoring and advancing the concept and it became my own datasciencebox, it was natural to include the code for creating the ipcluster code in the same project.

Impala to python: ODBC and thrift

For good or bad HDFS is where the data is placed today and we all know this data is hard and slow to query and analyze. The good news is that the people at Cloudera created impala, basic idea is: fast SQL on Hadoop (HDFS or HBase) using the Hive metastore.

For good or for bad Hadoop is not were Data Science happens it usually happens in R or python, good news is that the folks at cloudera made sure that is very easy to extract data out of impala using standard technologies such as ODBC or thrift.

In this post I try some tools to extract data from impala to python (pandas) to do in memory analysis.

The setup is quite simple since I am running the cloudera (CDH 4.6) in my own computer using their virtual machine, just be sure to port forward port 21050, if you are in the cloud (EC2) just be sure to open that port. The data is just a table of 100,000 rows and 3 columns in AVRO format based on the getting started with AVRO example

One-liner: Deploy python scipy stack with IPython notebook on AWS

Problem: How many times have you needed to create a powerful EC2 instance with the the python scientific stack installed and the ipython notebook running? I have to do this at least 2 or 3 times every week.

A simple solution is to have an AMI with all the libraries ready and create a new instance every time, you can even have the ipython notebook as an upstart service in ubuntu so it runs when the instance is ready. That was my previous solution, but that was before I learned salt.

The problem with the AMI solution is that it gets dated really quickly and while update it is not hard, it is annoying. Also having to login into AWS, look for the AMI and spin up and instance is annoying.

Solution: Salt + Anaconda + Vagrant

Salt will do the provisioning of the instance using states, that makes the updates of ...

Review: Harvard Data Science - Fall 2013

A few weeks/months ago I found on HackerNews that Harvard was publishing all their Data Science course content (lectures, videos, labs, homeworks) online. I couldn't miss the opportunity to find what are they teaching, it was a nice experience to see how much I have learned and what else I am missing. Also, when I saw that everything was on python I knew they are doing it right!

The homeworks are IPython notebooks, the books are what I consider every Data Scientist should read: Python for Data Analysis, Machine Learning for Hackers, Probabilistic Programming and Bayesian methods for Hackers.

The homeworks were amazing and useful, I haven't seen all the videos (yet) but the ones I did were amazing. Below I do a little review of each homework, the stuff I learned and stuff that I think should be different.

Initially I wanted to use this opportunity ...