ReproduceIt is a series of articles that reproduce the results from data analysis articles focusing on having open data and open code.
On this small article I reproduce the results (not visualization this time) from the reddit user /u/fhoffa, he posted a nice word cloud visualization of the most common words on some famous subreddits.
The reddit post can be found in the data is beautiful subreddit: Reddit most common words for /r/politics, /r/movies, /r/trees, /r/science
and the original word cloud was this:
For some context he mentioned that he used Google BigQuery and Tableau. The data used was the most recent month available (May 2015) in a recent reddit dump that user /u/Stuck_In_the_Matrix made available in a nice torrent.
To reproduce the results I am using dask which is a nice new project from Continuum Analytics which got a lot of attention in the most recent SciPy. A little disclaimer here: I currently work for Continuum but this post is not sponsored in any way.