Jupyter Notebook Cloudera CSD

Running a Jupyter Notebook is as simple as executing jupyter notebook, assuming you have the libraries installed: pip install jupyter.

Doing this in a different environment (a Hadoop cluster) is basically the same. It requires a little bit more of sys admin experience (not much if you use the right tools) but problems really start when you don't have admin access to the cluster nodes.

Some of the new ways of running a Jupyter Notebook use the Anaconda Parcel which is basically the Anaconda Distribution packaged in a different format so Cloudera manager can installed.

The "problem" with the parcel is that it only includes the libraries it doesn't manage services (start, stop, restart). So the parcel is great if you already have a notebook server (maybe a multi-user Jupyter Hub server) that has access to the cluster and all you are missing is libraries for your Spark job.

Note …

Jupyter Hub on Kubernetes Part II: NFS

Second part of my JupyterHub deployment in Kubernetes experiment be sure to read Part I.

Last time we got JupyterHub authenticating to LDAP and creating the single user notebooks in Kubernetes containers. As I mentioned in that post one big problem with that deployment was that the notebook files are gone when the pod is deleted so this time I add an NFS volume to the JupyterHub single user containers to persist the notebook data.

Also improved a little bit the deployment and code so to its no longer needed to build a custom image you can just pull two images from my docker hub registry and configure them using a Kubernetes ConfigMap.

All the code in this post is at danielfrg/jupyterhub-kubernetes_spawner. Specifically in the example directory.


There is multiple options to have persistent data in Kubernetes containers. I chose NFS because its one of the few types …

Jupyter Hub on Kubernetes with LDAP

In this post I am going to show some initial work I did in the last day to deploy Jupyter Hub in Kubernetes with user auth based on LDAP.

I wasn't that much work considering that Jupyter Hub already had support for LDAP user auth and the modularity they have is amazing. It was quite straight forward to write a new spawner based on the existing one on the Jupyter Hub github org.

All the code in this post is at danielfrg/jupyterhub-kubernetes_spawner. Specifically in the examples directory.

I consider this a nice example of a more production ready Jupyter Hub deployment being based on LDAP that for good or bad is everywhere and Kubernetes to deployment of the single user notebook servers.

Something I don't talk is SSL certificates because its very well supported on Jupyter Hub and I think there is enough information about them on the Jupyter …

Talk @ Spark Summit 2015: Connecting python to the Spark Ecosystem

I was lucky enough to go San Francisco in June to give a talk at Spark Summit 2016. It was my first time in San Francisco and Silicon Valley I have to say it is a unique place.

My talk at Spark Summit covered a variety of topics around spark, python, python libraries, the deployment, use cases and a little bit about alternatives and the future.

Below you can find the video of the presentation and slides.

PD: There is a one good joke/story at 9:24

ReproduceIt: Name Trends

ReproduceIt is a series of articles that reproduce the results from data analysis articles focusing on having open data and open code.

Today as small return for the ReproduceIt series I try to reproduce a simple but nice data analysis and webapp that braid.io did called Most Beyonces are 14 years old and most Kanyes are about 11.

The article analyses the trend of names of some music artits (Beyonce, Kanye and Madona) in the US, it also has some nice possible explanations for the ups and downs in time, its a quick read. The data is based on Social Security Office and can be downloaded from the SSN website: Beyond the Top 1000 Names

The data is very small and loading it into pandas and plotting using bokeh it was very easy.

Talk @ PyData NYC 2015: Querying 1.6 billion reddit comments with python

I had the luck to go to beautiful NYC in the fall to give a talk at PyData NYC 2015.

The talk was about how to query around 1.6 billion reddit comments with python tools while leveraging some big data tools like Impala and Hive.

Some of the content can be found in the continuum developer blog

Below you can find the video of the presentation and slides.

PD: There is a couple of good jokes at 35:05 - If you like bad jokes

Multicorn in Docker + conda for Postgres Foreign Data Wrappers in Python

Multicorn is (in my opinion) one of those hidden gems in the python community. It is basically a wrapper for Postgres Foreign data wrappers and it makes it really easy to develop one in python. What that means is that it allows to use what is probably the most common and used database right now, Postgres, as a frontend for sql queries while allowing to use different data storage and even computation.

Unfortunately its not really known and therefore used, the only real example I have been impress with is a talk by Ville Tuulos: How to Build a SQL-based Data Warehouse for 100+ Billion Rows in Python where he talks about how AdRoll "built a custom, high-performance data warehouse in Python which can handle hundreds of billions of data points with sub-minute latency on a small cluster of servers".

That talk is a only a year old but to …

Crawling with Python, Selenium and Docker

TD;DR: Using selenium inside a docker container to crawl webistes that need javascript or user interaction + a cluster of those using docker swarm.

While simple HTTP requests are good enough 90% to get the data you want from a website I am always looking for better ways to optimize my crawlers specially in websites that require javascript and user interaction, a login or a click in the right place sometimes give you the access you need. I am looking at to you government websites!

Recently I have seen more solutions to some of these problems in python such as Splash from ScrapingHub that is basically a QT browser with an scriptable API. I havent tried it and it definetly looks like a viable option but if I am going to render a webpage I want to do it in a "real" (Chrome) browser.

An easy way to use Chrome (or Firefox or any other popular browser) with an scriptable and multi-language API is using Selenium

ReproduceIt: Reddit word count

ReproduceIt is a series of articles that reproduce the results from data analysis articles focusing on having open data and open code.

On this small article I reproduce the results (not visualization this time) from the reddit user /u/fhoffa, he posted a nice word cloud visualization of the most common words on some famous subreddits.

The reddit post can be found in the data is beautiful subreddit: Reddit most common words for /r/politics, /r/movies, /r/trees, /r/science and the original word cloud was this: Original Wordcloud

For some context he mentioned that he used Google BigQuery and Tableau. The data used was the most recent month available (May 2015) in a recent reddit dump that user /u/Stuck_In_the_Matrix made available in a nice torrent.

To reproduce the results I am using dask which is a nice new project from Continuum Analytics which got a lot of attention in the most recent SciPy. A little disclaimer here: I currently work for Continuum but this post is not sponsored in any way.

ReproduceIt: FiveThirtyEight - How Baltimore’s Young Black Men Are Boxed In

ReproduceIt is a series of articles that reproduce the results from data analysis articles focusing on having open data and open code. All the code and data is available on github: reproduceit-538-baltimore-black-income. This post contains a more verbose version of the content that will probably get outdated while the github version could be updated including fixes.

For this second article I (again) took an article from one of my favorite websites FiveThirtyEight. On this case I took an article by Ben Casselman "How Baltimore’s Young Black Men Are Boxed In" which I found interesting given the recent events in the US, specially on this case Baltimore, Maryland.

The article analyses the income gap between white and black people in different cities all around the US. The data source is the "American Community Survey" and is available in the "American Fact Finder"

With this app is not possible to crawl it like on the previous ReproduciIt article