This post describes a simple yet flexible implementation on how to deploy an IPython.parallel cluster in multiple EC2 instances using salt and a little bit of Vagrant. The final output will be one instance running the IPython notebook,
ipcontroller and acting as the salt-master, also 6 instances each one running
ipengine and being salt-minions, see the IPython.parallel docs for information on those commands.
In a previous post I created a one-liner for deploying an ipython notebook to the cloud, after that I have been refactoring and advancing the concept and it became my own datasciencebox, it was natural to include the code for creating the ipcluster code in the same project.
For good or bad HDFS is where the data is placed today and we all know this data is hard and slow to query and analyze. The good news is that the people at Cloudera created impala, basic idea is: fast SQL on Hadoop (HDFS or HBase) using the Hive metastore.
For good or for bad Hadoop is not were Data Science happens it usually happens in R or python, good news is that the folks at cloudera made sure that is very easy to extract data out of impala using standard technologies such as ODBC or thrift.
In this post I try some tools to extract data from impala to python (pandas) to do in memory analysis.
The setup is quite simple since I am running the cloudera (CDH 4.6) in my own computer using their virtual machine, just be sure to port forward port 21050, if you are in the cloud (EC2) just be sure to open that port. The data is just a table of 100,000 rows and 3 columns in AVRO format based on the getting started with AVRO example
Problem: How many times have you needed to create a powerful EC2 instance with the the python scientific stack installed and the ipython notebook running? I have to do this at least 2 or 3 times every week.
A simple solution is to have an AMI with all the libraries ready and create a new instance every time, you can even have the ipython notebook as an upstart service in ubuntu so it runs when the instance is ready. That was my previous solution, but that was before I learned salt.
The problem with the AMI solution is that it gets dated really quickly and while update it is not hard, it is annoying. Also having to login into AWS, look for the AMI and spin up and instance is annoying.
Salt will do the provisioning of the instance using states, that makes the updates of ...
A few weeks/months ago I found on HackerNews that Harvard was publishing all their Data Science course content (lectures, videos, labs, homeworks) online. I couldn't miss the opportunity to find what are they teaching, it was a nice experience to see how much I have learned and what else I am missing. Also, when I saw that everything was on python I knew they are doing it right!
The homeworks are IPython notebooks, the books are what I consider every Data Scientist should read: Python for Data Analysis, Machine Learning for Hackers, Probabilistic Programming and Bayesian methods for Hackers.
The homeworks were amazing and useful, I haven't seen all the videos (yet) but the ones I did were amazing. Below I do a little review of each homework, the stuff I learned and stuff that I think should be different.
Initially I wanted to use this opportunity ...
I've been busy with real work to write any specific posts for the blog but I realize it was a productive month. I worked a lot and learned quite a few things and new technologies and I have some thoughts about them, I usually use twitter to share simple thoughts but they usually get lost on the noise.
So I am going to start a monthly series in which I discuss a little bit about what I have done that month. Hopefully mixing with regular posts I am hoping that this makes me learn more and more stuff every month so I have something to write about. On this case is October and a few weeks of September.
Books I read
I had to do some pig for my job so I used this opportunity to consolidate a little bit my knowledge. I had worked with pig a little ...
A few weeks ago Google released some code to convert words to vectors called word2vec. The company I am currently working on does something similar and I was quite amazed by the performance and accuracy of Google's algorithm so I created a simple python wrapper to call the C code for training and read the training vectors into numpy arrays, you can check it out on pypi (word2vec).
At the same time I found out about yhat, I found about them via twitter and after reading their blog I had to try their product. What they do is very simple but very useful: take some python (scikit-learn) or R classifier and create a REST endpoint to make predictions on new data. The product is still in very beta but the guys were very responsive and they helped me to solve some of my issues.
The only restriction I had ...
I have written a few posts about crawling content from the web. Mainly because I know the power of data and the biggest data source in the world is the web, Google knew it and we all now how they are doing. In my last post I wrote about crawling relevant content from blogs; It worked but what if I want an admin UI to control the crawling. What if I just put the crawler in an EC2 instance and just call it when I need to crawl.
I was amazed on how easy it was to integrate celery with django. I just created a few tasks to actually crawl blogs (I already had ...
We all know that the most important aspect of data science or machine learning is data; with enough quality data you can do everything. Is also not a mistery that the problem of big data is to get that amount of data into a queryable, reportable or undestandable format; now we have a lot of amazing new tools to store that amount of data (casandra, hbase and more) but I still believe that almost nothing beats the fact of collecting a good amount (not necessarily huge, but the more you have the better) but structured data, and there is nothing more structured than SQL.
There is a lot of information power in the web and crawl it gives you that power (or is at least the first step), Google does it and I am pretty sure I don't have to say more. I cannot even begin to imagine the amount of work that they do to understand that data. So I created my own mini crawler to crawl what I call relevant content of websites, more specificly blogs, yes I believe blogs and not twitter have a lot of information power, that is why I am writing this in a blog.
My previous post on implementing a basic Neural Network on python got a lot of attention staying one whole day on HN front page. I was very happy about that but more about the feedback I got. The community gave me a lot of tips and tricks on how to improve. So now I am presenting an improved version which supports multiple hidden layers, more optimization options using minibatches and a more maintainable/understandable code (or so I believe).
I kept reading a lot about Neural Networks mainly by watching some videos from Geoffrey's Hinton Neural Network course on coursera. Also reading more on deeplearning.net and trying to read some papers. That last task was definitely the hardest one because of the complexity of the papers. If I cannot even read them I can just imagine how hard is to write them.
About the Neural networks course I didn't like it as much as the Machine Learning course. The main reason is that I had to watch most videos 3 or 4 times before understanding (something). This is definitely my fault because the material is great but it focuses more on the theory (math) which I am not that good at and not on the implementation which I am more interested. But I take it as a learning experience and I will try to finish those videos.
Is really difficult be part of the data science / machine learning community and have not heard about Deep Neural Networks, everybody is talking about them. Is even harder for a person like me without a PhD and without a deep computer science or mathematics education to learn about them, because 1. Machine learning uses a quite heavy math and 2. There are no Neural Networks on sklearn.
Libraries like sklearn hide all the stuff and let you use machine learning and get amazing results. But sometimes you need more, and also understanding how an algorithm is implemented can help you understand how to improve results. The learning curve is hard, you can easily spend hours on DeepLearning.net and not understand anything. But there is hope.
I took Andrew's Ng Machine Learning course on Coursera wanting to get a better undesrtanding of how the algorithms I use almost everyday work. I learned a lot of usefull tricks and learned a lot more about simple machine learning implementations. Is important to start with the easy stuff or you get overwhelmed easy and eventually give up. Hopefully in a few more weeks I would be able to understand two or three more words on DeepLearning.net