Django + Celery + Readability = Python relevant content crawler

I have written a few posts about crawling content from the web. Mainly because I know the power of data and the biggest data source in the world is the web, Google knew it and we all now how they are doing. In my last post I wrote about crawling relevant content from blogs; It worked but what if I want an admin UI to control the crawling. What if I just put the crawler in an EC2 instance and just call it when I need to crawl.

The solution was pretty simple thanks to some python projects. I just needed to move from sqlalchemy to django ORM, create a few celery tasks and use django-celery to have a pretty UI of the tasks.

I was amazed on how easy it was to integrate celery with django. I just created a few tasks to actually crawl blogs (I already had ...

Relevant content blog crawler

We all know that the most important aspect of data science or machine learning is data; with enough quality data you can do everything. Is also not a mistery that the problem of big data is to get that amount of data into a queryable, reportable or undestandable format; now we have a lot of amazing new tools to store that amount of data (casandra, hbase and more) but I still believe that almost nothing beats the fact of collecting a good amount (not necessarily huge, but the more you have the better) but structured data, and there is nothing more structured than SQL.

There is a lot of information power in the web and crawl it gives you that power (or is at least the first step), Google does it and I am pretty sure I don't have to say more. I cannot even begin to imagine the amount of work that they do to understand that data. So I created my own mini crawler to crawl what I call relevant content of websites, more specificly blogs, yes I believe blogs and not twitter have a lot of information power, that is why I am writing this in a blog.

A not so basic neural network on python

My previous post on implementing a basic Neural Network on python got a lot of attention staying one whole day on HN front page. I was very happy about that but more about the feedback I got. The community gave me a lot of tips and tricks on how to improve. So now I am presenting an improved version which supports multiple hidden layers, more optimization options using minibatches and a more maintainable/understandable code (or so I believe).

I kept reading a lot about Neural Networks mainly by watching some videos from Geoffrey's Hinton Neural Network course on coursera. Also reading more on deeplearning.net and trying to read some papers. That last task was definitely the hardest one because of the complexity of the papers. If I cannot even read them I can just imagine how hard is to write them.

About the Neural networks course I didn't like it as much as the Machine Learning course. The main reason is that I had to watch most videos 3 or 4 times before understanding (something). This is definitely my fault because the material is great but it focuses more on the theory (math) which I am not that good at and not on the implementation which I am more interested. But I take it as a learning experience and I will try to finish those videos.

Basic [1 hidden layer] neural network on Python

Is really difficult be part of the data science / machine learning community and have not heard about Deep Neural Networks, everybody is talking about them. Is even harder for a person like me without a PhD and without a deep computer science or mathematics education to learn about them, because 1. Machine learning uses a quite heavy math and 2. There are no Neural Networks on sklearn.

Libraries like sklearn hide all the stuff and let you use machine learning and get amazing results. But sometimes you need more, and also understanding how an algorithm is implemented can help you understand how to improve results. The learning curve is hard, you can easily spend hours on DeepLearning.net and not understand anything. But there is hope.

I took Andrew's Ng Machine Learning course on Coursera wanting to get a better undesrtanding of how the algorithms I use almost everyday work. I learned a lot of usefull tricks and learned a lot more about simple machine learning implementations. Is important to start with the easy stuff or you get overwhelmed easy and eventually give up. Hopefully in a few more weeks I would be able to understand two or three more words on DeepLearning.net

Extracting NBA data from ESPN

I've been wanting to play with some sports data for a while. Today I decide to stop procastinating and do it. The problems was that after searching a while (15 minutes) for some data I was unable to find the data I wanted. Even in the Basktetball Database (not really sure I undestand the site).

A friend showed me the ESPN stats and ask me if I knew how to scrap the data from a website. I lied and told him Yes. But I know python and its magic powers so after reading 15 minutes I knew how to do it.

I used requests and beautifulsoup to download and scrap the data from the ESPN site. Then used pandas to order, slice, and save the data into simple csv files. Also used iPython notebooks to develop the code faster. And a little bit of my copper project to use ...

Plugin for blogging with IPython notebooks in Pelican

This is just a little update on my previous post about Blogging with iPython notebooks with pelican.

One of the people behind pelican helped me to convert the previous code to a pelican plugin and I just made it available via github: pelican-ipythonnb.

The only thing that changed is the installation but the docs on how to use it is below (or in the readme of the repo).

An example is my last post about a cleaning data for a Kaggle competition.

Happy blogging!

Installation

Download plugin files: plugin/ipythonnb.py and the plugin/nbconverter directory.

The esiest way is to locate the pelican directory (for example: ~/.virtualenvs/blog/lib/python2.7/site-packages/pelican/) and paste plugins files in the pelican/plugins folder. Then in the pelicanconf.py put: PLUGINS = ['pelican.plugins.ipythonnb'].

But is also is possible to add plugins on the same directory of the pelican project: Create ...

Kaggle bulldozers: Basic cleaning

In the last (2) weeks I have been doing a project for my Business Intelligence class and I learned why people always say they spend 80% of their time clearning the data, couldn't be more true.

To practice a little bit more I decide to try the Kaggle bulldozers competition. The first thing I notice was that the data is huge (+450 Mgs), is by far the bigguest data I have dealed with, for most Big Data experts is probably tiny but for me was huge xD.

I was curious to see if python was capable of dealing with that and make some basic cleaning, also I ended creating some functionality for copper, the date cleaning and join datasets.

I start by looking (and getting scared) the data on Excel then imported it into python (pandas); +401k and 53 columns.

Blogging with IPython notebooks in pelican

Update: Check out the updated post on blogging with iPython notebook and pelican with a plugin.

It seems that I spend most time redesigning/developing this blog than actually blogging xD, just a few weeks ago I wrote about how I was blogging using jekyll with iPython notebooks and now I am talking about doing the same different static blog engine.

The fact is that a few days ago I found pelican, the first serious python alternative to Jekyll I have found, and after reading 15 minutes about it I was downloading pelican and creating a theme based on my old Jekyll site.

The switch was easy just needed move some files, change some of the tags from liquid to jinja2 and add some metadata. The fact of having less than 10 posts also helped because I did that part manually.

The fact that pelican is on python gave me ...

Copper - Quick and automatic data transformation

Today I start working on the first assignment for the Coursera's Data Analysis course and I needed to make some basic transformation on the data so decide to include them on copper.

I realize I was making some wrong decisions on the Type metadata on the Dataset. Since the money type is just a number (similar to a percent number on the case of this example) it is not necessary to have it as a different type, just make the transformations I was doing so I decide to remove the Money type, now the only types are Number and Category.

Lets dive into the example: dataset: loansData.csv

Import copper and set the project path

Using D3, backbone and tornado to visualize histograms of a csv file

After being procrastinating for weeks the learning of D3.js and backbone.js I have finally made my first example using both libraries to explore (via histograms) a pandas DataFrame. The reason of the procrastination is very simple: I love python to much, because is probably the only language who is great in all areas (that I am interested at least):

  • Great web frameworks as Django and Tornado - "fighting" with ruby (rails)
  • Great Data Analysis packages such as pandas - "fighting" with R
  • Great machine-learning libraries such as scikit-learn
  • Probably not the most successful but has a good gaming library pyGame
  • Is a great general purpose language - I use it to program a robot for a NASA competition using a PS3 controller, serial-ports, web-server, cameras, and all in one language
  • And the list could go for hours

For me that is the python killer feature: do anything on one language, sometimes ...