Copper - Easy data analysis and machine learning on python

As part of my Master in IT & Management this semester I am taking a class called "Advanced Business Intelligence" which is the same as machine learning or data analysis to the Business people; in general they teach how to use SAS (Enteprise Miner) to do data mining.

SAS and Enterprise Miner is a good software - with some problems:

  • Costs a lot of money
  • Is slow as hell, seriously
  • Good for business people, not so much for engineers
  • Costs a lot of money

I was able to get it for $100 dollars because the university has an arrangement with SAS but without that is impossible to buy for personal use. For that reason I start learning R (from coursera) and Python Data Analysis packages (pandas and scikit-learn), while I was taking my "(basic) Business Intelligence".

I learned that is possible to replace SAS with R or Python but some easy tasks can take a long time. I want to make a contribution by making some tasks easier and keep learning python. For that reason I am going to try to do everything we do on my BI class with Python and try to make it into a package for making some tasks more easy.

The first week was review of how to import Data into SAS Enterprise Miner and explore a little bit the data.

Copper

After thinking a lot how to call the package I decide to call it copper (inspired by the dog of The Fox and the Hound a.k.a. the saddest movie ever).

Importing

Note: I am going to use the same data that as on my class, a dataset from donations, available here: donors.csv

One thing that SAS does really good and pandas does not have is meta-data:

  • Understands the type of each column: for example, columns with money symbols are converted to numbers that later can be used for doing machine learning.
  • But also can change the type (or level how SAS call it) to be categorical or number variables, also can define roles: for example Rejected columns are not used for Machine Learning and that way is very easy to try different combinations.

So I create a class called DataSet which is a wrapper around a few pandas DataFrames to introduce meta-data.

To load data have to import copper and then configure the directory path for the project. Inside the project directory needs to be another folder called 'data' with the data (csv files for example)

In [1]:
import copper
copper.project.path = '../'

Then create a new Dataset and load the data.csv file from the data folder.

In [2]:
ds = copper.Dataset()
In [3]:
ds.load('data.csv')

Metadata

By default copper tries to find the best match for each column, similar at what SAS does.

  • Depending on the name of the column decides what the target and ID columns are; the rest are inputs.
  • Also tries to figure out the type of each column depending on the dtype (from pandas/numpy) and the content: for example if the dtype is object but most of the values starts with a $ symbols defines the column as a money column.
In [4]:
ds.metadata
Out[4]:
Role Type
TARGET_B Input Number
ID ID Number
TARGET_D Input Money
GiftCnt36 Input Number
GiftCntAll Input Number
GiftCntCard36 Input Number
GiftCntCardAll Input Number
GiftAvgLast Input Money
GiftAvg36 Input Money
GiftAvgAll Input Money
GiftAvgCard36 Input Money
GiftTimeLast Input Number
GiftTimeFirst Input Number
PromCnt12 Input Number
PromCnt36 Input Number
PromCntAll Input Number
PromCntCard12 Input Number
PromCntCard36 Input Number
PromCntCardAll Input Number
StatusCat96NK Input Category
StatusCatStarAll Input Number
DemCluster Input Number
DemAge Input Number
DemGender Input Category
DemHomeOwner Input Category
DemMedHomeValue Input Money
DemPctVeterans Input Number
DemMedIncome Input Money

Of course is possible to change the defaults role and type of each column, lets fix some of the metadata

In [5]:
ds.role['TARGET_D'] = ds.REJECTED
ds.role['TARGET_B'] = ds.TARGET
ds.type['ID'] = ds.CATEGORY
In [6]:
ds.metadata.head(3)
Out[6]:
Role Type
TARGET_B Target Number
ID ID Category
TARGET_D Rejected Money

Depending of the metadata copper transforms the data. Mainly it transforms non-numbers into numbers to make machine learning possible; in scikit-learn is necessary to enter only numbers. But more on that on a later post.

Histograms

Before going into Machine Learning is a good idea to explore the data, the usual way is with a histogram. Is easy to explore money (numerical) columns. I remove the legend, because is to big but the method also returns a list with the information of each bin.

In [8]:
ds.histogram('DemMedIncome', legend=False, retList=True)
Out[8]:
0          0.0 - 10000.05: 2358
1         10000.0 - 20000.10: 9
2       20000.1 - 30000.15: 304
3      30000.1 - 40000.20: 1397
4      40000.2 - 50000.25: 2187
5      50000.2 - 60000.30: 1303
6       60000.3 - 70000.35: 921
7       70000.3 - 80000.40: 550
8       80000.4 - 90000.45: 290
9      90000.4 - 100000.50: 130
10    100000.5 - 110000.55: 110
11     110000.5 - 120000.60: 39
12     120000.6 - 130000.65: 34
13     130000.6 - 140000.70: 18
14      140000.7 - 150000.75: 7
15     150000.8 - 160000.80: 13
16      160000.8 - 170000.85: 7
17      170000.8 - 180000.90: 7
18      180000.9 - 190000.95: 0
19      190000.9 - 200001.00: 2

Is also possible to explore categorical variables.

In [9]:
ds.histogram('DemGender')

Inputs

We can take a look at how the data is transformed.

In [10]:
ds.inputs
Out[10]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9686 entries, 0 to 9685
Data columns:
GiftCnt36            9686  non-null values
GiftCntAll           9686  non-null values
GiftCntCard36        9686  non-null values
GiftCntCardAll       9686  non-null values
GiftAvgLast          9686  non-null values
GiftAvg36            9686  non-null values
GiftAvgAll           9686  non-null values
GiftAvgCard36        7906  non-null values
GiftTimeLast         9686  non-null values
GiftTimeFirst        9686  non-null values
PromCnt12            9686  non-null values
PromCnt36            9686  non-null values
PromCntAll           9686  non-null values
PromCntCard12        9686  non-null values
PromCntCard36        9686  non-null values
PromCntCardAll       9686  non-null values
StatusCat96NK [A]    9686  non-null values
StatusCat96NK [E]    9686  non-null values
StatusCat96NK [F]    9686  non-null values
StatusCat96NK [L]    9686  non-null values
StatusCat96NK [N]    9686  non-null values
StatusCat96NK [S]    9686  non-null values
StatusCatStarAll     9686  non-null values
DemCluster           9686  non-null values
DemAge               7279  non-null values
DemGender [F]        9686  non-null values
DemGender [M]        9686  non-null values
DemGender [U]        9686  non-null values
DemHomeOwner [H]     9686  non-null values
DemHomeOwner [U]     9686  non-null values
DemMedHomeValue      9686  non-null values
DemPctVeterans       9686  non-null values
DemMedIncome         9686  non-null values
dtypes: float64(7), int64(26)

inputs is a pandas DataFrame. We can see that each categorical variables are divided into more columns that are filled with one's and zero's for doing machine learning possible also money columns are converted to be numbers only.

See that the dtypes are float and int so is possible to enter that on scikit-learn by calling inputs.values to get a numpy array.

Conclusion

Thats it for now, the next week I hope to get the integration with scikit-learn to make comparison of models as easy (and why not easier) than with SAS.

The code is on github: copper