Copper - Easy data analysis and machine learning on python
As part of my Master in IT & Management this semester I am taking a class called "Advanced Business Intelligence" which is the same as machine learning or data analysis to the Business people; in general they teach how to use SAS (Enteprise Miner) to do data mining.
SAS and Enterprise Miner is a good software - with some problems:
- Costs a lot of money
- Is slow as hell, seriously
- Good for business people, not so much for engineers
- Costs a lot of money
I was able to get it for $100 dollars because the university has an arrangement with SAS but without that is impossible to buy for personal use. For that reason I start learning R (from coursera) and Python Data analysis packages (pandas and scikit-learn), while I was taking my "(basic) Business Intelligence".
I learned that is possible to replace SAS with R or Python but some easy tasks can take a long time. I want to make a contribution by making some tasks easier and keep learning python. For that reason I am going to try to do everything we do on my BI class with Python and try to make it into a package for making some tasks more easy.
The first week was review of how to import Data into SAS Enterprise Miner and explore a little bit the data.
Copper¶
After thinking a lot how to call the package I decide to call it copper (inspired by the dog of The Fox and the Hound a.k.a. the saddest movie ever).
Importing¶
Note: I am going to use the same data that as on my class, a dataset from donations, available here: donors.csv
One thing that SAS does really good and pandas does not have is meta-data:
- Understands the type of each column: for example, columns with money symbols are converted to numbers that later can be used for doing machine learning.
- But also can change the type (or level how SAS call it) to be categorical or number variables, also can define roles: for example Rejected columns are not used for Machine Learning and that way is very easy to try different combinations.
So I create a class called DataSet which is a wrapper around a few pandas DataFrames to introduce meta-data.
To load data have to import copper
and then configure the directory path for the project. Inside the project directory needs to be another folder called 'data' with the data (csv
files for example)
import copper
copper.project.path = '../'
Then create a new Dataset and load the data.csv
file from the data
folder.
ds = copper.Dataset()
ds.load('data.csv')
Metadata¶
By default copper tries to find the best match for each column, similar at what SAS does.
- Depending on the name of the column decides what the target and ID columns are; the rest are inputs.
- Also tries to figure out the type of each column depending on the dtype (from pandas/numpy) and the content: for example if the dtype is object but most of the values starts with a $ symbols defines the column as a money column.
ds.metadata
Role | Type | |
---|---|---|
TARGET_B | Input | Number |
ID | ID | Number |
TARGET_D | Input | Money |
GiftCnt36 | Input | Number |
GiftCntAll | Input | Number |
GiftCntCard36 | Input | Number |
GiftCntCardAll | Input | Number |
GiftAvgLast | Input | Money |
GiftAvg36 | Input | Money |
GiftAvgAll | Input | Money |
GiftAvgCard36 | Input | Money |
GiftTimeLast | Input | Number |
GiftTimeFirst | Input | Number |
PromCnt12 | Input | Number |
PromCnt36 | Input | Number |
PromCntAll | Input | Number |
PromCntCard12 | Input | Number |
PromCntCard36 | Input | Number |
PromCntCardAll | Input | Number |
StatusCat96NK | Input | Category |
StatusCatStarAll | Input | Number |
DemCluster | Input | Number |
DemAge | Input | Number |
DemGender | Input | Category |
DemHomeOwner | Input | Category |
DemMedHomeValue | Input | Money |
DemPctVeterans | Input | Number |
DemMedIncome | Input | Money |
Of course is possible to change the defaults role and type of each column, lets fix some of the metadata
ds.role['TARGET_D'] = ds.REJECTED
ds.role['TARGET_B'] = ds.TARGET
ds.type['ID'] = ds.CATEGORY
ds.metadata.head(3)
Role | Type | |
---|---|---|
TARGET_B | Target | Number |
ID | ID | Category |
TARGET_D | Rejected | Money |
Depending of the metadata copper transforms the data. Mainly it transforms non-numbers into numbers to make machine learning possible; in scikit-learn is necessary to enter only numbers. But more on that on a later post.
Histograms¶
Before going into Machine Learning is a good idea to explore the data, the usual way is with a histogram. Is easy to explore money (numerical) columns. I remove the legend, because is to big but the method also returns a list with the information of each bin.
ds.histogram('DemMedIncome', legend=False, retList=True)
0 0.0 - 10000.05: 2358 1 10000.0 - 20000.10: 9 2 20000.1 - 30000.15: 304 3 30000.1 - 40000.20: 1397 4 40000.2 - 50000.25: 2187 5 50000.2 - 60000.30: 1303 6 60000.3 - 70000.35: 921 7 70000.3 - 80000.40: 550 8 80000.4 - 90000.45: 290 9 90000.4 - 100000.50: 130 10 100000.5 - 110000.55: 110 11 110000.5 - 120000.60: 39 12 120000.6 - 130000.65: 34 13 130000.6 - 140000.70: 18 14 140000.7 - 150000.75: 7 15 150000.8 - 160000.80: 13 16 160000.8 - 170000.85: 7 17 170000.8 - 180000.90: 7 18 180000.9 - 190000.95: 0 19 190000.9 - 200001.00: 2
Is also possible to explore categorical variables.
ds.histogram('DemGender')
Inputs¶
We can take a look at how the data is transformed.
ds.inputs
<class 'pandas.core.frame.DataFrame'> Int64Index: 9686 entries, 0 to 9685 Data columns: GiftCnt36 9686 non-null values GiftCntAll 9686 non-null values GiftCntCard36 9686 non-null values GiftCntCardAll 9686 non-null values GiftAvgLast 9686 non-null values GiftAvg36 9686 non-null values GiftAvgAll 9686 non-null values GiftAvgCard36 7906 non-null values GiftTimeLast 9686 non-null values GiftTimeFirst 9686 non-null values PromCnt12 9686 non-null values PromCnt36 9686 non-null values PromCntAll 9686 non-null values PromCntCard12 9686 non-null values PromCntCard36 9686 non-null values PromCntCardAll 9686 non-null values StatusCat96NK [A] 9686 non-null values StatusCat96NK [E] 9686 non-null values StatusCat96NK [F] 9686 non-null values StatusCat96NK [L] 9686 non-null values StatusCat96NK [N] 9686 non-null values StatusCat96NK [S] 9686 non-null values StatusCatStarAll 9686 non-null values DemCluster 9686 non-null values DemAge 7279 non-null values DemGender [F] 9686 non-null values DemGender [M] 9686 non-null values DemGender [U] 9686 non-null values DemHomeOwner [H] 9686 non-null values DemHomeOwner [U] 9686 non-null values DemMedHomeValue 9686 non-null values DemPctVeterans 9686 non-null values DemMedIncome 9686 non-null values dtypes: float64(7), int64(26)
inputs
is a pandas DataFrame. We can see that each categorical variables are divided into more columns that are filled with one's and zero's for doing machine learning possible also money columns are converted to be numbers only.
See that the dtypes are float and int so is possible to enter that on scikit-learn by calling inputs.values
to get a numpy array.
Conclusion¶
Thats it for now, the next week I hope to get the integration with scikit-learn to make comparison of models as easy (and why not easier) than with SAS.
The code is on github: copper