Copper - Bootstrap and Bagging
This week on my Advance Business Intelligence class we took a look at Boosting and Bagging, two concepts well known by everybody in the machine learning world. The example on the class was take a simple Decision Tree and compare it to some bagged Decision Tree; FYI the example on class failed: SAS told that a simple DT was "better" than the bagged one, some error on SAS Enterprise Miner that we could not found. Time to see if python is capable of doing it.
This first part is just a recap of Post #1 I am using the same donors.csv that I am using for my class. We import the data and set some roles for the variables
import copper
copper.project.path = '../'
ds = copper.Dataset()
ds.load('data.csv')
ds.role['TARGET_D'] = ds.REJECTED
ds.role['TARGET_B'] = ds.TARGET
ds.type['ID'] = ds.CATEGORY
Since scikit-learn cant handle NANs we need to fill the values so I create simple method con the Dataset class to fill values of numerical columns with the mean.
ds.fillna('DemAge', 'mean')
ds.fillna('GiftAvgCard36', 'mean')
Let's see if we are good
ds.inputs
<class 'pandas.core.frame.DataFrame'> Int64Index: 9686 entries, 0 to 9685 Data columns: GiftCnt36 9686 non-null values GiftCntAll 9686 non-null values GiftCntCard36 9686 non-null values GiftCntCardAll 9686 non-null values GiftAvgLast 9686 non-null values GiftAvg36 9686 non-null values GiftAvgAll 9686 non-null values GiftAvgCard36 9686 non-null values GiftTimeLast 9686 non-null values GiftTimeFirst 9686 non-null values PromCnt12 9686 non-null values PromCnt36 9686 non-null values PromCntAll 9686 non-null values PromCntCard12 9686 non-null values PromCntCard36 9686 non-null values PromCntCardAll 9686 non-null values StatusCat96NK [A] 9686 non-null values StatusCat96NK [E] 9686 non-null values StatusCat96NK [F] 9686 non-null values StatusCat96NK [L] 9686 non-null values StatusCat96NK [N] 9686 non-null values StatusCat96NK [S] 9686 non-null values StatusCatStarAll 9686 non-null values DemCluster 9686 non-null values DemAge 9686 non-null values DemGender [F] 9686 non-null values DemGender [M] 9686 non-null values DemGender [U] 9686 non-null values DemHomeOwner [H] 9686 non-null values DemHomeOwner [U] 9686 non-null values DemMedHomeValue 9686 non-null values DemPctVeterans 9686 non-null values DemMedIncome 9686 non-null values dtypes: float64(7), int64(26)
OK, no missing values so we are good to go.
Machine Learning¶
Time to see if boosting and bagging are that good as they promise to be. We create a new machine learning instance and set the dataset and we tell them to sample half for training and half for testing.
ml = copper.MachineLearning()
ml.dataset = ds
ml.sample(trainSize=0.5)
Create a new Decision Tree, add it to the models to compare and fit the models
from sklearn import tree
tree_clf = tree.DecisionTreeClassifier(max_depth=10)
ml.add_clf(tree_clf, 'Decision Tree')
ml.fit()
Since I start coding this library I wanted to maintain all compatibility with pandas and scikit-learn and its respective API. For that reason is possible to just add already fitted classifiers to the ML class and then just compare them with the utilities; this is useful in the case of using bootstraping.
In the next lines I create 20 different Decision Tree classifiers using 20 different samples (using Bootstrap), fit each classifier and add them to the ML class.
Have to carefull here, on the Bootstraping need to use only the training part, so I only use the training part of ds.inputs
(above I did ml.sample(0.5)
to divide half of the inputs in training and half on testing), then I use only the training part (ml.train)
to re-sample 20 times and fit 20 classifiers.
The first time I use all the inputs and the results were amazing, but obviously I was using some records on training and testing, which is wrong.
So in this case each new Decision Tree is only going to be using a quarter of the inputs to be trained.
from sklearn import cross_validation
bs = cross_validation.Bootstrap(len(ml.X_train), n_iter=20)
i = 0
for train_index, test_index in bs:
X_train = ml.X_train[train_index]
y_train = ml.y_train[train_index]
clf = tree.DecisionTreeClassifier(max_depth=10)
clf.fit(X_train, y_train)
ml.add_clf(clf, "DT" + str(i + 1))
i += 1
Lets see some results
ml.accuracy().head()
Decision Tree 0.550279 DT17 0.539955 DT14 0.536031 DT11 0.535205 DT19 0.534999 Name: Accuracy
ml.roc(legend=False, retList=True)
Decision Tree 0.541008 DT15 0.521935 DT19 0.520583 DT11 0.516007 DT16 0.515312 DT14 0.514619 DT8 0.513731 DT17 0.509060 DT9 0.506149 DT3 0.504594 DT13 0.504411 DT5 0.501457 DT2 0.499923 DT6 0.497339 DT10 0.494504 DT12 0.494483 DT7 0.493177 DT18 0.492243 DT4 0.491786 DT1 0.488879 DT20 0.483900
As expected that didnt do much, most of the other models are even worst than the original but now we have a few models ready to be bagged.
Bagging¶
Since scikit-learn does not have a implementation of bagging. I made a really simple (and un-efficient) implementation of bagging by using the mode to predict the class and using the mean to predict the probabilities. Finally I created some methods to make that possible and easy on the ML class.
To create a bag of all models is as simple as call the method and pass a name as parameter. On future releases will be possible to pass a list of target models to use on the bag and that way will be easy to create more bags.
ml.bagging("Bag 1")
ml.clfs # Checking the classifiers
DT14 DecisionTreeClassifier(compute_importances=Fal... Decision Tree DecisionTreeClassifier(compute_importances=Fal... DT13 DecisionTreeClassifier(compute_importances=Fal... DT15 DecisionTreeClassifier(compute_importances=Fal... DT18 DecisionTreeClassifier(compute_importances=Fal... DT12 DecisionTreeClassifier(compute_importances=Fal... DT17 DecisionTreeClassifier(compute_importances=Fal... DT9 DecisionTreeClassifier(compute_importances=Fal... DT8 DecisionTreeClassifier(compute_importances=Fal... DT20 DecisionTreeClassifier(compute_importances=Fal... DT11 DecisionTreeClassifier(compute_importances=Fal... DT19 DecisionTreeClassifier(compute_importances=Fal... DT16 DecisionTreeClassifier(compute_importances=Fal... DT3 DecisionTreeClassifier(compute_importances=Fal... DT2 DecisionTreeClassifier(compute_importances=Fal... DT1 DecisionTreeClassifier(compute_importances=Fal... DT10 DecisionTreeClassifier(compute_importances=Fal... DT7 DecisionTreeClassifier(compute_importances=Fal... DT6 DecisionTreeClassifier(compute_importances=Fal... DT5 DecisionTreeClassifier(compute_importances=Fal... DT4 DecisionTreeClassifier(compute_importances=Fal... Bag 1 <copper.core.ensemble.Bagging object at 0x7937...
Let's see the results
ml.accuracy().head()
Bag 1 0.558951 Decision Tree 0.550279 DT17 0.539955 DT14 0.536031 DT11 0.535205 Name: Accuracy
ml.roc(legend=False, retList=True)
Bag 1 0.578210 Decision Tree 0.541008 DT15 0.521935 DT19 0.520583 DT11 0.516007 DT16 0.515312 DT14 0.514619 DT8 0.513731 DT17 0.509060 DT9 0.506149 DT3 0.504594 DT13 0.504411 DT5 0.501457 DT2 0.499923 DT6 0.497339 DT10 0.494504 DT12 0.494483 DT7 0.493177 DT18 0.492243 DT4 0.491786 DT1 0.488879 DT20 0.483900
Well it an improvement, not a huge one but to be only taking the mean of each model it is not bad at all. At least the bag score a better accuracy than the other 20 clfs and also a better Area Under the Curve
Conclusion¶
Bagging is good. Just a very simple implementation gave better results.
I did not select any parameter for the Decision Trees, only the max-depth=10, playing with more models and more parameters I am sure bagging is going to be even better. To solve this problem next I want to take a look at Grid Search and see how it can help me to improve this results.
I really believe that the potential of bagging is to do some conditional scoring, for example only ask models that are good with high Income when the entry income is higher than $10000 for example. Also probably is a good idea to use different classifiers, instead o 20 decision trees 5 DT, 5 SVM, and so on. But is just something that cross my mind.
As usual the code is on github: copper