Kaggle bulldozers: Basic cleaning
In the last (2) weeks I have been doing a project for my Business Intelligence class and I learned why people always say they spend 80% of their time clearning the data, couldn't be more true.
To practice a little bit more I decide to try the Kaggle bulldozers competition. The first thing I notice was that the data is huge (+450 Mgs), is by far the bigguest data I have dealed with, for most Big Data experts is probably tiny but for me was huge xD.
I was curious to see if python was capable of dealing with that and make some basic cleaning, also I ended creating some functionality for copper, the date cleaning and join datasets.
I start by looking (and getting scared) the data on Excel then imported it into python (pandas); +401k and 53 columns.
import copper
copper.project.path = '../'
train = copper.read_csv('train.csv')
len(train), len(train.columns)
(401125, 53)
I decide to divide the cleaning into 4 sections approximately 12 columns each. And create 4 different copper.Datasets
to modify the metadata of each section. Mainly needed to decide which columns are useful and which columns has to many missing values to be useful. At the same time making sure that the useful columns are do not have duplicates (categorical variables) and are good for machine learning (dates).
Columns 1¶
Select the first 11 columsn and print its first values.
cols1 = train[train.columns[0:12]]
cols1.head()
SalesID | SalePrice | MachineID | ModelID | datasource | auctioneerID | YearMade | MachineHoursCurrentMeter | UsageBand | saledate | fiModelDesc | fiBaseModel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1139246 | 66000 | 999089 | 3157 | 121 | 3 | 2004 | 68 | Low | 11/16/2006 0:00 | 521D | 521 |
1 | 1139248 | 57000 | 117657 | 77 | 121 | 3 | 1996 | 4640 | Low | 3/26/2004 0:00 | 950FII | 950 |
2 | 1139249 | 10000 | 434808 | 7009 | 121 | 3 | 2001 | 2838 | High | 2/26/2004 0:00 | 226 | 226 |
3 | 1139251 | 38500 | 1026470 | 332 | 121 | 3 | 2001 | 3486 | High | 5/19/2011 0:00 | PC120-6E | PC120 |
4 | 1139253 | 11000 | 1057373 | 17311 | 121 | 3 | 2007 | 722 | Medium | 7/23/2009 0:00 | S175 | S175 |
Most of the columns on this section are only IDs which are not usefull on prediction, also the target is the SalePrice
variable so I created a Dataset and put the correct metadata.
ds1 = copper.Dataset(cols1)
ds1.role['SalesID'] = ds1.ID
ds1.role['MachineID'] = ds1.ID
ds1.role['ModelID'] = ds1.ID
ds1.role['datasource'] = ds1.ID
ds1.role['auctioneerID'] = ds1.ID
ds1.role['SalePrice'] = ds1.TARGET
ds1.percent_missing()
UsageBand 0.826391 MachineHoursCurrentMeter 0.644089 auctioneerID 0.050199 fiBaseModel 0.000000 fiModelDesc 0.000000 saledate 0.000000 YearMade 0.000000 datasource 0.000000 ModelID 0.000000 MachineID 0.000000 SalePrice 0.000000 SalesID 0.000000
The values with more than 50% of missing values are automatically rejected. But the ones that are not reject (and are not IDs) are looking good.
Transforming a date¶
The Date field could be usefull but is a text and is necessary to convert it into a python datetime, and for doing machine learning is usefull to convert dates into numbers, the most common aproach is to use the Julian Date method.
Original data:
ds1['saledate'].head(2)
0 11/16/2006 0:00 1 3/26/2004 0:00 Name: saledate
Convert it into a datetime giving the correct format.
ds1['saledate'] = ds1['saledate'].apply(copper.transform.strptime, args='%m/%d/%Y %H:%M')
ds1['saledate'].head(2)
0 2006-11-16 00:00:00 1 2004-03-26 00:00:00 Name: saledate
Convert the dates into numbers
ds1['saledate'] = ds1['saledate'].apply(copper.transform.date_to_number)
ds1['saledate'].head(2)
0 13468 1 12503 Name: saledate
Change the type metadata of saledate
because now is a Number not a Category.
ds1.type['saledate'] = ds1.NUMBER
ds1.metadata
Role | Type | dtype | |
---|---|---|---|
SalesID | ID | Number | int64 |
SalePrice | Target | Number | int64 |
MachineID | ID | Number | int64 |
ModelID | ID | Number | int64 |
datasource | ID | Number | int64 |
auctioneerID | ID | Number | float64 |
YearMade | Input | Number | int64 |
MachineHoursCurrentMeter | Reject | Number | float64 |
UsageBand | Reject | Category | object |
saledate | Input | Number | int64 |
fiModelDesc | Input | Category | object |
fiBaseModel | Input | Category | object |
Columns 2¶
Select the next columns and print the percent of missing values.
cols2 = train[train.columns[12:20]]
cols2.head(7)
fiSecondaryDesc | fiModelSeries | fiModelDescriptor | ProductSize | fiProductClassDesc | state | ProductGroup | ProductGroupDesc | |
---|---|---|---|---|---|---|---|---|
0 | D | NaN | NaN | NaN | Wheel Loader - 110.0 to 120.0 Horsepower | Alabama | WL | Wheel Loader |
1 | F | II | NaN | Medium | Wheel Loader - 150.0 to 175.0 Horsepower | North Carolina | WL | Wheel Loader |
2 | NaN | NaN | NaN | NaN | Skid Steer Loader - 1351.0 to 1601.0 Lb Operat... | New York | SSL | Skid Steer Loaders |
3 | NaN | -6E | NaN | Small | Hydraulic Excavator, Track - 12.0 to 14.0 Metr... | Texas | TEX | Track Excavators |
4 | NaN | NaN | NaN | NaN | Skid Steer Loader - 1601.0 to 1751.0 Lb Operat... | New York | SSL | Skid Steer Loaders |
5 | G | NaN | NaN | NaN | Backhoe Loader - 14.0 to 15.0 Ft Standard Digg... | Arizona | BL | Backhoe Loaders |
6 | E | NaN | LC | Large / Medium | Hydraulic Excavator, Track - 21.0 to 24.0 Metr... | Florida | TEX | Track Excavators |
ds2 = copper.Dataset(cols2)
ds2.percent_missing()
fiModelSeries 0.858129 fiModelDescriptor 0.820707 ProductSize 0.525460 fiSecondaryDesc 0.342016 ProductGroupDesc 0.000000 ProductGroup 0.000000 state 0.000000 fiProductClassDesc 0.000000
On the first dataset we have one column (fiModelDesc
) that is divided into two variables (fiBaseModel
and fiSecondaryDesc
) so we do not need the original variable. We also have ProductGroup
and ProductGroupDesc
which are have the same information so I reject one.
Finally we have fiProductClassDesc, which can be usefull but needs more transformation. Initially I was thinking of taking the numbers of the field but some values are Lbs and some are HorsePower so is not that simple, for not I reject it.
ds1.role['fiModelDesc'] = ds1.REJECTED
ds2.role['ProductGroupDesc'] = ds2.REJECTED
ds2.role['fiProductClassDesc'] = ds2.REJECTED
Since I have rejected so many variables I gave a change to 'ProductSize' since is just above the margin of missing values.
ds2.role['ProductSize'] = ds1.INPUT
set(ds2['ProductSize'])
set([nan, 'Mini', 'Medium', 'Large / Medium', 'Compact', 'Large', 'Small'])
Looks quite clean so no problems there. The final metadata for this section looks like this.
ds2.metadata
Role | Type | dtype | |
---|---|---|---|
fiSecondaryDesc | Input | Category | object |
fiModelSeries | Reject | Category | object |
fiModelDescriptor | Reject | Category | object |
ProductSize | Input | Category | object |
fiProductClassDesc | Reject | Category | object |
state | Input | Category | object |
ProductGroup | Input | Category | object |
ProductGroupDesc | Reject | Category | object |
Columns 3¶
Print the first rows of the next section and see the missing values.
cols3 = train[train.columns[20:31]]
cols3.head(5)
Drive_System | Enclosure | Forks | Pad_Type | Ride_Control | Stick | Transmission | Turbocharged | Blade_Extension | Blade_Width | Enclosure_Type | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | EROPS w AC | None or Unspecified | NaN | None or Unspecified | NaN | NaN | NaN | NaN | NaN | NaN |
1 | NaN | EROPS w AC | None or Unspecified | NaN | None or Unspecified | NaN | NaN | NaN | NaN | NaN | NaN |
2 | NaN | OROPS | None or Unspecified | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | NaN | EROPS w AC | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | NaN | EROPS | None or Unspecified | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ds3 = copper.Dataset(cols3)
ds3.percent_missing()
Enclosure_Type 0.937129 Blade_Width 0.937129 Blade_Extension 0.937129 Turbocharged 0.802720 Stick 0.802720 Pad_Type 0.802720 Drive_System 0.739829 Ride_Control 0.629527 Transmission 0.543210 Forks 0.521154 Enclosure 0.000810
By default only Enclosure
will be an input so lets see if need to be cleaned.
set(ds3['Enclosure'])
set([nan, 'None or Unspecified', 'OROPS', 'EROPS w AC', 'NO ROPS', 'EROPS AC', 'EROPS'])
I believe 'EROPS w AC' and 'EROPS AC' are the same, probably should look at the forum of the competition but for now lets say I am right. Also I change the 'None or Unspecified' to nan.
ds3['Enclosure'][ds3['Enclosure'] == 'EROPS w AC'] = 'EROPS AC'
ds3['Enclosure'][ds3['Enclosure'] == 'None or Unspecified'] = np.nan
Since I create some new nans let's see if that changes the missing values.
ds3.percent_missing()['Enclosure']
0.00081520722966654802
Lets take a quick look at 'Forks' which is just above the rejection limit.
set(ds3['Forks'])
set([nan, 'Yes', 'None or Unspecified'])
ds3['Forks'][ds3['Forks'] == 'None or Unspecified'] = np.nan
ds3.percent_missing()['Forks']
0.96565409784979739
Changing 'None or Unespecified' to nan make the variable almost all missing values so is not usefull at all. The final metadata for this section is:
ds3.metadata
Role | Type | dtype | |
---|---|---|---|
Drive_System | Reject | Category | object |
Enclosure | Input | Category | object |
Forks | Reject | Category | object |
Pad_Type | Reject | Category | object |
Ride_Control | Reject | Category | object |
Stick | Reject | Category | object |
Transmission | Reject | Category | object |
Turbocharged | Reject | Category | object |
Blade_Extension | Reject | Category | object |
Blade_Width | Reject | Category | object |
Enclosure_Type | Reject | Category | object |
Columns 4¶
Let's see the remaining columns and their missing values.
ds4 = copper.Dataset(train[train.columns[31:]])
ds4.percent_missing()
Tip_Control 0.937129 Pushblock 0.937129 Engine_Horsepower 0.937129 Scarifier 0.937102 Hydraulics_Flow 0.891899 Grouser_Tracks 0.891899 Coupler_System 0.891660 Steering_Controls 0.827064 Differential_Type 0.826959 Backhoe_Mounting 0.803872 Blade_Type 0.800977 Travel_Controls 0.800975 Tire_Size 0.763869 Grouser_Type 0.752813 Track_Type 0.752813 Pattern_Changer 0.752651 Stick_Length 0.752651 Thumb 0.752476 Undercarriage_Pad_Width 0.751020 Ripper 0.740388 Coupler 0.466620 Hydraulics 0.200823
Almost everything is rejected. Let's take a look at what is not rejected.
set(ds4['Coupler'])
set([nan, 'None or Unspecified', 'Manual', 'Hydraulic'])
set(ds4['Hydraulics'])
set([nan, 'None or Unspecified', 'Base + 3 Function', 'Auxiliary', 'Base + 4 Function', 'Base + 1 Function', 'Standard', 'Base + 6 Function', '4 Valve', '3 Valve', '2 Valve', 'Base + 5 Function', 'Base + 2 Function'])
Need to change the 'None or Unspecified' to nan and see what happens.
ds4['Coupler'][ds4['Coupler'] == 'None or Unspecified'] = np.nan
ds4['Hydraulics'][ds4['Hydraulics'] == 'None or Unspecified'] = np.nan
ds4.percent_missing()[['Coupler', 'Hydraulics']]
Coupler 0.926781 Hydraulics 0.200848
Hydraulics
maintains low missing values but Coupler
is now huge on missing values so reject.
ds4.role['Coupler'] = ds4.REJECTED
Join¶
With all the datasets ready we can join them into one huge dataset.
ds = copper.join(ds1, ds2, others=[ds3, ds4])
Let's just check the number of rows and columns.
len(ds), len(ds.columns)
(401125, 53)
We have everything but I am not going to use all the rejected data in machine learning so let's filter the data taking only the Inputs and Targets.
Note: By default the filter
method returns a pandas.DataFrame
but on this case I want a Dataset so I use the ret_ds
parameter.
ds = ds.filter(role=['Input', 'Target'], ret_ds=True)
Just a final check on the first values and missing values.
ds.head()
SalePrice | YearMade | saledate | fiBaseModel | fiSecondaryDesc | ProductSize | state | ProductGroup | Enclosure | Hydraulics | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 66000 | 2004 | 13468 | 521 | D | NaN | Alabama | WL | EROPS AC | 2 Valve |
1 | 57000 | 1996 | 12503 | 950 | F | Medium | North Carolina | WL | EROPS AC | 2 Valve |
2 | 10000 | 2001 | 12474 | 226 | NaN | NaN | New York | SSL | OROPS | Auxiliary |
3 | 38500 | 2001 | 15113 | PC120 | NaN | Small | Texas | TEX | EROPS AC | 2 Valve |
4 | 11000 | 2007 | 14448 | S175 | NaN | NaN | New York | SSL | EROPS | Auxiliary |
ds.percent_missing()
ProductSize 0.525460 fiSecondaryDesc 0.342016 Hydraulics 0.200848 Enclosure 0.000815 ProductGroup 0.000000 state 0.000000 fiBaseModel 0.000000 saledate 0.000000 YearMade 0.000000 SalePrice 0.000000
Finally save the (pickled) Dataset for future use.
copper.save(ds, 'cleaned')
Conclusion¶
Removing and addind columns using copper Datasets is really easy and the new join feature makes easier this kind of tasks.
What I don't like is that I removed 43 columns and I dont have very high expectations only using the remaining 10 for machine learning. Usually the dates are not that useful, but will see the, and I prefer to have less columns with good information than 43 columns with 95% missing values. But that is the data available so I guess there is nothing much to do.
I created a repo for this kaggle competition, there are only 30 days left so I want to make a few submisions, stay tunned for imputation and some machine learning. The code for copper is on github too.