Copper - Quick and automatic data transformation
Today I start working on the first assignment for the Coursera's Data Analysis course and I needed to make some basic transformation on the data so decide to include them on copper.
I realize I was making some wrong decisions on the Type metadata on the Dataset. Since the money type is just a number (similar to a percent number on the case of this example) it is not necessary to have it as a different type, just make the transformations I was doing so I decide to remove the Money type, now the only types are Number and Category.
Lets dive into the example: dataset: loansData.csv
Import copper and set the project path
import copper
copper.project.path = '../'
Load the data into a pandas DataFrame and see the first rows
loans = copper.read_csv('loansData.csv')
loans.head()
Amount.Requested | Amount.Funded.By.Investors | Interest.Rate | Loan.Length | Loan.Purpose | Debt.To.Income.Ratio | State | Home.Ownership | Monthly.Income | FICO.Range | Open.CREDIT.Lines | Revolving.CREDIT.Balance | Inquiries.in.the.Last.6.Months | Employment.Length | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
81174 | 20000 | 20000 | 8.90% | 36 months | debt_consolidation | 14.90% | SC | MORTGAGE | 6541.67 | 735-739 | 14 | 14272 | 2 | < 1 year |
99592 | 19200 | 19200 | 12.12% | 36 months | debt_consolidation | 28.36% | TX | MORTGAGE | 4583.33 | 715-719 | 12 | 11140 | 1 | 2 years |
80059 | 35000 | 35000 | 21.98% | 60 months | debt_consolidation | 23.81% | CA | MORTGAGE | 11500.00 | 690-694 | 14 | 21977 | 1 | 2 years |
15825 | 10000 | 9975 | 9.99% | 36 months | debt_consolidation | 14.30% | KS | MORTGAGE | 3833.33 | 695-699 | 10 | 9346 | 0 | 5 years |
33182 | 12000 | 12000 | 11.71% | 36 months | credit_card | 18.78% | NJ | RENT | 3195.00 | 695-699 | 11 | 14469 | 0 | 9 years |
There are some transformations to be done:
- Interest.Rate to Number
- Loan.Lenght to Number
- Debt.To.Income.Ratio to Number
- FICO.Rate to Number
- Employment.Length to Number
Notice a pattern? xD. Thats where I realize I was doing the wrong about a type only for money columns on the previous examples, is just a number so just define it like number.
Lets load the data into a Dataset and see the metadata
dataset = copper.Dataset(loans)
dataset.metadata
Role | Type | dtype | |
---|---|---|---|
Amount.Requested | Input | Number | int64 |
Amount.Funded.By.Investors | Input | Number | float64 |
Interest.Rate | Input | Category | object |
Loan.Length | Input | Category | object |
Loan.Purpose | Input | Category | object |
Debt.To.Income.Ratio | Input | Category | object |
State | Input | Category | object |
Home.Ownership | Input | Category | object |
Monthly.Income | Input | Number | float64 |
FICO.Range | Input | Category | object |
Open.CREDIT.Lines | Input | Number | float64 |
Revolving.CREDIT.Balance | Input | Number | float64 |
Inquiries.in.the.Last.6.Months | Input | Number | float64 |
Employment.Length | Input | Category | object |
Now all columns with dtype of float or int are Number and the rest are Category no other unnecessay types such as Money
Let's convert Interest.Rate, Loan.Lenght, Debt.To.Income.Ratio to Numbers
dataset.type['Interest.Rate'] = dataset.NUMBER
dataset.type['Loan.Length'] = dataset.NUMBER
dataset.type['Debt.To.Income.Ratio'] = dataset.NUMBER
dataset.update()
dataset[['Interest.Rate', 'Loan.Length', 'Debt.To.Income.Ratio']].head()
Interest.Rate | Loan.Length | Debt.To.Income.Ratio | |
---|---|---|---|
81174 | 8.90 | 36 | 14.90 |
99592 | 12.12 | 36 | 28.36 |
80059 | 21.98 | 60 | 23.81 |
15825 | 9.99 | 36 | 14.30 |
33182 | 11.71 | 36 | 18.78 |
We can see that the values are transformed. Inside copper uses a very simple regular expression to find the first number on each entry so it works for many cases, such as money values, percent values or even custom values such as '36 months' to '36'.
Let's check the metadata,
dataset.metadata
Role | Type | dtype | |
---|---|---|---|
Amount.Requested | Input | Number | int64 |
Amount.Funded.By.Investors | Input | Number | float64 |
Interest.Rate | Input | Number | float64 |
Loan.Length | Input | Number | float64 |
Loan.Purpose | Input | Category | object |
Debt.To.Income.Ratio | Input | Number | float64 |
State | Input | Category | object |
Home.Ownership | Input | Category | object |
Monthly.Income | Input | Number | float64 |
FICO.Range | Input | Category | object |
Open.CREDIT.Lines | Input | Number | float64 |
Revolving.CREDIT.Balance | Input | Number | float64 |
Inquiries.in.the.Last.6.Months | Input | Number | float64 |
Employment.Length | Input | Category | object |
So far so good. Now is necessary to convert Employment.Length
to number to but this column is special. It has two values that the regular expression is going to match as the same: "< 1 years" and "1 year" are both going to be converted to just '1'. We need to fix that.
So I convert the '< 1 year' values to '0 years' and then use the copper transformation.
dataset['Employment.Length'][dataset['Employment.Length'] == '< 1 year'] = '0 years'
dataset[['Employment.Length']].head(3) # Checking
Employment.Length | |
---|---|
81174 | 0 years |
99592 | 2 years |
80059 | 2 years |
dataset.type['Employment.Length'] = dataset.NUMBER
dataset.update()
dataset[['Employment.Length']].head(3)
Employment.Length | |
---|---|
81174 | 0 |
99592 | 2 |
80059 | 2 |
The final transformation is the 'Fico.Range' column, on this example is possible to see that the second number is just the first number plus four, so we can just use the same regular expression and get the first number.
But lets say we want to see what would happen on case of using it.
copper.transform.to_number(dataset['FICO.Range']).head()
81174 735 99592 715 80059 690 15825 695 33182 695 Name: FICO.Range
That is exactly what we want so lets use the transformation.
dataset.type['FICO.Range'] = dataset.NUMBER
dataset.update()
Finally let's see the final table and metadata
dataset.head()
Amount.Requested | Amount.Funded.By.Investors | Interest.Rate | Loan.Length | Loan.Purpose | Debt.To.Income.Ratio | State | Home.Ownership | Monthly.Income | FICO.Range | Open.CREDIT.Lines | Revolving.CREDIT.Balance | Inquiries.in.the.Last.6.Months | Employment.Length | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
81174 | 20000 | 20000 | 8.90 | 36 | debt_consolidation | 14.90 | SC | MORTGAGE | 6541.67 | 735 | 14 | 14272 | 2 | 0 |
99592 | 19200 | 19200 | 12.12 | 36 | debt_consolidation | 28.36 | TX | MORTGAGE | 4583.33 | 715 | 12 | 11140 | 1 | 2 |
80059 | 35000 | 35000 | 21.98 | 60 | debt_consolidation | 23.81 | CA | MORTGAGE | 11500.00 | 690 | 14 | 21977 | 1 | 2 |
15825 | 10000 | 9975 | 9.99 | 36 | debt_consolidation | 14.30 | KS | MORTGAGE | 3833.33 | 695 | 10 | 9346 | 0 | 5 |
33182 | 12000 | 12000 | 11.71 | 36 | credit_card | 18.78 | NJ | RENT | 3195.00 | 695 | 11 | 14469 | 0 | 9 |
dataset.metadata
Role | Type | dtype | |
---|---|---|---|
Amount.Requested | Input | Number | int64 |
Amount.Funded.By.Investors | Input | Number | float64 |
Interest.Rate | Input | Number | float64 |
Loan.Length | Input | Number | float64 |
Loan.Purpose | Input | Category | object |
Debt.To.Income.Ratio | Input | Number | float64 |
State | Input | Category | object |
Home.Ownership | Input | Category | object |
Monthly.Income | Input | Number | float64 |
FICO.Range | Input | Number | float64 |
Open.CREDIT.Lines | Input | Number | float64 |
Revolving.CREDIT.Balance | Input | Number | float64 |
Inquiries.in.the.Last.6.Months | Input | Number | float64 |
Employment.Length | Input | Number | float64 |
The values are correct and the metadata matches the information on the table, we are done. Now is possible to do some machine learning on the data.
Finally let's save the dataset for future use.
copper.save(dataset, 'loans')
As usual the code is on github: copper.