Copper - Quick and automatic data transformation

Today I start working on the first assignment for the Coursera's Data Analysis course and I needed to make some basic transformation on the data so decide to include them on copper.

I realize I was making some wrong decisions on the Type metadata on the Dataset. Since the money type is just a number (similar to a percent number on the case of this example) it is not necessary to have it as a different type, just make the transformations I was doing so I decide to remove the Money type, now the only types are Number and Category.

Lets dive into the example: dataset: loansData.csv

Import copper and set the project path

In [1]:
import copper
copper.project.path = '../'

Load the data into a pandas DataFrame and see the first rows

In [2]:
loans = copper.read_csv('loansData.csv')
In [3]:
loans.head()
Out[3]:
Amount.Requested Amount.Funded.By.Investors Interest.Rate Loan.Length Loan.Purpose Debt.To.Income.Ratio State Home.Ownership Monthly.Income FICO.Range Open.CREDIT.Lines Revolving.CREDIT.Balance Inquiries.in.the.Last.6.Months Employment.Length
81174 20000 20000 8.90% 36 months debt_consolidation 14.90% SC MORTGAGE 6541.67 735-739 14 14272 2 < 1 year
99592 19200 19200 12.12% 36 months debt_consolidation 28.36% TX MORTGAGE 4583.33 715-719 12 11140 1 2 years
80059 35000 35000 21.98% 60 months debt_consolidation 23.81% CA MORTGAGE 11500.00 690-694 14 21977 1 2 years
15825 10000 9975 9.99% 36 months debt_consolidation 14.30% KS MORTGAGE 3833.33 695-699 10 9346 0 5 years
33182 12000 12000 11.71% 36 months credit_card 18.78% NJ RENT 3195.00 695-699 11 14469 0 9 years

There are some transformations to be done:

  1. Interest.Rate to Number
  2. Loan.Lenght to Number
  3. Debt.To.Income.Ratio to Number
  4. FICO.Rate to Number
  5. Employment.Length to Number

Notice a pattern? xD. Thats where I realize I was doing the wrong about a type only for money columns on the previous examples, is just a number so just define it like number.

Lets load the data into a Dataset and see the metadata

In [4]:
dataset = copper.Dataset(loans)
In [5]:
dataset.metadata
Out[5]:
Role Type dtype
Amount.Requested Input Number int64
Amount.Funded.By.Investors Input Number float64
Interest.Rate Input Category object
Loan.Length Input Category object
Loan.Purpose Input Category object
Debt.To.Income.Ratio Input Category object
State Input Category object
Home.Ownership Input Category object
Monthly.Income Input Number float64
FICO.Range Input Category object
Open.CREDIT.Lines Input Number float64
Revolving.CREDIT.Balance Input Number float64
Inquiries.in.the.Last.6.Months Input Number float64
Employment.Length Input Category object

Now all columns with dtype of float or int are Number and the rest are Category no other unnecessay types such as Money

Let's convert Interest.Rate, Loan.Lenght, Debt.To.Income.Ratio to Numbers

In [6]:
dataset.type['Interest.Rate'] = dataset.NUMBER
dataset.type['Loan.Length'] = dataset.NUMBER
dataset.type['Debt.To.Income.Ratio'] = dataset.NUMBER
In [7]:
dataset.update()
In [8]:
dataset[['Interest.Rate', 'Loan.Length', 'Debt.To.Income.Ratio']].head()
Out[8]:
Interest.Rate Loan.Length Debt.To.Income.Ratio
81174 8.90 36 14.90
99592 12.12 36 28.36
80059 21.98 60 23.81
15825 9.99 36 14.30
33182 11.71 36 18.78

We can see that the values are transformed. Inside copper uses a very simple regular expression to find the first number on each entry so it works for many cases, such as money values, percent values or even custom values such as '36 months' to '36'.

Let's check the metadata,

In [9]:
dataset.metadata
Out[9]:
Role Type dtype
Amount.Requested Input Number int64
Amount.Funded.By.Investors Input Number float64
Interest.Rate Input Number float64
Loan.Length Input Number float64
Loan.Purpose Input Category object
Debt.To.Income.Ratio Input Number float64
State Input Category object
Home.Ownership Input Category object
Monthly.Income Input Number float64
FICO.Range Input Category object
Open.CREDIT.Lines Input Number float64
Revolving.CREDIT.Balance Input Number float64
Inquiries.in.the.Last.6.Months Input Number float64
Employment.Length Input Category object

So far so good. Now is necessary to convert Employment.Length to number to but this column is special. It has two values that the regular expression is going to match as the same: "< 1 years" and "1 year" are both going to be converted to just '1'. We need to fix that. So I convert the '< 1 year' values to '0 years' and then use the copper transformation.

In [10]:
dataset['Employment.Length'][dataset['Employment.Length'] == '< 1 year'] = '0 years'
In [11]:
dataset[['Employment.Length']].head(3) # Checking
Out[11]:
Employment.Length
81174 0 years
99592 2 years
80059 2 years
In [12]:
dataset.type['Employment.Length'] = dataset.NUMBER
In [13]:
dataset.update()
In [14]:
dataset[['Employment.Length']].head(3)
Out[14]:
Employment.Length
81174 0
99592 2
80059 2

The final transformation is the 'Fico.Range' column, on this example is possible to see that the second number is just the first number plus four, so we can just use the same regular expression and get the first number.

But lets say we want to see what would happen on case of using it.

In [15]:
copper.transform.to_number(dataset['FICO.Range']).head()
Out[15]:
81174    735
99592    715
80059    690
15825    695
33182    695
Name: FICO.Range

That is exactly what we want so lets use the transformation.

In [16]:
dataset.type['FICO.Range'] = dataset.NUMBER
In [17]:
dataset.update()

Finally let's see the final table and metadata

In [18]:
dataset.head()
Out[18]:
Amount.Requested Amount.Funded.By.Investors Interest.Rate Loan.Length Loan.Purpose Debt.To.Income.Ratio State Home.Ownership Monthly.Income FICO.Range Open.CREDIT.Lines Revolving.CREDIT.Balance Inquiries.in.the.Last.6.Months Employment.Length
81174 20000 20000 8.90 36 debt_consolidation 14.90 SC MORTGAGE 6541.67 735 14 14272 2 0
99592 19200 19200 12.12 36 debt_consolidation 28.36 TX MORTGAGE 4583.33 715 12 11140 1 2
80059 35000 35000 21.98 60 debt_consolidation 23.81 CA MORTGAGE 11500.00 690 14 21977 1 2
15825 10000 9975 9.99 36 debt_consolidation 14.30 KS MORTGAGE 3833.33 695 10 9346 0 5
33182 12000 12000 11.71 36 credit_card 18.78 NJ RENT 3195.00 695 11 14469 0 9
In [19]:
dataset.metadata
Out[19]:
Role Type dtype
Amount.Requested Input Number int64
Amount.Funded.By.Investors Input Number float64
Interest.Rate Input Number float64
Loan.Length Input Number float64
Loan.Purpose Input Category object
Debt.To.Income.Ratio Input Number float64
State Input Category object
Home.Ownership Input Category object
Monthly.Income Input Number float64
FICO.Range Input Number float64
Open.CREDIT.Lines Input Number float64
Revolving.CREDIT.Balance Input Number float64
Inquiries.in.the.Last.6.Months Input Number float64
Employment.Length Input Number float64

The values are correct and the metadata matches the information on the table, we are done. Now is possible to do some machine learning on the data.

Finally let's save the dataset for future use.

In [22]:
copper.save(dataset, 'loans')

As usual the code is on github: copper.