# Coursera Data Analysis - Quiz 2 on python

I am taking the Data Analysis class on coursera, I wanted to keep learning R since I previously took Computing for Data Analysis but since that I change it to python. I think I love python too much so I did this week quiz on python using pandas and numpy. Not sure if I will do this each week/quiz of the class, we will see.

You can find the questions and solutions for the quiz2.pdf, the data: ss06hid.csv, ss06pid.csv and the metadata on the github page.

In [26]:
import numpy as np
import pandas as pd


## Question 2¶

In [141]:
import urllib
f = urllib.request.urlopen('http://simplystatistics.tumblr.com/')

In [142]:
lines = []
for i in range(150):

In [143]:
len(lines[1]), len(lines[44]), len(lines[121])

Out[143]:
(920, 7, 26)

python adds a '\n' on on each line so -2

## Question 3¶

In [2]:
housing = pd.read_csv('ss06hid.csv')

In [6]:
housing

Out[6]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6496 entries, 0 to 6495
Columns: 188 entries, RT to wgtp80
dtypes: float64(97), int64(90), object(1)
In [10]:
len(housing[housing['VAL'] >= 24])

Out[10]:
53

## Question 4¶

Column has to many information: Family type and employment status

## Question 5¶

In [19]:
len(housing[(housing['BDS'] == 3) & (housing['RMS'] == 4)])

Out[19]:
148
In [20]:
len(housing[(housing['BDS'] == 2) & (housing['RMS'] == 5)])

Out[20]:
386
In [21]:
len(housing[(housing['BDS'] == 2) & (housing['RMS'] == 7)])

Out[21]:
49

## Question 6¶

In [60]:
agricultureLogical = (housing['ACR'] >= 3) & (housing['AGS'] >= 6)

Out[60]:
pandas.core.series.Series
In [66]:
np.where(agricultureLogical == True)

Out[66]:
(array([ 124,  237,  261,  469,  554,  567,  607,  642,  786,  807,  823,
848,  951,  954, 1032, 1264, 1274, 1314, 1387, 1606, 1628, 1650,
1855, 1918, 2100, 2193, 2402, 2442, 2538, 2579, 2654, 2679, 2739,
2837, 2964, 3130, 3132, 3162, 3290, 3369, 3401, 3584, 3651, 3851,
3861, 3911, 4022, 4044, 4106, 4112, 4116, 4184, 4197, 4309, 4342,
4353, 4447, 4452, 4460, 4717, 4816, 4834, 4909, 5139, 5198, 5235,
5325, 5416, 5530, 5573, 5893, 6032, 6043, 6088, 6274, 6375, 6419]),)

## Question 7¶

In [67]:
q7subsetDataFrame = housing[agricultureLogical]

In [68]:
q7subsetDataFrame

Out[68]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 77 entries, 124 to 6419
Columns: 188 entries, RT to wgtp80
dtypes: float64(97), int64(90), object(1)
In [71]:
l1 = len(q7subsetDataFrame)
l1

Out[71]:
77
In [72]:
l2 = len(q7subsetDataFrame['MRGX'].dropna())
l2

Out[72]:
69
In [73]:
l1 - l2

Out[73]:
8

## Question 8¶

In [180]:
splits = []
for col in housing.columns:
splits.append(col.split("wgtp"))

In [182]:
splits[122]

Out[182]:
['', '15']

## Question 9¶

In [82]:
housing['YBL'].quantile(0)

Out[82]:
-1.0
In [83]:
housing['YBL'].quantile(1)

Out[83]:
25.0

Something wrong because YBL is: 'When structure first built'

• b .N/A (GQ)
• 1 .2005 or later
• 2 .2000 to 2004
• 3 .1990 to 1999
• 4 .1980 to 1989
• 5 .1970 to 1979
• 6 .1960 to 1969
• 7 .1950 to 1959
• 8 .1940 to 1949
• 9 .1939 or earlier

## Question 10¶

In [84]:
populations = pd.read_csv('ss06pid.csv')

In [88]:
pd.merge(populations, housing, on='SERIALNO', how='outer')

Out[88]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15451 entries, 0 to 15450
Columns: 426 entries, RT_x to wgtp80
dtypes: float64(333), int64(89), object(4)

## Conclusion¶

I believe python is catching up on data analysis with tools as pandas and scikit-learn, the problem is that it is just catching up while R has years of being consolidated as the tool for doind data analysis but I believe python is the future for its integration with other technologies such as the web with django; python is a language that is fighting on all fronts that can be good or bad, lets hope that is good.

FYI, I almost don't get question 8 xD.