Coursera Data Analysis - Quiz 2 on python
I am taking the Data analysis class on coursera, I wanted to keep learning R since I previously took Computing for Data analysis but since that I change it to python. I think I love python too much so I did this week quiz on python using pandas and numpy. Not sure if I will do this each week/quiz of the class, we will see.
You can find the questions and solutions for the quiz2.pdf, the data: ss06hid.csv, ss06pid.csv and the metadata on the github page.
import numpy as np
import pandas as pd
Question 2¶
import urllib
f = urllib.request.urlopen('http://simplystatistics.tumblr.com/')
lines = []
for i in range(150):
lines.append(f.readline())
len(lines[1]), len(lines[44]), len(lines[121])
(920, 7, 26)
python adds a '\n'
on on each line so -2
Question 3¶
housing = pd.read_csv('ss06hid.csv')
housing
<class 'pandas.core.frame.DataFrame'> Int64Index: 6496 entries, 0 to 6495 Columns: 188 entries, RT to wgtp80 dtypes: float64(97), int64(90), object(1)
len(housing[housing['VAL'] >= 24])
53
Question 4¶
Column has to many information: Family type and employment status
Question 5¶
len(housing[(housing['BDS'] == 3) & (housing['RMS'] == 4)])
148
len(housing[(housing['BDS'] == 2) & (housing['RMS'] == 5)])
386
len(housing[(housing['BDS'] == 2) & (housing['RMS'] == 7)])
49
Question 6¶
agricultureLogical = (housing['ACR'] >= 3) & (housing['AGS'] >= 6)
pandas.core.series.Series
np.where(agricultureLogical == True)
(array([ 124, 237, 261, 469, 554, 567, 607, 642, 786, 807, 823, 848, 951, 954, 1032, 1264, 1274, 1314, 1387, 1606, 1628, 1650, 1855, 1918, 2100, 2193, 2402, 2442, 2538, 2579, 2654, 2679, 2739, 2837, 2964, 3130, 3132, 3162, 3290, 3369, 3401, 3584, 3651, 3851, 3861, 3911, 4022, 4044, 4106, 4112, 4116, 4184, 4197, 4309, 4342, 4353, 4447, 4452, 4460, 4717, 4816, 4834, 4909, 5139, 5198, 5235, 5325, 5416, 5530, 5573, 5893, 6032, 6043, 6088, 6274, 6375, 6419]),)
Question 7¶
q7subsetDataFrame = housing[agricultureLogical]
q7subsetDataFrame
<class 'pandas.core.frame.DataFrame'> Int64Index: 77 entries, 124 to 6419 Columns: 188 entries, RT to wgtp80 dtypes: float64(97), int64(90), object(1)
l1 = len(q7subsetDataFrame)
l1
77
l2 = len(q7subsetDataFrame['MRGX'].dropna())
l2
69
l1 - l2
8
Question 8¶
splits = []
for col in housing.columns:
splits.append(col.split("wgtp"))
splits[122]
['', '15']
Question 9¶
housing['YBL'].quantile(0)
-1.0
housing['YBL'].quantile(1)
25.0
Something wrong because YBL is: 'When structure first built'
- b .N/A (GQ)
- 1 .2005 or later
- 2 .2000 to 2004
- 3 .1990 to 1999
- 4 .1980 to 1989
- 5 .1970 to 1979
- 6 .1960 to 1969
- 7 .1950 to 1959
- 8 .1940 to 1949
- 9 .1939 or earlier
Question 10¶
populations = pd.read_csv('ss06pid.csv')
pd.merge(populations, housing, on='SERIALNO', how='outer')
<class 'pandas.core.frame.DataFrame'> Int64Index: 15451 entries, 0 to 15450 Columns: 426 entries, RT_x to wgtp80 dtypes: float64(333), int64(89), object(4)
Conclusion¶
I believe python is catching up on data analysis with tools as pandas and scikit-learn, the problem is that it is just catching up while R has years of being consolidated as the tool for doind data analysis but I believe python is the future for its integration with other technologies such as the web with django; python is a language that is fighting on all fronts that can be good or bad, lets hope that is good.
FYI, I almost don't get question 8 xD.