ReproduceIt: FiveThirtyEight - How Baltimore’s Young Black Men Are Boxed In
ReproduceIt is a series of articles that reproduce the results from data analysis articles focusing on having open data and open code. All the code and data is available on github: reproduceit-538-baltimore-black-income. This post contains a more verbose version of the content that will probably get outdated while the github version could be updated including fixes.
For this second article I (again) took an article from one of my favorite websites FiveThirtyEight. On this case I took an article by Ben Casselman "How Baltimore’s Young Black Men Are Boxed In" which I found interesting given the recent events in the US, specially on this case Baltimore, Maryland.
The article analyses the income gap between white and black people in different cities all around the US. The data source is the "American Community Survey" and is available in the "American Fact Finder"
With this app is not possible to crawl it like on the previous ReproduciIt article since it requires user input to select the desired data but the data used for this analysis is available with the code in the github repo.
As usual we print some versions of the software and libraries used for future reproducibility.
import sys
sys.version_info
sys.version_info(major=3, minor=4, micro=3, releaselevel='final', serial=0)
import numpy as np
np.__version__
'1.9.2'
import pandas as pd
pd.__version__
'0.16.0'
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
matplotlib.__version__
'1.4.3'
Cleaning data¶
The majority of this article was spent cleaning a little bit the data, which was not really hard since it only required removing some columns and getting a weighed average. For all those operations I am using pandas.
Black¶
black = pd.read_csv('BlackIncome/ACS_13_5YR_B19001B_with_ann.csv', encoding='cp1252', skiprows=[0])
black.head()
Id | Id2 | Geography | Estimate; Total: | Margin of Error; Total: | Estimate; Total: - Less than $10,000 | Margin of Error; Total: - Less than $10,000 | Estimate; Total: - $10,000 to $14,999 | Margin of Error; Total: - $10,000 to $14,999 | Estimate; Total: - $15,000 to $19,999 | ... | Estimate; Total: - $75,000 to $99,999 | Margin of Error; Total: - $75,000 to $99,999 | Estimate; Total: - $100,000 to $124,999 | Margin of Error; Total: - $100,000 to $124,999 | Estimate; Total: - $125,000 to $149,999 | Margin of Error; Total: - $125,000 to $149,999 | Estimate; Total: - $150,000 to $199,999 | Margin of Error; Total: - $150,000 to $199,999 | Estimate; Total: - $200,000 or more | Margin of Error; Total: - $200,000 or more | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1600000US0100100 | 100100 | Abanda CDP, Alabama | 0 | 11 | 0 | 11 | 0 | 11 | 0 | ... | 0 | 11 | 0 | 11 | 0 | 11 | 0 | 11 | 0 | 11 |
1 | 1600000US0100124 | 100124 | Abbeville city, Alabama | 410 | 90 | 81 | 42 | 47 | 35 | 120 | ... | 4 | 8 | 0 | 11 | 12 | 15 | 0 | 11 | 0 | 11 |
2 | 1600000US0100460 | 100460 | Adamsville city, Alabama | 585 | 116 | 40 | 31 | 11 | 18 | 9 | ... | 112 | 62 | 38 | 41 | 33 | 47 | 5 | 8 | 0 | 11 |
3 | 1600000US0100484 | 100484 | Addison town, Alabama | 0 | 11 | 0 | 11 | 0 | 11 | 0 | ... | 0 | 11 | 0 | 11 | 0 | 11 | 0 | 11 | 0 | 11 |
4 | 1600000US0100676 | 100676 | Akron town, Alabama | 118 | 37 | 26 | 17 | 18 | 17 | 6 | ... | 0 | 11 | 0 | 11 | 8 | 12 | 0 | 11 | 0 | 11 |
5 rows × 37 columns
black.set_index('Geography', inplace=True)
black.drop(['Id', 'Id2', 'Estimate; Total:'], axis=1, inplace=True)
margin_cols = [col for col in black.columns if col.startswith('Margin of Error')]
black.drop(margin_cols, axis=1, inplace=True)
black.head()
Estimate; Total: - Less than $10,000 | Estimate; Total: - $10,000 to $14,999 | Estimate; Total: - $15,000 to $19,999 | Estimate; Total: - $20,000 to $24,999 | Estimate; Total: - $25,000 to $29,999 | Estimate; Total: - $30,000 to $34,999 | Estimate; Total: - $35,000 to $39,999 | Estimate; Total: - $40,000 to $44,999 | Estimate; Total: - $45,000 to $49,999 | Estimate; Total: - $50,000 to $59,999 | Estimate; Total: - $60,000 to $74,999 | Estimate; Total: - $75,000 to $99,999 | Estimate; Total: - $100,000 to $124,999 | Estimate; Total: - $125,000 to $149,999 | Estimate; Total: - $150,000 to $199,999 | Estimate; Total: - $200,000 or more | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Geography | ||||||||||||||||
Abanda CDP, Alabama | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Abbeville city, Alabama | 81 | 47 | 120 | 7 | 30 | 5 | 47 | 16 | 15 | 6 | 20 | 4 | 0 | 12 | 0 | 0 |
Adamsville city, Alabama | 40 | 11 | 9 | 11 | 28 | 109 | 17 | 61 | 21 | 52 | 38 | 112 | 38 | 33 | 5 | 0 |
Addison town, Alabama | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Akron town, Alabama | 26 | 18 | 6 | 16 | 14 | 21 | 2 | 7 | 0 | 0 | 0 | 0 | 0 | 8 | 0 | 0 |
Since the data available is given in intervals I used a simple weithed average to get a single number per city.
weights = [10000, 12500, 17500, 22500, 27500, 32500, 37500, 42500, 47500, 55000, 67500, 87500, 112500, 137500, 187500, 200000]
weights = pd.Series(weights, index=black.columns)
def weight_average(x):
return (x * weights).sum() / x.sum()
black.head().apply(weight_average, axis=1)
Geography Abanda CDP, Alabama NaN Abbeville city, Alabama 27993.902439 Adamsville city, Alabama 58901.709402 Addison town, Alabama NaN Akron town, Alabama 29576.271186 dtype: float64
estimate_cols = [col for col in black.columns if col.startswith('Estimate; Total:')]
black['average'] = black[estimate_cols].apply(weight_average, axis=1)
black.head(2)
Estimate; Total: - Less than $10,000 | Estimate; Total: - $10,000 to $14,999 | Estimate; Total: - $15,000 to $19,999 | Estimate; Total: - $20,000 to $24,999 | Estimate; Total: - $25,000 to $29,999 | Estimate; Total: - $30,000 to $34,999 | Estimate; Total: - $35,000 to $39,999 | Estimate; Total: - $40,000 to $44,999 | Estimate; Total: - $45,000 to $49,999 | Estimate; Total: - $50,000 to $59,999 | Estimate; Total: - $60,000 to $74,999 | Estimate; Total: - $75,000 to $99,999 | Estimate; Total: - $100,000 to $124,999 | Estimate; Total: - $125,000 to $149,999 | Estimate; Total: - $150,000 to $199,999 | Estimate; Total: - $200,000 or more | average | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Geography | |||||||||||||||||
Abanda CDP, Alabama | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN |
Abbeville city, Alabama | 81 | 47 | 120 | 7 | 30 | 5 | 47 | 16 | 15 | 6 | 20 | 4 | 0 | 12 | 0 | 0 | 27993.902439 |
black.average.hist(bins=25, color='black')
<matplotlib.axes._subplots.AxesSubplot at 0x10cf75080>
White¶
white = pd.read_csv('WhiteIncome/ACS_13_5YR_B19001A_with_ann.csv', encoding='cp1252', skiprows=[0])
white.head(2)
Id | Id2 | Geography | Estimate; Total: | Margin of Error; Total: | Estimate; Total: - Less than $10,000 | Margin of Error; Total: - Less than $10,000 | Estimate; Total: - $10,000 to $14,999 | Margin of Error; Total: - $10,000 to $14,999 | Estimate; Total: - $15,000 to $19,999 | ... | Estimate; Total: - $75,000 to $99,999 | Margin of Error; Total: - $75,000 to $99,999 | Estimate; Total: - $100,000 to $124,999 | Margin of Error; Total: - $100,000 to $124,999 | Estimate; Total: - $125,000 to $149,999 | Margin of Error; Total: - $125,000 to $149,999 | Estimate; Total: - $150,000 to $199,999 | Margin of Error; Total: - $150,000 to $199,999 | Estimate; Total: - $200,000 or more | Margin of Error; Total: - $200,000 or more | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1600000US0100100 | 100100 | Abanda CDP, Alabama | 23 | 25 | 0 | 11 | 11 | 17 | 0 | ... | 0 | 11 | 0 | 11 | 0 | 11 | 0 | 11 | 0 | 11 |
1 | 1600000US0100124 | 100124 | Abbeville city, Alabama | 580 | 107 | 48 | 37 | 37 | 21 | 66 | ... | 59 | 37 | 50 | 28 | 13 | 13 | 8 | 12 | 3 | 6 |
2 rows × 37 columns
white.set_index('Geography', inplace=True)
white.drop(['Id', 'Id2', 'Estimate; Total:'], axis=1, inplace=True)
margin_cols = [col for col in white.columns if col.startswith('Margin of Error')]
white.drop(margin_cols, axis=1, inplace=True)
estimate_cols = [col for col in white.columns if col.startswith('Estimate; Total:')]
white['average'] = white[estimate_cols].apply(weight_average, axis=1)
white.average.hist(bins=25, color='black')
<matplotlib.axes._subplots.AxesSubplot at 0x109eeb940>
These two previous histograms were not in the original article but is a nice plus to see how the white and black income compare in general.
Combined¶
black_and_white = black[['average']].join(white[['average']], lsuffix='_black', rsuffix='_white')
black_and_white.head()
average_black | average_white | |
---|---|---|
Geography | ||
Abanda CDP, Alabama | NaN | 22934.782609 |
Abbeville city, Alabama | 27993.902439 | 49422.413793 |
Adamsville city, Alabama | 58901.709402 | 53250.750751 |
Addison town, Alabama | NaN | 44384.328358 |
Akron town, Alabama | 29576.271186 | 29398.148148 |
black_and_white['gap'] = black_and_white.average_white - black_and_white.average_black
ax = black_and_white.dropna().plot(kind='scatter', x='average_black', y='gap', color='black', alpha=0.1)
black_and_white.ix[["Baltimore city, Maryland"]].plot(kind='scatter', x='average_black', y='gap', color='red', ax=ax , figsize=(8, 8))
<matplotlib.axes._subplots.AxesSubplot at 0x10a518dd8>
This scatter plot shows the ditribution for Average Black Income vs White-Black income gap as the original article.
The only difference is that this plot is showing all cities in the US. The vertical bars that apper in are a consecuence of the simple weighed average explained before.
Subset: 10% of black population¶
Article plotted the only the cities with more than 10% of black population which removes some noise and results in a more clean scatter plot.
races = pd.read_csv('races/ACS_13_5YR_B02001_with_ann.csv', encoding='cp1252', skiprows=[0])
races = races[['Geography', 'Estimate; Total:', 'Estimate; Total: - Black or African American alone']]
races = races.set_index('Geography')
black_percentage = races['Estimate; Total: - Black or African American alone'] / races['Estimate; Total:']
subset = black_and_white[black_percentage > 0.1]
ax = subset.dropna().plot(kind='scatter', x='average_black', y='gap', color='black', alpha=0.1)
subset.ix[["Baltimore city, Maryland"]].plot(kind='scatter', x='average_black', y='gap', color='red', ax=ax, figsize=(8, 8))
<matplotlib.axes._subplots.AxesSubplot at 0x10f551eb8>
Interactive¶
Using Bokeh is possible to convert the static matplotlib image to a javascript interactive visualization.
source = plt.ColumnDataSource(
data=dict(
black_income=subset.average_black,
gap=subset.gap,
city=subset.index,
)
)
p = plt.figure(tools='hover,reset,save',
title='', width=530, height=530,
x_axis_label="Average Black Income",
y_axis_label="Black-white income gap")
p.scatter(subset.average_black, subset.gap, size=5, color="black", alpha=0.05, source=source)
hover = p.select(dict(type=HoverTool))
hover.tooltips = [
("City", "@city"),
("Average Black Income ", "@black_income"),
("B-W income gap", "@gap"),
]
plt.show(p)
The big cloud in the middle might get a little bit messy to see since there are too many points in close together and the tooltip will try to show all of them. But is more interesting to see the outliers since Memphis, Alabama in the top left corner or Hensley, Arkansas as the lower point.
Conclusion¶
This was a very simple article reproducing the results from the article "How Baltimore’s Young Black Men Are Boxed In". I highly recommend reading the original article to read the conclusions the author presents there since the objective of this article is to merely reproduce the results.
Beyond the simple code I learn about how much data is available by the US goverment, on this case the "American Fact Finder" . The website might not be as friendly as other data sources and you might have to spend a little bit of time trying to get the data you want but there is a lot of data which can lead to simple but interesting analysis like the original article.