This exercise has you apply two things:
The guiding question for this notebook is “is a higher level of college education in the population correlated with income?”
Let's go!
We first need to make sure we have the census
and US
packages:
%pip install census us
And then we can import:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from census import Census
from us import states
And set up a census object (replace API KEY with your API key):
c = Census('<your api key>')
The census data comes in a variety of files. These files include:
sf1
— Summary File 1, containing complete count information on the decennial census.acs1
— American Community Survey, a supplementary annual survey of a sample of the population carried out by the census bureau every year.Both of these files are accessed in the same way, via the Census object, but they contain different variables. Each contains thousands of variables.
This notebook focuses on the ACS. The variable list describes these variables, and the ones of interest are all reported as estimated population counts. That means variable B06009_003E
is an estimate, based on the sample, of the number of people in a geographic region whose highest educational attainment is a high school degree.
To fetch data, we need to know three things:
Variables in turn are nested. Many variables are estimated population counts; for these, one variable is the total population, and others are counts within subgroups. Look at the variable list to see how these are described:
B01001_001E
is the estimated total populationB01001_002E
is the estimated male populationB01001_026E
is the estimated female populationThere are variables for a lot of different breakdowns.
The API returns a list of dictionaries containing the variables. Let's get the gender population estimate for Idaho:
c.acs1.state(('NAME', 'B01001_001E', 'B01001_002E', 'B01001_026E'), states.ID.fips, year=2014)
Regions are identified by FIPS codes: numeric codes that identify states and counties. Each state has a 2-digit FIPS code, and the us.states
module lets us look up a state's FIPS code. (We can also get a table of them.)
Each county's code is a 5-digit number: its state code, followed by 3 digits to identify the county.
When calling state
, we can provide '*'
instead of a FIPS code to request all states, and use Pandas from_records
to make a data frame:
gender_pop = pd.DataFrame.from_records(
c.acs1.state(('NAME', 'B01001_001E', 'B01001_002E', 'B01001_026E'), '*', year=2014)
)
gender_pop.head()
gender_pop.info()
Why is total a string? Let's make it an int:
gender_pop['B01001_001E'] = gender_pop['B01001_001E'].astype('i4')
gender_pop.info()
✅ Todo: do the following:
The 06009 variables (B06009_001E
and subcounts) report the number of people whose highest education is at at different levels. B06009_002E
is people who have not completed high school, etc. The 07011 variable reports the median income (B07011_001E
is an estimate of median income - it is one of the variables that is not reported as a population count).
Not all variables are counts! B07011_001E
is the median income for a region.
Fetch it too, and show its distribution!
Now, look at the question: do states with a higher fraction of their education college-educated have higher median incomes? Show with a scatterplot and compute a correlation coefficient (Pandas .corr
method).
(if you have time)
The c.acs1.state_county
method fetches county-level data. For example, to fetch data from counties in Idaho:
A lot of counties are missing, because there is not enough sample data from them.
You can provide '*'
for both state and county, to get all counties in the US (for which data is available).
Look at education and income at the county level!