Week 6 Exercise - Census Data¶

This exercise has you apply two things:

  • Obtaining data from the census (see the tutorial notebook
  • Plotting two related numeric variables

The guiding question for this notebook is “is a higher level of college education in the population correlated with income?”

Let's go!

Setup¶

We first need to make sure we have the census and US packages:

In [ ]:
%pip install census us

And then we can import:

In [ ]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from census import Census
from us import states

And set up a census object (replace API KEY with your API key):

In [ ]:
c = Census('<your api key>')

Census Data¶

The census data comes in a variety of files. These files include:

  • sf1 — Summary File 1, containing complete count information on the decennial census.
  • acs1 — American Community Survey, a supplementary annual survey of a sample of the population carried out by the census bureau every year.

Both of these files are accessed in the same way, via the Census object, but they contain different variables. Each contains thousands of variables.

This notebook focuses on the ACS. The variable list describes these variables, and the ones of interest are all reported as estimated population counts. That means variable B06009_003E is an estimate, based on the sample, of the number of people in a geographic region whose highest educational attainment is a high school degree.

To fetch data, we need to know three things:

  • The geographic level we want: county or state?
  • Which geographic area(s)?
  • The variables to retrieve.
  • The year. We're going to use 2014.

Variables in turn are nested. Many variables are estimated population counts; for these, one variable is the total population, and others are counts within subgroups. Look at the variable list to see how these are described:

  • B01001_001E is the estimated total population
  • B01001_002E is the estimated male population
  • B01001_026E is the estimated female population

There are variables for a lot of different breakdowns.

The API returns a list of dictionaries containing the variables. Let's get the gender population estimate for Idaho:

In [ ]:
c.acs1.state(('NAME', 'B01001_001E', 'B01001_002E', 'B01001_026E'), states.ID.fips, year=2014)

Regions are identified by FIPS codes: numeric codes that identify states and counties. Each state has a 2-digit FIPS code, and the us.states module lets us look up a state's FIPS code. (We can also get a table of them.)

Each county's code is a 5-digit number: its state code, followed by 3 digits to identify the county.

When calling state, we can provide '*' instead of a FIPS code to request all states, and use Pandas from_records to make a data frame:

In [ ]:
gender_pop = pd.DataFrame.from_records(
    c.acs1.state(('NAME', 'B01001_001E', 'B01001_002E', 'B01001_026E'), '*', year=2014)
)
gender_pop.head()
In [ ]:
gender_pop.info()

Why is total a string? Let's make it an int:

In [ ]:
gender_pop['B01001_001E'] = gender_pop['B01001_001E'].astype('i4')
gender_pop.info()

✅ Todo: do the following:

  • Rename the columns to have meaningful names
  • Compute the fraction of the population that is female in each state
  • Plot the distribution of '% female'
In [ ]:

In [ ]:

Educational Attainment and Income¶

The 06009 variables (B06009_001E and subcounts) report the number of people whose highest education is at at different levels. B06009_002E is people who have not completed high school, etc. The 07011 variable reports the median income (B07011_001E is an estimate of median income - it is one of the variables that is not reported as a population count).

  1. Fetch these variables for all states.
  2. Compute the fraction of the population that has at least completed college. Look at the variable list to see which variables you will need.
  3. Show the distribution of this variable.
In [ ]:

Not all variables are counts! B07011_001E is the median income for a region.

Fetch it too, and show its distribution!

In [ ]:

Now, look at the question: do states with a higher fraction of their education college-educated have higher median incomes? Show with a scatterplot and compute a correlation coefficient (Pandas .corr method).

In [ ]:

In [ ]:

In [ ]:

In [ ]:

County Level¶

(if you have time)

The c.acs1.state_county method fetches county-level data. For example, to fetch data from counties in Idaho:

A lot of counties are missing, because there is not enough sample data from them.

You can provide '*' for both state and county, to get all counties in the US (for which data is available).

Look at education and income at the county level!

In [ ]: