# Assignment 1 - US Census data

This assignment is due on **September 6, 2017** at **11:59 pm**.

## Updates

* Added probability questions
* Added links to Dplyr resources

## Preliminaries

To work on this assignment, you will need three things:

1. Your basic [R installation with Jupyter and the Tidyverse](https://boisestate.github.io/CS533/resources.html).
2. The `censusapi` package, which can be installed with `conda install -c mdekstrand r-censusapi`.
3. A US Census API key, which you can obtain [here](https://api.census.gov/data/key_signup.html). Sometimes data sets will require you to register to get access to them, and this is a good exampe.

You can find documentation on the `censusapi` package in the [vignette](https://cran.r-project.org/web/packages/censusapi/vignettes/getting-started.html).

For reference, see [R for Data Science Chapter 5](r4ds.had.co.nz/transform.html) and the [dplyr introduction](dplyr.tidyverse.org/articles/dplyr.html).

In [None]:
library(tidyverse)
library(censusapi)

Set up your census key:

In [None]:
Sys.setenv(CENSUS_KEY="")

## Loading Census Data

The census data is scattered among a bunch of sources.  Accessing it is somewhat arcane.

- The `sf1` data set is the _summary file_, containing summary statistics about each region in the _decennial census_.
- The `vintage` says we want to use the 2010 census (the most recent one).
- `vars` selects some variables to download; `NAME` is the name of the region, `P0010001` is the total population, and `P0420002` is the total institutionalized (imprisoned) population.  For more fun, see the [full list of variables](http://api.census.gov/data/2010/sf1/variables.html).

In [None]:
last_census = getCensus(name="sf1", vintage=2010, vars=c("NAME", "P0010001", "P0420002"), region="state:*")
head(last_census)

I recommend that you rename the fields to be something more meaningful before proceeding!

## Initial Questions

Write R code and text to answer the following questions:

### Most Populus States

What 5 states have the highest population?

**Hint:** the `arrange` dplyr verb will sort data.

### Highest Prison Populations

What 5 states have the highest prison populations?

### Prisoners per Capita

What 5 states have the most prisoners per capita?

### Probability

What is the probability that a person selected at random is born in Rhode Island?

What is the probability that a person is born in one of the Dakotas (North Dakota or South Dakota)?

## Visualization

Plot a histogram of state populations.

Plot a histogram of _county_ populations.  You can get counties by using `county:*` as your selector instead of `state:*` in a `getCensus` call.

What are the largest 5 counties in Idaho?

## Joining Data

For the next section, we want to work with _two_ data sets.

Fetch the 2000 data set as well. Unfortunately, they changed the variable names!  You can find the 2000 list [here](http://api.census.gov/data/2000/sf1/variables.html); the key thing is that, for example, the total population variable is now `P001001`

You will need to connect the two data sets; the `inner_join` dplyr verb is used for this.

### Population Growth

What 5 states saw the most population growth from 2000 to 2010?

### Population Loss

What 5 states saw the most population loss from 2000 to 2010?

### Fancy Graphics

See [Mapping US State, County, and Zipcode Data with R](http://www.poppy-zhang.com/r-coding/mapping-us-state-county-and-zipcode-data-with-r/) and plot a map of the 48 contiguous US states, shaded by their population growth from 2000 to 2010.

## Go It Alone

Identify 5 more questions and answer them using the US Census data.

- At least one must use one or more variables not described above.
- At least 2 must involve joining more than one table
- You aren't restricted to state or even county level - individual census tracts can be interesting
- You do not need to keep looking nationwide - you can grab the census tracts in Idaho, for example

## Submitting

Export your notebook to HTML and e-mail it to the professor.