urbnindicators aims to provide users with analysis-ready data from the American Community Survey (ACS).
What you can access:
-
Hundreds of pre-computed variables, including percentages and the raw count variables used to produce them. Or flexibly query any table your heart desires.
-
Or flexibly specify your own derived variables with a series of helper functions.
-
Margins of error for all variables–those direct from the API as well as derived variables–with correctly calculated pooled margins of error, per Census Bureau guidance.
-
Meaningful, consistent variable names–no more “B01003_001”; try “total_population_universe” instead. (But if you’re fond of the API’s variable names, those are stored in the codebook as well for cross-referencing.)
-
A codebook that describes how each variable is calculated.
-
Data for multiple years and multiple states out of the box.
-
Supplemental measures, such as population density, that aren’t available from the ACS.
-
Tools to aggregate or interpolate your data to different geographies–along with correctly adjusted margins of error.
Install the development version of urbnindicators from
GitHub with:
# install.packages("renv")
renv::install("UI-Research/urbnindicators")You’ll want a Census API key (request one here). Set it once with:
tidycensus::census_api_key("YOUR_KEY", install = TRUE)Note that this package is under active development with frequent updates–check to ensure you have the most recent version installed!
list_tables() |> head(10)
#> [1] "age" "computing_devices" "cost_burden"
#> [4] "disability" "educational_attainment" "employment"
#> [7] "gini" "health_insurance" "household_size"
#> [10] "income_quintiles"A single call to compile_acs_data() returns analysis-ready data with
pre-computed percentages, meaningful variable names, and margins of
error:
df = compile_acs_data(
tables = "race",
years = c(2019, 2024),
geography = "county",
states = "NJ")
df %>%
select(1:10) %>%
glimpse()
#> Rows: 42
#> Columns: 10
#> $ data_source_year <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019,…
#> $ GEOID <chr> "34025", "34037", "34013", "34015", "3403…
#> $ NAME <chr> "Monmouth County, New Jersey", "Sussex Co…
#> $ total_population_universe <dbl> 621659, 141483, 795404, 291165, 503637, 9…
#> $ race_universe <dbl> 621659, 141483, 795404, 291165, 503637, 9…
#> $ race_nonhispanic_allraces <dbl> 554491, 129866, 612222, 273106, 294434, 7…
#> $ race_nonhispanic_white_alone <dbl> 467752, 122081, 242965, 228576, 208005, 5…
#> $ race_nonhispanic_black_alone <dbl> 41697, 2991, 305796, 28452, 52523, 49249,…
#> $ race_nonhispanic_aian_alone <dbl> 440, 16, 1107, 204, 651, 1000, 123, 191, …
#> $ race_nonhispanic_asian_alone <dbl> 33451, 2887, 41976, 9002, 25732, 151090, …compile_acs_data() makes it easy to pull multiple years and produce
publication-ready visualizations:
plot_data = df %>%
transmute(
county_name = NAME %>% str_remove(" County, New Jersey"),
race_personofcolor_percent,
race_personofcolor_percent_M,
data_source_year = factor(data_source_year))
state_averages = plot_data %>%
summarize(
.by = data_source_year,
mean_pct = mean(race_personofcolor_percent)) %>%
arrange(data_source_year) %>%
pull(mean_pct)
## order counties by 2019 value for the dumbbell plot
county_order = plot_data %>%
filter(data_source_year == "2019") %>%
arrange(race_personofcolor_percent) %>%
pull(county_name)
plot_data = plot_data %>%
mutate(county_name = factor(county_name, levels = county_order))
dumbbell_data = plot_data %>%
pivot_wider(
id_cols = county_name,
names_from = data_source_year,
values_from = race_personofcolor_percent,
names_prefix = "year_")
ggplot() +
geom_segment(
data = dumbbell_data,
aes(
x = county_name,
y = year_2019,
yend = year_2024),
color = palette_urbn_main[7],
linewidth = 1) +
ggdist::stat_gradientinterval(
data = plot_data,
aes(
x = county_name,
ydist = distributional::dist_normal(
race_personofcolor_percent,
race_personofcolor_percent_M / 1.645),
color = data_source_year),
point_size = 2,
.width = .95) +
geom_hline(
yintercept = state_averages[1],
linetype = "dashed",
color = palette_urbn_main[1]) +
geom_hline(
yintercept = state_averages[2],
linetype = "dashed",
color = palette_urbn_main[2]) +
annotate(
"text",
y = state_averages[1] - .15,
x = 21.5,
label = "State mean (2019)",
fontface = "bold.italic",
color = palette_urbn_main[1],
size = 9 / .pt,
hjust = 0,
nudge_y = .01) +
annotate(
"text",
y = state_averages[2] + .01,
x = 21.5,
label = "State mean (2024)",
fontface = "bold.italic",
color = palette_urbn_main[2],
size = 9 / .pt,
hjust = 0,
nudge_y = .01) +
labs(
title = "All NJ Counties Experienced Racial Diversification from 2019 to 2024",
subtitle = paste0("Share of population who are people of color, by county, 2019-2024
Confidence intervals are presented around each point but are extremely small"),
x = "",
y = "Share of population who are people of color") +
scale_x_discrete(expand = expansion(mult = c(.03, .04))) +
scale_y_continuous(
breaks = c(0, .25, .50, .75, 1.0),
limits = c(0, .75),
labels = scales::percent) +
coord_flip() +
theme_urbn_print()ACS data are available for standard geographies (tracts, counties,
states, etc.), but many analyses require non-standard areas like
neighborhoods, school zones, or planning districts. interpolate_acs()
aggregates source data to any user-defined geography, properly
re-deriving percentages and propagating margins of error:
dc_tracts = compile_acs_data(
tables = "snap",
years = 2024,
geography = "tract",
states = "DC",
spatial = TRUE)
## assign each tract to a quadrant based on its centroid
dc_tracts = dc_tracts %>%
mutate(
centroid = sf::st_centroid(geometry),
lon = sf::st_coordinates(centroid)[, 1],
lat = sf::st_coordinates(centroid)[, 2],
quadrant = case_when(
lon < median(lon) & lat >= median(lat) ~ "NW",
lon >= median(lon) & lat >= median(lat) ~ "NE",
lon < median(lon) & lat < median(lat) ~ "SW",
lon >= median(lon) & lat < median(lat) ~ "SE")) %>%
select(-centroid, -lon, -lat)
## aggregate tracts to quadrants
dc_quadrants = interpolate_acs(
.data = dc_tracts,
target_geoid = "quadrant")
dc_quadrants %>%
sf::st_drop_geometry() %>%
select(GEOID, snap_received_percent, snap_received_percent_M)
#> GEOID snap_received_percent snap_received_percent_M
#> 1 NE 0.15951925 0.019448994
#> 2 NW 0.07036185 0.006889427
#> 3 SE 0.24445974 0.012073306
#> 4 SW 0.06525691 0.012003668See vignette("custom-geographies") for more.
Beyond the package’s built-in tables, you can define your own derived
variables using the define_*() helpers and pass them directly to
compile_acs_data(). Your custom variables automatically get codebook
entries and margins of error:
df = compile_acs_data(
tables = list(
"snap",
define_percent(
"snap_not_received_percent",
numerator_variables = c("snap_universe"),
numerator_subtract_variables = c("snap_received"),
denominator_variables = c("snap_universe"))),
years = 2024,
geography = "county",
states = "DC")
df %>%
select(matches("snap.*percent")) %>%
glimpse()
#> Rows: 1
#> Columns: 4
#> $ snap_received_percent <dbl> 0.143
#> $ snap_not_received_percent <dbl> 0.857
#> $ snap_received_percent_M <dbl> 0.0064
#> $ snap_not_received_percent_M <dbl> 0.0071See vignette("custom-derived-variables") for detailed examples of each
of the define_*() helpers.
Check out the vignettes for additional details:
-
A package overview to help users Get Started.
-
An interactive version of the package’s Codebook so that prospective users can know what to expect.
-
A brief description of the package’s Design Philosophy to clarify the use-cases that
urbnindicatorsis built to support. -
An illustration of how Quantifying Survey Error can improve inference making.
-
You can re-create your indicators and their measures of error for Custom Geographies. Neighborhoods? Unincorporated counties? Start here.
-
A guide to defining Custom Derived Variables using the
define_*()helpers.
This package is built on top of and enormously indebted to
library(tidycensus), which provides the core functionality for
accessing the Census Bureau API. Learn more here:
https://walker-data.com/tidycensus/index.html.
