diff --git a/README.Rmd b/README.Rmd index dae7fa2..cb07116 100644 --- a/README.Rmd +++ b/README.Rmd @@ -275,4 +275,8 @@ Cite the organizations that produce the crosswalks returned by this package: *CT Data Collaborative. (2023). 2022 Census Tract Crosswalk. Retrieved from: https://github.com/CT-Data-Collaborative/2022-tract-crosswalk.* -- **For this package:** https://ui-research.github.io/crosswalk/authors.html#citation \ No newline at end of file +- **For this package:** https://ui-research.github.io/crosswalk/authors.html#citation + +## AI Use + +This package was written in part with the use of agentic AI tools under the supervision of the author. \ No newline at end of file diff --git a/README.md b/README.md index 6505ad4..c274746 100644 --- a/README.md +++ b/README.md @@ -1,328 +1,312 @@ - -# crosswalk - -An R package for translating data across space and time. - -## Overview - -This package provides a consistent API and standardized versions of -crosswalks to enable consistent approaches that work across different -geography and year combinations. The package also facilitates -interpolation–that is, adjusting source geography/year values by their -crosswalk weights and translating these values to the desired target -geography/year–including diagnostics of the joins between source data -and crosswalks. - -The package sources crosswalks from: - -- **Geocorr** (Missouri Census Data Center) - for inter-geography - crosswalks (same-decade) -- **IPUMS NHGIS** - for inter-temporal crosswalks (across decades) -- **CT Data Collaborative** - for Connecticut 2020→2022 crosswalks - (planning region changes) - -## Why Use `crosswalk`? - -- **Programmatic access**: No more manual downloads from web interfaces; - data is cached for speed -- **Standardized output**: Consistent column names across all crosswalk - sources -- **Metadata tracking**: Full provenance of crosswalks stored as - attributes -- **Crosswalk chaining**: Automatic chaining when multiple crosswalks - are required - -## Installation - -``` r -# Install from GitHub -renv::install("UI-Research/crosswalk") -``` - -## Quick Start - -First we obtain a crosswalk and apply it to our data: - -``` r -library(crosswalk) -library(dplyr) -library(ggplot2) -library(stringr) -library(sf) -library(tidycensus) -library(tigris) -library(scales) - -source_data = get_acs( - year = 2023, - geography = "zcta", - output = "wide", - variables = c(below_poverty_level = "B17001_002")) %>% - select( - source_geoid = GEOID, - count_below_poverty_level = below_poverty_levelE) - -# Get a crosswalk from ZCTAs to PUMAs (same year, uses Geocorr (2022)) -zcta_puma_crosswalk <- get_crosswalk( - source_geography = "zcta", - target_geography = "puma22", - weight = "population") - -# Apply the crosswalk to your data -crosswalked_data <- crosswalk_data( - data = source_data, - crosswalk = zcta_puma_crosswalk) - -## Or in a single step -crosswalked_data = crosswalk_data( - data = source_data, - source_geography = "zcta", - target_geography = "puma22", - weight = "population") -``` - -What does the crosswalk(s) reflect and how was it sourced? - -``` r -## and there's more (not shown) -names(attr(crosswalked_data, "crosswalk_metadata")) %>% head() -#> [1] "call_parameters" "data_source" "data_source_full_name" -#> [4] "download_url" "api_endpoint" "documentation_url" -``` - -How well did the crosswalk join to our source data? - -``` r -## look at all the characteristics of the join(s) between the source data -## and the crosswalks -join_quality = attr(crosswalked_data, "join_quality") - -## what share of records in the source data do not join to a crosswalk and -## thus are dropped during the crosswalking process? -join_quality$pct_data_unmatched -#> [1] 0.4234277 - -## zctas aren't nested within states, otherwise join_quality$state_analysis_data -## would help us to ID whether non-joining source data were clustered within one -## or a few states. instead we can join to spatial data to diagnose further: -zctas_sf = zctas(year = 2023, progress_bar = FALSE) -states_sf = states(year = 2023, cb = TRUE, progress_bar = FALSE) - -## apart from DC, which has a disproportionate number of non-joining ZCTAs-- -## seemingly corresponding to federal areas and buildings--the distribution of -## non-joining ZCTAs appears proportionate to state-level populations and is -## distributed across many states: -zctas_sf %>% - filter(GEOID20 %in% join_quality$data_geoids_unmatched) %>% - st_intersection(states_sf %>% select(NAME)) %>% - st_drop_geometry() %>% - count(NAME, sort = TRUE) %>% - head() -#> NAME n -#> 1 District of Columbia 19 -#> 2 New York 15 -#> 3 Texas 9 -#> 4 California 8 -#> 5 Colorado 6 -#> 6 Utah 6 -``` - -And how accurate was the crosswalking process? - -``` r -comparison_data = get_acs( - year = 2023, - geography = "puma", - output = "wide", - variables = c( - below_poverty_level = "B17001_002")) %>% - select( - source_geoid = GEOID, - count_below_poverty_level_acs = below_poverty_levelE) - -combined_data = left_join( - comparison_data, - crosswalked_data, - by = c("source_geoid" = "geoid")) - -combined_data %>% - select(source_geoid, matches("count")) %>% - mutate(difference_percent = (count_below_poverty_level_acs - count_below_poverty_level) / count_below_poverty_level_acs) %>% - ggplot() + - geom_histogram(aes(x = difference_percent)) + - theme_minimal() + - theme(panel.grid = element_blank()) + - scale_x_continuous(labels = percent) + - labs( - title = "Crosswalked data approximates observed values", - subtitle = "Block group-level source data would produce more accurate crosswalked values", - y = "", - x = "Percent difference between observed and crosswalked values") -``` - - - -## Core Functions - -The package has two main functions, though you can also specify the -needed crosswalk(s) directly from `crosswalk_data()` and omit the -intermediate `get_crosswalk()` call. - -| Function | Purpose | -|----|----| -| `get_crosswalk()` | Fetch crosswalk(s) | -| `crosswalk_data()` | Apply crosswalk(s) to interpolate data to the target geography-year | - -## Output Structure - -`get_crosswalk()` **always returns a list** structured as follows: - -The list contains three elements: - -| Element | Description | -|--------------|-------------------------------------------------------| -| `crosswalks` | A named list of crosswalks (`step_1`, `step_2`, etc.) | -| `plan` | Details about what crosswalks are being fetched | -| `message` | A description of the crosswalk chain | - -### Multi-Step Crosswalks - -For some source year/geography -\> target year/geography combinations, -there is not a single direct crosswalk. The package automatically plans -and fetches the required chain of crosswalks, using a year-first -strategy: - -1. **NHGIS step(s)**: Change year while keeping geography constant - (multiple hops if the temporal span requires it, e.g. 1990→2010→2020) -2. **Geocorr step**: Change geography at the target year - -``` r -result <- get_crosswalk( - source_geography = "tract", - target_geography = "zcta", - source_year = 2010, - target_year = 2020, - weight = "population", - silent = TRUE) - -# Two crosswalks are returned -# Step 1: 2010 tracts -> 2020 tracts (NHGIS) -# Step 2: 2020 tracts -> 2020 ZCTAs (Geocorr) - -# Longer chains are produced when needed, e.g. -# 2000 tracts -> 2020 ZCTAs produces three steps: -# Step 1: 2000 tracts -> 2010 tracts (NHGIS) -# Step 2: 2010 tracts -> 2020 tracts (NHGIS) -# Step 3: 2020 tracts -> 2020 ZCTAs (Geocorr) -``` - -### Crosswalk Structure - -Each crosswalk contains standardized columns: - -| Column | Description | -|----|----| -| `source_geoid` | Identifier for source geography | -| `target_geoid` | Identifier for target geography | -| `allocation_factor_source_to_target` | Weight for interpolating values | -| `weighting_factor` | What attribute was used (population, housing, land) | - -Additional columns may include `source_year`, `target_year`, -`population_2020`, `housing_2020`, and `land_area_sqmi` depending on the -source of the crosswalk. - -### Accessing Metadata - -Each crosswalk tibble has a `crosswalk_metadata` attribute that -documents what the crosswalk represents and how it was created: - -``` r -metadata <- attr(result$crosswalks$step_1, "crosswalk_metadata") -names(metadata) -``` - -## Interpolation - -`crosswalk_data()` applies crosswalk weights to transform your data. If -you’re in a hurry, you can omit a call to `get_crosswalk()` and specify -the needed crosswalk parameters to `crosswalk_data()`, which will pass -these to `get_crosswalk()` behind the scenes. Or you can call -`get_crosswalk()` explicitly and then pass the result to -`crosswalk_data()`. - -### Column Naming Convention - -The function auto-detects columns based on prefixes: - -| Prefix | Treatment | -|----|----| -| `count_` | Summed after weighting (for counts like population, housing units) | -| `mean_`, `median_`, `percent_`, `ratio_` | Weighted mean (for rates, percentages, averages) | - -You can also specify columns explicitly via `count_columns` and -`non_count_columns`. All non-count variables are interpolated using -weighted means, weighting by the allocation factor from the crosswalk. - -## Supported Geography and Year Combinations - -`get_available_crosswalks()` returns a listing of all supported -year-geography combinations. - -``` r -get_available_crosswalks() %>% - head() -#> # A tibble: 6 × 4 -#> source_geography target_geography source_year target_year -#> -#> 1 block block 1990 2010 -#> 2 block block 2000 2010 -#> 3 block block 2010 2020 -#> 4 block block 2020 2010 -#> 5 block block 2020 2022 -#> 6 block block 2022 2020 -``` - -## API Keys - -NHGIS crosswalks require an IPUMS API key. Get one at - and add to your `.Renviron`: - -``` r -usethis::edit_r_environ() -# Add: IPUMS_API_KEY=your_key_here -``` - -## Caching - -Use the `cache` parameter to save crosswalks locally for ease: - -``` r -result <- get_crosswalk( - source_geography = "tract", - target_geography = "zcta", - weight = "population", - cache = here::here("crosswalks-cache")) -``` - -## Citations - -Cite the organizations that produce the crosswalks returned by this -package: - -**For NHGIS**, see requirements at: - - -**For Geocorr**, a suggested citation (update the year): - -> Missouri Census Data Center, University of Missouri. (2022/2018). -> Geocorr 2022/2018: Geographic Correspondence Engine. Retrieved from: -> - -**For CTData**, a suggested citation (adjust for alternate source -geography): - -> CT Data Collaborative. (2023). 2022 Census Tract Crosswalk. Retrieved -> from: . - -**For this package**, refer here: - + +# crosswalk + +An R package for translating data across space and time. + +## Overview + +This package provides a simple API and standardized versions of +crosswalks to enable consistent, programmatic approaches that work +across different geography and year combinations. + +The package also facilitates interpolation–that is, adjusting source +geography/year values by their crosswalk weights and translating these +values to the desired target geography/year–including diagnostics of the +joins between source data and crosswalks. + +The package sources crosswalks from: + +- **Geocorr** (Missouri Census Data Center) - for inter-geography + crosswalks (same-decade) +- **IPUMS NHGIS** - for inter-temporal crosswalks (across decades) +- **CT Data Collaborative** - for Connecticut 2020→2022 crosswalks + (planning region changes) + +## Why Use `crosswalk`? + +- **Programmatic access**: No more manual downloads from web interfaces; + data is cached for speed +- **Standardized output**: Consistent column names across all crosswalk + sources +- **Metadata tracking**: Full provenance of crosswalks stored as + attributes +- **Crosswalk chaining**: Automatic chaining when multiple crosswalks + are required + +## Installation + + # Install from GitHub + renv::install("UI-Research/crosswalk") + +## Quick Start + +We obtain a crosswalk and apply it to our data: + +``` r +library(crosswalk) +library(dplyr) +library(ggplot2) +library(stringr) +library(sf) +library(tidycensus) +library(tigris) +library(scales) + +source_data = get_acs( + year = 2023, + geography = "zcta", + output = "wide", + variables = c(below_poverty_level = "B17001_002")) %>% + select( + source_geoid = GEOID, + count_below_poverty_level = below_poverty_levelE) + +# Get a crosswalk from ZCTAs to PUMAs (same year, uses Geocorr (2022)) +zcta_puma_crosswalk <- get_crosswalk( + source_geography = "zcta", + target_geography = "puma22", + weight = "population") + +# Apply the crosswalk to your data +crosswalked_data <- crosswalk_data( + data = source_data, + crosswalk = zcta_puma_crosswalk) + +## Or in a single step +crosswalked_data = crosswalk_data( + data = source_data, + source_geography = "zcta", + target_geography = "puma22", + weight = "population") +``` + +What does the crosswalk(s) reflect and how was it sourced? + +``` r +## and there's more (not shown) +names(attr(crosswalked_data, "crosswalk_metadata")) %>% head() +#> [1] "call_parameters" "data_source" "data_source_full_name" +#> [4] "download_url" "api_endpoint" "documentation_url" +``` + +How well did the crosswalk join to our source data? + +``` r +## look at all the characteristics of the join(s) between the source data +## and the crosswalks +join_quality = attr(crosswalked_data, "join_quality") + +## what share of records in the source data do not join to a crosswalk and +## thus are dropped during the crosswalking process? +join_quality$pct_data_unmatched +#> [1] 0.4234277 + +## zctas aren't nested within states, otherwise join_quality$state_analysis_data +## would help us to ID whether non-joining source data were clustered within one +## or a few states. instead we can join to spatial data to diagnose further: +zctas_sf = zctas(year = 2023, progress_bar = FALSE) +states_sf = states(year = 2023, cb = TRUE, progress_bar = FALSE) + +## apart from DC, which has a disproportionate number of non-joining ZCTAs-- +## seemingly corresponding to federal areas and buildings--the distribution of +## non-joining ZCTAs appears proportionate to state-level populations and is +## distributed across many states: +zctas_sf %>% + filter(GEOID20 %in% join_quality$data_geoids_unmatched) %>% + st_intersection(states_sf %>% select(NAME)) %>% + st_drop_geometry() %>% + count(NAME, sort = TRUE) %>% + head() +#> NAME n +#> 1 District of Columbia 19 +#> 2 New York 15 +#> 3 Texas 9 +#> 4 California 8 +#> 5 Colorado 6 +#> 6 Utah 6 +``` + +And how accurate was the crosswalking process? + +``` r +comparison_data = get_acs( + year = 2023, + geography = "puma", + output = "wide", + variables = c( + below_poverty_level = "B17001_002")) %>% + select( + source_geoid = GEOID, + count_below_poverty_level_acs = below_poverty_levelE) + +combined_data = left_join( + comparison_data, + crosswalked_data, + by = c("source_geoid" = "geoid")) + +combined_data %>% + select(source_geoid, matches("count")) %>% + mutate(difference_percent = (count_below_poverty_level_acs - count_below_poverty_level) / count_below_poverty_level_acs) %>% + ggplot() + + geom_histogram(aes(x = difference_percent)) + + theme_minimal() + + theme(panel.grid = element_blank()) + + scale_x_continuous(labels = percent) + + labs( + title = "Crosswalked data approximates observed values", + subtitle = "Block group-level source data would produce more accurate crosswalked values", + y = "", + x = "Percent difference between observed and crosswalked values") +``` + + + +## Core Functions + +The package has two main functions, though you can also specify the +needed crosswalk(s) directly from `crosswalk_data()` and omit the +intermediate `get_crosswalk()` call. + +| Function | Purpose | +|----|----| +| `get_crosswalk()` | Fetch crosswalk(s) | +| `crosswalk_data()` | Apply crosswalk(s) to interpolate data to the target geography-year | + +## Output Structure + +`get_crosswalk()` **always returns a list** structured as follows: + +The list contains three elements: + +| Element | Description | +|--------------|-------------------------------------------------------| +| `crosswalks` | A named list of crosswalks (`step_1`, `step_2`, etc.) | +| `plan` | Details about what crosswalks are being fetched | +| `message` | A description of the crosswalk chain | + +### Multi-Step Crosswalks + +For some source year/geography -\> target year/geography combinations, +there is not a single direct crosswalk. In such cases, we need two +crosswalks. The package automatically plans and fetches the required +crosswalks: + +1. **Step 1 (NHGIS)**: Change year, keep geography constant +2. **Step 2 (Geocorr)**: Change geography at target year + +``` r +result <- get_crosswalk( + source_geography = "tract", + target_geography = "zcta", + source_year = 2010, + target_year = 2020, + weight = "population", + silent = TRUE) + +# Two crosswalks are returned +# Step 1: 2010 tracts -> 2020 tracts (NHGIS) +# Step 2: 2020 tracts -> 2020 ZCTAs (Geocorr) +``` + +### Crosswalk Structure + +Each crosswalk contains standardized columns: + +| Column | Description | +|----|----| +| `source_geoid` | Identifier for source geography | +| `target_geoid` | Identifier for target geography | +| `allocation_factor_source_to_target` | Weight for interpolating values | +| `weighting_factor` | What attribute was used (population, housing, land) | + +Additional columns may include `source_year`, `target_year`, +`population_2020`, `housing_2020`, and `land_area_sqmi` depending on the +source of the crosswalk. + +### Accessing Metadata + +Each crosswalk tibble has a `crosswalk_metadata` attribute that +documents what the crosswalk represents and how it was created: + +``` r +metadata <- attr(result$crosswalks$step_1, "crosswalk_metadata") +names(metadata) +``` + +## Interpolation + +`crosswalk_data()` applies crosswalk weights to transform your data. If +you’re in a hurry, you can omit a call to `get_crosswalk()` and specify +the needed crosswalk parameters to `crosswalk_data()`, which will pass +these to `get_crosswalk()` behind the scenes. Or you can call +`get_crosswalk()` explicitly and then pass the result to +`crosswalk_data()`. + +## Supported Geography and Year Combinations + +`get_available_crosswalks()` returns a listing of all supported +year-geography combinations. + +``` r +get_available_crosswalks() %>% + head() +#> # A tibble: 6 × 4 +#> source_geography target_geography source_year target_year +#> +#> 1 block aiannh 2022 2022 +#> 2 block block 1990 2010 +#> 3 block block 2000 2010 +#> 4 block block 2010 2020 +#> 5 block block 2020 2010 +#> 6 block block 2020 2022 +``` + +## API Keys + +NHGIS crosswalks require an IPUMS API key. Get one at + and add to your `.Renviron`: + +``` r +usethis::edit_r_environ() +# Add: IPUMS_API_KEY=your_key_here +``` + +## Caching + +Use the `cache` parameter to save crosswalks locally for ease: + +``` r +result <- get_crosswalk( + source_geography = "tract", + target_geography = "zcta", + weight = "population", + cache = here::here("crosswalks-cache")) +``` + +## Citations + +Cite the organizations that produce the crosswalks returned by this +package: + +**For NHGIS**, see requirements at: + + +**For Geocorr**, a suggested citation (update the year): + +> Missouri Census Data Center, University of Missouri. (2022/2018). +> Geocorr 2022/2018: Geographic Correspondence Engine. Retrieved from: +> + +- **For CT Data Collaborative**, a suggested citation (adjust for + alternate source geography): + +*CT Data Collaborative. (2023). 2022 Census Tract Crosswalk. Retrieved +from: .* + +- **For this package:** + + +## AI Use + +This package was written in part with the use of agentic AI tools under +the supervision of the author. diff --git a/man/figures/README-unnamed-chunk-5-1.png b/man/figures/README-unnamed-chunk-5-1.png new file mode 100644 index 0000000..c9ec9ef Binary files /dev/null and b/man/figures/README-unnamed-chunk-5-1.png differ