diff --git a/.github/workflows/test-coverage.yaml b/.github/workflows/test-coverage.yaml index 21b8a93..0ab748d 100644 --- a/.github/workflows/test-coverage.yaml +++ b/.github/workflows/test-coverage.yaml @@ -4,9 +4,10 @@ on: push: branches: [main, master] pull_request: - branches: [main, master] -name: test-coverage +name: test-coverage.yaml + +permissions: read-all jobs: test-coverage: @@ -23,18 +24,29 @@ jobs: - uses: r-lib/actions/setup-r-dependencies@v2 with: - extra-packages: any::covr + extra-packages: any::covr, any::xml2 needs: coverage - name: Test coverage run: | - covr::codecov( + cov <- covr::package_coverage( quiet = FALSE, clean = FALSE, install_path = file.path(normalizePath(Sys.getenv("RUNNER_TEMP"), winslash = "/"), "package") ) + print(cov) + covr::to_cobertura(cov) shell: Rscript {0} + - uses: codecov/codecov-action@v5 + with: + # Fail if error if not on PR, or if on PR and token is given + fail_ci_if_error: ${{ github.event_name != 'pull_request' || secrets.CODECOV_TOKEN }} + files: ./cobertura.xml + plugins: noop + disable_search: true + token: ${{ secrets.CODECOV_TOKEN }} + - name: Show testthat output if: always() run: | diff --git a/R/calculate_cvs.R b/R/calculate_cvs.R index 6d84772..6296ba9 100644 --- a/R/calculate_cvs.R +++ b/R/calculate_cvs.R @@ -232,10 +232,10 @@ se_weighted_mean = function( } #' @title Calculate a coefficient of variation -#' @details Return a coefficient of variation at the 90% level +#' @details Return a coefficient of variation reflecting the ration of the SE to the estimate #' @param estimate The estimate -#' @param se The standard error -#' @returns A coefficient of variation at the 90% level +#' @param se The standard error (SE) +#' @returns A coefficient of variation cv = function(estimate, se) { cv = se / estimate * 100 diff --git a/README.Rmd b/README.Rmd index 6bb9b59..f563a6a 100644 --- a/README.Rmd +++ b/README.Rmd @@ -27,7 +27,7 @@ experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](h v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0) [![Codecov test coverage](https://codecov.io/gh/UI-Research/urbnindicators/graph/badge.svg)](https://app.codecov.io/gh/UI-Research/urbnindicators) - +[![Codecov test coverage](https://codecov.io/gh/UI-Research/urbnindicators/graph/badge.svg)](https://app.codecov.io/gh/UI-Research/urbnindicators) # Overview @@ -35,28 +35,32 @@ coverage](https://codecov.io/gh/UI-Research/urbnindicators/graph/badge.svg)](htt **urbnindicators** aims to provide users with analysis-ready data from the American Community Survey (ACS). -With a single function call, you get: +What you can access: + +- Hundreds of pre-computed variables, including percentages and + the raw count variables used to produce them. Or flexibly query + any table your heart desires. -- Access to hundreds of standardized variables, such as percentages and - the raw count variables used to produce them. +- Or flexibly specify your own derived variables with a series of + helper functions. - Margins of error for all variables--those direct from the API as - well as derived variables. + well as derived variables--with correctly calculated pooled margins + of error, per Census Bureau guidance. -- Meaningful, consistent variable names. +- Meaningful, consistent variable names--no more "B01003_001"; try + "total_population_universe" instead. (But if you're fond of the API's + variable names, those are stored in the codebook as well for cross-referencing.) - A codebook that describes how each variable is calculated. -- The built-in capacity to pull data for multiple years and multiple - states. +- Data for multiple years and multiple states out of the box. - Supplemental measures, such as population density, that aren't available from the ACS. -- Built-in quality checks to help ensure that calculated variables - and measures of error are accurate. Plus some good, old-fashioned manual QC. - That said--use at your own risk. We cannot and do not guarantee there aren't bugs. - +- Tools to aggregate or interpolate your data to different + geographies--along with correctly adjusted margins of error. # Installation @@ -222,7 +226,7 @@ Confidence intervals are presented around each point but are extremely small"), ACS data are available for standard geographies (tracts, counties, states, etc.), but many analyses require non-standard areas like neighborhoods, school zones, or planning districts. -`interpolate_acs()` aggregates tract-level data to +`interpolate_acs()` aggregates source data to any user-defined geography, properly re-deriving percentages and propagating margins of error: @@ -274,10 +278,7 @@ df = compile_acs_data( "snap_not_received_percent", numerator_variables = c("snap_universe"), numerator_subtract_variables = c("snap_received"), - denominator_variables = c("snap_universe")), - define_one_minus( - "snap_received_complement", - source_variable = "snap_received_percent")), + denominator_variables = c("snap_universe"))), years = 2024, geography = "county", states = "DC") @@ -287,18 +288,8 @@ df %>% glimpse() ``` -The available helpers are: - -| Helper | Use case | -|---|---| -| `define_percent()` | Ratio of a numerator to a denominator | -| `define_across_percent()` | Percentages for every column matching a regex | -| `define_across_sum()` | Sum paired columns (e.g., male + female counts) | -| `define_one_minus()` | Complement of an existing percentage (1 - x) | -| `define_metadata()` | Codebook-only entry for a non-computed variable | - See `vignette("custom-derived-variables")` for detailed examples of -each helper. +each of the `define_*()` helpers. # Learn More @@ -331,9 +322,5 @@ Check out the vignettes for additional details: This package is built on top of and enormously indebted to `library(tidycensus)`, which provides the core functionality for -accessing the Census Bureau API. For users who want additional -variables, `library(tidycensus)` exposes the entire range of -pre-tabulated variables available from the ACS and provides access to -ACS microdata and other Census Bureau datasets. - -Learn more here: . +accessing the Census Bureau API. Learn more here: +. diff --git a/README.md b/README.md index 5cc04bb..ca99bbd 100644 --- a/README.md +++ b/README.md @@ -11,7 +11,8 @@ experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](h v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0) [![Codecov test coverage](https://codecov.io/gh/UI-Research/urbnindicators/graph/badge.svg)](https://app.codecov.io/gh/UI-Research/urbnindicators) - +[![Codecov test +coverage](https://codecov.io/gh/UI-Research/urbnindicators/graph/badge.svg)](https://app.codecov.io/gh/UI-Research/urbnindicators) # Overview @@ -19,28 +20,33 @@ coverage](https://codecov.io/gh/UI-Research/urbnindicators/graph/badge.svg)](htt **urbnindicators** aims to provide users with analysis-ready data from the American Community Survey (ACS). -With a single function call, you get: +What you can access: + +- Hundreds of pre-computed variables, including percentages and the raw + count variables used to produce them. Or flexibly query any table your + heart desires. -- Access to hundreds of standardized variables, such as percentages and - the raw count variables used to produce them. +- Or flexibly specify your own derived variables with a series of helper + functions. - Margins of error for all variables–those direct from the API as well - as derived variables. + as derived variables–with correctly calculated pooled margins of + error, per Census Bureau guidance. -- Meaningful, consistent variable names. +- Meaningful, consistent variable names–no more “B01003_001”; try + “total_population_universe” instead. (But if you’re fond of the API’s + variable names, those are stored in the codebook as well for + cross-referencing.) - A codebook that describes how each variable is calculated. -- The built-in capacity to pull data for multiple years and multiple - states. +- Data for multiple years and multiple states out of the box. - Supplemental measures, such as population density, that aren’t available from the ACS. -- Built-in quality checks to help ensure that calculated variables and - measures of error are accurate. Plus some good, old-fashioned manual - QC. That said–use at your own risk. We cannot and do not guarantee - there aren’t bugs. +- Tools to aggregate or interpolate your data to different + geographies–along with correctly adjusted margins of error. # Installation @@ -209,7 +215,7 @@ Confidence intervals are presented around each point but are extremely small"), ACS data are available for standard geographies (tracts, counties, states, etc.), but many analyses require non-standard areas like neighborhoods, school zones, or planning districts. `interpolate_acs()` -aggregates tract-level data to any user-defined geography, properly +aggregates source data to any user-defined geography, properly re-deriving percentages and propagating margins of error: ``` r @@ -265,10 +271,7 @@ df = compile_acs_data( "snap_not_received_percent", numerator_variables = c("snap_universe"), numerator_subtract_variables = c("snap_received"), - denominator_variables = c("snap_universe")), - define_one_minus( - "snap_received_complement", - source_variable = "snap_received_percent")), + denominator_variables = c("snap_universe"))), years = 2024, geography = "county", states = "DC") @@ -284,18 +287,8 @@ df %>% #> $ snap_not_received_percent_M 0.0071 ``` -The available helpers are: - -| Helper | Use case | -|---------------------------|-------------------------------------------------| -| `define_percent()` | Ratio of a numerator to a denominator | -| `define_across_percent()` | Percentages for every column matching a regex | -| `define_across_sum()` | Sum paired columns (e.g., male + female counts) | -| `define_one_minus()` | Complement of an existing percentage (1 - x) | -| `define_metadata()` | Codebook-only entry for a non-computed variable | - See `vignette("custom-derived-variables")` for detailed examples of each -helper. +of the `define_*()` helpers. # Learn More @@ -328,9 +321,5 @@ Check out the vignettes for additional details: This package is built on top of and enormously indebted to `library(tidycensus)`, which provides the core functionality for -accessing the Census Bureau API. For users who want additional -variables, `library(tidycensus)` exposes the entire range of -pre-tabulated variables available from the ACS and provides access to -ACS microdata and other Census Bureau datasets. - -Learn more here: . +accessing the Census Bureau API. Learn more here: +. diff --git a/man/cv.Rd b/man/cv.Rd index b7f4704..596bfac 100644 --- a/man/cv.Rd +++ b/man/cv.Rd @@ -9,14 +9,14 @@ cv(estimate, se) \arguments{ \item{estimate}{The estimate} -\item{se}{The standard error} +\item{se}{The standard error (SE)} } \value{ -A coefficient of variation at the 90\% level +A coefficient of variation } \description{ Calculate a coefficient of variation } \details{ -Return a coefficient of variation at the 90\% level +Return a coefficient of variation reflecting the ration of the SE to the estimate } diff --git a/vignettes/codebook.Rmd b/vignettes/codebook.Rmd index 7b949b1..a369457 100644 --- a/vignettes/codebook.Rmd +++ b/vignettes/codebook.Rmd @@ -18,7 +18,7 @@ knitr::opts_chunk$set( comment = "#>") ``` -```{r setup, echo = FALSE} +```{r setup} library(urbnindicators) library(dplyr) library(reactable) @@ -68,7 +68,10 @@ critical. ## Browse the codebook Use the search box below to filter by variable name, type, or -definition text. +definition text. Note that this codebook reflects all variables from +the tables returned by `list_tables()`, but if you were to specify +different tables in your `compile_acs_data()` call, your codebook +would comprise different variable listings. ```{r, echo = FALSE} reactable( diff --git a/vignettes/custom-geographies.Rmd b/vignettes/custom-geographies.Rmd index e5fb093..710bd7f 100644 --- a/vignettes/custom-geographies.Rmd +++ b/vignettes/custom-geographies.Rmd @@ -20,7 +20,7 @@ knitr::opts_chunk$set( comment = "#>") ``` -```{r setup, echo = FALSE} +```{r setup} library(dplyr) library(ggplot2) library(scales) diff --git a/vignettes/design-philosophy.Rmd b/vignettes/design-philosophy.Rmd index 84de5b1..1b9d98a 100644 --- a/vignettes/design-philosophy.Rmd +++ b/vignettes/design-philosophy.Rmd @@ -16,15 +16,10 @@ knitr::opts_chunk$set( comment = "#>") ``` -**urbnindicators** makes a number of opinionated design choices about -what data to select from the Census Bureau API, how to process it, what -relevant derived variables to calculate, and even which types of -geographies to support. - +**urbnindicators** makes a number of opinionated design choices. "Opinionated" doesn't mean that these decisions are the best ones for every user or use-case, but these decisions are designed to either speed -or improve the accuracy of a common use-case involving a large set of variables -(optionally over multiple years). +or improve the accuracy of common workflows. ## Design choices @@ -39,14 +34,6 @@ or improve the accuracy of a common use-case involving a large set of variables larger-population geographies, such as tracts, zip codes, and some places and counties. -- **Support only a subset of ACS variables.** Pre-calculated ACS - estimates cover tens of thousands of different variables. But, in - our work, only a small fraction of these is used frequently. We've - tried to select those common variables to return by default, - cognizant that at present, every additional variable returned - results in a slower query. Open an issue in GitHub if you'd like to - see additional variables added to the default set. - - **Rename all variables.** The default variable names returned by the API are not human-friendly. Not only is it challenging to determine what a given variable represents when you're looking at a @@ -61,7 +48,8 @@ or improve the accuracy of a common use-case involving a large set of variables documentation anywhere (apart from the codebook returned by this package!) of a variable named, for example, `race_personofcolor_percent`. Variables in the codebook have - their original API names included in their definitions. + their original API names included in their definitions so that you + can cross-reference these as needed. - **Use a consistent variable naming convention.** Variable names follow the pattern @@ -77,7 +65,9 @@ or improve the accuracy of a common use-case involving a large set of variables are expressed as proportions (e.g., 0.25 rather than 25). This avoids ambiguity and simplifies downstream calculations (e.g., multiplying a proportion by a population count). Use - `scales::percent()` for display formatting. + `scales::percent()` for display formatting. You can always just multiply + these values (and the MOEs) by 100 if you prefer; this multiplication + requires no other adjustments to the MOEs. - **Always propagate margins of error.** When `urbnindicators` derives a new variable from two or more raw ACS estimates, it @@ -88,11 +78,3 @@ or improve the accuracy of a common use-case involving a large set of variables `vignette("quantified-survey-error")`) but are far preferable to dropping error information entirely. -- **Design for extensibility.** New ACS tables can be added to the - package via a single `register_table()` call in - `R/table_registry.R`. The registration declaratively specifies - raw variables, derived calculations, and codebook metadata; the - codebook and margin of error calculations are generated - automatically. See `vignette("custom-derived-variables")` for a - walkthrough. - diff --git a/vignettes/quantified-survey-error.Rmd b/vignettes/quantified-survey-error.Rmd index ce235d6..295f7b5 100644 --- a/vignettes/quantified-survey-error.Rmd +++ b/vignettes/quantified-survey-error.Rmd @@ -20,7 +20,7 @@ knitr::opts_chunk$set( comment = "#>") ``` -```{r setup, echo = FALSE} +```{r setup} library(dplyr) library(ggplot2) library(tidyr) @@ -86,8 +86,8 @@ measures of error. have no access to a car." (And then we should include, either as a footnote or in the body of the document, that this and other MOEs are calculated at the 90% confidence level). What this means in - practice is that if we were to repeat 100 times–using exactly the - same methods–our approach to calculating this estimate, 90 of those + practice is that if we were to repeat 100 times–-using exactly the + same methods–-our approach to calculating this estimate, 90 of those times we would produce a parallel estimate between 15% and 25%, while 10 of those times, our estimate would fall outside this range. diff --git a/vignettes/urbnindicators.Rmd b/vignettes/urbnindicators.Rmd index caf83ed..4bcefa3 100644 --- a/vignettes/urbnindicators.Rmd +++ b/vignettes/urbnindicators.Rmd @@ -20,7 +20,7 @@ knitr::opts_chunk$set( out.width = "100%") ``` -```{r setup, echo = FALSE} +```{r setup} library(urbnindicators) library(tidycensus) library(dplyr) @@ -124,9 +124,7 @@ how `library(urbnindicators)` helps simplify this task.) of a call to `tidycensus::get_acs()`, a call to `urbnindicators::compile_acs_data()` returns a dataset of both raw ACS measures and derived estimates (such as the share of all individuals who -are disabled). And that dataset can include a range of measures–-spanning -things such as health insurance, employment, housing costs, and race and -ethnicity–-not just one variable or table from the ACS. +are disabled). ### Acquire data @@ -136,8 +134,7 @@ geographies. Note that selecting more tables or more geographic units--either by selecting a `geography` option comprising more units, by selecting more states, or selecting -more years--can significantly increase the query time. A tract-level query of the -entire US for all supported variables can take 30+ minutes. +more years--can significantly increase the query time. Use `list_tables()` to see some of the most commonly-used tables: @@ -251,12 +248,19 @@ codebook %>% pull(definition) ``` -### Aggregate to custom geographies +### Create your own derived variables -ACS data are available for standard geographies, but many analyses -require non-standard areas like neighborhoods or planning districts. -`interpolate_acs()` aggregates tract-level data to any +For tables from `list_tables()`, raw ACS variables and derived variables are automatically returned. But for other tables, there are no pre-computed (by `urbnindicators`) derived variables. And even for tables reflected in `list_tables()`, +you may want alternate or additional derived variables. `urbnindicators` provides a +suite of helper functions (`define_*()`) that allow you to specify how you want to create these derived variables; these helper functions abstract away the actual calculations and ensure that you get an updated codeboook and correctly-pooled margins +of error for each of your newly-derived variables. See [Custom Derived Variables](custom-derived-variables.html) for more. + +### Interpolate data to custom geographies + +ACS data are available for many statistical and political geographies, but many analyses +rfocus on other geographies like neighborhoods or planning districts. +`interpolate_acs()` translates data from ACS-supported geographies to any user-defined geography, properly re-deriving percentages and propagating margins of error. See -[Aggregating to Custom Geographies](custom-geographies.html) for a +[Translating ACS Data to Custom Geographies](custom-geographies.html) for a worked example.