Skip to content

Vectorize se_simple() and se_sum() for performance #82

@wcurrangroome

Description

@wcurrangroome

Problem

se_simple() and se_sum() in R/calculate_cvs.R are the most-called functions in the MOE pipeline (invoked hundreds of times inside dplyr::across() in calculate_moes()), but both have unnecessary overhead:

se_simple() (line 8)

Uses purrr::map_dbl(moe, ~ .x / 1.645) for what is natively vectorized as moe / 1.645. The map_dbl wrapper adds per-element function-call overhead for no benefit.

se_sum() (lines 16-61)

For each call, this function:

  1. Converts list inputs to data frames
  2. Pivots twice (pivot_longer + pivot_wider)
  3. Groups and splits by observation
  4. Maps 4 times (pull, se_simple, square, sum)
  5. Collects with map_dbl(sqrt)

The core operation is sqrt(sum(se^2)) row-wise, with a Census Bureau special case: when multiple zero-estimate observations are summed, keep only the largest MOE among them.

A vectorized matrix-based approach should be significantly faster — but needs careful testing against the current implementation for the zero-estimate edge case.

Suggested approach

  1. Replace se_simple() body with moe / 1.645
  2. Rewrite se_sum() using matrix operations
  3. Benchmark before/after on tract-level data for a large state
  4. Verify the zero-estimate edge case produces identical results

Context

Identified during code review. For tract-level data across multiple states, calculate_moes() is a significant portion of total compile_acs_data() runtime.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions