Skip to content

Feature request: Add bin_prop computed variable to stat_bin for proportion-based histograms #6478

Open
@kieran-mace

Description

@kieran-mace

Summary

stat_bin currently lacks the after_stat(prop) functionality that stat_count provides, making it difficult to create proportion-based visualizations for continuous data. This feature request proposes adding a bin_prop computed variable to stat_bin to achieve feature parity.

Problem Description

Currently, users can create proportion-based bar charts with discrete data using stat_count:

# This works with discrete data
ggplot(data, aes(x = discrete_var, y = after_stat(prop), fill = group)) +
  geom_bar(position = "dodge")

However, there's no equivalent for continuous data with stat_bin:

# This doesn't work - no prop variable available
ggplot(data, aes(x = continuous_var, y = after_stat(prop), fill = group)) +
  geom_histogram(position = "dodge", bins = 10)

Use Case Example

Consider analyzing weight distribution by sex. Users want to see the proportion of each sex within weight bins:

# Desired functionality (currently not possible)
ggplot(people_data, aes(x = weight, y = after_stat(bin_prop), fill = sex)) +
  stat_bin(geom = "col", bins = 8, position = "dodge") +
  scale_y_continuous(labels = scales::percent) +
  labs(y = "Proportion within bin")

This would show insights such as:

  • Lower weight bins: ~100% female
  • Middle weight bins: Mixed proportions
  • Higher weight bins: ~100% male

Something like this:

Proposed Solution

Add a bin_prop computed variable to stat_bin that calculates the proportion of each group within each bin:

  • bin_prop = count_in_group / total_count_in_bin
  • Handles multiple groups and respects weights
  • For single groups: bin_prop = 1 (backwards compatible)
  • For empty bins: bin_prop = 0

Benefits

  1. Feature parity with stat_count
  2. Enables proportion-based histograms for continuous data
  3. Useful for demographic analysis and group comparisons
  4. Backwards compatible - doesn't break existing code

Alternatives Considered

  1. Manual calculation: Users could manually calculate proportions, but this is cumbersome and error-prone
  2. Using stat_count with discretized data: Loses the benefits of proper binning algorithms
  3. Custom stat function: Would require users to write their own implementation

Expected API

# Documentation would include:
#' @eval rd_computed_vars(
#'   count    = "number of points in bin.",
#'   density  = "density of points in bin, scaled to integrate to 1.",
#'   ncount   = "count, scaled to a maximum of 1.",
#'   ndensity = "density, scaled to a maximum of 1.",
#'   width    = "widths of bins.",
#'   bin_prop = "proportion of points in bin that belong to each group."
#' )

This would enable the intuitive usage:

aes(y = after_stat(bin_prop))

Additional Context

This feature would be particularly valuable for:

  • Demographic analysis (age/income by group)
  • Scientific data (measurements by treatment group)
  • Market research (customer segments by behavior)
  • Any scenario where you want to show group composition within continuous ranges

The implementation should handle edge cases like empty bins, single groups, and weighted data appropriately.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions