Skip to content

Promote bucket-overlap contingency/projection math: dense (pairwise) + indexed (MultiBucketBitmap) #219

@Fieldnote-Echo

Description

@Fieldnote-Echo

Context

ordgraph (ordgraph-proto/src/edge.rs) builds full bucket-overlap contingency tables C[a][b] = |{coords: query∈bucket a ∧ doc∈bucket b}| over fixed-composition BucketCodes, plus projections: top overlap, diagonal agreement, band agreement, top-group overlap, bucket L1 distance, coarsened counts, rankquant symmetric score. ordvec's experimental MultiBucketBitmap is the same bilinear bucket-overlap math (Σ_{a,b} W[a,b]·|Q_a ∩ D_b|) in bitmap form.

Profiling (branch experiment/multibucket-profile) established the durable conclusions:

  • Symmetric scoring's recall loss (~0.10 @10 vs asymmetric) is real and kernel-independent.
  • Full-table materialization is strictly more work than one scalar projection.
  • Dense contingency dominates for one pair; dedicated scalar kernels dominate for one score.
  • MultiBucketBitmap earns its keep only when the workload is many docs AND multiple projections / full-table evidence per doc.
  • MultiBucketBitmap is NOT the default single-score retrieval path (two-stage / FastScan / RankQuant-asym dominate it on the full recall/latency/bytes frontier at every dim).

Reframe: two APIs, not one

1. Stateless dense-code contingency API (ordgraph migrates to this first)

  • Inputs: two bucket-code / rank-code vectors.
  • Output: full buckets × buckets contingency table.
  • Projection helpers: diagonal agreement, band agreement, top overlap, top-group overlap, bucket L1, rankquant-like score, learned/custom weight matrices.
  • One O(dim) histogram pass builds C; every projection is then cheap. This is the right container for one (query, doc) pair.

2. Indexed MultiBucketBitmap API (future indexed companion)

  • Inputs: query bucket code + many document bitmaps.
  • Output: full tables per doc or batched projections per doc.
  • Must avoid rescanning per projection (compute table/accumulators once).
  • Needs a VPOPCNTDQ / AVX-512-style popcount kernel with a portable fallback.
  • Niche: one query vs many docs, full joint table or many projections per doc. Not positioned as a single-score retriever.

Boundary

  • ordvec owns: contingency math, projection scoring, packed/vectorized kernels.
  • ordgraph owns: "this projection result is evidence for X" (EdgeEvidence graph/event payload).

Acceptance criteria

  • First expose the dense-code contingency + projection API.
  • Parity against ordgraph's current pairwise evidence outputs (ordgraph::edge::Projection / EdgeEvidence) before the local copy is deleted.
  • Treat bitmap MultiBucketBitmap as a second indexed API, gated on table / batched-projection output (no per-projection rescan).
  • Vectorized popcount kernel (AVX-512 VPOPCNTDQ) + portable scalar fallback for the indexed path; scalar-parity test.
  • Benchmarks in three regimes: (a) one pair / many projections, (b) one query / many docs / one projection, (c) one query / many docs / many projections.
  • Explicitly document that MultiBucketBitmap is not the default single-score retrieval path.

Notes

Not for 0.5.0. See docs/multibucket_profile_findings.md on the branch for the full profiling + adversarial verdict.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions