Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 19 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,11 +78,25 @@ independent uniform top-bucket sets — overlap **hypergeometrically**,
unrelated document clears a given overlap threshold — is closed-form and
data-independent, not a tuned cutoff. (Whether *true* neighbours clear the bar
is empirical; this is an exact candidate-generation null, not a
retrieval-optimality theorem.) The invariance underneath it — that the rank
transform is unchanged by any strictly monotone reparametrisation of the
coordinates — is separately machine-checked in Lean, with the formalisation
accompanying the paper. Details in
[`docs/RANK_MODES.md`](docs/RANK_MODES.md).
retrieval-optimality theorem.) Two pieces of this are separately machine-checked in Lean 4, both `sorry`-free
on Lean's standard axiom base (`propext`, `Classical.choice`, `Quot.sound`):

- the **ordinal invariance** on which the rank transform rests — that a vector's
sorting permutation is unchanged by any strictly monotone reparametrisation
of its coordinates — in
[`takens-formalization`](https://github.com/Project-Navi/takens-formalization)
(theorem `isOrdinalPatternOf_comp_strictMono`); and
- the **shape of the bitmap candidate filter** — that under a finite
monotone-likelihood-ratio overlap-tilt model, an overlap-count *threshold*
(the popcount cutoff) is the Bayes-optimal deterministic admission rule, and
the uniform constant-weight null gives that threshold exactly the
hypergeometric upper tail — in
[`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization)
(theorem `exists_uniformBitmapOverlapTail_finiteBayesRisk_le_and_hypergeomTail`).
Comment thread
Fieldnote-Echo marked this conversation as resolved.
This is an *in-model* result: it fixes the optimal rule's shape under the
stated assumptions, not a claim that any given corpus obeys the model.

Details in [`docs/RANK_MODES.md`](docs/RANK_MODES.md).

## Quickstart

Expand Down
26 changes: 17 additions & 9 deletions docs/RANK_MODES.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,15 +128,23 @@ don't have a hypergeometric null because they don't have fixed
bucket cardinalities — their score distribution depends on the
unknown embedding distribution.

**A research program this suggests.** The chain — representation →
statistic → retrieval theorem → systems implementation — has a
plausible formal target. Under a shared-latent-support model where
relevant documents have elevated coordinates on a query-specific
support set `S_q`, the top-bucket overlap statistic is monotone in
the likelihood ratio for relevance, suggesting that bitmap probing
may approach Bayes-optimality under that model. We do not claim
that theorem here; this section flags it as the natural
mathematical direction for the empirical results below.
**A research program this suggests — now partly machine-checked.** The
chain — representation → statistic → retrieval theorem → systems
implementation — now has a checked link in the middle. Under a finite
monotone-likelihood-ratio overlap-tilt model (the formal cousin of a
shared-latent-support model, where relevant documents have elevated
coordinates on a query-specific support set `S_q`), the top-bucket
overlap statistic is monotone in the likelihood ratio for relevance,
and an overlap-count threshold — the popcount cutoff — is the
Bayes-optimal deterministic admission rule, with the uniform
constant-weight null assigning that threshold exactly the
hypergeometric upper tail. That is proved `sorry`-free in
[`ordvec-formalization`](https://github.com/Fieldnote-Echo/ordvec-formalization)
(`exists_uniformBitmapOverlapTail_finiteBayesRisk_le_and_hypergeomTail`).
It is an *in-model* result: it fixes the shape of the optimal rule
under the stated assumptions, not a claim that any given corpus obeys
the model — whether real neighbours clear the bar stays empirical,
which is what the bench and the paper measure.
Comment thread
Fieldnote-Echo marked this conversation as resolved.

The systems consequence is what the bench measures: at a moderate M
the bitmap probe captures most of exact RankQuant's top-10 neighbours,
Expand Down
3 changes: 3 additions & 0 deletions src/bitmap.rs
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,7 @@ impl Bitmap {
/// `Vec<u32>` of all N scores and `select_nth_unstable` the
/// top-`m`: O(N + m log m). The ~808 KiB temp at N=207k is
/// cheap relative to the cost it saves at M ≥ 1000.
#[must_use = "this scans the corpus to generate candidates; dropping the result discards that work"]
pub fn top_m_candidates(&self, q: &[f32], m: usize) -> Vec<u32> {
assert_all_finite(q);
let m_eff = m.min(self.n_vectors);
Expand Down Expand Up @@ -224,6 +225,7 @@ impl Bitmap {
/// `queries` is a flat `batch * dim` f32 slice. Returns
/// `Vec<Vec<u32>>` with one top-`m` set per query, sorted by
/// overlap descending.
#[must_use = "this scans the corpus per query to generate candidates; dropping the result discards that work"]
pub fn top_m_candidates_batched(&self, queries: &[f32], m: usize) -> Vec<Vec<u32>> {
let dim = self.dim;
let batch = queries.len() / dim;
Expand Down Expand Up @@ -294,6 +296,7 @@ impl Bitmap {
/// [`Self::top_m_candidates_batched`] in parallel via rayon. Use
/// when the full query workload is larger than one batch fits
/// efficiently in L2/L3.
#[must_use = "this scans the corpus per query to generate candidates; dropping the result discards that work"]
pub fn top_m_candidates_batched_chunked(
&self,
queries: &[f32],
Expand Down
1 change: 1 addition & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ pub type RankQuantFastscanIndex = RankQuantFastscan;
/// type. Prefer the slice accessors above for read-only per-query access —
/// exposing the flat buffers as the stable representation is the trade-off for
/// that zero-copy interop.
#[must_use = "search runs the full scan to produce these results; dropping them discards that work"]
pub struct SearchResults {
pub scores: Vec<f32>,
pub indices: Vec<i64>,
Expand Down
2 changes: 2 additions & 0 deletions src/sign_bitmap.rs
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ impl SignBitmap {
/// `m_eff` produce a deterministic survivor set across runs and
/// SIMD dispatch paths — same audit discipline as
/// [`crate::Bitmap::top_m_candidates`].
#[must_use = "this scans the corpus to generate candidates; dropping the result discards that work"]
pub fn top_m_candidates(&self, q: &[f32], m: usize) -> Vec<u32> {
let m_eff = m.min(self.n_vectors);
if m_eff == 0 {
Expand Down Expand Up @@ -157,6 +158,7 @@ impl SignBitmap {
/// top-`m` candidate sets for `batch` queries in parallel. Mirrors
/// [`crate::Bitmap::top_m_candidates_batched`] in kernel
/// shape (CHUNK=8 hot + tail) and tie-break semantics.
#[must_use = "this scans the corpus per query to generate candidates; dropping the result discards that work"]
pub fn top_m_candidates_batched(&self, queries: &[f32], m: usize) -> Vec<Vec<u32>> {
let dim = self.dim;
let batch = queries.len() / dim;
Expand Down
Loading