Context
RankQuant::search_asymmetric_subset intentionally accepts duplicate candidate row IDs and scores each entry independently. That is a valid low-level contract, but DB-style integrations frequently need unique final rows. Today every downstream caller must remember to deduplicate candidate lists before rerank.
This issue asks for a small, explicit helper or option so callers can choose unique-candidate semantics without reimplementing ordvec-specific edge cases.
Related:
Evidence
RankQuant::search_asymmetric_subset docs state duplicate candidates can produce duplicate returned global IDs and callers must deduplicate first: src/quant.rs:527-530.
- Determinism docs preserve duplicate-candidate behavior as part of the public contract:
docs/determinism.md:15-22, docs/determinism.md:82-86.
Bitmap::search_subset has the same duplicate-entry behavior: src/bitmap.rs:416-422.
docs/RANK_MODES.md repeats that callers must deduplicate candidate lists before reranking: docs/RANK_MODES.md:430-432.
Proposed Shape
Any of these would satisfy the need:
pub fn dedup_candidates_stable(candidates: &mut Vec<u32>);
pub enum CandidateDupPolicy {
PreserveEntries,
UniqueRows,
}
pub fn search_asymmetric_subset_with_options(
&self,
query: &[f32],
candidates: &[u32],
k: usize,
options: SubsetSearchOptions,
) -> (Vec<f32>, Vec<i64>);
A standalone helper is probably enough for the MVP, as long as it is documented next to subset rerank and two-stage APIs.
Acceptance Criteria
- Provides a public way to convert an arbitrary candidate list into unique row IDs with deterministic order.
- Defines whether unique order is first-seen order, row-id ascending, or score-order preserving after rerank.
- Tests cover duplicates, unsorted inputs, empty inputs, and out-of-range values if validation is included.
- Docs make the low-level duplicate-preserving behavior and the unique-helper behavior both explicit.
- Batched subset / docset / two-stage APIs either use the helper when requested or document their duplicate policy.
Non-goals
- Do not silently change existing
search_asymmetric_subset duplicate-preserving behavior.
- No external ID map or database-level duplicate policy.
Context
RankQuant::search_asymmetric_subsetintentionally accepts duplicate candidate row IDs and scores each entry independently. That is a valid low-level contract, but DB-style integrations frequently need unique final rows. Today every downstream caller must remember to deduplicate candidate lists before rerank.This issue asks for a small, explicit helper or option so callers can choose unique-candidate semantics without reimplementing ordvec-specific edge cases.
Related:
Evidence
RankQuant::search_asymmetric_subsetdocs state duplicate candidates can produce duplicate returned global IDs and callers must deduplicate first:src/quant.rs:527-530.docs/determinism.md:15-22,docs/determinism.md:82-86.Bitmap::search_subsethas the same duplicate-entry behavior:src/bitmap.rs:416-422.docs/RANK_MODES.mdrepeats that callers must deduplicate candidate lists before reranking:docs/RANK_MODES.md:430-432.Proposed Shape
Any of these would satisfy the need:
A standalone helper is probably enough for the MVP, as long as it is documented next to subset rerank and two-stage APIs.
Acceptance Criteria
Non-goals
search_asymmetric_subsetduplicate-preserving behavior.