Skip to content

api: add unique-candidate helpers for subset rerank #190

@Fieldnote-Echo

Description

@Fieldnote-Echo

Context

RankQuant::search_asymmetric_subset intentionally accepts duplicate candidate row IDs and scores each entry independently. That is a valid low-level contract, but DB-style integrations frequently need unique final rows. Today every downstream caller must remember to deduplicate candidate lists before rerank.

This issue asks for a small, explicit helper or option so callers can choose unique-candidate semantics without reimplementing ordvec-specific edge cases.

Related:

Evidence

  • RankQuant::search_asymmetric_subset docs state duplicate candidates can produce duplicate returned global IDs and callers must deduplicate first: src/quant.rs:527-530.
  • Determinism docs preserve duplicate-candidate behavior as part of the public contract: docs/determinism.md:15-22, docs/determinism.md:82-86.
  • Bitmap::search_subset has the same duplicate-entry behavior: src/bitmap.rs:416-422.
  • docs/RANK_MODES.md repeats that callers must deduplicate candidate lists before reranking: docs/RANK_MODES.md:430-432.

Proposed Shape

Any of these would satisfy the need:

pub fn dedup_candidates_stable(candidates: &mut Vec<u32>);

pub enum CandidateDupPolicy {
    PreserveEntries,
    UniqueRows,
}

pub fn search_asymmetric_subset_with_options(
    &self,
    query: &[f32],
    candidates: &[u32],
    k: usize,
    options: SubsetSearchOptions,
) -> (Vec<f32>, Vec<i64>);

A standalone helper is probably enough for the MVP, as long as it is documented next to subset rerank and two-stage APIs.

Acceptance Criteria

  • Provides a public way to convert an arbitrary candidate list into unique row IDs with deterministic order.
  • Defines whether unique order is first-seen order, row-id ascending, or score-order preserving after rerank.
  • Tests cover duplicates, unsorted inputs, empty inputs, and out-of-range values if validation is included.
  • Docs make the low-level duplicate-preserving behavior and the unique-helper behavior both explicit.
  • Batched subset / docset / two-stage APIs either use the helper when requested or document their duplicate policy.

Non-goals

  • Do not silently change existing search_asymmetric_subset duplicate-preserving behavior.
  • No external ID map or database-level duplicate policy.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions