Skip to content

Add full-text search benchmark support#794

Open
jamesgao-jpg wants to merge 127 commits into
zilliztech:mainfrom
jamesgao-jpg:fts_impl_only
Open

Add full-text search benchmark support#794
jamesgao-jpg wants to merge 127 commits into
zilliztech:mainfrom
jamesgao-jpg:fts_impl_only

Conversation

@jamesgao-jpg

@jamesgao-jpg jamesgao-jpg commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Context

VDBBench did not have a dedicated native full-text search benchmark path. This PR adds FTS as a first-class benchmark workload so BM25-based text search can be evaluated through the same task, runner, dataset, frontend, and result pipeline used by the rest of VDBBench.

Summary

  • Add full-text search benchmark support centered on BM25 text retrieval.
  • Introduce FTS performance cases that load text documents, run text queries, and report comparable performance results.
  • Wire FTS through backend execution, dataset preparation, runner orchestration, Streamlit task generation, and result formatting.
  • Use manifest-driven FTS ground truth so recall is measured against generated mathematical BM25 neighbors rather than semantic relevance labels.

Backends Covered

  • Milvus: native BM25 full-text indexing/search configuration and execution path.
  • Zilliz Cloud: FTS routing through the Milvus-compatible API with Cloud sparse auto-index handling, sharing the Milvus optimize/compaction path.
  • ElasticCloud / Elasticsearch: BM25 text indexing/search path with configurable BM25 k1/b support.
  • Vespa: BM25 schema/query path plus Vespa feed-client loading for large FTS document ingestion.
  • Turbopuffer: namespace-based full-text benchmark path.

Testing Infra Touched

  • Dataset layer: add MS MARCO and HotpotQA FTS dataset definitions, document/query loading, and S3-hosted mathematical BM25 ground-truth loading.
  • Case layer: add FTS performance case definitions, payload profiles, and task assembly support.
  • Runner layer: support FTS document loading plus serial recall and concurrent text-query search execution while preserving the existing backend insert contract.
  • Backend layer: route Vespa FTS loading through its backend insert path, where the Vespa feed client is managed for high-throughput ingestion.
  • Frontend layer: expose FTS cases and generate backend-specific FTS task configs from Streamlit.
  • Result layer: format FTS benchmark outputs alongside existing VDBBench results.

Datasets Supported

  • MS MARCO: small 100K, medium 1M, large 8.8M documents.
  • HotpotQA: small 100K, medium 1M, large 5.2M documents.

Metrics

  • Search metric type: BM25.
  • Accuracy metric: recall@k against generated mathematical BM25 ground truth.
  • Performance metrics: serial latency p95/p99, concurrent QPS, load duration, optimize duration, inserted count, payload profile, batch size, and load concurrency.

Denise2004 and others added 30 commits June 1, 2026 04:00
Co-authored-by: zilliz <zilliz@zillizdeMacBook-Pro.local>
James Gao and others added 10 commits June 24, 2026 06:29
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>

@XuanYang-cn XuanYang-cn left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes on the FTS benchmark correctness contract. Inline comments preserve the existing architecture review draft structure for the blocking and medium findings.

Comment on lines +270 to +292
@classmethod
def supports_full_text_search(cls) -> bool:
return False

def insert_documents(
self,
texts: list[str],
doc_ids: list[str],
**kwargs,
) -> tuple[int, Exception | None]:
msg = f"{self.name or self.__class__.__name__} does not support full-text document insert"
raise NotImplementedError(msg)

def search_documents(
self,
query: str,
k: int = 100,
payload_profile: PayloadProfile = PayloadProfile.IDS_ONLY,
**kwargs,
) -> list[str]:
msg = f"{self.name or self.__class__.__name__} does not support full-text document search"
raise NotImplementedError(msg)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add detailed doc string to guide others on how to support fts-related API.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in ea08494. I added docstrings to the FTS extension points in VectorDB: supports_full_text_search, insert_documents, and search_documents, covering the capability gate, insert return contract, payload-profile behavior, and search return shape.

Comment thread vectordb_bench/backend/task_runner.py Outdated
analyzer_params = getattr(self.ca.dataset, "analyzer_params", {}) or {}
updates = {}

if self.config.db == DB.Milvus:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. 🔴 Blocking — Runner owns backend-specific FTS manifest policy

Where: vectordb_bench/backend/task_runner.py:250

What: _apply_fts_manifest_params() branches on self.config.db == DB.Milvus / DB.ElasticCloud / DB.Vespa to translate BM25/analyzer manifest params into backend config fields.

Why it matters:

  • Puts backend-specific policy inside CaseRunner, which should only orchestrate the benchmark lifecycle.
  • Every new FTS-capable backend needs another runner edit, and one missed branch silently changes correctness.
  • DB.ZillizCloud falls through with no manifest updates. ZillizCloudFtsConfig extends the Milvus FTS config and this PR claims Zilliz Cloud is covered, so Zilliz Cloud can run with default BM25/analyzer settings while recall is scored against ground truth built from the manifest settings.

Fix: Replace the runner branching with a config-owned hook:

  • db_case_config.apply_fts_manifest(manifest) -> FtsManifestApplyResult returns the updated config plus which BM25/analyzer params were applied vs. unapplied.
  • CaseRunner calls this one hook before init_db(), swaps in the returned config, and records the report without importing or naming any client.
  • MilvusFtsConfig owns BM25/analyzer translation; ZillizCloudFtsConfig inherits it. ElasticCloudFtsConfig, VespaFtsConfig, and TurboPufferFtsConfig each report supported vs. unsupported params explicitly.
  • Add a contract test: every FTS-capable config either applies k1, b, analyzer params, and avgdl where supported, or explicitly reports them unsupported.

Why a config hook: manifest params must be applied before the client exists, so a client hook is too late; a central registry would duplicate backend config semantics and drift.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 2eca41d. I moved FTS manifest application out of CaseRunner and into DBCaseConfig.apply_fts_manifest(...) implementations. The runner now only prepares the FTS dataset, calls the config hook, records the returned applied/unapplied manifest report, and then initializes the backend. Milvus owns the BM25/analyzer mapping, Zilliz Cloud inherits that behavior through ZillizCloudFtsConfig(MilvusFtsConfig), Elasticsearch applies supported BM25 k1/b, and Vespa applies k1/b/avgdl while unsupported analyzer params are reported as unapplied. Verified with git diff --check, compile checks for the touched modules, a focused local manifest-hook test, and tests/test_db_client_resolution.py.

MAX_STREAMLIT_INT = (1 << 53) - 1

DB_LIST = [d for d in DB if d != DB.Test]
FTS_SUPPORTED_DBS = {DB.Milvus, DB.ElasticCloud, DB.Vespa, DB.TurboPuffer}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. 🔴 Blocking — Run-test UI excludes Zilliz Cloud FTS

Where: vectordb_bench/frontend/config/dbCaseConfigs.py:15

What: FTS_SUPPORTED_DBS lists Milvus, ElasticCloud, Vespa, TurboPuffer, but not DB.ZillizCloud. Yet the same file later defines a Zilliz Cloud FTS config mapping, and the PR commits Zilliz Cloud FTS results.

Why it matters: Backend support and committed results say Zilliz Cloud FTS exists, but the run-test UI filters the FTS cases out when Zilliz Cloud is selected. The matrix disagrees with itself.

Fix:

  • Add DB.ZillizCloud to FTS_SUPPORTED_DBS.
  • Add coverage for get_selectable_case_items(): every backend that has supports_full_text_search() and an FTS config must be selectable.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 92ad3c4. I added DB.ZillizCloud to FTS_SUPPORTED_DBS, so the run-test UI exposes FullTextSearchPerformance cases for Zilliz Cloud consistently with CASE_CONFIG_MAP. I also added tests/test_fts_frontend_config.py to verify the FTS-supported DB set matches backends with FTS case configs and that Zilliz Cloud FTS cases are selectable. Verified with git diff --check and .venv/bin/python -m pytest -q tests/test_fts_frontend_config.py.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: I amended the previous commit to keep this patch minimal. The current pushed commit is 14b86fc and only adds DB.ZillizCloud to FTS_SUPPORTED_DBS. I removed the test file from the commit as requested. Verification run locally: git diff --check, compile check for dbCaseConfigs.py, and a direct assertion that Zilliz Cloud FTS cases are selectable.

raise TypeError(msg)
return manifest

def _load_manifest_params(self) -> None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. 🟡 Medium — Row-order ground-truth contract is implicit

Where: vectordb_bench/backend/dataset.py:779, :822, :920; pyproject.toml:44

What: The FTS dataset layer loads ground truth by row order, then overwrites every source document id with str(self._doc_count). Concurrent insert does not shift those IDs because ConcurrentInsertRunner pulls batches under _iter_lock and clients insert the assigned doc_ids directly.

Status: The public msmarco_small_100k/neighbors.parquet artifact confirms this is intended: query id is 0..6979, and neighbors_id holds dense doc row IDs in 0..99999. Assuming ir_datasets iteration is stable, this is not a blocker.

Why it still matters: The contract lives only in code plus artifact convention. _load_manifest_params() reads only BM25/analyzer params and ignores source_ir_dataset, doc_limit, indexed_doc_count, and query_count that build_manifest.json records.

Fix: Validate those manifest fields before accepting the ground truth, and document the row-id policy explicitly so future datasets do not mix original doc IDs with generated row IDs.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in aa21119. I documented the FTS math-GT row-ID contract at the GT load point: neighbors.parquet stores dense document row IDs, and FtsDocumentIterator assigns the same row IDs during insertion. I also added manifest validation before accepting GT params: source_ir_dataset must match the translator dataset, doc_limit/indexed_doc_count must match the configured dataset size, and query_count must match the loaded query count when present. Verification: git diff --check, compile check for dataset.py, and a direct validation script covering valid manifests plus source/doc/query-count mismatches. I also tried the broader tests/test_dataset.py, but it attempted multi-GB dataset downloads and failed due to local disk exhaustion, so I cleaned the partial /tmp/vectordb_bench artifacts.

Comment thread vectordb_bench/metric.py
def calc_recall_fts(k: int, ground_truth: list[int], got: list[int]) -> float:
if not ground_truth or k <= 0:
return 0.0
gt_set = set(ground_truth)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4. 🟡 Medium — -1 ground-truth padding counted as a relevant doc

Where: vectordb_bench/backend/dataset.py:760; vectordb_bench/metric.py:124

What: msmarco_small_100k/neighbors.parquet pads neighbors_id with -1. _load_math_gt_data() stringifies every neighbor and keeps -1; calc_recall_fts() builds gt_set = set(ground_truth) with no sentinel filtering.

Why it matters: Padded rows put an impossible doc ID (-1) in the denominator. A backend can never return -1, so recall on those rows is depressed by padding rather than reflecting only real ground-truth docs.

Fix: Filter -1 / "-1" when loading FTS ground truth or inside calc_recall_fts(). Add a regression test with padded GT rows.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 87dcfa3. I filtered the FTS math-GT sentinel padding at load time in FtsDatasetManager._load_math_gt_data(), so downstream recall receives clean dense document IDs and does not count impossible -1 entries in the denominator. This keeps the metric code unchanged and affects only FTS GT loading. Verification: git diff --check, compile check for dataset.py, and a cached GT check showing MS MARCO small raw padding count 1968 and loaded padding count 0; HotpotQA small remained 0/0.

if data.empty:
return data

data = data[data["backend"] != "Milvus"].copy()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5. 🟡 Medium — FTS result dashboard hard-codes a different backend matrix

Where: vectordb_bench/frontend/pages/full_text_search.py:25, :127

What: The result page keeps its own hard-coded BACKEND_ORDER and explicitly filters out Milvus at line 127, while the backend and run-test UI include Milvus as an FTS-supported backend.

Why it matters: The result page disagrees with the benchmark support matrix and can silently drop valid result files.

Fix: Pick one intent and make it consistent:

  • If this is intentionally a cloud-only published-results page, say so in the title/caption.
  • If it is the FTS dashboard, derive the backend list from the supported FTS DB set and stop dropping valid result files.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in ee44c28 by making the dashboard scope explicit instead of changing the published-result backend set. The page now uses the title Full Text Search Cloud Results, adds a short caption naming the published backend subset, and documents the hard-coded backend order as the cloud/service result subset. The Milvus filter remains intentional for this published-results page. Verification: git diff --check, compile check for full_text_search.py, and a direct assertion that the backend subset remains Zilliz Cloud / ElasticSearch / Vespa / TurboPuffer.

Comment thread vectordb_bench/backend/cases.py Outdated

NewIntFilterPerformanceCase = 400
CloudPayloadSearchCase = 500
FTSmsmarcoPerformance = 503

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6. 🟡 Medium — Public FTS case id is still MS MARCO-specific

Where: vectordb_bench/backend/cases.py:68

What: CaseType.FTSmsmarcoPerformance now represents the generic BM25 FTS workload. The dataset is already carried separately via CaseConfig.custom_case["dataset_with_size_type"], so run identity is not lost.

Why it matters: The public workload id still says MS MARCO even when the dataset is HotpotQA. That name leaks into CLI options, serialized results, and TestResult.read_file() compatibility handling, so new FTS datasets look like special cases under an MS MARCO id.

Fix:

  • Rename the public case id/name to a generic workload name, e.g. FTSBm25Performance, and use it in the new UI/CLI/result paths. The case is new, so no backward-compatible FTSmsmarcoPerformance alias is needed.
  • Move the FTS-only IR dataset types, translators, and manager into one new file, e.g. vectordb_bench/backend/fts_dataset.py (no package/directory split needed yet).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the public case-id naming part in 623977b. I renamed the case enum/class and tracked call sites from FTSmsmarcoPerformance to FTSBm25Performance while keeping enum value 503, so the workload name is no longer MS MARCO-specific and can represent HotpotQA or future BM25 FTS datasets via dataset_with_size_type.

I intentionally did not move the FTS dataset code into a separate fts_dataset.py in this patch. I think FTS should be handled more appropriately in the future as part of a cleaner fusion with the existing vector dataset/API path, not by adding a separate file split now. For example, the runner path already shares FTS and vector flow in the same runner functions where practical, so a file move alone does not create a meaningful architectural improvement and would add import churn to this PR. Verification: git diff --check, compile checks for touched modules, and a direct check that CaseType.FTSBm25Performance.value == 503, HotpotQA instantiation works, and CLI custom-case parsing accepts FTSBm25Performance.

@sre-ci-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jamesgao-jpg
To complete the pull request process, please ask for approval from xuanyang-cn after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
James Gao added 5 commits June 25, 2026 02:51
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
James Gao added 6 commits June 25, 2026 04:25
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
Signed-off-by: jamesgao-jpg <james.gao@zilliz.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants