diff --git a/CHANGELOG.md b/CHANGELOG.md index a919bee..9f0fc00 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,10 @@ The format is based on Keep a Changelog, and this project adheres to Semantic Ve ## [Unreleased] +### Added + +- `lance_vector_search` now supports the `nprobs` and `refine_factor` parameters. + ## [0.4.0] - 2025-12-29 ### Added diff --git a/README.md b/README.md index 514a6e2..9d99fe2 100644 --- a/README.md +++ b/README.md @@ -142,18 +142,7 @@ FROM lance_vector_search('path/to/dataset.lance', 'vec', [0.1, 0.2, 0.3, 0.4]::F ORDER BY _distance ASC; ``` -- Signature: `lance_vector_search(uri, vector_column, query_vector, ...)` -- Positional arguments: - - `uri` (VARCHAR): Dataset root path or object store URI (e.g. `s3://...`). - - `vector_column` (VARCHAR): Vector column name. - - `query_vector` (FLOAT[dim] or DOUBLE[dim], preferred): Query vector (must be non-empty; values are cast to float32). `FLOAT[]` / `DOUBLE[]` are also accepted. -- Named parameters: - - `k` (BIGINT, default `10`): Number of results to return. - - `prefilter` (BOOLEAN, default `false`): If `true`, filters are applied before top-k selection. - - `use_index` (BOOLEAN, default `true`): If `true`, allow ANN index usage when available. - - `explain_verbose` (BOOLEAN, default `false`): Emit a more verbose Lance plan in `EXPLAIN` output. -- Output: - - Dataset columns plus `_distance` (smaller is closer). +See the SQL reference for full parameter documentation: [docs/sql.md#search](docs/sql.md#search). ### Full-text search (FTS) @@ -164,16 +153,7 @@ FROM lance_fts('path/to/dataset.lance', 'text', 'puppy', k = 10, prefilter = tru ORDER BY _score DESC; ``` -- Signature: `lance_fts(uri, text_column, query, ...)` -- Positional arguments: - - `uri` (VARCHAR): Dataset root path or object store URI (e.g. `s3://...`). - - `text_column` (VARCHAR): Text column name. - - `query` (VARCHAR): Query string. -- Named parameters: - - `k` (BIGINT, default `10`): Number of results to return. - - `prefilter` (BOOLEAN, default `false`): If `true`, filters are applied before top-k selection. -- Output: - - Dataset columns plus `_score` (larger is better). +See the SQL reference for full parameter documentation: [docs/sql.md#search](docs/sql.md#search). ### Hybrid search (vector + FTS) @@ -188,20 +168,7 @@ FROM lance_hybrid_search('path/to/dataset.lance', ORDER BY _hybrid_score DESC; ``` -- Signature: `lance_hybrid_search(uri, vector_column, query_vector, text_column, query, ...)` -- Positional arguments: - - `uri` (VARCHAR): Dataset root path or object store URI (e.g. `s3://...`). - - `vector_column` (VARCHAR): Vector column name. - - `query_vector` (FLOAT[dim] or DOUBLE[dim], preferred): Query vector (must be non-empty; values are cast to float32). `FLOAT[]` / `DOUBLE[]` are also accepted. - - `text_column` (VARCHAR): Text column name. - - `query` (VARCHAR): Query string. -- Named parameters: - - `k` (BIGINT, default `10`): Number of results to return. - - `prefilter` (BOOLEAN, default `false`): If `true`, filters are applied before top-k selection. - - `alpha` (FLOAT, default `0.5`): Vector/text mixing weight. - - `oversample_factor` (INTEGER, default `4`): Oversample factor for candidate generation (larger can improve recall at higher cost). -- Output: - - Dataset columns plus `_hybrid_score` (larger is better), `_distance`, and `_score`. +See the SQL reference for full parameter documentation: [docs/sql.md#search](docs/sql.md#search). ## Contributing diff --git a/docs/sql.md b/docs/sql.md index 32710cf..6bb4a23 100644 --- a/docs/sql.md +++ b/docs/sql.md @@ -25,6 +25,116 @@ FROM 'path/to/dataset.lance' LIMIT 10; ``` +## Search + +### Vector search: `lance_vector_search` + +```sql +-- Search a vector column, returning distances in `_distance` (smaller is closer) +SELECT id, label, _distance +FROM lance_vector_search( + 'path/to/dataset.lance', + 'vec', + [0.1, 0.2, 0.3, 0.4]::FLOAT[4], + k = 5, + use_index = true, + nprobs = 4, + refine_factor = 2, + prefilter = true +) +ORDER BY _distance ASC; +``` + +Signature: `lance_vector_search(uri, vector_column, query_vector, ...)` + +Positional arguments: +- `uri` (VARCHAR): Dataset root path or object store URI (e.g. `s3://...`). +- `vector_column` (VARCHAR): Vector column name. +- `query_vector` (FLOAT[dim] or DOUBLE[dim], preferred): Query vector (must be non-empty; values are cast to float32). `FLOAT[]` / `DOUBLE[]` are also accepted. + +Named parameters: +- `k` (BIGINT, default `10`): Number of results to return. Must be > 0. +- `use_index` (BOOLEAN, default `true`): If `true`, allow ANN index usage when available. +- `nprobs` (BIGINT, optional): Number of IVF partitions to probe when using a vector index. Must be > 0. Only affects IVF-based vector indices. +- `refine_factor` (BIGINT, optional): Over-fetch factor for re-ranking using original vectors. Must be > 0. A value of `1` still enables re-ranking. +- `prefilter` (BOOLEAN, default `false`): If `true`, filters are applied before top-k selection. +- `explain_verbose` (BOOLEAN, default `false`): Emit a more verbose Lance plan in `EXPLAIN` output. + +Output: +- Dataset columns plus `_distance` (smaller is closer). + +Filter semantics: +- If `prefilter=false`, filter pushdown is best-effort. If pushdown fails, the query is retried without pushed filters and DuckDB applies filters for correctness. +- If `prefilter=true`, prefilterable filters must be pushed down, otherwise the query fails with an error. + +### Full-text search: `lance_fts` + +```sql +-- Search a text column, returning BM25-like scores in `_score` (larger is better) +SELECT id, text, _score +FROM lance_fts('path/to/dataset.lance', 'text', 'puppy', k = 10, prefilter = true) +ORDER BY _score DESC; +``` + +Signature: `lance_fts(uri, text_column, query, ...)` + +Positional arguments: +- `uri` (VARCHAR): Dataset root path or object store URI (e.g. `s3://...`). +- `text_column` (VARCHAR): Text column name. +- `query` (VARCHAR): Query string. + +Named parameters: +- `k` (BIGINT, default `10`): Number of results to return. Must be > 0. +- `prefilter` (BOOLEAN, default `false`): If `true`, filters are applied before top-k selection. + +Output: +- Dataset columns plus `_score` (larger is better). + +Filter semantics: +- If `prefilter=false`, filter pushdown is best-effort. If pushdown fails, the query is retried without pushed filters and DuckDB applies filters for correctness. +- If `prefilter=true`, prefilterable filters must be pushed down, otherwise the query fails with an error. + +### Hybrid search: `lance_hybrid_search` + +```sql +-- Combine vector and text scores, returning `_hybrid_score` (larger is better) +SELECT id, _hybrid_score, _distance, _score +FROM lance_hybrid_search( + 'path/to/dataset.lance', + 'vec', + [0.1, 0.2, 0.3, 0.4]::FLOAT[4], + 'text', + 'puppy', + k = 10, + prefilter = false, + alpha = 0.5, + oversample_factor = 4 +) +ORDER BY _hybrid_score DESC; +``` + +Signature: `lance_hybrid_search(uri, vector_column, query_vector, text_column, query, ...)` + +Positional arguments: +- `uri` (VARCHAR): Dataset root path or object store URI (e.g. `s3://...`). +- `vector_column` (VARCHAR): Vector column name. +- `query_vector` (FLOAT[dim] or DOUBLE[dim], preferred): Query vector (must be non-empty; values are cast to float32). `FLOAT[]` / `DOUBLE[]` are also accepted. +- `text_column` (VARCHAR): Text column name. +- `query` (VARCHAR): Query string. + +Named parameters: +- `k` (BIGINT, default `10`): Number of results to return. Must be > 0. +- `prefilter` (BOOLEAN, default `false`): If `true`, filters are applied before top-k selection. +- `alpha` (FLOAT, default `0.5`): Vector/text mixing weight. Larger values weigh vector similarity more heavily. +- `oversample_factor` (INTEGER, default `4`): Oversample factor for candidate generation. If provided, must be > 0. + +Output: +- Dataset columns plus `_hybrid_score` (larger is better), `_distance`, and `_score`. + +Filter semantics: +- If `prefilter=false`, filter pushdown is best-effort. If pushdown fails, the query is retried without pushed filters and DuckDB applies filters for correctness. +- If `prefilter=true`, prefilterable filters must be pushed down, otherwise the query fails with an error. + ## Namespaces Namespaces let you treat a directory (or a remote namespace service) as a database catalog and access datasets as tables. diff --git a/rust/ffi/knn.rs b/rust/ffi/knn.rs index 7cf593f..2b0bfa8 100644 --- a/rust/ffi/knn.rs +++ b/rust/ffi/knn.rs @@ -21,6 +21,8 @@ pub unsafe extern "C" fn lance_get_knn_schema( query_values: *const f32, query_len: usize, k: u64, + nprobes: u64, + refine_factor: u64, prefilter: u8, use_index: u8, ) -> *mut c_void { @@ -30,6 +32,8 @@ pub unsafe extern "C" fn lance_get_knn_schema( query_values, query_len, k, + nprobes, + refine_factor, prefilter, use_index, ) { @@ -50,6 +54,8 @@ fn get_knn_schema_inner( query_values: *const f32, query_len: usize, k: u64, + nprobes: u64, + refine_factor: u64, prefilter: u8, use_index: u8, ) -> FfiResult { @@ -71,6 +77,19 @@ fn get_knn_schema_inner( let query = Float32Array::from_iter_values(query_values.iter().copied()); scan.nearest(vector_column, &query, k_usize) .map_err(|err| FfiError::new(ErrorCode::KnnSchema, format!("knn schema nearest: {err}")))?; + if nprobes != 0 { + let nprobes_usize = nonzero_u64_to_usize(nprobes, "nprobes")?; + scan.nprobes(nprobes_usize); + } + if refine_factor != 0 { + let refine_factor_u32: u32 = refine_factor.try_into().map_err(|_| { + FfiError::new( + ErrorCode::InvalidArgument, + "refine_factor must fit in u32", + ) + })?; + scan.refine(refine_factor_u32); + } scan.use_index(use_index != 0); scan.disable_scoring_autoprojection(); scan.project(projection.as_ref()) @@ -90,6 +109,8 @@ pub unsafe extern "C" fn lance_create_knn_stream_ir( query_values: *const f32, query_len: usize, k: u64, + nprobes: u64, + refine_factor: u64, filter_ir: *const u8, filter_ir_len: usize, prefilter: u8, @@ -101,6 +122,8 @@ pub unsafe extern "C" fn lance_create_knn_stream_ir( query_values, query_len, k, + nprobes, + refine_factor, filter_ir, filter_ir_len, prefilter, @@ -124,6 +147,8 @@ fn create_knn_stream_ir_inner( query_values: *const f32, query_len: usize, k: u64, + nprobes: u64, + refine_factor: u64, filter_ir: *const u8, filter_ir_len: usize, prefilter: u8, @@ -163,6 +188,19 @@ fn create_knn_stream_ir_inner( format!("knn scan nearest: {err}"), ) })?; + if nprobes != 0 { + let nprobes_usize = nonzero_u64_to_usize(nprobes, "nprobes")?; + scan.nprobes(nprobes_usize); + } + if refine_factor != 0 { + let refine_factor_u32: u32 = refine_factor.try_into().map_err(|_| { + FfiError::new( + ErrorCode::InvalidArgument, + "refine_factor must fit in u32", + ) + })?; + scan.refine(refine_factor_u32); + } scan.use_index(use_index != 0); scan.disable_scoring_autoprojection(); scan.project(projection.as_ref()).map_err(|err| { @@ -189,6 +227,8 @@ pub unsafe extern "C" fn lance_explain_knn_scan_ir( query_values: *const f32, query_len: usize, k: u64, + nprobes: u64, + refine_factor: u64, filter_ir: *const u8, filter_ir_len: usize, prefilter: u8, @@ -201,6 +241,8 @@ pub unsafe extern "C" fn lance_explain_knn_scan_ir( query_values, query_len, k, + nprobes, + refine_factor, filter_ir, filter_ir_len, prefilter, @@ -225,6 +267,8 @@ fn explain_knn_scan_ir_inner( query_values: *const f32, query_len: usize, k: u64, + nprobes: u64, + refine_factor: u64, filter_ir: *const u8, filter_ir_len: usize, prefilter: u8, @@ -260,6 +304,19 @@ fn explain_knn_scan_ir_inner( let query = Float32Array::from_iter_values(query_values.iter().copied()); scan.nearest(vector_column, &query, k_usize) .map_err(|err| FfiError::new(ErrorCode::ExplainPlan, format!("knn scan nearest: {err}")))?; + if nprobes != 0 { + let nprobes_usize = nonzero_u64_to_usize(nprobes, "nprobes")?; + scan.nprobes(nprobes_usize); + } + if refine_factor != 0 { + let refine_factor_u32: u32 = refine_factor.try_into().map_err(|_| { + FfiError::new( + ErrorCode::InvalidArgument, + "refine_factor must fit in u32", + ) + })?; + scan.refine(refine_factor_u32); + } scan.use_index(use_index != 0); scan.disable_scoring_autoprojection(); scan.project(projection.as_ref()) diff --git a/src/include/lance_ffi.hpp b/src/include/lance_ffi.hpp index d604000..5fe92ad 100644 --- a/src/include/lance_ffi.hpp +++ b/src/include/lance_ffi.hpp @@ -174,16 +174,19 @@ const char *lance_explain_dataset_scan_ir(void *dataset, const char **columns, void *lance_get_knn_schema(void *dataset, const char *vector_column, const float *query_values, size_t query_len, - uint64_t k, uint8_t prefilter, uint8_t use_index); + uint64_t k, uint64_t nprobes, uint64_t refine_factor, + uint8_t prefilter, uint8_t use_index); void *lance_create_knn_stream_ir(void *dataset, const char *vector_column, const float *query_values, size_t query_len, - uint64_t k, const uint8_t *filter_ir, - size_t filter_ir_len, uint8_t prefilter, - uint8_t use_index); + uint64_t k, uint64_t nprobes, + uint64_t refine_factor, + const uint8_t *filter_ir, size_t filter_ir_len, + uint8_t prefilter, uint8_t use_index); const char *lance_explain_knn_scan_ir(void *dataset, const char *vector_column, const float *query_values, size_t query_len, uint64_t k, + uint64_t nprobes, uint64_t refine_factor, const uint8_t *filter_ir, size_t filter_ir_len, uint8_t prefilter, uint8_t use_index, uint8_t verbose); diff --git a/src/lance_search.cpp b/src/lance_search.cpp index 21e9b1e..92ead0a 100644 --- a/src/lance_search.cpp +++ b/src/lance_search.cpp @@ -35,6 +35,7 @@ namespace duckdb { static bool TryLanceExplainKnn(void *dataset, const string &vector_column, const vector &query, uint64_t k, + uint64_t nprobes, uint64_t refine_factor, const string *filter_ir, bool prefilter, bool use_index, bool verbose, string &out_plan, string &out_error) { @@ -58,8 +59,9 @@ static bool TryLanceExplainKnn(void *dataset, const string &vector_column, } auto *plan_ptr = lance_explain_knn_scan_ir( - dataset, vector_column.c_str(), query.data(), query.size(), k, filter_ptr, - filter_len, prefilter ? 1 : 0, use_index ? 1 : 0, verbose ? 1 : 0); + dataset, vector_column.c_str(), query.data(), query.size(), k, nprobes, + refine_factor, filter_ptr, filter_len, prefilter ? 1 : 0, + use_index ? 1 : 0, verbose ? 1 : 0); if (!plan_ptr) { out_error = LanceConsumeLastError(); if (out_error.empty()) { @@ -152,6 +154,8 @@ struct LanceKnnBindData : public TableFunctionData { string vector_column; vector query; uint64_t k = 0; + uint64_t nprobes = 0; + uint64_t refine_factor = 0; bool prefilter = true; bool use_index = true; bool explain_verbose = false; @@ -320,6 +324,37 @@ LanceSearchVectorBind(ClientContext &context, TableFunctionBindInput &input, } result->k = NumericCast(k_val); + bool has_nprobes = false; + int64_t nprobes_val = 0; + auto nprobes_named = input.named_parameters.find("nprobs"); + if (nprobes_named != input.named_parameters.end() && + !nprobes_named->second.IsNull()) { + has_nprobes = true; + nprobes_val = nprobes_named->second.DefaultCastAs(LogicalType::BIGINT) + .GetValue(); + } + if (has_nprobes && nprobes_val <= 0) { + throw InvalidInputException("lance_vector_search requires nprobs > 0"); + } + result->nprobes = has_nprobes ? NumericCast(nprobes_val) : 0; + + bool has_refine_factor = false; + int64_t refine_factor_val = 0; + auto refine_factor_named = input.named_parameters.find("refine_factor"); + if (refine_factor_named != input.named_parameters.end() && + !refine_factor_named->second.IsNull()) { + has_refine_factor = true; + refine_factor_val = + refine_factor_named->second.DefaultCastAs(LogicalType::BIGINT) + .GetValue(); + } + if (has_refine_factor && refine_factor_val <= 0) { + throw InvalidInputException( + "lance_vector_search requires refine_factor > 0"); + } + result->refine_factor = + has_refine_factor ? NumericCast(refine_factor_val) : 0; + auto prefilter_named = input.named_parameters.find("prefilter"); if (prefilter_named != input.named_parameters.end() && !prefilter_named->second.IsNull()) { @@ -343,8 +378,8 @@ LanceSearchVectorBind(ClientContext &context, TableFunctionBindInput &input, auto *schema_handle = lance_get_knn_schema( result->dataset, result->vector_column.c_str(), result->query.data(), - result->query.size(), result->k, result->prefilter ? 1 : 0, - result->use_index ? 1 : 0); + result->query.size(), result->k, result->nprobes, result->refine_factor, + result->prefilter ? 1 : 0, result->use_index ? 1 : 0); if (!schema_handle) { throw IOException("Failed to get Lance KNN schema: " + result->file_path + LanceFormatErrorSuffix()); @@ -445,8 +480,9 @@ LanceKnnLocalInit(ExecutionContext &context, TableFunctionInitInput &input, auto filter_ir_len = global.lance_filter_ir.size(); result->stream = lance_create_knn_stream_ir( bind_data.dataset, bind_data.vector_column.c_str(), - bind_data.query.data(), bind_data.query.size(), bind_data.k, filter_ir, - filter_ir_len, bind_data.prefilter ? 1 : 0, bind_data.use_index ? 1 : 0); + bind_data.query.data(), bind_data.query.size(), bind_data.k, + bind_data.nprobes, bind_data.refine_factor, filter_ir, filter_ir_len, + bind_data.prefilter ? 1 : 0, bind_data.use_index ? 1 : 0); if (!result->stream && filter_ir && !bind_data.prefilter) { // Best-effort: if filter pushdown failed, retry without it and rely on // DuckDB-side filter execution for correctness. @@ -455,7 +491,8 @@ LanceKnnLocalInit(ExecutionContext &context, TableFunctionInitInput &input, result->filter_pushed_down = false; result->stream = lance_create_knn_stream_ir( bind_data.dataset, bind_data.vector_column.c_str(), - bind_data.query.data(), bind_data.query.size(), bind_data.k, nullptr, 0, + bind_data.query.data(), bind_data.query.size(), bind_data.k, + bind_data.nprobes, bind_data.refine_factor, nullptr, 0, bind_data.prefilter ? 1 : 0, bind_data.use_index ? 1 : 0); } if (!result->stream) { @@ -577,6 +614,8 @@ LanceKnnToString(TableFunctionToStringInput &input) { result["Lance Path"] = bind_data.file_path; result["Lance Vector Column"] = bind_data.vector_column; result["Lance K"] = to_string(bind_data.k); + result["Lance Nprobes"] = to_string(bind_data.nprobes); + result["Lance Refine Factor"] = to_string(bind_data.refine_factor); result["Lance Query Dim"] = to_string(bind_data.query.size()); result["Lance Prefilter"] = bind_data.prefilter ? "true" : "false"; result["Lance Use Index"] = bind_data.use_index ? "true" : "false"; @@ -594,11 +633,11 @@ LanceKnnToString(TableFunctionToStringInput &input) { string plan; string error; - if (TryLanceExplainKnn(bind_data.dataset, bind_data.vector_column, - bind_data.query, bind_data.k, - filter_ir_msg.empty() ? nullptr : &filter_ir_msg, - bind_data.prefilter, bind_data.use_index, - bind_data.explain_verbose, plan, error)) { + if (TryLanceExplainKnn( + bind_data.dataset, bind_data.vector_column, bind_data.query, + bind_data.k, bind_data.nprobes, bind_data.refine_factor, + filter_ir_msg.empty() ? nullptr : &filter_ir_msg, bind_data.prefilter, + bind_data.use_index, bind_data.explain_verbose, plan, error)) { result["Lance Plan (Bind)"] = plan; } else if (!error.empty()) { result["Lance Plan Error (Bind)"] = error; @@ -616,6 +655,8 @@ LanceKnnDynamicToString(TableFunctionDynamicToStringInput &input) { result["Lance Path"] = bind_data.file_path; result["Lance Vector Column"] = bind_data.vector_column; result["Lance K"] = to_string(bind_data.k); + result["Lance Nprobes"] = to_string(bind_data.nprobes); + result["Lance Refine Factor"] = to_string(bind_data.refine_factor); result["Lance Query Dim"] = to_string(bind_data.query.size()); result["Lance Prefilter"] = bind_data.prefilter ? "true" : "false"; result["Lance Use Index"] = bind_data.use_index ? "true" : "false"; @@ -640,13 +681,13 @@ LanceKnnDynamicToString(TableFunctionDynamicToStringInput &input) { if (!global_state.explain_computed.load()) { string plan; string error; - auto ok = TryLanceExplainKnn(bind_data.dataset, bind_data.vector_column, - bind_data.query, bind_data.k, - global_state.lance_filter_ir.empty() - ? nullptr - : &global_state.lance_filter_ir, - bind_data.prefilter, bind_data.use_index, - bind_data.explain_verbose, plan, error); + auto ok = TryLanceExplainKnn( + bind_data.dataset, bind_data.vector_column, bind_data.query, + bind_data.k, bind_data.nprobes, bind_data.refine_factor, + global_state.lance_filter_ir.empty() ? nullptr + : &global_state.lance_filter_ir, + bind_data.prefilter, bind_data.use_index, bind_data.explain_verbose, + plan, error); if (ok) { global_state.explain_plan = std::move(plan); } else { @@ -668,6 +709,8 @@ LanceKnnDynamicToString(TableFunctionDynamicToStringInput &input) { static void RegisterLanceVectorSearch(ExtensionLoader &loader) { auto configure = [](TableFunction &fun) { fun.named_parameters["k"] = LogicalType::BIGINT; + fun.named_parameters["nprobs"] = LogicalType::BIGINT; + fun.named_parameters["refine_factor"] = LogicalType::BIGINT; fun.named_parameters["prefilter"] = LogicalType::BOOLEAN; fun.named_parameters["use_index"] = LogicalType::BOOLEAN; fun.named_parameters["explain_verbose"] = LogicalType::BOOLEAN; diff --git a/test/sql/index_ddl.test b/test/sql/index_ddl.test index 3d5d226..9cc9728 100644 --- a/test/sql/index_ddl.test +++ b/test/sql/index_ddl.test @@ -61,11 +61,13 @@ FROM lance_vector_search( 0.41665298 ]::FLOAT[16], k = 1, + nprobs = 2, + refine_factor = 3, use_index = true, explain_verbose = true ); ---- -physical_plan :[\s\S]*ANNSubIndex: name=vec_idx[\s\S]* +physical_plan :[\s\S]*ANNSubIndex: name=vec_idx, k=3[\s\S]*minimum_nprobes=2[\s\S]*maximum_nprobes=Some\(2\)[\s\S]* query II EXPLAIN (FORMAT JSON) diff --git a/test/sql/search_functions.test b/test/sql/search_functions.test index f9af15b..9e3b11f 100644 --- a/test/sql/search_functions.test +++ b/test/sql/search_functions.test @@ -22,6 +22,18 @@ SELECT * FROM lance_vector_search('test/data/test_data.lance', 'vec', [1.0]::FLO ---- Invalid Input Error: lance_vector_search requires k > 0 +# Non-positive nprobs is rejected +statement error +SELECT * FROM lance_vector_search('test/data/test_data.lance', 'vec', [1.0]::FLOAT[1], nprobs = 0) +---- +Invalid Input Error: lance_vector_search requires nprobs > 0 + +# Non-positive refine_factor is rejected +statement error +SELECT * FROM lance_vector_search('test/data/test_data.lance', 'vec', [1.0]::FLOAT[1], refine_factor = 0) +---- +Invalid Input Error: lance_vector_search requires refine_factor > 0 + # Sanity: dataset is readable query I SELECT count(*) FROM 'test/data/search_test_data.lance' @@ -76,6 +88,113 @@ ORDER BY _distance 4 5 +# Vector search + refine_factor (flat): accepted and deterministic +query I +SELECT id +FROM lance_vector_search( + 'test/data/search_test_data.lance', + 'vec', + [0.0, 0.0, 0.0, 0.0]::FLOAT[4], + k = 3, + refine_factor = 2, + use_index = false +) +ORDER BY _distance +---- +1 +2 +3 + +# Vector search + nprobs (flat): accepted (even when not using an index) +query I +SELECT id +FROM lance_vector_search( + 'test/data/search_test_data.lance', + 'vec', + [0.0, 0.0, 0.0, 0.0]::FLOAT[4], + k = 3, + nprobs = 2, + use_index = false +) +ORDER BY _distance +---- +1 +2 +3 + +# Vector search (indexed): nprobs and refine_factor are reflected in EXPLAIN +statement ok +COPY (SELECT * FROM 'test/data/bigann_tiny/base.lance') +TO 'test/.tmp/search_knn_indexed.lance' (FORMAT lance, mode 'overwrite'); + +statement ok +CREATE INDEX vec_idx ON 'test/.tmp/search_knn_indexed.lance' (vec) +USING IVF_FLAT WITH (num_partitions=1, metric_type='l2'); + +query II +EXPLAIN (FORMAT JSON) +SELECT id +FROM lance_vector_search( + 'test/.tmp/search_knn_indexed.lance', + 'vec', + [ + 0.32245764, + 0.2814153, + 0.08346437, + 0.28065863, + 0.2566661, + -0.47035545, + -0.9935358, + 0.09578487, + 0.000604047, + -0.37332273, + 0.12592472, + -0.3458825, + -0.2536356, + -0.3673502, + -0.68120277, + 0.41665298 + ]::FLOAT[16], + k = 1, + nprobs = 1, + refine_factor = 2, + use_index = true, + explain_verbose = true +); +---- +physical_plan :[\s\S]*ANNSubIndex: name=vec_idx, k=2[\s\S]*minimum_nprobes=1[\s\S]*maximum_nprobes=Some\(1\)[\s\S]* + +query I +SELECT count(*) +FROM lance_vector_search( + 'test/.tmp/search_knn_indexed.lance', + 'vec', + [ + 0.32245764, + 0.2814153, + 0.08346437, + 0.28065863, + 0.2566661, + -0.47035545, + -0.9935358, + 0.09578487, + 0.000604047, + -0.37332273, + 0.12592472, + -0.3458825, + -0.2536356, + -0.3673502, + -0.68120277, + 0.41665298 + ]::FLOAT[16], + k = 1, + nprobs = 1, + refine_factor = 2, + use_index = true +); +---- +1 + # Vector search over ARRAY vectors written by DuckDB (preferred usage) statement ok COPY (