Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions python/python/lance/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -5189,6 +5189,34 @@ def analyze_plan(self) -> str:

return self._scanner.analyze_plan()

def plan_splits(
self, max_split_size_bytes: Optional[int] = None
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to update this to include both max_split_size_bytes and max_row_count options, with one trumping the other if both are provided. I'm interested if people think this paradigm is useful? My intuition is that since we are estimating row sizes based on the schema that we could be VERY wrong (just using 64B for everything that is not known size - string / blob could be 1B - 1M+). In these scenarios a user will know their data better and can use a max_row_count to target a partition size. So basically, hopefully most use-cases we're close and estimation works well, but there are knobs to fine-tune in the other cases.

) -> List[List["FragmentMetadata"]]:
"""Plan splits for distributed scanning.

This method analyzes the scanner's filter and uses indices to determine
which fragments need to be scanned and approximately how many rows each
fragment will return. It then groups fragments into splits that can be
processed independently.

The scanner estimates the size of each row based on the output schema
projection and uses that to determine how many rows fit within the
target split size.

Parameters
----------
max_split_size_bytes : int, optional
The target maximum size in bytes for each split. Defaults to 128MB.

Returns
-------
List[List[FragmentMetadata]]
A list of splits, where each split is a list of FragmentMetadata objects.
Each split can be processed independently for distributed scanning.
"""

return self._scanner.plan_splits(max_split_size_bytes=max_split_size_bytes)


class DatasetOptimizer:
def __init__(self, dataset: LanceDataset):
Expand Down
28 changes: 27 additions & 1 deletion python/src/scanner.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ use std::sync::Arc;

use arrow::pyarrow::*;
use arrow_array::RecordBatchReader;
use lance::dataset::scanner::ExecutionSummaryCounts;
use lance::dataset::scanner::{ExecutionSummaryCounts, SplitPackStrategy};
use pyo3::prelude::*;
use pyo3::pyclass;

Expand All @@ -30,6 +30,7 @@ use pyo3::exceptions::PyValueError;
use crate::reader::LanceReader;
use crate::rt;
use crate::schema::logical_arrow_schema;
use crate::utils::PyLance;

/// This will be wrapped by a python class to provide
/// additional functionality
Expand Down Expand Up @@ -150,4 +151,29 @@ impl Scanner {

Ok(PyArrowType(Box::new(reader)))
}

#[pyo3(signature = (max_split_size_bytes=None))]
fn plan_splits<'py>(
self_: PyRef<'py, Self>,
max_split_size_bytes: Option<usize>,
) -> PyResult<Vec<Vec<Bound<'py, PyAny>>>> {
let scanner = self_.scanner.clone();
let strategy = max_split_size_bytes.map(SplitPackStrategy::MaxSizeBytes);
let splits = rt()
.spawn(Some(self_.py()), async move {
scanner.plan_splits(strategy).await
})?
.map_err(|err| PyValueError::new_err(err.to_string()))?;

splits
.into_iter()
.map(|split| {
split
.fragments
.into_iter()
.map(|sf| PyLance(sf.fragment).into_pyobject(self_.py()))
.collect::<Result<Vec<_>, _>>()
})
.collect::<Result<Vec<_>, _>>()
}
}
4 changes: 4 additions & 0 deletions rust/lance-core/src/utils/mask.rs
Original file line number Diff line number Diff line change
Expand Up @@ -657,6 +657,10 @@ impl RowAddrTreeMap {
}),
})
}

pub fn fragments(&self) -> Vec<u32> {
self.inner.keys().cloned().collect()
}
}

impl std::ops::BitOr<Self> for RowAddrTreeMap {
Expand Down
Loading