Introduce Partition Table Spec in Lance Namespace #272

wojiaodoubao · 2025-12-04T05:51:27Z

wojiaodoubao
Dec 4, 2025
Collaborator

There was a partitioning discussion in the Lance community at: lance-format/lance#4125. The community is not yet ready to introduce partitioning at the Lance Dataset level.
In big data scenarios, physical partitioning is a widely adopted approach to organize data conveniently and efficiently. Below are two common use cases:

Rapid Dataset Cleanup
Lance uses a mark-and-sweep garbage collection mechanism, which does not immediately reclaim storage space when data is deleted. For users leveraging Lance (e.g., for model training), when they explicitly confirm that certain data is no longer needed and require immediate storage release, a physical partitioning-based solution would be far more effective.
Large-Scale Data Processing
A user needs to maintain a massive Lance table with 10 trillion records, performing hourly updates that affect approximately 1 billion records each time. Without physical partitioning, updates would have to operate on the entire 10-trillion-record table. While data scans can be accelerated via indexes, the overhead of scanning data and maintaining indexes remains prohibitively high. Physical partitioning offers a simpler, more efficient alternative.

Here is my proposal: introduce a partitioned table specification under the Lance namespace. The specification is defined as follows:

A partitioned table is represented by a Namespace object.
A partitioned table supports multi-level partitioning:
- Each intermediate partition corresponds to a Namespace object, whose name is the partition name.
- Each leaf partition corresponds to a Table object, whose name is the partition name.
All Tables under a partitioned table share the identical schema, which also serves as the schema of the entire partitioned table. The schema must include all partition columns.
The following metadata properties are added to the partitioned table’s Namespace:
- Partition flag: lance.partition-table=true
- Partition column information: lance.partition-columns=[{col_name, partition_type}, {col_name, partition_type},..., {col_name, partition_type}]
- Schema: lance.partition-schema, json arrow schema

A partitioned table is a specification rather than an enforcement constraint:

Compute engines are responsible for parsing partitioned tables according to this specification. For example, we can modify the implementation of the lance-spark connector to add support for partitioned tables.
Upper-layer engines or SDKs may also choose to ignore the partitioned table specification and access the underlying Tables by treating it as an ordinary namespace hierarchy.

wojiaodoubao · 2025-12-04T06:51:34Z

wojiaodoubao
Dec 4, 2025
Collaborator Author

Hi @jackye1995 @yanghua @majin1102 @fangbo @zhangyue19921010, please share your thoughts, thanks very much!

0 replies

majin1102 · 2025-12-04T17:37:13Z

majin1102
Dec 4, 2025
Collaborator

Thanks for starting this discussion in the lance-namespace module! I think this is a valuable initiative.

When we talk about "partition," it’s necessary to distinguish two conceptual layers:

Lance-level partitioning: Also known as the Lance partition spec.
Engine-level partitioning: Like Spark’s native partitioning.

Our prior discussions have mostly centered on Lance-level partitioning, but to fully address the problem, we need to implement engine-level partitioning.

Here are two approaches we can explore:

1.Leverage namespace multi-table transactions (e.g., Directory Namespace v2)

In engines layer, we could encapsulate multiple datasets under a namespace into a PartitionedTable. This approach wouldn’t require the namespace to add new functionality beyond its current capabilities—we’d just use the existing namespace spec and define specific attributes within namespaces.

This seems like a clear path because:

No spec modifications are needed (only engine-layer adaptation).
Different engines can independently handle their own adaptations.
We could focus scenarios we need patitions instead of adding partitions to all api including take，blob.

If Lance later introduces a partition spec, migrating away from this approach would be straightforward.

2.Add a partition spec at the namespace layer

As you noted, I personally don’t recommend this. Equating a namespace directly to a Lance table would mean the namespace must dynamically generate a manifest for a large dataset by aggregating smaller ones during metadata loading. This introduces inherent conflicts like row_ids and fragment_ids might get duplicated. Committing changes could demand specialized interfaces.

Structurally, the namespace layer sits above tables, while a partition spec belongs inside a table. Merging these layers doesn’t align with their design intent in my view.

Perhaps we could outline a specific scenario and brainstorm with it? That might help ground the discussion.

0 replies

wojiaodoubao · 2025-12-05T15:39:19Z

wojiaodoubao
Dec 5, 2025
Collaborator Author

Hi @majin1102 , thanks your nice suggestion! My use case is as follows:

I have two Lance Tables:

A detailed table with two-level partitioning (hourly partition + bucket partition);
An aggregated table with two-level partitioning (bucket partition + hash partition).

My requirements are:

Enable support for Spark CREATE TABLE and DESCRIBE TABLE statements for both tables;
Insert data hourly into the detailed table using Spark SQL’s INSERT INTO statement;
Hourly execute a GROUP BY operation on the incremental data newly inserted into the detailed table, then merge/upsert the aggregated results into the aggregated table using Spark SQL’s MERGE INTO statement.

I think we can validate the case by integrating it with Spark.

1. DDL
All metadata operations for the table are fully built on LanceNamespace, taking three scenarios for example:

Create Table: Create a LanceNamespace and configure its metadata properties.
Describe Table: First attempt to resolve it as a unpartitioned table; if it returns a 404, retry to resolve it is a partitioned table.
Delete Table: First attempt to drop it via the dropTable API; if it returns a 404, verify whether it is a partitioned table. For partitioned tables, perform post-order deletion of all child tables and child namespaces, then clean up the root table.

2. Query/Scan(Partition Pruning)
In Spark’s ScanBuilder, perform partition pruning based on the pushed-down filters to filter child tables. Finally, construct InputPartitions only from the pruned remaining tables to avoid full-table scanning.

3. Insert Data to New Partitions
For each row of data to be written:

Extract the values of the partition columns.
Dynamically construct the path of the target leaf partition based on these values.
Check whether the leaf partition (a Lance Table) corresponding to this path exists.
If it does not exist, the write logic will automatically create it (and any missing intermediate namespaces) via LanceNamespace.
Append the row of data to the correct leaf Lance Table.

0 replies

wjones127 · 2025-12-05T20:01:38Z

wjones127
Dec 5, 2025
Maintainer

Example / benchmarks?

Rapid Dataset Cleanup
Lance uses a mark-and-sweep garbage collection mechanism, which does not immediately reclaim storage space when data is deleted. For users leveraging Lance (e.g., for model training), when they explicitly confirm that certain data is no longer needed and require immediate storage release, a physical partitioning-based solution would be far more effective.

Large-Scale Data Processing
A user needs to maintain a massive Lance table with 10 trillion records, performing hourly updates that affect approximately 1 billion records each time. Without physical partitioning, updates would have to operate on the entire 10-trillion-record table. While data scans can be accelerated via indexes, the overhead of scanning data and maintaining indexes remains prohibitively high. Physical partitioning offers a simpler, more efficient alternative.

For your two use cases, could you describe in detail what the improvement would be? What are the kind of bottlenecks you see now? And how would this improve it?

It might even be nice to develop a benchmark or two. That way we can show how this proposal once completed actually improves things. The problem I've had with the partition discussion thus far is the benefits are often abstract.

1 reply

wojiaodoubao Dec 7, 2025
Collaborator Author

The "Rapid Dataset Cleanup"
It is detailed at Introduce Partition Table Spec in Lance Namespace #272 (reply in thread).
Large-Scale Data Processing

Hourly execute a GROUP BY operation on the incremental data newly inserted into the detailed table, then merge/upsert the aggregated results into the aggregated table using Spark SQL’s MERGE INTO statement.

I haven’t yet run benchmarks to compare the "single large table + index" approach and the "physical partitioning-based approach". However, based on a high-level walkthrough of their execution flows, I believe there will be performance differences between the two.
First, during the MERGE INTO operation from the detailed table to the aggregated table, I will split the job into multiple independent subtasks by bucket and submit them separately—submitting a single large job would result in too much blast radius in case of failure. To simplify the table schema, let’s assume the table has only 3 columns: url, bucket, and value, with indexes built on both url and bucket. I will submit tasks for each bucket to update the value column via MERGE INTO.

In the physical partitioning scenario:
- For each run, I only need to process the single Lance Dataset corresponding to the bucket, so the input data is limited to the fragments of that specific dataset.
- For each fragment, I will read the index on the url column—and this index is scoped only to the current dataset.
In contrast, in the single large table scenario:
- For each run, I need to read all fragments of the large table.
- In each task, the indexes on both url and bucket columns are read, and both indexes are global to the entire table.

From this perspective, the single large table approach would incur more overhead in terms of the number of fragments (and thus the number of tasks) and the size of the indexes being read. However, the exact magnitude of this overhead can only be determined via formal benchmarks.

Another reason is that I have no experience maintaining 10-trillion-row tables, nor do I have the resources to validate a 10-trillion-row table implementation—so I can’t confirm its feasibility. However, I have extensive experience working with 1-billion-row tables, and the approach of splitting into smaller tables is far more controllable for me.

wjones127 · 2025-12-05T20:03:14Z

wjones127
Dec 5, 2025
Maintainer

Atomicity

Insert Data to New Partitions
For each row of data to be written:

Extract the values of the partition columns.
Dynamically construct the path of the target leaf partition based on these values.
Check whether the leaf partition (a Lance Table) corresponding to this path exists.
If it does not exist, the write logic will automatically create it (and any missing intermediate namespaces) via LanceNamespace.
Append the row of data to the correct leaf Lance Table.

In this description, is each row added in a separate transaction?

Overall, if you have multiple Lance tables, it seems like there wouldn't be a way to atomically write to multiple partitions? Is that correct? Is that a limitation we think is acceptable?

1 reply

wojiaodoubao Dec 7, 2025
Collaborator Author

No, there won't be any atomically write or any transactional semantics. My initial idea was to implement simple, directory/dataset-based non-ACID partitioning—similar to the mechanism used in legacy Hive—which is sufficient to cover simple use cases.

wjones127 · 2025-12-05T20:06:45Z

wjones127
Dec 5, 2025
Maintainer

Use case

I'm open to having some sort of "partitioning" at the namespace level. The use case I had in mind isn't the kind of time-based partitioning I think you are describing, that's common with Iceberg and Spark Tables. IMO that seems more like clustering to me.

But we have other use cases where for business reasons we don't want to co-mingle data, which is multi-tenancy. Often users have a table where they store vectors and data for multiple of their users. When they query, they always query data for just one user. But for managing the table (creating indices, schema evolution) they'd like to manage the overall table as a single object. Right now they'd be forced to manage these "partitions" as multiple tables.

This sort of "partitioning" corresponds to the "namespace" concept in vector databases like Pinecone or Turbopuffer.

10 replies

majin1102 Dec 10, 2025
Collaborator

I think these are different in their query patterns: "business partitioning" is for cases where most of your queries hit a single partition, while "traditional" or maybe "within-table" partitioning or clustering is appropriate when queries aren't typically trying to hit just one partition.

The handling of secondary indices would also be different. Secondary indices are created per physical table. So in "business partitioning" each index just covers a single partition. While in "traditional" they cover the whole table.

I wonder if we could see this "business partitioning" as a special case of within-table partitions. As I see, the key difference is that "An index covers all of traditional partitions but only one business partition". Could you elaborate more on this? In my view, One index per traditional partition is a considerable approach.

For the scanning part, business partition pattern is a subset of traditional partition pattern right?

wojiaodoubao Dec 10, 2025
Collaborator Author

To scan a whole table, the "business partitioning" scheme would involve unioning all of the partitions, which might over complicate the plan in many engines. It's probably an anti-pattern.

Hi @wjones127 , I think for most compute engines, the Fragment is the fundamental unit of data processing—Spark is a typical example. We do not need to union all partitions; instead, we only need to support Fragments from different datasets as InputPartitions—a feature already supported in the current spark-lance connector. Leave the rest to the compute engine.

Spark also supports filter pushdown, which enables partition pruning. There is no need to explicitly specify the partition in Spark SQL queries, including the partition column in the filter conditions should be enough.

wjones127 Dec 10, 2025
Maintainer

I wonder if we could see this "business partitioning" as a special case of within-table partitions.

The difference is basically, is it one dataset per partition or multiple partitions all in one dataset. So table-per-partition ("business partitions") vs within-table partitions ("traditional partitioning") are better ways to name it.

One difference I've mentioned is the indexes. I don't think it's unreasonable to have an index-per-partition, it will just be slower in many cases than being able to combine them if you are trying to search across the entire logical table.

The other difference is the manifests: each dataset has is own sequence of manifests. This has benefits for each side:

On table-per-partition: writes can go independently to each manifest. This gets around the transaction commit bottlenecks when there are many small concurrent writes.
Within-table partitions: making writes atomic is easier and the implementation is faster.

I think for most compute engines, the Fragment is the fundamental unit of data processing—Spark is a typical example. We do not need to union all partitions; instead, we only need to support Fragments from different datasets as InputPartitions—a feature already supported in the current spark-lance connector. Leave the rest to the compute engine.

Sure, that works.

westonpace Dec 10, 2025
Maintainer

Repeating the other differences (and giving them numbers and adding a few more):

With table-per-partition each individual table could have its own config (in additional to potentially sharing a partition-wide config). For example, users might want different storage config (bucket, credentials) for different tables. This would be difficult to do in the within-table partitions.
As mentioned earlier, scanning across partitions is a "union all tables" in table-per-partition. However, it isn't clear to me that this would be inherently any worse than "union all fragments" which we'd have for the within-table partitions.
Indexing, as mentioned above (just wanted to give it a number). In table-per-partition each table has its own index. Scans across the entire partition would require multiple index searches. However, index updates are very isolated. Only the partition that is changed needs to update the index. In within-tables there is one index. Changing a single partition might mean the entire index needs to be updated (some indexes like zonemap could avoid this but indexes like IVF/PQ which have a component that is not aligned with the data [the IVF centroids] this becomes a problem).

majin1102 Dec 11, 2025
Collaborator

Thanks for the details.

I think table-per-partition and within-table partitions are much clearer to understand. I agree table-per-partition is a better way for Lance.

I’d also like to hear your thoughts on two remaining issues @wjones127 @westonpace .

Atomicity on partitions as metioned above. We could submit one partition one scheduling to workaround but this is unfriendly.
Would it be worthwhile to abstract the “namespace-as-one-table” model for engines? Its functionality may be limited, but it could be reusable. Also, from the perspective of engines like Spark, is there any fundamental difference between “table-per-partition” and “within-table partitioning” that we should be aware of (I mean some difference we aimed to introduce)

majin1102 · 2025-12-09T08:52:13Z

majin1102
Dec 9, 2025
Collaborator

If our focus is mainly on the use case of "business partitioning", then having writes to multiple partitions not transactional seems like an okay limitation.

As I understand, either multi-tenant or timely partition, we both need to expose partition fields to business layers and let them use through. IMO all partitions are busness partitions. Do we have a specific senario using partitions as none-busness partitions? @jackye1995

I am thinking if we can also use the "business partitioning" design on the mentioned "traditional partitioning"

Could you elaborate more on the difference between "business partitioning" and "traditional partitioning" in your mind? I think the boundary is unclear ? @yanghua

But if we seek for more traditional partitioning use cases (e.g. time based like the example above) then this is probably a bad experience, but again I think the original thread's general consensus was that we should do clustering for those use cases.

+1
IMO atomicity on partitions is a kind of basic capability. If we have a lance partitioned table and we could write partial data into it（like writing 10 partitions, 5 success, 5 failed). We shall demand computing or transformations provides idempotence like joining the source and target as the source or just using overwrite. I think that is quite a huge intrusion to business layer.

No, there won't be any atomically write or any transactional semantics. My initial idea was to implement simple, directory/dataset-based non-ACID partitioning—similar to the mechanism used in legacy Hive—which is sufficient to cover simple use cases.

I'm not sure none transactional partitions could meet your case. You metioned insert into which doesn't have idempotence syntax. We may introduce inconsistency if encoutering failures. We usually use insert overwrite in Hive @wojiaodoubao

Overall I think we'd better provide atomicity on transactions. I don't think it's good to provide partitions compared with old Hive instead of other table formats.

I have been going back and forth myself on this topic. Technically we could also record the latest version table manifest in the namespace manifest, and always resolve the latest version in that way instead of doing directory listing on the _versions directory.

This is exactly the solution I had in mind.

Provide multi-table transaction capability by namespace manifest as the basis
Provide a clean function to clean uncommitted data into namespace but commited to partition. This could be used as commit hook
Provide some kind of encapsulation for namespace as dataset and dataset as partition (It's a issue where to put them)

What do you think? @wojiaodoubao @jackye1995

6 replies

wojiaodoubao Dec 9, 2025
Collaborator Author

This is exactly the solution I had in mind.

Provide multi-table transaction capability by namespace manifest as the basis
Provide a clean function to clean uncommitted data into namespace but commited to partition. This could be used as commit hook
Provide some kind of encapsulation for namespace as dataset and dataset as partition (It's a issue where to put them)

This is a very good and smart idea! I might have over-concern the problem a bit. My primary concern is complexity. IMO, the current solution’s best point is its simplicity: it is nothing more than a specification, with implementations restricted to connectors such as lance-spark and lance-ray. The underlying logic is this: address 60% to 80% of use cases with a straightforward partitioning scheme. For users with more complex partitioning requirements, they should turn to a liquid clustering-based solution.

If we introduce transactions at the lance-namespace layer, will this steer us toward a complex partitioning spec? Shall we implement the following for partitioned tables: branches, tags, add-column, table-level statistics, indexes etc — this level of complexity would be tantamount to building an entirely new table format on top of the existing lance table format.

Perhaps we can draw clear boundaries: what can be done within the lance-namespace and what will never be done.

I'm not sure none transactional partitions could meet your case. You metioned insert into which doesn't have idempotence syntax. We may introduce inconsistency if encoutering failures. We usually use insert overwrite in Hive

When writing data, users write it by partition, submitting one insert task per partition. Compared to manually sharding tables, partitioned tables have two key advantages: 1) No need to concern partitions during querying; 2) No need to manually implement table sharding logic or manage path mappings for partitioned tables.

majin1102 Dec 9, 2025
Collaborator

Could you elaborate more on the difference between "business partitioning" and "traditional partitioning" in your mind?

@majin1102 The difference between them is whether it must follow the transactional semantics when writing to multiple partitions or not. The former doesn't need, the latter must need.

Emm....I feel strange if saying hive partitions are business partitions and Iceberg partitions are traditional.
Anyway I get your point.

majin1102 Dec 9, 2025
Collaborator

When writing data, users write it by partition, submitting one insert task per partition. Compared to manually sharding tables, partitioned tables have two key advantages: 1) No need to concern partitions during querying; 2) No need to manually implement table sharding logic or manage path mappings for partitioned tables.

I think insert overwrite or insert into is important to your case, the former one provides idempotence and retry could keep at-last consistency. The latter can't.

yanghua Dec 9, 2025
Maintainer

hive partitions are business partitions

IMO, hive partitions are not the "business partitions" that I said. "Business partitions" are built on the lance namespace in my mind.

wojiaodoubao Dec 9, 2025
Collaborator Author

I think insert overwrite or insert into is important to your case, the former one provides idempotence and retry could keep at-last consistency. The latter can't.

For example, we have 1000 physical partitions(lance tables). User will submit 1000 insert into jobs, each job handle one partition.

wojiaodoubao · 2025-12-11T07:26:35Z

wojiaodoubao
Dec 11, 2025
Collaborator Author

~~Based on our prior discussions, I would like to initiate a vote to introduce the experimental partition spec.~~

~~Please vote by commenting:~~
* +1 to approve
* 0 to abstain or neutral
* -1 if issues found (please include details)

~~### Summary~~
~~1. Lance introduces a partition specification at the Namespace level, implemented as a business partition.~~
~~2. Adopts the Table-per-Partition physical model, where each leaf partition is an independent Lance Table object.~~
3. The partition functions as a specification (not a constraint) : Compute engines (e.g., Spark Connector) will be the primary consumers of this spec, responsible for parsing partition metadata and providing a unified logical table view. Clients that do not recognize this spec can still access the underlying data via directory structure.
~~4. The first version will not support transactions. Atomicity is defined as a direction for future evolution and will be discussed in subsequent releases.~~

~~### Details~~
~~#279~~

PR: #279

cc @westonpace @wjones127 @jackye1995 @yanghua @majin1102

1 reply

jackye1995 Dec 12, 2025
Maintainer

thanks for putting this up! Let's review the PR first, and have a separated voting thread on this once the PR is about right.

Introduce Partition Table Spec in Lance Namespace #272

Uh oh!

Uh oh!

wojiaodoubao Dec 4, 2025 Collaborator

Replies: 8 comments · 19 replies

Uh oh!

Uh oh!

wojiaodoubao Dec 4, 2025 Collaborator Author

Uh oh!

Uh oh!

majin1102 Dec 4, 2025 Collaborator

Uh oh!

wojiaodoubao Dec 5, 2025 Collaborator Author

Uh oh!

Uh oh!

wjones127 Dec 5, 2025 Maintainer

Example / benchmarks?

Uh oh!

wojiaodoubao Dec 7, 2025 Collaborator Author

Uh oh!

wjones127 Dec 5, 2025 Maintainer

Atomicity

Uh oh!

wojiaodoubao Dec 7, 2025 Collaborator Author

Uh oh!

wjones127 Dec 5, 2025 Maintainer

Use case

Uh oh!

majin1102 Dec 10, 2025 Collaborator

Uh oh!

wojiaodoubao Dec 10, 2025 Collaborator Author

Uh oh!

wjones127 Dec 10, 2025 Maintainer

Uh oh!

westonpace Dec 10, 2025 Maintainer

Uh oh!

Uh oh!

majin1102 Dec 11, 2025 Collaborator

Uh oh!

Uh oh!

majin1102 Dec 9, 2025 Collaborator

Uh oh!

Uh oh!

wojiaodoubao Dec 9, 2025 Collaborator Author

Uh oh!

Uh oh!

majin1102 Dec 9, 2025 Collaborator

Uh oh!

majin1102 Dec 9, 2025 Collaborator

Uh oh!

yanghua Dec 9, 2025 Maintainer

Uh oh!

wojiaodoubao Dec 9, 2025 Collaborator Author

Uh oh!

Uh oh!

wojiaodoubao Dec 11, 2025 Collaborator Author

Uh oh!

jackye1995 Dec 12, 2025 Maintainer

wojiaodoubao
Dec 4, 2025
Collaborator

Replies: 8 comments 19 replies

wojiaodoubao
Dec 4, 2025
Collaborator Author

majin1102
Dec 4, 2025
Collaborator

wojiaodoubao
Dec 5, 2025
Collaborator Author

wjones127
Dec 5, 2025
Maintainer

wojiaodoubao Dec 7, 2025
Collaborator Author

wjones127
Dec 5, 2025
Maintainer

wojiaodoubao Dec 7, 2025
Collaborator Author

wjones127
Dec 5, 2025
Maintainer

majin1102 Dec 10, 2025
Collaborator

wojiaodoubao Dec 10, 2025
Collaborator Author

wjones127 Dec 10, 2025
Maintainer

westonpace Dec 10, 2025
Maintainer

majin1102 Dec 11, 2025
Collaborator

majin1102
Dec 9, 2025
Collaborator

wojiaodoubao Dec 9, 2025
Collaborator Author

majin1102 Dec 9, 2025
Collaborator

majin1102 Dec 9, 2025
Collaborator

yanghua Dec 9, 2025
Maintainer

wojiaodoubao Dec 9, 2025
Collaborator Author

wojiaodoubao
Dec 11, 2025
Collaborator Author

jackye1995 Dec 12, 2025
Maintainer