Introduce Partition Table Spec in Lance Namespace #272
Replies: 8 comments 19 replies
-
|
Hi @jackye1995 @yanghua @majin1102 @fangbo @zhangyue19921010, please share your thoughts, thanks very much! |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for starting this discussion in the lance-namespace module! I think this is a valuable initiative. When we talk about "partition," it’s necessary to distinguish two conceptual layers:
Our prior discussions have mostly centered on Lance-level partitioning, but to fully address the problem, we need to implement engine-level partitioning. Here are two approaches we can explore: 1.Leverage namespace multi-table transactions (e.g., Directory Namespace v2) In engines layer, we could encapsulate multiple datasets under a namespace into a PartitionedTable. This approach wouldn’t require the namespace to add new functionality beyond its current capabilities—we’d just use the existing namespace spec and define specific attributes within namespaces. This seems like a clear path because:
If Lance later introduces a partition spec, migrating away from this approach would be straightforward. 2.Add a partition spec at the namespace layer As you noted, I personally don’t recommend this. Equating a namespace directly to a Lance table would mean the namespace must dynamically generate a manifest for a large dataset by aggregating smaller ones during metadata loading. This introduces inherent conflicts like row_ids and fragment_ids might get duplicated. Committing changes could demand specialized interfaces. Structurally, the namespace layer sits above tables, while a partition spec belongs inside a table. Merging these layers doesn’t align with their design intent in my view. Perhaps we could outline a specific scenario and brainstorm with it? That might help ground the discussion. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @majin1102 , thanks your nice suggestion! My use case is as follows: I have two Lance Tables:
My requirements are:
I think we can validate the case by integrating it with Spark. 1. DDL
2. Query/Scan(Partition Pruning) 3. Insert Data to New Partitions
|
Beta Was this translation helpful? Give feedback.
-
Example / benchmarks?
For your two use cases, could you describe in detail what the improvement would be? What are the kind of bottlenecks you see now? And how would this improve it? It might even be nice to develop a benchmark or two. That way we can show how this proposal once completed actually improves things. The problem I've had with the partition discussion thus far is the benefits are often abstract. |
Beta Was this translation helpful? Give feedback.
-
Atomicity
In this description, is each row added in a separate transaction? Overall, if you have multiple Lance tables, it seems like there wouldn't be a way to atomically write to multiple partitions? Is that correct? Is that a limitation we think is acceptable? |
Beta Was this translation helpful? Give feedback.
-
Use caseI'm open to having some sort of "partitioning" at the namespace level. The use case I had in mind isn't the kind of time-based partitioning I think you are describing, that's common with Iceberg and Spark Tables. IMO that seems more like clustering to me. But we have other use cases where for business reasons we don't want to co-mingle data, which is multi-tenancy. Often users have a table where they store vectors and data for multiple of their users. When they query, they always query data for just one user. But for managing the table (creating indices, schema evolution) they'd like to manage the overall table as a single object. Right now they'd be forced to manage these "partitions" as multiple tables. This sort of "partitioning" corresponds to the "namespace" concept in vector databases like Pinecone or Turbopuffer. |
Beta Was this translation helpful? Give feedback.
-
As I understand, either multi-tenant or timely partition, we both need to expose partition fields to business layers and let them use through. IMO all partitions are busness partitions. Do we have a specific senario using partitions as none-busness partitions? @jackye1995
Could you elaborate more on the difference between "business partitioning" and "traditional partitioning" in your mind? I think the boundary is unclear ? @yanghua
+1
I'm not sure none transactional partitions could meet your case. You metioned Overall I think we'd better provide atomicity on transactions. I don't think it's good to provide partitions compared with old Hive instead of other table formats.
This is exactly the solution I had in mind.
What do you think? @wojiaodoubao @jackye1995 |
Beta Was this translation helpful? Give feedback.
-
|
PR: #279 |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
There was a partitioning discussion in the Lance community at: lance-format/lance#4125. The community is not yet ready to introduce partitioning at the Lance Dataset level.
In big data scenarios, physical partitioning is a widely adopted approach to organize data conveniently and efficiently. Below are two common use cases:
Rapid Dataset Cleanup
Lance uses a mark-and-sweep garbage collection mechanism, which does not immediately reclaim storage space when data is deleted. For users leveraging Lance (e.g., for model training), when they explicitly confirm that certain data is no longer needed and require immediate storage release, a physical partitioning-based solution would be far more effective.
Large-Scale Data Processing
A user needs to maintain a massive Lance table with 10 trillion records, performing hourly updates that affect approximately 1 billion records each time. Without physical partitioning, updates would have to operate on the entire 10-trillion-record table. While data scans can be accelerated via indexes, the overhead of scanning data and maintaining indexes remains prohibitively high. Physical partitioning offers a simpler, more efficient alternative.
Here is my proposal: introduce a partitioned table specification under the Lance namespace. The specification is defined as follows:
A partitioned table is a specification rather than an enforcement constraint:
Beta Was this translation helpful? Give feedback.
All reactions