Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/src/format/table/.pages
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@ nav:
- Layout: layout.md
- Branch & Tag: branch_tag.md
- Row ID & Lineage: row_id_lineage.md
- MemTable & WAL: mem_wal.md
- index
2 changes: 1 addition & 1 deletion docs/src/format/table/index/system/.pages
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
title: System Indices
nav:
- Fragment Reuse: frag_reuse.md
- MemWAL: memwal.md
- MemWAL: mem_wal.md
12 changes: 12 additions & 0 deletions docs/src/format/table/index/system/mem_wal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# MemWAL Index

The MemWAL Index is a system index that serves as the centralized structure for all MemWAL metadata.
It stores configuration (region specs, indexes to maintain), merge progress, and region state snapshots.

A table has at most one MemWAL index.

For the complete specification, see:

- [MemWAL Index Overview](../../mem_wal.md#memwal-index) - Purpose and high-level description
- [MemWAL Index Details](../../mem_wal.md#memwal-index-details) - Storage format, schemas, and staleness handling
- [MemWAL Index Builder](../../mem_wal.md#memwal-index-builder) - Background process and configuration updates
27 changes: 0 additions & 27 deletions docs/src/format/table/index/system/memwal.md

This file was deleted.

1,006 changes: 1,006 additions & 0 deletions docs/src/format/table/mem_wal.md

Large diffs are not rendered by default.

Binary file added docs/src/images/mem_wal_overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/src/images/mem_wal_regional.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions java/lance-jni/src/transaction.rs
Original file line number Diff line number Diff line change
Expand Up @@ -559,7 +559,7 @@ fn convert_to_java_operation_inner<'local>(
updated_fragments,
new_fragments,
fields_modified,
mem_wal_to_merge: _,
merged_generations: _,
fields_for_preserving_frag_bitmap,
update_mode,
inserted_rows_filter: _,
Expand Down Expand Up @@ -1048,7 +1048,7 @@ fn convert_to_rust_operation(
updated_fragments,
new_fragments,
fields_modified,
mem_wal_to_merge: None,
merged_generations: vec![],
fields_for_preserving_frag_bitmap,
update_mode,
inserted_rows_filter: None,
Expand Down
238 changes: 167 additions & 71 deletions protos/table.proto
Original file line number Diff line number Diff line change
Expand Up @@ -504,80 +504,176 @@ message FragmentReuseIndexDetails {
}
}

// ============================================================================
// MemWAL Index Types
// ============================================================================

// Region manifest containing epoch-based fencing and WAL state.
// Each region has exactly one active writer at any time.
message RegionManifest {
// Region identifier (UUID v4).
UUID region_id = 11;

// Manifest version number.
// Matches the version encoded in the filename.
uint64 version = 1;

// Region spec ID this region was created with.
// Set at region creation and immutable thereafter.
// A value of 0 indicates a manually-created region not governed by any spec.
uint32 region_spec_id = 10;

// Writer fencing token - monotonically increasing.
// A writer must increment this when claiming the region.
uint64 writer_epoch = 2;

// The most recent WAL entry ID that has been flushed to a MemTable.
// During recovery, replay starts from replay_after_wal_id + 1.
uint64 replay_after_wal_id = 3;

// The most recent WAL entry ID at the time manifest was updated.
// This is a hint, not authoritative - recovery must list files to find actual state.
uint64 wal_id_last_seen = 4;

// Next generation ID to create (incremented after each MemTable flush).
uint64 current_generation = 6;

// Field 7 removed: merged_generation moved to MemWalIndexDetails.merged_generations
// which is the authoritative source for merge progress.

// List of flushed MemTable generations and their directory paths.
repeated FlushedGeneration flushed_generations = 8;
}

// A flushed MemTable generation and its storage location.
message FlushedGeneration {
// Generation number.
uint64 generation = 1;

// Directory name relative to the region directory.
string path = 2;
}

// A region's merged generation, used in MemWalIndexDetails.
message MergedGeneration {
// Region identifier (UUID v4).
UUID region_id = 1;

// Last generation merged to base table for this region.
uint64 generation = 2;
}

// Tracks which merged generation a base table index has been rebuilt to cover.
// Used to determine whether to read from flushed MemTable indexes or base table.
message IndexCatchupProgress {
// Name of the base table index (must match an entry in maintained_indexes).
string index_name = 1;

// Per-region progress: the generation up to which this index covers.
// If a region is not present, the index is assumed to be fully caught up
// (i.e., caught_up_generation >= merged_generation for that region).
repeated MergedGeneration caught_up_generations = 2;
}

// Index details for MemWAL Index, stored in IndexMetadata.index_details.
// This is the centralized structure for all MemWAL metadata:
// - Configuration (region specs, indexes to maintain)
// - Merge progress (merged generations per region)
// - Region state snapshots
//
// Writers read this index to get configuration before writing.
// Readers read this index to discover regions and their state.
// A background process updates the index periodically to keep region snapshots current.
//
// Region snapshots are stored as a Lance file with one row per region.
// The schema has one column per RegionManifest field, with region fields as columns:
// region_id: fixed_size_binary(16) -- UUID bytes
// version: uint64
// region_spec_id: uint32
// writer_epoch: uint64
// replay_after_wal_id: uint64
// wal_id_last_seen: uint64
// current_generation: uint64
// merged_generation: uint64
// flushed_generations: list<struct<generation: uint64, path: string>>
message MemWalIndexDetails {
// Snapshot timestamp (Unix timestamp in milliseconds).
int64 snapshot_ts_millis = 1;

// Number of regions in the snapshot.
// Used to determine storage format without reading the snapshot data.
uint32 num_regions = 2;

// Inline region snapshots for small region counts.
// When num_regions <= threshold (implementation-defined, e.g., 100),
// snapshots are stored inline as serialized bytes.
// Format: Lance file bytes with the region snapshot schema.
optional bytes inline_snapshots = 3;

// Region specs defining how to derive region identifiers.
// This configuration determines how rows are partitioned into regions.
repeated RegionSpec region_specs = 7;

// Indexes from the base table to maintain in MemTables.
// These are index names referencing indexes defined on the base table.
// The primary key btree index is always maintained implicitly and
// should not be listed here.
//
// For vector indexes, MemTables inherit quantization parameters (PQ codebook,
// SQ params) from the base table index to ensure distance comparability.
repeated string maintained_indexes = 8;

// Last generation merged to base table for each region.
// This is updated atomically with merge-insert data commits, enabling
// conflict resolution when multiple mergers operate concurrently.
//
// Note: This is separate from region snapshots because:
// 1. merged_generations is updated by mergers (atomic with data commit)
// 2. region snapshots are updated by background index builder
repeated MergedGeneration merged_generations = 9;

// Per-index catchup progress tracking.
// When data is merged to the base table, base table indexes are rebuilt
// asynchronously. This field tracks which generation each index covers.
//
// For indexed queries, if an index's caught_up_generation < merged_generation,
// readers should use flushed MemTable indexes for the gap instead of
// scanning unindexed data in the base table.
//
// If an index is not present in this list, it is assumed to be fully caught up.
repeated IndexCatchupProgress index_catchup = 10;
}

repeated MemWal mem_wal_list = 1;
// Region spec definition.
message RegionSpec {
// Unique identifier for this spec within the index.
// IDs are never reused.
uint32 spec_id = 1;

message MemWalId {
// The name of the region that this specific MemWAL is responsible for.
string region = 1;
// Region field definitions that determine how to compute region identifiers.
repeated RegionField fields = 2;
}

// The generation of the MemWAL.
// Every time a new MemWAL is created and an old one is sealed,
// the generation number of the next MemWAL is incremented.
// At any given point of time for all MemWALs of the same name,
// there must be only 1 generation that is not sealed.
uint64 generation = 2;
}
// Region field definition.
message RegionField {
// Unique string identifier for this region field.
string field_id = 1;

// A combination of MemTable and WAL for fast upsert.
message MemWal {

enum State {
// MemWAL is open and accepting new entries
OPEN = 0;
// When a MemTable is considered full, the writer should update this MemWAL as sealed
// and create a new MemWAL to write to atomically.
SEALED = 1;
// When a MemTable is sealed, it can be flushed asynchronously to disk.
// This state indicates the data has been persisted to disk but not yet merged
// into the source table.
FLUSHED = 2;
// When the flushed data has been merged into the source table.
// After a MemWAL is merged, the cleanup process can delete the WAL.
MERGED = 3;
}

MemWalId id = 1;

// The MemTable location, which is likely an in-memory address starting with memory://.
// The actual details of how the MemTable is stored is outside the concern of Lance.
string mem_table_location = 2;

// the root location of the WAL.
// THe WAL storage durability determines the data durability.
// This location is immutable once set at MemWAL creation time.
string wal_location = 3;

// All entries in the WAL, serialized as U64Segment.
// Each entry in the WAL has a uint64 sequence ID starting from 0.
// The actual details of how the WAL entry is stored is outside the concern of Lance.
// In most cases this U64Segment should be a simple range.
// Every time the writer starts writing, it must always try to atomically write to the last entry ID + 1.
// If fails due to concurrent writer, it then tries to write to the +2, +3, +4, etc. entry ID until succeed.
// but if there are 2 writers accidentally writing to the same WAL concurrently,
// although one writer will fail to update this index at commit time,
// the WAL entry is already written,
// causing some holes within the U64Segment range.
bytes wal_entries = 4;

// The current state of the MemWAL, indicating its lifecycle phase.
// States progress: OPEN -> SEALED -> FLUSHED
// OPEN: MemWAL is accepting new WAL entries
// SEALED: MemWAL has been sealed and no longer accepts new WAL entries
// FLUSHED: MemWAL has been flushed to the source Lance table and can be cleaned up
State state = 5;

// The owner identifier for this MemWAL, used for compare-and-swap operations.
// When a writer wants to perform any operation on this MemWAL, it must provide
// the expected owner_id. This serves as an optimistic lock to prevent concurrent
// writers from interfering with each other. When a new writer starts replay,
// it must first atomically update this owner_id to claim ownership.
// All subsequent operations will fail if the owner_id has changed.
string owner_id = 6;

// The dataset version that last updated this MemWAL.
// This is set to the new dataset version whenever the MemWAL is created or modified.
uint64 last_updated_dataset_version = 7;
}
// Field IDs referencing source columns in the schema.
repeated int32 source_ids = 2;

// Well-known region transform name (e.g., "identity", "year", "bucket").
// Mutually exclusive with expression.
optional string transform = 3;

// DataFusion SQL expression for custom logic.
// Mutually exclusive with transform.
optional string expression = 4;

// Output type of the region value (Arrow type name).
string result_type = 5;

// Transform parameters (e.g., num_buckets for bucket transform).
map<string, string> parameters = 6;
}

17 changes: 7 additions & 10 deletions protos/transaction.proto
Original file line number Diff line number Diff line change
Expand Up @@ -235,8 +235,8 @@ message Transaction {
repeated DataFragment new_fragments = 3;
// The ids of the fields that have been modified.
repeated uint32 fields_modified = 4;
/// The MemWAL (pre-image) that should be marked as merged after this transaction
MemWalIndexDetails.MemWal mem_wal_to_merge = 5;
/// List of MemWAL region generations to mark as merged after this transaction
repeated MergedGeneration merged_generations = 5;
/// The fields that used to judge whether to preserve the new frag's id into
/// the frag bitmap of the specified indices.
repeated uint32 fields_for_preserving_frag_bitmap = 6;
Expand Down Expand Up @@ -305,15 +305,12 @@ message Transaction {
repeated DataReplacementGroup replacements = 1;
}

// Update the state of the MemWal index
// Update the merged generations in MemWAL index.
// This operation is used during merge-insert to atomically record which
// generations have been merged to the base table.
message UpdateMemWalState {

repeated MemWalIndexDetails.MemWal added = 1;

repeated MemWalIndexDetails.MemWal updated = 2;

// If a MemWAL is updated, its pre-image should be in the removed list.
repeated MemWalIndexDetails.MemWal removed = 3;
// Regions and generations being marked as merged.
repeated MergedGeneration merged_generations = 1;
}

// An operation that updates base paths in the dataset.
Expand Down
2 changes: 1 addition & 1 deletion python/src/transaction.rs
Original file line number Diff line number Diff line change
Expand Up @@ -229,7 +229,7 @@ impl FromPyObject<'_> for PyLance<Operation> {
updated_fragments,
new_fragments,
fields_modified,
mem_wal_to_merge: None,
merged_generations: vec![],
fields_for_preserving_frag_bitmap,
update_mode,
inserted_rows_filter: None,
Expand Down
Loading