Skip to content

Conversation

@zhuxinjie-nz
Copy link
Contributor

Summary

Implement a SequentialStorageBackend to store key-value data in buckets for efficient data management and maintenance.

Motivation

  • Group KV data into buckets to avoid creating too many small files.
  • Add metadata support in file storage for KV data to enable LRU/LFU/creation-time-based eviction later.

Behavior & Compatibility

  • The SequentialStorageBackend currently supports only low-level bucket I/O and is not fully integrated into the pipeline. No impact on Mooncake in its current state.
  • The follow-up implementation approach is inconsistent with the current StorageBackend's design pattern.Key inconsistencies include:
    1. On write: create buckets by synchronizing the offloading queue with the master via a heartbeat mechanism, enabling better scalability for independently deployed clients and supporting high-availability client scenarios.
    2. On read: enable distributed read capability for bucket data via global metadata registration, coordinated through the Transfer Engine.

… data in buckets for efficient data management and maintenance.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @zhuxinjie-nz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a foundational SequentialStorageBackend to the Mooncake store, enabling key-value data to be organized into buckets. This change is motivated by the need to optimize data management by grouping related data, thereby reducing file overhead and preparing the system for future advanced eviction strategies. While the new backend provides low-level I/O capabilities, it is not yet integrated into the main Mooncake pipeline, ensuring no immediate impact on existing operations.

Highlights

  • New SequentialStorageBackend: Introduced a new SequentialStorageBackend class designed to store key-value data in logical 'buckets' for more efficient management and to reduce the number of small files.
  • Metadata Structures: Added SequentialObjectMetadata, SequentialBucketMetadata, and SequentialOffloadMetadata to support detailed tracking of object and bucket properties, paving the way for advanced eviction policies like LRU/LFU.
  • Bucket Operations: Implemented core functionalities for the new backend, including BatchOffload, BatchQuery, BatchLoad, GetBucketKeys, BucketScan, and GetStoreMetadata.
  • Error Handling: Expanded the ErrorCode enum with new values specific to bucket operations, such as BUCKET_NOT_FOUND, BUCKET_ALREADY_EXISTS, and KEYS_ULTRA_BUCKET_LIMIT.
  • Unit Tests: Added comprehensive unit tests in storage_backend_test.cpp to validate the functionality of the SequentialStorageBackend, covering data offloading, querying, loading, and bucket scanning.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a SequentialStorageBackend for managing key-value data in buckets, which is a solid approach to handle numerous small files efficiently. The implementation includes functionalities for batch operations like offloading, querying, and loading, and is accompanied by new tests, which is commendable.

However, the review has uncovered several critical thread-safety issues. Specifically, the bucket ID generation mechanism is not thread-safe, and several methods that modify shared data structures do so without adequate locking, creating potential race conditions and data corruption risks. Additionally, there are opportunities to improve error handling by properly checking the results of functions returning tl::expected instead of unsafely calling .value().

I have provided detailed comments on these points with suggestions for remediation. Addressing these concerns will significantly improve the robustness and reliability of the new storage backend.

LOG(ERROR) << "Storage backend already initialized";
return tl::unexpected(ErrorCode::INTERNAL_ERROR);
}
std::shared_lock lock(mutex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This function modifies shared class members (object_bucket_map_, buckets_, total_size_) under a std::shared_lock. A shared_lock is intended for concurrent read access only. Using it during write operations leads to a race condition and undefined behavior. You must use a std::unique_lock to ensure exclusive access and maintain thread safety.

        std::unique_lock lock(mutex_);

if (!write_result) {
LOG(ERROR) << "vector_write failed for: " << bucket_id
<< ", error: " << write_result.error();
buckets_.erase(bucket_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The buckets_ map is being modified here with erase() without holding a lock on mutex_. Since buckets_ is a shared resource accessed by multiple threads, this operation is not thread-safe and can lead to race conditions or data corruption. You should acquire a std::unique_lock before modifying the map.

        {
            std::unique_lock lock(mutex_);
            buckets_.erase(bucket_id);
        }

Comment on lines 692 to 702
tl::expected<int64_t, ErrorCode> SequentialStorageBackend::CreateBucketId() {
auto cur_time_stamp = time_gen();
if(cur_time_stamp <= m_i64LastTimeStamp){
m_i64SequenceID = (m_i64SequenceID + 1) & SEQUENCE_MASK;
} else{
m_i64SequenceID = 0;
}
m_i64LastTimeStamp = cur_time_stamp;
return (cur_time_stamp << TIMESTAMP_SHIFT)
| (m_i64SequenceID << SEQUENCE_ID_SHIFT);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This function for creating bucket IDs is not thread-safe. It reads and modifies the shared member variables m_i64LastTimeStamp and m_i64SequenceID without any synchronization. If CreateBucketId is called concurrently from multiple threads (e.g., via BatchOffload), it can lead to race conditions and result in duplicate bucket IDs. You must protect this critical section with a mutex.

tl::expected<int64_t, ErrorCode> SequentialStorageBackend::CreateBucketId() {
    std::unique_lock lock(mutex_);
    auto cur_time_stamp = time_gen();
    if(cur_time_stamp <= m_i64LastTimeStamp){
        m_i64SequenceID = (m_i64SequenceID + 1) & SEQUENCE_MASK;
    } else{
        m_i64SequenceID = 0;
    }
    m_i64LastTimeStamp = cur_time_stamp;
    return (cur_time_stamp  << TIMESTAMP_SHIFT)
        | (m_i64SequenceID << SEQUENCE_ID_SHIFT);
}

Comment on lines 115 to 118
char *buf = new char[object_it->second.size+1];
buf[object_it->second.size] = '\0';
memcpy(buf, object_it->second.ptr, object_it->second.size);
auto data = std::string(buf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This code allocates memory for buf using new char[] but never deallocates it, causing a memory leak in the test. Additionally, it's inefficient as it involves multiple data copies. A safer and more efficient approach is to construct the std::string directly from the buffer pointer and size. This avoids both the memory leak and the unnecessary memcpy.

Suggested change
char *buf = new char[object_it->second.size+1];
buf[object_it->second.size] = '\0';
memcpy(buf, object_it->second.ptr, object_it->second.size);
auto data = std::string(buf);
std::string data(static_cast<char*>(object_it->second.ptr), object_it->second.size);

LOG(ERROR) << "batch object is empty";
return tl::make_unexpected(ErrorCode::INVALID_KEY);
}
auto bucket_id = CreateBucketId().value();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling .value() on a tl::expected without first checking if it contains a value is unsafe. If CreateBucketId() were to return an error, this would throw a tl::bad_expected_access exception and crash the program. It's crucial to handle the error case properly. This pattern of unchecked .value() calls appears multiple times in this file and should be addressed throughout.

    auto bucket_id_result = CreateBucketId();
    if (!bucket_id_result) {
        return tl::make_unexpected(bucket_id_result.error());
    }
    auto bucket_id = bucket_id_result.value();

Comment on lines 223 to 224
tl::expected<void, ErrorCode> BatchQuery(const std::vector<std::string>& keys,
std::unordered_map<std::string, SequentialObjectMetadata> &batche_object_metadata);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in the parameter name batche_object_metadata. It should likely be batch_object_metadata for consistency with standard terminology.

                                             std::unordered_map<std::string, SequentialObjectMetadata> &batch_object_metadata);

Comment on lines 308 to 309
int64_t m_i64SequenceID = 0;
int64_t m_i64LastTimeStamp = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These member variables use Hungarian notation (m_i64...), which is generally discouraged in modern C++. It would be more consistent with the rest of the class to use a trailing underscore for private members (e.g., sequence_id_, last_timestamp_).

    int64_t sequence_id_ = 0;
    int64_t last_timestamp_ = 0;

}
std::unique_lock lock(mutex_);
total_size_ += bucket->data_size + bucket->meta_size;
for (auto key:bucket->keys) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This loop copies each key string from bucket->keys. To improve performance by avoiding unnecessary allocations and copies, you should iterate using a const reference.

    for (const auto& key:bucket->keys) {

auto file = std::move(open_file_result.value());
for (const auto& key : keys) {
size_t offset;
auto slice = batched_slices[key];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using the [] operator on batched_slices is risky because it will insert a default-constructed Slice if the key doesn't exist, which could hide bugs. It's safer to use find() and check for the key's existence, or use at() which would throw an exception if the key is missing, making debugging easier.

        auto slice_it = batched_slices.find(key);
        if (slice_it == batched_slices.end()) {
            LOG(ERROR) << "Slice for key " << key << " not found in batched_slices";
            return tl::make_unexpected(ErrorCode::INVALID_KEY);
        }
        auto& slice = slice_it->second;

Comment on lines 682 to 685
tl::expected<std::string, ErrorCode> SequentialStorageBackend::GetBucketDataPath(int64_t bucket_id) {
std::string sep = storage_path_.empty() || storage_path_.back() == '/' ? "" : "/";
return storage_path_ + sep + std::to_string(bucket_id);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function is declared to return tl::expected<std::string, ErrorCode>, but it never actually returns an error. The implementation can be simplified by changing the return type to std::string. This also applies to GetBucketMetadataPath.

std::string SequentialStorageBackend::GetBucketDataPath(int64_t bucket_id) {
    std::string sep = storage_path_.empty() || storage_path_.back() == '/' ? "" : "/";
    return storage_path_ + sep + std::to_string(bucket_id);
}

@ykwd
Copy link
Collaborator

ykwd commented Oct 27, 2025

Thanks for the great work!

While reviewing the PR, I found several parts of the implementation that I couldn’t fully understand in terms of the underlying motivation and design reasoning.

Would it be possible to provide a high-level design description, along with an explanation of the bucket data structure and how it fits into the overall storage architecture? That would really help make the review more effective and ensure we’re aligned on the intended design direction.

@ykwd ykwd self-requested a review October 28, 2025 02:08
@xiaguan
Copy link
Collaborator

xiaguan commented Oct 28, 2025

Maybe you could resolve Gemini's Critical and High Priority comments. For the lock, you could try GUARDED_BY (you can search for it—we already use it), and use clang++ to compile; you'll get proper thread-safety analysis.

@zhuxinjie-nz
Copy link
Contributor Author

Maybe you could resolve Gemini's Critical and High Priority comments. For the lock, you could try GUARDED_BY (you can search for it—we already use it), and use clang++ to compile; you'll get proper thread-safety analysis.

Thank you, I'll fix this issue.

Copy link
Collaborator

@ykwd ykwd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work! I have left some comments.

* - total_size_: cumulative data size of all stored objects
*/
mutable std::shared_mutex mutex_;
std::string storage_path_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the "GUARDED_BY" to ensure these objects are accessed correctly.

FileMode mode) const;
};

class SequentialStorageBackend {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Sequential" in this context is a little bit misleading. Shall we consider using another name? e.g., BucketStorageBackend

auto meta_path = GetBucketMetadataPath(id).value();
auto open_file_result = OpenFile(meta_path, FileMode::Read);
if (!open_file_result) {
LOG(INFO) << "Failed to open file for reading: " << meta_path;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shall be a log(error) message

return tl::make_unexpected(ErrorCode::FILE_OPEN_FAIL);
}
auto file = std::move(open_file_result.value());
LOG(INFO) << "Writing bucket with path: " << bucket_data_path;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no error, outputting some logs for every successful operation would result in too many logs being output, which would disturb users. Consider using vlog(1) for debugging. Same apply to other places

auto write_bucket_result = WriteBucket(bucket_id, bucket, iovs);
if (!write_bucket_result) {
LOG(ERROR) << "Failed to write bucket with id: " << bucket_id;
buckets_.erase(bucket_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn’t we add the bucket information to buckets_ after all files have been written?

  1. Because once we add it to buckets_, read requests can already access it even though the data hasn’t been fully written yet.
  2. If the write operation fails, then this bucket shouldn’t be added to buckets_ at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Since the bucket ID is used as the file name in the storage backend, reserving a slot in buckets_ upfront prevents multiple threads from concurrently writing to the same bucket file under concurrent access.
  • A read operation can only proceed after the key has been inserted into object_bucket_map_.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holding the lock from before writing until write completion would result in excessively long lock duration.

LOG(INFO) << "Writing bucket with path: " << bucket_data_path;

auto write_result = file->vector_write(iovs.data(), iovs.size(), 0);
if (!write_result) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If write fails, shall we try to delete this file? It seems like this file will never be used. This also applies to the metadata file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If write fails, shall we try to delete this file? It seems like this file will never be used. This also applies to the metadata file.

If FILE_WRITE_FAIL occurred, the PosixFile destructor will delete the file

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. That makes sense.

@stmatengss
Copy link
Collaborator

Use clang-format to pass format checking. @zhuxinjie-nz

@zhuxinjie-nz
Copy link
Contributor Author

Use clang-format to pass format checking. @zhuxinjie-nz

A file was missing, now fixed

@xiaguan
Copy link
Collaborator

xiaguan commented Oct 30, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a BucketStorageBackend for more efficient key-value data management by grouping data into buckets. The implementation is well-structured, including a new SharedMutex wrapper, a BucketIdGenerator, and comprehensive unit tests. My review focuses on improving robustness and maintainability by addressing a critical issue with serialization, improving error handling consistency, removing unsafe coding patterns, and fixing a test that invokes undefined behavior.

YLT_REFL(BucketObjectMetadata, offset, key_size, data_size);

struct BucketMetadata {
mutable std::shared_mutex statistics_mutex;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The BucketMetadata struct contains a std::shared_mutex statistics_mutex. std::shared_mutex is not serializable, but this struct is marked for serialization with YLT_REFL. This will likely cause compilation errors or undefined behavior at runtime when struct_pb::to_pb is called on a BucketMetadata instance. Since statistics_mutex does not appear to be used, it should be removed to avoid this issue. If a mutex is needed for BucketMetadata instances, it should be managed separately from the serializable struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix @xiaguan

Comment on lines 387 to 389
} else {
LOG(ERROR) << "Key " << key << " does not exist";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In BatchQuery, when a key is not found, an error is logged, but the function continues and returns a success status. This is inconsistent with BatchLoad, which returns ErrorCode::INVALID_KEY in a similar scenario. This behavior can be misleading for callers. Consider returning an error, such as ErrorCode::OBJECT_NOT_FOUND, if any key is not found.

        } else {
            LOG(ERROR) << "Key " << key << " does not exist";
            return tl::make_unexpected(ErrorCode::OBJECT_NOT_FOUND);
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix @xiaguan

Comment on lines 466 to 468
GetBucketDataPath(bucket_id).value();
auto bucket_meta_path =
GetBucketMetadataPath(bucket_id).value();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The code frequently uses .value() on tl::expected objects (e.g., GetBucketDataPath(bucket_id).value()) without first checking if they contain a value. This is a fragile pattern that can lead to crashes if the function is ever modified to return an error. It's safer to check for an error before accessing the value. This pattern appears in multiple places within this file (Init, WriteBucket, StoreBucketMetadata, etc.).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix @xiaguan

Comment on lines 757 to 758
auto bucket_data_path = GetBucketDataPath(bucket_id);
return *bucket_data_path + ".meta";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Unsafe dereferencing of tl::expected. The GetBucketDataPath function returns a tl::expected, which could contain an error. Dereferencing it with * without checking for an error first is unsafe and will lead to a crash if an error is present. You should check for the error before dereferencing.

    auto bucket_data_path = GetBucketDataPath(bucket_id);
    if (!bucket_data_path) {
        return tl::make_unexpected(bucket_data_path.error());
    }
    return bucket_data_path.value() + ".meta";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix @xiaguan

BucketStorageBackend::BuildBucket(
const std::unordered_map<std::string, std::vector<Slice>>& batch_object,
std::vector<iovec>& iovs) {
SharedMutexLocker lock(&mutex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The BuildBucket method acquires an exclusive lock on mutex_, but it doesn't access any shared member variables of the BucketStorageBackend class. This lock is unnecessary and could create a performance bottleneck by serializing bucket-building operations. This method could be a static helper function or a free function that doesn't take a lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix @xiaguan

Comment on lines 710 to 721
auto bucket = buckets_.find(bucket_id);
if (bucket == buckets_.end()) {
LOG(ERROR) << "Failed to open file for reading: "
<< storage_filepath;
return tl::make_unexpected(ErrorCode::FILE_OPEN_FAIL);
}
auto object_metadata = buckets_[bucket_id]->object_metadata.find(key);
if (object_metadata == buckets_[bucket_id]->object_metadata.end()) {
LOG(ERROR) << "Failed to open file for reading: "
<< storage_filepath;
return tl::make_unexpected(ErrorCode::FILE_OPEN_FAIL);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are a couple of issues in this section:

  1. Misleading Errors: When a bucket or object metadata is not found, the code logs a generic "Failed to open file for reading" message and returns ErrorCode::FILE_OPEN_FAIL. The message and error code should be more specific, like "Bucket not found" (BUCKET_NOT_FOUND).
  2. Redundant Lookup: buckets_.find(bucket_id) is called, and then buckets_[bucket_id] is used, which performs a second lookup. It's more efficient to use the iterator from the find call.

Here is a suggested change that addresses both points.

        auto bucket_it = buckets_.find(bucket_id);
        if (bucket_it == buckets_.end()) {
            LOG(ERROR) << "Bucket not found with id: " << bucket_id;
            return tl::make_unexpected(ErrorCode::BUCKET_NOT_FOUND);
        }
        auto object_metadata = bucket_it->second->object_metadata.find(key);
        if (object_metadata == bucket_it->second->object_metadata.end()) {
            LOG(ERROR) << "Object metadata not found for key '" << key
                       << "' in bucket " << bucket_id;
            return tl::make_unexpected(ErrorCode::OBJECT_NOT_FOUND);
        }

Comment on lines +127 to +130
EXPECT_TRUE(
mtx.try_lock_shared()); // Multiple shared locks should be allowed
// Note: This test does not attempt recursive locking (UB), just checks
// concurrent shared access.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test attempts to acquire a shared lock on mtx while it's already held by a SharedMutexLocker in the same thread. Calling try_lock_shared on a std::shared_mutex that is already owned by the calling thread (in any mode) results in undefined behavior. This test should be removed or rewritten to test concurrency from a different thread, similar to how SharedAccessIsConcurrent is implemented.


struct StorageObjectMetadata {
int64_t bucket_id;
size_t offset;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use uint64_t or int64_t since size_t varies in size across different machines.

I think int64_t is better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@xiaguan
Copy link
Collaborator

xiaguan commented Oct 30, 2025

So here are my several points of confusion:

  1. We need a doc (or something similar) that shows the on-disk storage format for each "Bucket" (as you call it in your design) when we persist key/value pairs to disk. For example: the first 8 bytes are the magic number, bytes 8–16 are the checksum, etc. We use 4 bytes for key length and 8 bytes for value length, and so on.

  2. We need a benchmark for this—similar to LevelDB's db_bench —showing this SSD KV engine's write throughput and read throughput under Mooncake Store's use case. For instance, with value = 1 MB, how fast are batch_put and batch_load (at batch sizes of 8, 16, 32)? Can it hit the SSD’s max throughput?

I think we could also refer to Cachelib's Navy Block Cache documentation for more best practices on efficiently storing large key/value pairs on SSD: https://cachelib.org/docs/Cache_Library_Architecture_Guide/navy_overview#block-cache

@ykwd
Copy link
Collaborator

ykwd commented Oct 30, 2025

The code looks good to me. We can merge this pr at the moment and leave the further optimization to subsequent work.

@ykwd ykwd merged commit 2cf86bf into kvcache-ai:main Oct 31, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants