Skip to content

enhance: add cache shard disk usage metric#98

Open
sunby wants to merge 3 commits into
zilliztech:masterfrom
sunby:codex/cache-shard-disk-usage
Open

enhance: add cache shard disk usage metric#98
sunby wants to merge 3 commits into
zilliztech:masterfrom
sunby:codex/cache-shard-disk-usage

Conversation

@sunby

@sunby sunby commented Jun 24, 2026

Copy link
Copy Markdown

Summary

  • add optional shard attribution to cachinglayer translator metadata
  • add internal_cache_shard_disk_usage_bytes{data_type,shard} gauge
  • update CacheSlot loaded/unloaded transitions to increment/decrement loaded disk bytes by shard
  • cover load and eviction accounting in cachinglayer cache slot test

Companion Milvus PR: milvus-io/milvus#50733

Signed-off-by: sunby <sunbingyi1992@gmail.com>
@sunby sunby marked this pull request as ready for review June 24, 2026 06:19

@sparknack sparknack left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved these review notes to inline comments on the corresponding diff lines.

Comment thread src/cachinglayer/Metrics.cpp Outdated

prometheus::Gauge&
cache_shard_disk_usage_bytes(CellDataType type, const std::string& shard) {
return internal_cache_shard_disk_usage_bytes_family.Add({{"data_type", CellDataTypeLabel(type)}, {"shard", shard}});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dynamic label needs an explicit lifecycle. Family::Add keeps a time series until Remove, so a long-running QueryNode will retain one {data_type, shard} series for every insert channel it has ever loaded, even after the gauge returns to 0. Please add a ref-counted handle that removes the gauge when the last slot for that shard is gone, or avoid using a dynamic shard label here.

Comment thread include/cachinglayer/Translator.h Outdated
std::optional<LoadingOverheadConfig> loading_overhead;
// Optional shard identifier for per-shard cache disk usage metrics.
// Empty means this translator is not attributed to any shard.
std::string shard;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This puts Milvus query shard / insert-channel semantics into the generic cachinglayer Meta, and the field is public mutable state. I would prefer an explicit metric-attribution type or handle with documented value domain, lifecycle, and cardinality limit, or keeping this attribution in the Milvus layer instead of common caching metadata.

Comment thread src/cachinglayer/Metrics.cpp Outdated
case CellDataType::OTHER:
return "other";
}
return "unknown";

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the existing fail-fast behavior for unknown CellDataType. The other metric helpers throw on unknown enum values; returning "unknown" here would silently hide bad metadata or a future enum addition under a normal-looking label.

Comment thread include/cachinglayer/CacheSlot.h Outdated
storage_usage_tracking_enabled_(storage_usage_tracking_enabled),
loading_timeout_(loading_timeout),
warmup_loading_timeout_(warmup_loading_timeout) {
if (!shard_.empty()) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because empty shard skips this metric entirely, internal_cache_shard_disk_usage_bytes cannot be summed and reconciled with internal_cache_loaded_bytes{location="disk"}. Direct Manager::ChargeLoadedResource paths are also unattributed. Please either make the name/help text clear that this is only attributed cache-slot disk bytes, or define an explicit representation for unattributed usage.

EXPECT_EQ(monitor::cache_shard_disk_usage_bytes(CellDataType::SCALAR_FIELD, kShard).Value(), before + 384);
}

ASSERT_TRUE(cache_slot->ManualEvictAll());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This covers the load + manual eviction happy path, but the risky accounting paths are still untested. Please also cover two slots sharing the same shard where one slot is released while the other remains loaded, and destroying a CacheSlot without manual eviction to verify the destructor / CacheCell::mark_unload path refunds the shard gauge.

sunby added 2 commits June 24, 2026 17:31
Signed-off-by: sunby <sunbingyi1992@gmail.com>
Signed-off-by: sunby <sunbingyi1992@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants