enhance: add cache shard disk usage metric#98
Conversation
Signed-off-by: sunby <sunbingyi1992@gmail.com>
|
|
||
| prometheus::Gauge& | ||
| cache_shard_disk_usage_bytes(CellDataType type, const std::string& shard) { | ||
| return internal_cache_shard_disk_usage_bytes_family.Add({{"data_type", CellDataTypeLabel(type)}, {"shard", shard}}); |
There was a problem hiding this comment.
This dynamic label needs an explicit lifecycle. Family::Add keeps a time series until Remove, so a long-running QueryNode will retain one {data_type, shard} series for every insert channel it has ever loaded, even after the gauge returns to 0. Please add a ref-counted handle that removes the gauge when the last slot for that shard is gone, or avoid using a dynamic shard label here.
| std::optional<LoadingOverheadConfig> loading_overhead; | ||
| // Optional shard identifier for per-shard cache disk usage metrics. | ||
| // Empty means this translator is not attributed to any shard. | ||
| std::string shard; |
There was a problem hiding this comment.
This puts Milvus query shard / insert-channel semantics into the generic cachinglayer Meta, and the field is public mutable state. I would prefer an explicit metric-attribution type or handle with documented value domain, lifecycle, and cardinality limit, or keeping this attribution in the Milvus layer instead of common caching metadata.
| case CellDataType::OTHER: | ||
| return "other"; | ||
| } | ||
| return "unknown"; |
There was a problem hiding this comment.
Please keep the existing fail-fast behavior for unknown CellDataType. The other metric helpers throw on unknown enum values; returning "unknown" here would silently hide bad metadata or a future enum addition under a normal-looking label.
| storage_usage_tracking_enabled_(storage_usage_tracking_enabled), | ||
| loading_timeout_(loading_timeout), | ||
| warmup_loading_timeout_(warmup_loading_timeout) { | ||
| if (!shard_.empty()) { |
There was a problem hiding this comment.
Because empty shard skips this metric entirely, internal_cache_shard_disk_usage_bytes cannot be summed and reconciled with internal_cache_loaded_bytes{location="disk"}. Direct Manager::ChargeLoadedResource paths are also unattributed. Please either make the name/help text clear that this is only attributed cache-slot disk bytes, or define an explicit representation for unattributed usage.
| EXPECT_EQ(monitor::cache_shard_disk_usage_bytes(CellDataType::SCALAR_FIELD, kShard).Value(), before + 384); | ||
| } | ||
|
|
||
| ASSERT_TRUE(cache_slot->ManualEvictAll()); |
There was a problem hiding this comment.
This covers the load + manual eviction happy path, but the risky accounting paths are still untested. Please also cover two slots sharing the same shard where one slot is released while the other remains loaded, and destroying a CacheSlot without manual eviction to verify the destructor / CacheCell::mark_unload path refunds the shard gauge.
Signed-off-by: sunby <sunbingyi1992@gmail.com>
Signed-off-by: sunby <sunbingyi1992@gmail.com>
Summary
internal_cache_shard_disk_usage_bytes{data_type,shard}gaugeCacheSlotloaded/unloaded transitions to increment/decrement loaded disk bytes by shardCompanion Milvus PR: milvus-io/milvus#50733