Skip to content

fix(CubeMaster): expire node health after stale nodemeta heartbeat#455

Open
zyl1121 wants to merge 4 commits into
TencentCloud:masterfrom
zyl1121:fix/nodemeta-heartbeat-health
Open

fix(CubeMaster): expire node health after stale nodemeta heartbeat#455
zyl1121 wants to merge 4 commits into
TencentCloud:masterfrom
zyl1121:fix/nodemeta-heartbeat-health

Conversation

@zyl1121
Copy link
Copy Markdown

@zyl1121 zyl1121 commented Jun 4, 2026

Motivation

On the nodemeta path, CubeMaster can keep a node healthy after cubelet stops reporting heartbeats because the last reported Ready=True fact is reused as the current health result.

This makes stale nodes look healthy in control-plane reads and can leave them eligible for scheduling decisions longer than expected.

What this PR changes

  • derive current node health from the last reported readiness fact plus heartbeat freshness
  • make nodemeta reads return the current effective healthy result instead of a stale cached value
  • make healthy=false explicit in API output
  • expose unhealthy_reason for HeartbeatExpired and ReportedNotReady
  • make localcache read paths refresh health from cached facts without mutating shared cached nodes
  • apply the same health refresh logic to GetNodesByIp
  • make prefilter skip stale metadata / metric nodes instead of only logging timeout

Testing

go test -vet=off ./pkg/localcache ./pkg/nodemeta ./pkg/selector/prefilter

Manual validation:

  • stale heartbeat is returned as healthy=false
  • unhealthy_reason distinguishes HeartbeatExpired and ReportedNotReady
  • /internal/meta/nodes/{id} and /internal/node reflect the same health semantics

Documentation

No standalone documentation update is included in this PR. However, this change does affect the JSON shape of node responses: healthy=false is now emitted explicitly instead of being omitted.

Related issue

Fixes #454

@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented Jun 4, 2026

Claude finished @zyl1121's task in 1m 48s —— View job


✅ Code Review — PR #455

Progress

  • Gather context and understand the PR
  • Read all changed files
  • Analyze code for correctness, bugs, and edge cases
  • Check test coverage and completeness
  • Post review feedback

Overall Assessment

This is a clean, well-structured fix that correctly addresses the stale-node-health problem. The architecture is sound, the tests are thorough, and the prefilter hardening (Fatalf → Warnf + continue) is a significant reliability improvement.

Decision: APPROVE with minor observations — no blocking issues found.


Design Review

1. Clean separation of concerns (nodehealth package) ✅

The new pkg/base/nodehealth/health.go correctly encapsulates:

  • Status struct combining Healthy + UnhealthyReason
  • EvaluateFromFacts() — derives health from the four facts (reportedReady, heartbeatTime, now, timeout)
  • ReadyConditionTrue() — extracts Kubernetes Ready condition

This makes the evaluation logic independently testable and reusable across localcache, nodemeta, and any future consumer.

2. Immutability on read paths ✅

All read paths now clone before mutating health state:

  • localcache/export.go:cloneNodeWithCurrentHealth() — clones the node.Node, then applies health
  • nodemeta/service.go:cloneSnapshotWithCurrentHealth() — same pattern for NodeSnapshot

Both TestGetNodeRefreshesCurrentHealthFromCachedFacts (export_test.go:270) and TestGetHealthyNodesByInstanceTypeFiltersExpiredHeartbeat (export_test.go:309) explicitly verify that cached source nodes are not mutated during reads. This prevents a class of subtle race-condition bugs.

3. Prefilter hardening ✅

The change from Fatalf to Warnf + continue in prefilter.go:77-94 is the most impactful reliability improvement. Previously, a stale metric or metadata update would crash the entire scheduler via Fatalf. Now it gracefully skips the stale node and continues. This alone prevents cascading failures when node heartbeats lag.

4. Proper JSON serialization ✅

  • Healthy removed omitempty → now serializes "healthy":false explicitly (consumers can distinguish "unknown" from "unhealthy")
  • UnhealthyReason with omitempty → only present when there's a reason
  • ReportedReady with json:"-" → internal field, never serialized

This is validated by TestNodeSnapshotJSONIncludesHealthyFalseAndHidesReportedReady in service_resource_test.go:145.


Potential Issues & Observations

1. Duplicated timeout constant (minor)

metadataHealthTimeout() in export.go:166 and healthTimeout() in service.go:388 compute the same value: SyncMetaDataInterval + 10*time.Second. If one is updated independently, they'll drift. Consider extracting to a shared helper.

2. Lock scope in downNodeCache (very minor)

In node_cache.go:205-217, the lockMetaData mutex is released before delSortedNodes(n) is called. This was the case before the PR, so it's not a regression, but there's a tiny window between the unlock and the sorted-nodes deletion where a reader iterating sortedNodesByClusters could see a node that was just marked ReportedReady=false. In practice this is harmless since the next cycle will correct it.

3. Fast path removed in GetHealthyNodesByInstanceType

The PR removes the old fast path:

// old
if n == -1 {
    return clusterNodes
}

The old code returned raw references from the sorted cluster cache, which was efficient but allowed stale Healthy values to be used. The new code always clones and re-evaluates health, which is correct but has a performance cost on every call. This is the right trade-off — health correctness > micro-optimization.

4. appendSortedNodes reads n.Healthy without health recomputation

In node_cache.go:231, appendSortedNodes checks n.Healthy directly rather than recomputing it. This is fine for current callers because:

  • From addNodeCache/constructNode: Healthy was just set from the DB live-status
  • From updateNodeFromMetaData: Healthy was just set from n.Healthy on line 299
    But it means if someone calls addNodeCache with a node whose Healthy is out of date, the health-based filtering in appendSortedNodes will use that stale value. The node will still be in the cache, so subsequent reads through GetNodes, GetNode, etc. will re-evaluate correctly — it's only the sortedNodesByClusters membership that could lag.

Test Coverage

Test File What it covers
TestGetNodeRefreshesCurrentHealthFromCachedFacts export_test.go:270 Stale heartbeat → Healthy=false, source not mutated
TestGetHealthyNodesByInstanceTypeFiltersExpiredHeartbeat export_test.go:309 Fresh vs stale node in cluster list
TestGetNodesByIpRefreshesCurrentHealthFromCachedFacts export_test.go:343 Health recomputation in IP-based lookup
TestNodeConcurrentCountersUpdateCachedNodeFromReadClone export_test.go:371 Incr/Decr routing to cached node
TestCurrentHealthStatusExpiresStaleHeartbeat service_resource_test.go:55 Fresh + stale heartbeat scenarios
TestUpdateNodeStatusStoresLastReportedReadyButExportsCurrentHealth service_resource_test.go:89 ReportedReady preserved, Healthy recomputed
TestCloneSnapshotWithCurrentHealthRefreshesStaleNode service_resource_test.go:121 Cloned snapshot has correct health, source unchanged
TestNodeSnapshotJSONIncludesHealthyFalseAndHidesReportedReady service_resource_test.go:145 JSON serialization correctness
TestPreFilterExcludesExpiredHeartbeatNode prefilter_test.go:34 Prefilter skips stale metadata nodes
TestPreFilterExcludesMetricTimeoutNode prefilter_test.go:74 Prefilter skips stale metric nodes

Missing (non-blocking): No test for GetNodes() or GetHealthyNodes() health refresh, though they use the same cloneNodeWithCurrentHealth path tested elsewhere.


Summary

The PR correctly fixes the stale heartbeat problem, is well-tested, and introduces no regressions. The design cleanly separates concerns, read paths are immutable, and the prefilter changes significantly improve scheduler robustness. No blocking issues found.

@zyl1121 zyl1121 force-pushed the fix/nodemeta-heartbeat-health branch from c8b8153 to 5eb0cda Compare June 4, 2026 02:55
@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented Jun 4, 2026

Claude finished @zyl1121's task in 2m 45s —— View job


Code Review — PR #455

Progress

  • Analyze context and understand the request
  • Read all changed files
  • Analyze code for correctness, bugs, and edge cases
  • Check test coverage and completeness
  • Post review feedback

Overall Assessment

This PR correctly addresses the stale-node-health problem described in #454. The architecture is clean, the design is well-isolated, and tests are thorough. The Fatalf → Warnf + continue change in prefilter is a significant reliability improvement.

Decision: APPROVE — no blocking issues found. Observations below are minor and can be tracked separately.


Design Review

1. nodehealth package — clean abstraction ✅

pkg/base/nodehealth/health.go encapsulates the health evaluation logic:

  • Status struct with Healthy + UnhealthyReason
  • EvaluateFromFacts(reportedReady, heartbeatTime, now, timeout) — pure function from four inputs to one output
  • Evaluate(conditions, heartbeatTime, now, timeout) — convenience wrapper that extracts ReadyConditionTrue

This is independently testable and reusable across localcache, nodemeta, and any future consumer. The two exported reason constants (ReasonReportedNotReady, ReasonHeartbeatExpired) make the API self-documenting.

2. Read-path immutability ✅

Both localcache and nodemeta now clone nodes before mutating health on read:

  • localcache/export.go:170-179cloneNodeWithCurrentHealth() clones via n.Clone(), then applies health evaluation from cached facts
  • nodemeta/service.go:488-492cloneSnapshotWithCurrentHealth() clones via cloneSnapshot(), then applies applyCurrentHealth()

All read paths (GetNodes, GetNode, GetNodesByIp, GetHealthyNodesByInstanceType, GetNode in nodemeta, ListNodes) use these helpers. Tests at export_test.go:296-306 and service_resource_test.go:121-143 explicitly verify source nodes are not mutated during reads.

3. Prefilter hardening ✅

In prefilter.go:78-94, the change from Fatalf to Warnf + continue for metadata and metric timeouts means stale nodes no longer crash the scheduler. This was the most impactful reliability win.

The Fatalf at line 73-77 for CpuLoadUsage > CpuTotal is correctly left unchanged — that condition represents a data integrity issue that warrants a crash.

4. JSON serialization ✅

node.go changes:

  • Healthy removed omitempty"healthy":false is now explicit in JSON
  • UnhealthyReason keeps omitempty → only serialized when non-empty
  • ReportedReady uses json:"-" → never serialized

The NodeSnapshot in nodemeta/service.go:98-118 mirrors these correctly.


Observations

1. metadataHealthTimeout() / healthTimeout() duplication

export.go:166-168 and service.go:388-390 both compute the same value: SyncMetaDataInterval + 10s. Both delegate to nodehealth.MetadataTimeout(), so the formula itself is shared — but they are separate wrapper functions. If the timeout derivation logic changes beyond the function call, only one wrapper might be updated. Minor, but worth tracking.

2. appendSortedNodes uses cached Healthy directly

At node_cache.go:230-233, appendSortedNodes gates on n.Healthy rather than recomputing health from facts:

func (l *local) appendSortedNodes(n *node.Node) {
    if n == nil || !n.Healthy {
        return
    }

This means:

  • For nodes coming from toSchedulerNode (nodemeta path): Healthy was just computed from currentHealthStatus(), so it's correct.
  • For nodes from constructNode (legacy DB path): Healthy equals constants.HeartbeatHealth == elem.LiveStatus, which is the persisted state rather than a freshness-based computation. A stale node loaded from DB with LiveStatus == HeartbeatHealth will be added to sortedNodesByClusters.

This is not a correctness bug because GetHealthyNodesByInstanceType recomputes health on every read via cloneNodeWithCurrentHealth() — so stale nodes in sortedNodesByClusters are filtered out during reads. The node appears in the sorted list but is excluded by the health filter. The impact is limited to a slightly larger sortedNodesByClusters list than necessary.

3. Redundant health checks in prefilter

prefilter.go:48-53 already skips !n.Healthy nodes. Then lines 90-95 check MetaDataUpdateAt timeout separately. Since Healthy is recomputed from MetaDataUpdateAt + timeout, the !n.Healthy check should catch stale-heartbeat nodes. The timeout check at line 90 acts as defense-in-depth in case a stale node somehow has Healthy=true (e.g., race with an update). The redundancy is acceptable.

4. downNodeCache mutates cached node in-place

node_cache.go:205-212 mutates the cached node directly with lockMetaData held, then releases the lock before calling delSortedNodes(n):

l.lockMetaData.Lock()
old := v.(*node.Node)
old.ReportedReady = false
old.Healthy = false
...
l.lockMetaData.Unlock()
l.delSortedNodes(n)

The mutation is correct (protected by lockMetaData), and delSortedNodes has its own lockSortedNodes lock. The brief window between Unlock and delSortedNodes is harmless — sortedNodesByClusters may include this node for an instant longer, but the next DB sync cleans it up. Not a regression.

5. incrementNodeCacheCounts in node_cache.go — consistent update

At node_cache.go:372, IncrNodeConcurrent calls getMutableNode(n), which fetches the cached node by n.ID(). Since read paths return a clone, IncrNodeConcurrent correctly routes to the cached reference. This means the clone/read pattern doesn't break concurrent counters.


Test Coverage

Test File What it covers
TestGetNodeRefreshesCurrentHealthFromCachedFacts export_test.go:270 Stale heartbeat → Healthy=false, source not mutated
TestGetHealthyNodesByInstanceTypeFiltersExpiredHeartbeat export_test.go:309 Fresh vs stale in cluster list, source not mutated
TestGetNodesByIpRefreshesCurrentHealthFromCachedFacts export_test.go:343 Health recomputation on IP-based lookup
TestNodeConcurrentCountersUpdateCachedNodeFromReadClone export_test.go:371 IncrNodeConcurrent/DecrNodeConcurrent route to cached node
TestCurrentHealthStatusExpiresStaleHeartbeat service_resource_test.go:55 Fresh + stale heartbeat scenarios
TestUpdateNodeStatusStoresLastReportedReadyButExportsCurrentHealth service_resource_test.go:89 ReportedReady preserved, Healthy recomputed
TestCloneSnapshotWithCurrentHealthRefreshesStaleNode service_resource_test.go:121 Cloned snapshot health correct, source unchanged
TestNodeSnapshotJSONIncludesHealthyFalseAndHidesReportedReady service_resource_test.go:145 JSON serialization
TestPreFilterExcludesExpiredHeartbeatNode prefilter_test.go:34 Prefilter skips stale metadata node
TestPreFilterExcludesMetricTimeoutNode prefilter_test.go:74 Prefilter skips stale metric node
TestToSchedulerNodeDoesNotForgeMetricUpdate service_resource_test.go:33 Metric fields left zero in nodemeta→scheduler conversion
TestSyncNodeTemplatesReconcilesHeartbeatState export_test.go:160 Template reconciliation during heartbeat

Coverage is thorough. Missing (non-blocking): no direct test for GetNodes() or GetHealthyNodes() health refresh, though both use the same cloneNodeWithCurrentHealth path.


Summary

The PR correctly fixes the stale-heartbeat problem with a clean architecture, immutable read paths, and comprehensive tests. The preloader reliability improvements (Fatalf → continue) are significant. No blocking issues.


@zyl1121 zyl1121 changed the title cubemaster: expire stale nodemeta node health fix(CubeMaster): expire node health after stale nodemeta heartbeat Jun 4, 2026
Comment on lines +1 to +3
// Copyright (c) 2026 Tencent Inc.
// SPDX-License-Identifier: Apache-2.0
//
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not add this line

// Copyright (c) 2026 Tencent Inc.
// SPDX-License-Identifier: Apache-2.0
//

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not add this line

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not add this line

@fslongjin Removed the extra header comments in a follow-up commit. Thanks for catching that.

@zyl1121 zyl1121 force-pushed the fix/nodemeta-heartbeat-health branch from 5eb0cda to 10fcd30 Compare June 4, 2026 08:08
@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented Jun 4, 2026

Claude finished @zyl1121's task in 1m 1s —— View job


Re-review after commit 10fcd30 — Extra header comments removed

Progress

  • Verify that extra header comments have been removed
  • Confirm files are correct after the fix

✅ Fix verified

The follow-up commit 10fcd30 ("fix(CubeMaster): drop extra header comments") correctly addresses both comments from @fslongjin:

  1. CubeMaster/pkg/base/nodehealth/health.go — now starts directly with package nodehealth (no blank copyright/SPDX header above the package declaration).

  2. CubeMaster/pkg/selector/prefilter/prefilter_test.go — same fix applied, starts directly with package prefilter.

Both files now match the project's convention (as seen in prefilter.go:1-6 and export.go:1-6), where the copyright header comment sits above a package-level doc comment, not as a floating block on its own.

No other changes were detected in this push. The rest of the PR's code, tests, and design remain as reviewed in the previous round.

Decision: APPROVE — no issues remain.


Copy link
Copy Markdown
Member

@fslongjin fslongjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR — the facts-vs-evaluation separation (ReportedReady vs Healthy) and the centralized nodehealth package are clean architectural improvements. I left a few inline comments on areas that could be tightened up.


Summary of inline comments

  1. prefilter.go — Redundant MetaDataUpdateAt timeout check; GetHealthyNodesByInstanceType already expires stale heartbeats via cloneNodeWithCurrentHealth.

  2. node.go:Clone() — Struct copy reads LocalCreateNum non-atomically (data race per Go memory model).

  3. service.go — Duplicate health evaluation in UpdateNodeStatus (line 239) and toSchedulerNode (line 412); two time.Now() calls on the same snapshot.

  4. node.go:Healthy JSON tagomitempty removed; API contract change worth calling out explicitly.

  5. service.go:RegisterNode — Returns cloneSnapshot without applying current health, inconsistent with GetNode/ListNodes/UpdateNodeStatus.

"CalleeCluster": n.ClusterLabel,
}).Fatalf("%s MetaDataUpdate timeout,lastupdate:%v", n.IP, n.MetaDataUpdateAt)
}).Warnf("%s MetaDataUpdate timeout,lastupdate:%v", n.IP, n.MetaDataUpdateAt)
continue
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔁 Redundant heartbeat-timeout check

GetHealthyNodesByInstanceType (line 40) already clones every node with cloneNodeWithCurrentHealth, which calls nodehealth.EvaluateFromFacts using the same MetadataTimeout. A node whose heartbeat has expired will already have Healthy=false and will have been excluded from the candidate set before this loop runs.

This second MetaDataUpdateAt guard therefore duplicates what the health evaluation already does. Over time, keeping two separate timeout calculations risks the two checks drifting (e.g. someone changes how cloneNodeWithCurrentHealth computes the timeout but forgets to update this prefilter copy, or vice versa).

Suggestion: Rely solely on the n.Healthy check above (line 49). That gives a single source of truth for heartbeat-liveness decisions.

cloned.VirtualNodeQuotaArray = append([]int64(nil), n.VirtualNodeQuotaArray...)
}
return &cloned
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Non-atomic read of LocalCreateNum

The struct copy cloned := *n reads every field including LocalCreateNum (an int64), which is accessed concurrently via atomic.AddInt64 (in LocalCreateNumIncrBy) and atomic.LoadInt64 (in LocalCreateConcurrentLimit).

Per the Go memory model this is a data race and would be flagged by go test -race. In practice on amd64 it is safe (aligned 8-byte reads are atomic at the hardware level), but on 32-bit platforms (ARM) this could tear.

If this project only targets amd64, a brief comment acknowledging the deliberate non-atomic copy would help future readers (and suppress race-detector noise). Otherwise, consider using atomic.LoadInt64(&n.LocalCreateNum) for the copy.

@@ -419,7 +438,9 @@ func toSchedulerNode(snap *NodeSnapshot) *node.Node {
OssClusterLabel: snap.ClusterLabel,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔄 Duplicate health evaluation with UpdateNodeStatus line 239

UpdateNodeStatus already calls applyCurrentHealth(snap, time.Now()) (line 239, position 46 in this diff), which runs currentHealthStatus internally. syncLocalcache on line 241 then calls toSchedulerNode, which calls currentHealthStatus(snap, time.Now()) again here.

The two time.Now() calls are microseconds apart, so the result is effectively identical — but it is redundant work and creates a maintenance risk: if someone later extends the health-decision logic in one call site but not the other, the two representations of the same snapshot's health will diverge.

Suggestion: Have applyCurrentHealth return the nodehealth.Status it computed (e.g. func applyCurrentHealth(...) nodehealth.Status), and let toSchedulerNode consume it instead of re-deriving from scratch. This keeps a single evaluation per heartbeat update.

UnhealthyReason string `json:"UnhealthyReason,omitempty"`

Score float64 `json:"Score,omitempty"`

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📢 API contract change: omitempty removed

The Healthy field changed from json:"Healthy,omitempty" to json:"Healthy", meaning "healthy":false is now always present in JSON output instead of being omitted. This is deliberate (the nodemeta test confirms it), but it is a wire-format change that existing API consumers may depend on.

Consider calling this out in the PR description and the next changelog entry so integrators know to expect the new field shape.

applyCurrentHealth(out, now)
return out
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistency: RegisterNode doesn't use this helper

cloneSnapshotWithCurrentHealth is now used by GetNode (line 300) and ListNodes (line 310), and UpdateNodeStatus calls applyCurrentHealth inline (line 239). But RegisterNode (line 199, unchanged) still returns a plain cloneSnapshot(snap) without calling applyCurrentHealth.

The caller gets Healthy=false (zero value) with an empty UnhealthyReason — rather than UnhealthyReason="HeartbeatExpired" which would be the consistent answer for a node that has never sent a heartbeat yet.

Should RegisterNode also use cloneSnapshotWithCurrentHealth for consistency?

@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented Jun 4, 2026

Claude finished @zyl1121's task in 2m 20s —— View job


Claude is analyzing the latest PR state...

Progress

  • Analyze all 5 inline review comments from @fslongjin
  • Verify current code state against each concern
  • Check if remaining fixes are needed
  • Post analysis

Analysis of @fslongjin's Review Against Current PR State

Commit faba88d ("tighten stale-node health follow-ups") was pushed after @fslongjin's review and addresses all technical concerns. Here's the per-comment breakdown:

1. ✅ prefilter.go — Redundant MetaDataUpdateAt timeout check

Status: Fixed in faba88d

The redundant metaDataUpdateAtTimeout variable and the time.Since(n.MetaDataUpdateAt) > metaDataUpdateAtTimeout guard were both removed. The prefilter now relies solely on n.Healthy (line 47), which is already recomputed from heartbeat freshness by GetHealthyNodesByInstanceTypecloneNodeWithCurrentHealthnodehealth.EvaluateFromFacts.

2. ✅ node.go:Clone() — Non-atomic read of LocalCreateNum

Status: Fixed in faba88d

Clone() now performs an atomic.LoadInt64(&n.LocalCreateNum) before the struct copy and overwrites the copied field, with an explanatory comment:

// Clone provides a best-effort read-side snapshot. Mutable counters such
// as LocalCreateNum are refreshed via atomic loads after the structural
// copy so cloned read models stay aligned with the write path.
localCreateNum := atomic.LoadInt64(&n.LocalCreateNum)
cloned := *n
cloned.LocalCreateNum = localCreateNum

3. ✅ service.go — Duplicate health evaluation

Status: Fixed in faba88d

toSchedulerNode no longer calls currentHealthStatus(snap, time.Now()). Instead it reads snap.Healthy and snap.UnhealthyReason directly — fields that were already set by applyCurrentHealth earlier in the call chain (UpdateNodeStatus line 240). There is now exactly one health evaluation per heartbeat update.

4. 📢 node.go:Healthy JSON tag — omitempty removed

Status: Noted, no code change needed

The omitempty removal is a deliberate wire-format change (confirmed by TestNodeSnapshotJSONIncludesHealthyFalseAndHidesReportedReady). As @fslongjin suggested, this should be called out in the PR description and next changelog entry — that's a documentation/communication task, not a code fix.

5. ✅ RegisterNode — Inconsistency with cloneSnapshotWithCurrentHealth

Status: Fixed in faba88d

Commit faba88d added applyCurrentHealth(snap, time.Now()) inside RegisterNode (line 197) before returning cloneSnapshot(snap) (line 200). This means the returned snapshot now has correct health (e.g., Healthy=false, UnhealthyReason="HeartbeatExpired" for a fresh node) rather than zero-value fields.

The function still returns cloneSnapshot(snap) rather than cloneSnapshotWithCurrentHealth(snap, time.Now()), but since applyCurrentHealth was just called, both produce identical results. This is consistent in practice.


Summary

All 5 inline review items from @fslongjin have been addressed:

# Concern Status
1 Redundant MetaDataUpdateAt check in prefilter ✅ Fixed
2 Non-atomic Clone() LocalCreateNum read ✅ Fixed
3 Duplicate health evaluation in toSchedulerNode ✅ Fixed
4 JSON omitempty API contract change 📢 Noted for docs
5 RegisterNode missing health application ✅ Fixed

No further code changes are needed. The PR is in good shape.

@zyl1121
Copy link
Copy Markdown
Author

zyl1121 commented Jun 4, 2026

Thanks for this PR — the facts-vs-evaluation separation (ReportedReady vs Healthy) and the centralized nodehealth package are clean architectural improvements. I left a few inline comments on areas that could be tightened up.

Summary of inline comments

  1. prefilter.go — Redundant MetaDataUpdateAt timeout check; GetHealthyNodesByInstanceType already expires stale heartbeats via cloneNodeWithCurrentHealth.
  2. node.go:Clone() — Struct copy reads LocalCreateNum non-atomically (data race per Go memory model).
  3. service.go — Duplicate health evaluation in UpdateNodeStatus (line 239) and toSchedulerNode (line 412); two time.Now() calls on the same snapshot.
  4. node.go:Healthy JSON tagomitempty removed; API contract change worth calling out explicitly.
  5. service.go:RegisterNode — Returns cloneSnapshot without applying current health, inconsistent with GetNode/ListNodes/UpdateNodeStatus.

Thanks for the detailed review. I addressed the issues you found in follow-up commit, and I also updated the PR description to explicitly mention that healthy=false is now emitted in JSON output.

@fslongjin
Copy link
Copy Markdown
Member

Hi @zyl1121, thanks for this PR!

The DCO check is currently failing because all 4 commits are missing the Signed-off-by trailer. The CubeSandbox project requires every commit to carry a DCO sign-off.

Could you please:

  1. Amend each commit to add your Signed-off-by line (e.g., Signed-off-by: zhengyilei <zhengyilei@myhexin.com>)
  2. Force push the updated branch

You can do this with an interactive rebase:

git rebase -i HEAD~4
# mark each commit as 'reword', and append:
# Signed-off-by: zhengyilei <zhengyilei@myhexin.com>

Or more conveniently use git rebase --signoff:

git rebase --signoff HEAD~4
git push --force-with-lease

Once all commits have the sign-off, the DCO check should pass. Thanks!

@zyl1121 zyl1121 force-pushed the fix/nodemeta-heartbeat-health branch from faba88d to 7ebdb38 Compare June 5, 2026 08:48
@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented Jun 5, 2026

Claude finished @zyl1121's task in 1m 22s —— View job


PR #455 Re-trigger Analysis

Progress

  • Gather context — check git state and new commits
  • Analyze the request and determine what action to take
  • Update GitHub comment with results

Current State

This is a re-trigger on a PR synchronize event. After examining the branch state:

✅ No new commits since last review

The same 4 commits reviewed previously are still present (latest: 7ebdb38, authored 2026-06-04 17:31:37 +0800). There are no new changes to review.

The PR is in good shape:

  • All previous review feedback from @fslongjin was addressed in commit faba88d / 7ebdb38
  • @fslongjin approved the PR on 2026-06-05
  • Test coverage is thorough (unit tests for localcache, nodemeta, prefilter)
  • The architecture (centralized nodehealth package, immutable read paths, prefilter hardening) is solid

⚠️ Remaining item: DCO sign-off

@fslongjin noted on 2026-06-05 that the DCO check is failing because all 4 commits are missing the Signed-off-by trailer. @zyl1121 (the author), you'll need to add the sign-off trailers. The easiest approach:

git rebase --signoff HEAD~4
git push --force-with-lease

This adds Signed-off-by: Your Name <your.email@example.com> to each commit. I cannot perform this step on your behalf, as the sign-off must come from the actual author.

Summary

Item Status
Code review feedback addressed ✅ All 5 items fixed in follow-up commits
PR approved by @fslongjin
DCO check passing ❌ Needs author action
New commits to review None — same 4 commits as before

Once the DCO sign-off is applied, this PR should be ready to merge.


@zyl1121
Copy link
Copy Markdown
Author

zyl1121 commented Jun 5, 2026

Hi @zyl1121, thanks for this PR!

The DCO check is currently failing because all 4 commits are missing the Signed-off-by trailer. The CubeSandbox project requires every commit to carry a DCO sign-off.

Could you please:

  1. Amend each commit to add your Signed-off-by line (e.g., Signed-off-by: zhengyilei <zhengyilei@myhexin.com>)
  2. Force push the updated branch

You can do this with an interactive rebase:

git rebase -i HEAD~4
# mark each commit as 'reword', and append:
# Signed-off-by: zhengyilei <zhengyilei@myhexin.com>

Or more conveniently use git rebase --signoff:

git rebase --signoff HEAD~4
git push --force-with-lease

Once all commits have the sign-off, the DCO check should pass. Thanks!

Fixed and force-pushed the branch with signed-off commits. Thanks!

zhengyilei added 4 commits June 5, 2026 17:02
Derive current node health from the last reported readiness fact
and heartbeat freshness so stale heartbeats no longer remain healthy
in nodemeta reads, localcache-backed reads, or scheduler prefilter.

Signed-off-by: zhengyilei <zheng_yilei@qq.com>
Reuse a shared helper for nodemeta heartbeat expiry so localcache,
nodemeta, and prefilter keep the same timeout rule and avoid drift.

Signed-off-by: zhengyilei <zheng_yilei@qq.com>
Remove the added file header comments from the new Go files to match
maintainer review feedback and keep this PR aligned with current repo
expectations.

Signed-off-by: zhengyilei <zheng_yilei@qq.com>
Remove the redundant prefilter heartbeat timeout check, carry the
current health state through nodemeta registration and scheduler
projection, and harden read-side cloning around mutable node state.

Signed-off-by: zhengyilei <zheng_yilei@qq.com>
@zyl1121 zyl1121 force-pushed the fix/nodemeta-heartbeat-health branch from 7ebdb38 to 0aac550 Compare June 5, 2026 09:03
@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented Jun 5, 2026

Claude finished @zyl1121's task in 34s —— View job


Progress

  • Analyze PR synchronize event context
  • Check for new commits on the branch
  • Verify DCO sign-off status
  • Update comment with findings

Analysis — PR #455 Synchronize Event

Current State

This is a re-trigger on a PR synchronize event. After examining the branch:

✅ No new commits since last review

The same 4 commits reviewed previously are still present (latest: 0aac550, all with DCO sign-off added). There are no new code changes to review.

✅ DCO sign-off confirmed

All 4 commits now carry the Signed-off-by: zhengyilei <zheng_yilei@qq.com> trailer:

3175c3e cubemaster: expire node health after stale nodemeta heartbeat
d1b27d4 cubemaster: centralize nodemeta heartbeat timeout
6c09b10 fix(CubeMaster): drop extra header comments
0aac550 fix(CubeMaster): tighten stale-node health follow-ups

The DCO check should now pass.

Summary

Item Status
Code review feedback addressed ✅ All 5 items fixed
PR approved by @fslongjin
DCO sign-off applied
New commits to review None — same 4 commits

The PR is fully ready — all review feedback has been addressed, the DCO issue has been resolved, and the PR has been approved. No further action needed from my side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug Report] CubeMaster keeps nodemeta nodes healthy after heartbeats stop reporting

2 participants