-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: refactor cni telemetry #3149
base: master
Are you sure you want to change the base?
Conversation
23e8b82
to
0613803
Compare
/azp run Azure Container Networking PR |
Azure Pipelines successfully started running 1 pipeline(s). |
b956ec4
to
dd9ca83
Compare
LGTM on @ramiro-gamarra 's approval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may still be missing some details about the purpose of this refactor, but seems to me that logs are getting duplicated and the abstractions introduced are not cleaning up the code much yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
8c83456
to
94a7991
Compare
/azp run Azure Container Networking PR |
This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days |
we will split this part of the pr into its own pr a telemetry event was added back which was previously removed undo this pr to add those telemetry statements back
remove reflect remove duplicated telemetry and telemetry buffer remove unused fields in report manager force access to telemetry client fields through methods move telemetry start/connect code closer to start of plugin execution
we use SendError where we would have previously called reportPluginError (no log emitted) we don't set error message in cni report because the error message and event message fields both end up in the Message field in the cni telemetry service
63f38ef
to
19b4227
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This PR refactors the telemetry handling within Azure Container Networking to improve log quality and consistency. Key changes include:
- Introducing a package‑level telemetry client (AIClient) and replacing ad‑hoc TelemetryBuffer instances.
- Updating several components (plugin, network, and tests) to use the new telemetry client.
- Minor logging tweaks such as adjusting log levels and refining log messages.
Reviewed Changes
File | Description |
---|---|
telemetry/telemetry_client_test.go | Tests that validate the new telemetry client behaviors. |
telemetry/telemetry_client.go | Introduces package‑level AIClient and thread‑safe telemetry calls. |
telemetry/telemetrybuffer.go | Refactors telemetry buffer connection handling and log levels. |
network/endpoint_test.go | Updates unit tests for pointer‑to‑struct formatting functions. |
network/endpoint.go | Expands PrettyString to include additional endpoint fields. |
cni/network/plugin/main.go | Updates telemetry client usage in plugin startup and error reporting. |
cni/network/stateless/main.go | Refactors telemetry handling to use AIClient for stateless mode. |
telemetry/telemetry.go | Removes unused fields from telemetry reports. |
network/manager.go | Adjusts logging for endpoint state updates with clearer keys. |
cni/network/common.go & network.go | Removes obsolete telemetry fields and consolidates telemetry setup. |
Test files in network and cni/network | Remove redundant telemetry instantiations in favor of AIClient. |
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (2)
telemetry/telemetrybuffer.go:311
- Changing the log level from Error to Warn for failing to kill the telemetry service process may mask critical failures. Please verify that downgrading the severity is intended.
tb.logger.Warn("Failed to kill process by", zap.String("TelemetryServiceProcessName", TelemetryServiceProcessName), zap.Error(err))
cni/network/network.go:43
- Ensure that telemetryClient is initialized and used consistently after the refactor, as mixing the old and new telemetry setups could lead to unexpected behaviors.
telemetryClient = telemetry.AIClient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This PR refactors the telemetry implementation in the Azure Container Networking codebase to improve logging quality, simplify telemetry usage, and remove unused metrics. Key changes include replacing ad hoc telemetry buffer instances with a package‐level AIClient, updating log severity in telemetry buffer routines, and removing legacy telemetry report fields across multiple modules.
Reviewed Changes
File | Description |
---|---|
telemetry/telemetry_client.go | Refactored telemetry client functions to leverage global AIClient instance. |
telemetry/telemetrybuffer.go | Changed log severity (Error → Warn) during telemetry service process shutdown. |
network/endpoint.go | Updated the PrettyString method and added documentation for FormatSliceOfPointersToString. |
cni/network/plugin/main.go | Replaced direct telemetry buffer usage with telemetry.AIClient calls. |
cni/network/stateless/main.go | Refactored telemetry connectivity to use AIClient instead of local TelemetryBuffer. |
telemetry/telemetry.go | Removed unused fields from CNIReport and ReportManager. |
network/manager.go | Modified logging in update endpoint state to use descriptive interface name keys. |
Various test files (network_test.go, network_windows_test.go, network_linux_test.go) | Removed obsolete telemetry objects from test configurations. |
cni/network/common.go & cni/network/network.go | Removed legacy telemetry helper functions; centralized telemetry through AIClient. |
Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (3)
telemetry/telemetrybuffer.go:311
- [nitpick] Changing the log level from Error to Warn may reduce noise, but ensure that this lower severity does not hide critical issues during process termination. If intentional, consider adding a comment to clarify the rationale.
tb.logger.Warn("Failed to kill process by", zap.String("TelemetryServiceProcessName", TelemetryServiceProcessName), zap.Error(err))
network/endpoint.go:158
- It appears that FormatSliceOfPointersToString is defined more than once in this file. Consolidate the duplicate definitions into a single implementation to avoid inconsistency.
func FormatSliceOfPointersToString[T any](slice []*T) string {
cni/network/network.go:297
- [nitpick] Consider renaming setCNIReportDetails to reflect its updated responsibility of setting telemetryClient values (e.g. updateTelemetryReportDetails) to improve clarity.
func (plugin *NetPlugin) setCNIReportDetails(containerID, opType, msg string) {
Reason for Change:
Currently the telemetry CNI is sending is insufficient to debug CNI issues. This PR refactors the cni telemetry to send more and better quality logs.
Examples of Logged information (Will be added in a separate PR-- this PR is focused on refactoring)
Potential additions:
Issue Fixed:
Requirements:
Notes:
Pipeline run to prove logs sent to kusto: https://msazure.visualstudio.com/One/_build/results?buildId=108208651&view=results
Passing run: https://msazure.visualstudio.com/One/_build/results?buildId=108563465&view=results