Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: refactor cni telemetry #3149

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open

feat: refactor cni telemetry #3149

wants to merge 19 commits into from

Conversation

QxBytes
Copy link
Contributor

@QxBytes QxBytes commented Nov 14, 2024

Reason for Change:

Currently the telemetry CNI is sending is insufficient to debug CNI issues. This PR refactors the cni telemetry to send more and better quality logs.

  • Moves telemetry into a package level variable so it is made accessible everywhere
  • Removes sending certain metrics as they are not used
  • Sets the subcontext to the container id. The container id is kept consistent throughout CNI calls for the same pod, meaning an ADD and DEL call (and all related logs) for the same pod will have the same subcontext/container id. The container id is also what is stored in stateless mode as one of the keys.
  • Sets the operation id before any telemetry events are sent. The operation id is used for sampling should we end up enabling it.

Examples of Logged information (Will be added in a separate PR-- this PR is focused on refactoring)

  • CNI add network configuration, arguments
  • CNI add completion with endpoint info struct information (contains hns endpoint id and hns network id), interface results from the ipam invoker, and any error that occurred
  • CNI del network configuration, arguments
  • CNI del completion with error that occurred
  • HNS Endpoint struct before creation / HNS Endpoint Id during deletion
  • HNS Network struct before creation / HNS Network Id during deletion
  • Deletion/Release of each IP (even if does not exist)
  • Mapping sent to CNS during stateless CNI mode during Update Endpoint State
  • Exact CNS response from CNS ipam invoker
  • Exact CNS response from multitenancy ipam invoker
  • Transparent vlan creating/deleting vlan veth interface

Potential additions:

  • endpoint and network structs saved to azure-vnet.json statefile

Issue Fixed:

Requirements:

Notes:
Pipeline run to prove logs sent to kusto: https://msazure.visualstudio.com/One/_build/results?buildId=108208651&view=results
Passing run: https://msazure.visualstudio.com/One/_build/results?buildId=108563465&view=results

@QxBytes QxBytes changed the title ci: refactor cni telemetry feat: refactor cni telemetry Nov 14, 2024
@QxBytes QxBytes self-assigned this Nov 14, 2024
@QxBytes QxBytes added cni Related to CNI. ci Infra or tooling. telemetry logging labels Nov 14, 2024
@QxBytes QxBytes force-pushed the alew/refactor-telemetry branch from 23e8b82 to 0613803 Compare November 14, 2024 20:18
@QxBytes QxBytes marked this pull request as ready for review November 14, 2024 23:56
@QxBytes QxBytes requested review from a team as code owners November 14, 2024 23:56
@QxBytes QxBytes requested a review from jpayne3506 November 14, 2024 23:56
@QxBytes
Copy link
Contributor Author

QxBytes commented Nov 15, 2024

/azp run Azure Container Networking PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@QxBytes QxBytes force-pushed the alew/refactor-telemetry branch 2 times, most recently from b956ec4 to dd9ca83 Compare November 15, 2024 21:31
@timraymond
Copy link
Member

LGTM on @ramiro-gamarra 's approval

Copy link
Contributor

@ramiro-gamarra ramiro-gamarra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may still be missing some details about the purpose of this refactor, but seems to me that logs are getting duplicated and the abstractions introduced are not cleaning up the code much yet.

behzad-mir
behzad-mir previously approved these changes Dec 3, 2024
Copy link
Contributor

@behzad-mir behzad-mir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@QxBytes
Copy link
Contributor Author

QxBytes commented Dec 5, 2024

/azp run Azure Container Networking PR

@github-actions github-actions bot added the stale Stale due to inactivity. label Feb 5, 2025
@QxBytes QxBytes removed the stale Stale due to inactivity. label Feb 7, 2025
Copy link

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Feb 22, 2025
@QxBytes QxBytes removed the stale Stale due to inactivity. label Feb 26, 2025
QxBytes added 18 commits March 5, 2025 15:55
we will split this part of the pr into its own pr
a telemetry event was added back which was previously removed
undo this pr to add those telemetry statements back
remove reflect
remove duplicated telemetry and telemetry buffer
remove unused fields in report manager
force access to telemetry client fields through methods
move telemetry start/connect code closer to start of plugin execution
we use SendError where we would have previously called reportPluginError (no log emitted)
we don't set error message in cni report because the error message and event message fields both end up in the Message field in the cni telemetry service
@QxBytes QxBytes force-pushed the alew/refactor-telemetry branch from 63f38ef to 19b4227 Compare March 5, 2025 23:55
@rbtr rbtr requested review from rbtr and Copilot March 7, 2025 00:40
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR refactors the telemetry handling within Azure Container Networking to improve log quality and consistency. Key changes include:

  • Introducing a package‑level telemetry client (AIClient) and replacing ad‑hoc TelemetryBuffer instances.
  • Updating several components (plugin, network, and tests) to use the new telemetry client.
  • Minor logging tweaks such as adjusting log levels and refining log messages.

Reviewed Changes

File Description
telemetry/telemetry_client_test.go Tests that validate the new telemetry client behaviors.
telemetry/telemetry_client.go Introduces package‑level AIClient and thread‑safe telemetry calls.
telemetry/telemetrybuffer.go Refactors telemetry buffer connection handling and log levels.
network/endpoint_test.go Updates unit tests for pointer‑to‑struct formatting functions.
network/endpoint.go Expands PrettyString to include additional endpoint fields.
cni/network/plugin/main.go Updates telemetry client usage in plugin startup and error reporting.
cni/network/stateless/main.go Refactors telemetry handling to use AIClient for stateless mode.
telemetry/telemetry.go Removes unused fields from telemetry reports.
network/manager.go Adjusts logging for endpoint state updates with clearer keys.
cni/network/common.go & network.go Removes obsolete telemetry fields and consolidates telemetry setup.
Test files in network and cni/network Remove redundant telemetry instantiations in favor of AIClient.

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

telemetry/telemetrybuffer.go:311

  • Changing the log level from Error to Warn for failing to kill the telemetry service process may mask critical failures. Please verify that downgrading the severity is intended.
tb.logger.Warn("Failed to kill process by", zap.String("TelemetryServiceProcessName", TelemetryServiceProcessName), zap.Error(err))

cni/network/network.go:43

  • Ensure that telemetryClient is initialized and used consistently after the refactor, as mixing the old and new telemetry setups could lead to unexpected behaviors.
telemetryClient = telemetry.AIClient

@QxBytes QxBytes requested a review from Copilot March 7, 2025 01:44

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR refactors the telemetry implementation in the Azure Container Networking codebase to improve logging quality, simplify telemetry usage, and remove unused metrics. Key changes include replacing ad hoc telemetry buffer instances with a package‐level AIClient, updating log severity in telemetry buffer routines, and removing legacy telemetry report fields across multiple modules.

Reviewed Changes

File Description
telemetry/telemetry_client.go Refactored telemetry client functions to leverage global AIClient instance.
telemetry/telemetrybuffer.go Changed log severity (Error → Warn) during telemetry service process shutdown.
network/endpoint.go Updated the PrettyString method and added documentation for FormatSliceOfPointersToString.
cni/network/plugin/main.go Replaced direct telemetry buffer usage with telemetry.AIClient calls.
cni/network/stateless/main.go Refactored telemetry connectivity to use AIClient instead of local TelemetryBuffer.
telemetry/telemetry.go Removed unused fields from CNIReport and ReportManager.
network/manager.go Modified logging in update endpoint state to use descriptive interface name keys.
Various test files (network_test.go, network_windows_test.go, network_linux_test.go) Removed obsolete telemetry objects from test configurations.
cni/network/common.go & cni/network/network.go Removed legacy telemetry helper functions; centralized telemetry through AIClient.

Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (3)

telemetry/telemetrybuffer.go:311

  • [nitpick] Changing the log level from Error to Warn may reduce noise, but ensure that this lower severity does not hide critical issues during process termination. If intentional, consider adding a comment to clarify the rationale.
tb.logger.Warn("Failed to kill process by", zap.String("TelemetryServiceProcessName", TelemetryServiceProcessName), zap.Error(err))

network/endpoint.go:158

  • It appears that FormatSliceOfPointersToString is defined more than once in this file. Consolidate the duplicate definitions into a single implementation to avoid inconsistency.
func FormatSliceOfPointersToString[T any](slice []*T) string {

cni/network/network.go:297

  • [nitpick] Consider renaming setCNIReportDetails to reflect its updated responsibility of setting telemetryClient values (e.g. updateTelemetryReportDetails) to improve clarity.
func (plugin *NetPlugin) setCNIReportDetails(containerID, opType, msg string) {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Infra or tooling. cni Related to CNI. logging telemetry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants