Skip to content

Conversation

@googs1025
Copy link
Collaborator

Pull Request Description

[Please provide a clear and concise description of your changes here]

Related Issues

Resolves: #1805

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

@googs1025 googs1025 marked this pull request as draft November 27, 2025 11:31
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @googs1025, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the gateway plugin's resilience and scalability by introducing Kubernetes leader election. This mechanism ensures that in a clustered environment, only one instance of the gateway plugin is active at any given time, preventing split-brain scenarios and ensuring consistent request processing. The integration with gRPC health checks allows Kubernetes to intelligently manage pod readiness and traffic routing, directing requests exclusively to the elected leader.

Highlights

  • Leader Election Implementation: Integrated Kubernetes leader election into the gateway plugin to enable High Availability (HA) for multi-instance deployments.
  • Custom Health Server: Introduced a new SimpleHealthServer for gRPC health checks, which dynamically reports liveness and readiness based on the instance's leader election status.
  • Configurable Leader Election: Added new command-line flags (--enable-leader-election, --leader-election-id, --leader-election-namespace) to allow flexible configuration and control over the leader election process.
  • Kubernetes Probe Updates: Modified Kubernetes deployment configurations (YAMLs and Helm charts) to specify distinct gRPC service names (liveness, readiness) for health probes, ensuring only the elected leader is marked as ready by Kubernetes.
  • Profiling Address Configuration: Made the profiling endpoint address configurable via a new command-line flag (--profiling-bind-address).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@googs1025
Copy link
Collaborator Author

TODO: will test local

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces leader election support for high availability in the gateway plugin. The changes include adding new command-line flags for configuration, implementing the leader election logic using Kubernetes leases, and creating a new leader-aware gRPC health check server. My review focuses on the correctness and robustness of the leader election implementation, particularly around namespace discovery and ensuring graceful shutdown procedures are followed when leadership is lost.


// start leader election
go func() {
leaderElector.Run(context.Background())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using context.Background() here means the leader election loop will run indefinitely and cannot be gracefully stopped when the application receives a shutdown signal. This could prevent the lease from being released promptly on shutdown (as ReleaseOnCancel: true would not be triggered).

It's better to use a context that can be cancelled. For example, you could create a context that is cancelled when the gracefulStop channel receives a signal. This would allow the leader elector to clean up and release the lease as part of a graceful shutdown.

@googs1025 googs1025 force-pushed the gateway_plugin_leaderelection branch 2 times, most recently from 8589c59 to 9ad7423 Compare November 27, 2025 15:19
@googs1025 googs1025 requested a review from Copilot November 28, 2025 00:09
Copilot finished reviewing on behalf of googs1025 November 28, 2025 00:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds leader election support for high availability (HA) in the gateway plugin. When enabled, only the elected leader instance will serve traffic while follower instances remain available but don't handle requests, ensuring a single active gateway at any time.

Key changes:

  • Implements a custom health check server with leader election awareness to differentiate liveness and readiness probes
  • Adds command-line flags for configuring leader election parameters (enabled/disabled, lease ID, namespace)
  • Updates Kubernetes health probe configurations to use service-specific checks (liveness vs readiness)

Reviewed changes

Copilot reviewed 3 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
pkg/plugins/gateway/health/health.go New health server implementation that returns different statuses based on leader election state and probe type
cmd/plugins/main.go Integrates leader election setup with lease-based coordination and replaces default health server with custom implementation
dist/chart/values.yaml Adds service names to liveness and readiness probe configurations
dist/chart/templates/gateway-plugin/deployment.yaml Adds commented example args for leader election configuration
config/gateway/gateway-plugin/gateway-plugin.yaml Configures leader election as disabled by default with service-specific health probes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@googs1025 googs1025 force-pushed the gateway_plugin_leaderelection branch 3 times, most recently from 5b16090 to 36ec241 Compare November 28, 2025 01:55
@googs1025
Copy link
Collaborator Author

Local Test

~ kubectl get pods -naibrix-system
NAME                                         READY   STATUS    RESTARTS      AGE
aibrix-controller-manager-55749fcbcc-vjlcm   1/1     Running   4 (12m ago)   16m
aibrix-gateway-plugins-774bc6b966-fw4mc      1/1     Running   2 (12m ago)   16m
aibrix-gateway-plugins-774bc6b966-jnjg8      0/1     Running   0             11m
aibrix-gateway-plugins-774bc6b966-kc5kf      0/1     Running   0             11m
aibrix-redis-master-574fc59fb6-v77qz         1/1     Running   1 (12m ago)   16m
# Pod Status Analysis
aibrix-gateway-plugins-774bc6b966-fw4mc   1/1     Running   #  Leader - 1/1 Ready
aibrix-gateway-plugins-774bc6b966-jnjg8   0/1     Running   #  Follower - 0/1 Ready  
aibrix-gateway-plugins-774bc6b966-kc5kf   0/1     Running   #  Follower - 0/1 Ready
  • 1 Leader (1/1 Ready)
  • 2 Followers (0/1 Ready)
  • All pods running without crashes

Log

From the logs:

Follower Pod 1 (jnjg8):

➜  ~ kubectl logs -f aibrix-gateway-plugins-774bc6b966-jnjg8 -naibrix-system | grep "check"
Defaulted container "gateway-plugin" out of: gateway-plugin, init-c (init)
I1128 03:43:02.183692       1 health.go:50] Health check request for service: readiness, leader election enabled: true, current leader: false
I1128 03:43:09.679857       1 health.go:50] Health check request for service: liveness, leader election enabled: true, current leader: false
I1128 03:43:12.184311       1 health.go:50] Health check request for service: readiness, leader election enabled: true, current leader: false
I1128 03:43:19.680017       1 health.go:50] Health check request for service: liveness, leader election enabled: true, current leader: false
I1128 03:43:22.183273       1 health.go:50] Health check request for service: readiness, leader election enabled: true, current leader: false
I1128 03:43:29.680793       1 health.go:50] Health check request for service: liveness, leader election enabled: true, current leader: false
I1128 03:43:32.183131       1 health.go:50] Health check request for service: readiness, leader election enabled: true, current leader: false
I1128 03:43:39.680291       1 health.go:50] Health check request for service: liveness, leader election enabled: true, current leader: false
I1128 03:43:42.183078       1 health.go:50] Health check request for service: readiness, leader election enabled: true, current leader: false
I1128 03:43:49.679807       1 health.go:50] Health check request for service: liveness, leader election enabled: true, current leader: false
I1128 03:43:52.183023       1 health.go:50] Health check request for service: readiness, leader election enabled: true, current leader: false
I1128 03:43:57.181368       1 health.go:50] Health check request for service: readiness, leader election enabled: true, current leader: false

Follower Pod 2 (kc5kf):

➜  ~ kubectl logs -f aibrix-gateway-plugins-774bc6b966-kc5kf -naibrix-system | grep "check"
Defaulted container "gateway-plugin" out of: gateway-plugin, init-c (init)
I1128 03:43:08.603850       1 health.go:50] Health check request for service: readiness, leader election enabled: true, current leader: false
I1128 03:43:09.685151       1 health.go:50] Health check request for service: liveness, leader election enabled: true, current leader: false
I1128 03:43:18.604618       1 health.go:50] Health check request for service: readiness, leader election enabled: true, current leader: false
I1128 03:43:19.685325       1 health.go:50] Health check request for service: liveness, leader election enabled: true, current leader: false
I1128 03:43:27.182344       1 health.go:50] Health check request for service: readiness, leader election enabled: true, current leader: false
I1128 03:43:29.685131       1 health.go:50] Health check request for service: liveness, leader election enabled: true, current leader: false
I1128 03:43:37.183827       1 health.go:50] Health check request for service: readiness, leader election enabled: true, current leader: false
I1128 03:43:39.685451       1 health.go:50] Health check request for service: liveness, leader election enabled: true, current leader: false
I1128 03:43:47.182836       1 health.go:50] Health check request for service: readiness, leader election enabled: true, current leader: false
I1128 03:43:49.685332       1 health.go:50] Health check request for service: liveness, leader election enabled: true, current leader: false
I1128 03:43:57.183717       1 health.go:50] Health check request for service: readiness, leader election enabled: true, current leader: false
I1128 03:43:59.683607       1 health.go:50] Health check request for service: liveness, leader election enabled: true, current leader: false
  • Readiness: current leader: false → returns NOT_SERVING
  • Liveness: current leader: false → returns SERVING
  • Behavior correct: Non-leader pods don't handle traffic but remain alive

** Service Endpoints Configured**

~ kubectl get svc -naibrix-system
NAME                                        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                       AGE
aibrix-controller-manager-metrics-service   ClusterIP   10.106.172.133   <none>        8080/TCP                      17m
aibrix-gateway-plugins                      ClusterIP   10.97.67.205     <none>        50052/TCP,6060/TCP,8080/TCP   17m
aibrix-gpu-optimizer                        ClusterIP   10.111.70.200    <none>        8080/TCP                      17m
aibrix-metadata-service                     ClusterIP   10.100.214.241   <none>        8090/TCP                      17m
aibrix-redis-master                         ClusterIP   10.110.224.198   <none>        6379/TCP                      17m
aibrix-webhook-service                      ClusterIP   10.109.35.232    <none>        443/TCP                       17m
debug-gateway-plugin                        ClusterIP   10.99.118.85     <none>        50052/TCP                     76d
➜  ~ kubectl get endpoints -n aibrix-system aibrix-gateway-plugins  -oyaml
apiVersion: v1
kind: Endpoints
metadata:
  creationTimestamp: "2025-11-28T03:26:57Z"
  labels:
    app.kubernetes.io/component: aibrix-gateway-plugin
    app.kubernetes.io/instance: aibrix
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: aibrix
    app.kubernetes.io/version: 0.5.0
    helm.sh/chart: 0.5.0
  name: aibrix-gateway-plugins
  namespace: aibrix-system
  resourceVersion: "12419605"
  uid: c87d0376-c554-48c4-ad25-a87565758d8b
subsets:
- addresses:  # Only Leader Pod in endpoints
  - ip: 10.244.86.209
    nodeName: minikube
    targetRef:
      kind: Pod
      name: aibrix-gateway-plugins-774bc6b966-fw4mc
      namespace: aibrix-system
      uid: 330d4942-24e6-48fe-8bfa-4d4ea3b4f8a4
notReadyAddresses:  # Follower Pods excluded from traffic
  - ip: 10.244.86.219
    nodeName: minikube
    targetRef:
      kind: Pod
      name: aibrix-gateway-plugins-774bc6b966-kc5kf
      namespace: aibrix-system
      uid: 5b4fbde1-f6bf-4b7c-bd78-cd4e97ebd129
  - ip: 10.244.86.221
    nodeName: minikube
    targetRef:
      kind: Pod
      name: aibrix-gateway-plugins-774bc6b966-jnjg8
      namespace: aibrix-system
      uid: 3bfbfb69-f3df-47cd-9acf-edd91bfe49fa
  ports:
  - name: gateway
    port: 50052
    protocol: TCP
  - name: metrics
    port: 8080
    protocol: TCP
  - name: profiling
    port: 6060
    protocol: TCP
  • Service routes only to Leader
  • Followers don't receive traffic
  • Load balancing working

4. High Availability Verification

Architecture:

Client → Service:50052 → [Leader Pod] 
                     ↘ [Follower Pod]  (readiness=NOT_SERVING)
                     ↘ [Follower Pod]  (readiness=NOT_SERVING)

Failover mechanism:

  1. Leader fails → Automatic new leader election
  2. Service auto updates → Points to new leader
  3. Traffic seamless switch → No service interruption

@googs1025
Copy link
Collaborator Author

Leader Election Failover Test

Pre-Failure State:

~ kubectl get pods -n aibrix-system -l app.kubernetes.io/component=aibrix-gateway-plugin -owide

NAME                                      READY   STATUS    RESTARTS   AGE    IP              NODE       NOMINATED NODE   READINESS GATES
aibrix-gateway-plugins-774bc6b966-8cjhl   0/1     Running   0          72m    10.244.86.222   minikube   <none>           <none>
aibrix-gateway-plugins-774bc6b966-jnjg8   0/1     Running   0          100m   10.244.86.221   minikube   <none>           <none>
aibrix-gateway-plugins-774bc6b966-kc5kf   1/1     Running   0          100m   10.244.86.219   minikube   <none>           <none>
➜  ~ kubectl get endpoints -n aibrix-system aibrix-gateway-plugins -o yaml

apiVersion: v1
kind: Endpoints
metadata:
  annotations:
    endpoints.kubernetes.io/last-change-trigger-time: "2025-11-28T04:00:33Z"
  creationTimestamp: "2025-11-28T03:26:57Z"
  labels:
    app.kubernetes.io/component: aibrix-gateway-plugin
    app.kubernetes.io/instance: aibrix
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: aibrix
    app.kubernetes.io/version: 0.5.0
    helm.sh/chart: 0.5.0
  name: aibrix-gateway-plugins
  namespace: aibrix-system
  resourceVersion: "12423743"
  uid: c87d0376-c554-48c4-ad25-a87565758d8b
subsets:
- addresses:
  - ip: 10.244.86.219
    nodeName: minikube
    targetRef:
      kind: Pod
      name: aibrix-gateway-plugins-774bc6b966-kc5kf
      namespace: aibrix-system
      uid: 5b4fbde1-f6bf-4b7c-bd78-cd4e97ebd129
  notReadyAddresses:
  - ip: 10.244.86.221
    nodeName: minikube
    targetRef:
      kind: Pod
      name: aibrix-gateway-plugins-774bc6b966-jnjg8
      namespace: aibrix-system
      uid: 3bfbfb69-f3df-47cd-9acf-edd91bfe49fa
  - ip: 10.244.86.222
    nodeName: minikube
    targetRef:
      kind: Pod
      name: aibrix-gateway-plugins-774bc6b966-8cjhl
      namespace: aibrix-system
      uid: 0a667312-abe9-4084-871c-d32f2c8d4710
  ports:
  - name: gateway
    port: 50052
    protocol: TCP
  - name: metrics
    port: 8080
    protocol: TCP
  - name: profiling
    port: 6060
    protocol: TCP

Failure:

~ kubectl delete pods -naibrix-system aibrix-gateway-plugins-774bc6b966-kc5kf
pod "aibrix-gateway-plugins-774bc6b966-kc5kf" deleted

Failover Monitoring:

~ kubectl get pods -n aibrix-system -l app.kubernetes.io/component=aibrix-gateway-plugin -w

NAME                                      READY   STATUS    RESTARTS   AGE
aibrix-gateway-plugins-774bc6b966-8cjhl   0/1     Running   0          72m
aibrix-gateway-plugins-774bc6b966-ffd5p   0/1     Running   0          6s
aibrix-gateway-plugins-774bc6b966-jnjg8   0/1     Running   0          100m

aibrix-gateway-plugins-774bc6b966-jnjg8   1/1     Running   0          101m
^C%

Post-Failure State:

➜  ~ kubectl get endpoints -n aibrix-system aibrix-gateway-plugins -o yaml

apiVersion: v1
kind: Endpoints
metadata:
  annotations:
    endpoints.kubernetes.io/last-change-trigger-time: "2025-11-28T05:13:12Z"
  creationTimestamp: "2025-11-28T03:26:57Z"
  labels:
    app.kubernetes.io/component: aibrix-gateway-plugin
    app.kubernetes.io/instance: aibrix
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: aibrix
    app.kubernetes.io/version: 0.5.0
    helm.sh/chart: 0.5.0
  name: aibrix-gateway-plugins
  namespace: aibrix-system
  resourceVersion: "12430565"
  uid: c87d0376-c554-48c4-ad25-a87565758d8b
subsets:
- addresses:
  - ip: 10.244.86.221
    nodeName: minikube
    targetRef:
      kind: Pod
      name: aibrix-gateway-plugins-774bc6b966-jnjg8
      namespace: aibrix-system
      uid: 3bfbfb69-f3df-47cd-9acf-edd91bfe49fa
  notReadyAddresses:
  - ip: 10.244.86.222
    nodeName: minikube
    targetRef:
      kind: Pod
      name: aibrix-gateway-plugins-774bc6b966-8cjhl
      namespace: aibrix-system
      uid: 0a667312-abe9-4084-871c-d32f2c8d4710
  - ip: 10.244.86.223
    nodeName: minikube
    targetRef:
      kind: Pod
      name: aibrix-gateway-plugins-774bc6b966-ffd5p
      namespace: aibrix-system
      uid: 227f893a-d7b1-4194-b8f3-48d13a7d7976
  ports:
  - name: gateway
    port: 50052
    protocol: TCP
  - name: metrics
    port: 8080
    protocol: TCP
  - name: profiling
    port: 6060
    protocol: TCP

** 2. Key Verification Points**

1. Automatic Election Successful:

  • Before Failure: aibrix-gateway-plugins-774bc6b966-kc5kf (1/1 Ready) - Original Leader
  • After Failure: aibrix-gateway-plugins-774bc6b966-jnjg8 (1/1 Ready) - New Leader
  • Pod Changes: aibrix-gateway-plugins-774bc6b966-ffd5p (New Pod)

2. Service Auto-Update:

  • Before: 10.244.86.219 (Original Leader kc5kf) in addresses
  • After: 10.244.86.221 (New Leader jnjg8) in addresses
  • Followers: 10.244.86.222 and 10.244.86.223 in notReadyAddresses

3. No Service Interruption:

  • Service remains available during pod status changes
  • Endpoints automatically switch to new Leader

4. HA Mechanism Verification

Leader Election Mechanism:

  1. Original Leader Failure → Lease lock released
  2. Follower Competitionjnjg8 acquires new Lease
  3. Health Check Updatejnjg8 readiness becomes SERVING
  4. Service Update → Automatically routes to new Leader

Health Check Mechanism:

  • New Leader (jnjg8) now returns SERVING
  • Other Followers continue returning NOT_SERVING

5. Final State Verification

# Final stable state
aibrix-gateway-plugins-774bc6b966-jnjg8   1/1     Running   #  New Leader
aibrix-gateway-plugins-774bc6b966-8cjhl   0/1     Running   #  Follower
aibrix-gateway-plugins-774bc6b966-ffd5p   0/1     Running   #  New Pod (Follower)

# Service Endpoints
addresses: [10.244.86.221]  # New Leader 
notReadyAddresses: [10.244.86.222, 10.244.86.223]  # Followers 

@googs1025 googs1025 force-pushed the gateway_plugin_leaderelection branch 3 times, most recently from a306a06 to b4d1caf Compare November 28, 2025 06:06
@googs1025
Copy link
Collaborator Author

Note the CI failed was caused by using the nightly image which doesn't contain the PR changes yet.

Root Cause:

  • The nightly image was built from the main branch (without PR changes)
  • The PR adds missing constants LivenessCheckService = "liveness" and ReadinessCheckService = "readiness"
  • Without these constants, gRPC health checks fail with "unknown service" error
  • This causes Pod readiness/liveness probes to fail, leading to container restart loops

Solution:

  • The PR contains the correct health check implementation with proper service name handling
  • Once the PR is merged and a new nightly image is built, the gRPC health checks will work properly
  • The health check logic correctly handles leader election: only leader returns SERVING for readiness, all instances return SERVING for liveness

@googs1025 googs1025 assigned googs1025 and unassigned googs1025 Nov 28, 2025
@googs1025 googs1025 marked this pull request as ready for review November 28, 2025 06:45
@googs1025 googs1025 force-pushed the gateway_plugin_leaderelection branch from b4d1caf to 1e26d8d Compare November 28, 2025 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RFC: Add Leader Election Support for Active-Passive Architecture in Gateway plugin

1 participant