-
Notifications
You must be signed in to change notification settings - Fork 489
feat: add leader election support for HA in gateway plugin #1809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add leader election support for HA in gateway plugin #1809
Conversation
Summary of ChangesHello @googs1025, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the gateway plugin's resilience and scalability by introducing Kubernetes leader election. This mechanism ensures that in a clustered environment, only one instance of the gateway plugin is active at any given time, preventing split-brain scenarios and ensuring consistent request processing. The integration with gRPC health checks allows Kubernetes to intelligently manage pod readiness and traffic routing, directing requests exclusively to the elected leader. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
TODO: will test local |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces leader election support for high availability in the gateway plugin. The changes include adding new command-line flags for configuration, implementing the leader election logic using Kubernetes leases, and creating a new leader-aware gRPC health check server. My review focuses on the correctness and robustness of the leader election implementation, particularly around namespace discovery and ensuring graceful shutdown procedures are followed when leadership is lost.
cmd/plugins/main.go
Outdated
|
|
||
| // start leader election | ||
| go func() { | ||
| leaderElector.Run(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using context.Background() here means the leader election loop will run indefinitely and cannot be gracefully stopped when the application receives a shutdown signal. This could prevent the lease from being released promptly on shutdown (as ReleaseOnCancel: true would not be triggered).
It's better to use a context that can be cancelled. For example, you could create a context that is cancelled when the gracefulStop channel receives a signal. This would allow the leader elector to clean up and release the lease as part of a graceful shutdown.
8589c59 to
9ad7423
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds leader election support for high availability (HA) in the gateway plugin. When enabled, only the elected leader instance will serve traffic while follower instances remain available but don't handle requests, ensuring a single active gateway at any time.
Key changes:
- Implements a custom health check server with leader election awareness to differentiate liveness and readiness probes
- Adds command-line flags for configuring leader election parameters (enabled/disabled, lease ID, namespace)
- Updates Kubernetes health probe configurations to use service-specific checks (liveness vs readiness)
Reviewed changes
Copilot reviewed 3 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/plugins/gateway/health/health.go | New health server implementation that returns different statuses based on leader election state and probe type |
| cmd/plugins/main.go | Integrates leader election setup with lease-based coordination and replaces default health server with custom implementation |
| dist/chart/values.yaml | Adds service names to liveness and readiness probe configurations |
| dist/chart/templates/gateway-plugin/deployment.yaml | Adds commented example args for leader election configuration |
| config/gateway/gateway-plugin/gateway-plugin.yaml | Configures leader election as disabled by default with service-specific health probes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
5b16090 to
36ec241
Compare
Local Test➜ ~ kubectl get pods -naibrix-system
NAME READY STATUS RESTARTS AGE
aibrix-controller-manager-55749fcbcc-vjlcm 1/1 Running 4 (12m ago) 16m
aibrix-gateway-plugins-774bc6b966-fw4mc 1/1 Running 2 (12m ago) 16m
aibrix-gateway-plugins-774bc6b966-jnjg8 0/1 Running 0 11m
aibrix-gateway-plugins-774bc6b966-kc5kf 0/1 Running 0 11m
aibrix-redis-master-574fc59fb6-v77qz 1/1 Running 1 (12m ago) 16m# Pod Status Analysis
aibrix-gateway-plugins-774bc6b966-fw4mc 1/1 Running # Leader - 1/1 Ready
aibrix-gateway-plugins-774bc6b966-jnjg8 0/1 Running # Follower - 0/1 Ready
aibrix-gateway-plugins-774bc6b966-kc5kf 0/1 Running # Follower - 0/1 Ready
LogFrom the logs: Follower Pod 1 ( Follower Pod 2 (
** Service Endpoints Configured**➜ ~ kubectl get svc -naibrix-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
aibrix-controller-manager-metrics-service ClusterIP 10.106.172.133 <none> 8080/TCP 17m
aibrix-gateway-plugins ClusterIP 10.97.67.205 <none> 50052/TCP,6060/TCP,8080/TCP 17m
aibrix-gpu-optimizer ClusterIP 10.111.70.200 <none> 8080/TCP 17m
aibrix-metadata-service ClusterIP 10.100.214.241 <none> 8090/TCP 17m
aibrix-redis-master ClusterIP 10.110.224.198 <none> 6379/TCP 17m
aibrix-webhook-service ClusterIP 10.109.35.232 <none> 443/TCP 17m
debug-gateway-plugin ClusterIP 10.99.118.85 <none> 50052/TCP 76d➜ ~ kubectl get endpoints -n aibrix-system aibrix-gateway-plugins -oyaml
apiVersion: v1
kind: Endpoints
metadata:
creationTimestamp: "2025-11-28T03:26:57Z"
labels:
app.kubernetes.io/component: aibrix-gateway-plugin
app.kubernetes.io/instance: aibrix
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: aibrix
app.kubernetes.io/version: 0.5.0
helm.sh/chart: 0.5.0
name: aibrix-gateway-plugins
namespace: aibrix-system
resourceVersion: "12419605"
uid: c87d0376-c554-48c4-ad25-a87565758d8b
subsets:
- addresses: # Only Leader Pod in endpoints
- ip: 10.244.86.209
nodeName: minikube
targetRef:
kind: Pod
name: aibrix-gateway-plugins-774bc6b966-fw4mc
namespace: aibrix-system
uid: 330d4942-24e6-48fe-8bfa-4d4ea3b4f8a4
notReadyAddresses: # Follower Pods excluded from traffic
- ip: 10.244.86.219
nodeName: minikube
targetRef:
kind: Pod
name: aibrix-gateway-plugins-774bc6b966-kc5kf
namespace: aibrix-system
uid: 5b4fbde1-f6bf-4b7c-bd78-cd4e97ebd129
- ip: 10.244.86.221
nodeName: minikube
targetRef:
kind: Pod
name: aibrix-gateway-plugins-774bc6b966-jnjg8
namespace: aibrix-system
uid: 3bfbfb69-f3df-47cd-9acf-edd91bfe49fa
ports:
- name: gateway
port: 50052
protocol: TCP
- name: metrics
port: 8080
protocol: TCP
- name: profiling
port: 6060
protocol: TCP
4. High Availability VerificationArchitecture: Failover mechanism:
|
Leader Election Failover TestPre-Failure State:➜ ~ kubectl get pods -n aibrix-system -l app.kubernetes.io/component=aibrix-gateway-plugin -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
aibrix-gateway-plugins-774bc6b966-8cjhl 0/1 Running 0 72m 10.244.86.222 minikube <none> <none>
aibrix-gateway-plugins-774bc6b966-jnjg8 0/1 Running 0 100m 10.244.86.221 minikube <none> <none>
aibrix-gateway-plugins-774bc6b966-kc5kf 1/1 Running 0 100m 10.244.86.219 minikube <none> <none>➜ ~ kubectl get endpoints -n aibrix-system aibrix-gateway-plugins -o yaml
apiVersion: v1
kind: Endpoints
metadata:
annotations:
endpoints.kubernetes.io/last-change-trigger-time: "2025-11-28T04:00:33Z"
creationTimestamp: "2025-11-28T03:26:57Z"
labels:
app.kubernetes.io/component: aibrix-gateway-plugin
app.kubernetes.io/instance: aibrix
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: aibrix
app.kubernetes.io/version: 0.5.0
helm.sh/chart: 0.5.0
name: aibrix-gateway-plugins
namespace: aibrix-system
resourceVersion: "12423743"
uid: c87d0376-c554-48c4-ad25-a87565758d8b
subsets:
- addresses:
- ip: 10.244.86.219
nodeName: minikube
targetRef:
kind: Pod
name: aibrix-gateway-plugins-774bc6b966-kc5kf
namespace: aibrix-system
uid: 5b4fbde1-f6bf-4b7c-bd78-cd4e97ebd129
notReadyAddresses:
- ip: 10.244.86.221
nodeName: minikube
targetRef:
kind: Pod
name: aibrix-gateway-plugins-774bc6b966-jnjg8
namespace: aibrix-system
uid: 3bfbfb69-f3df-47cd-9acf-edd91bfe49fa
- ip: 10.244.86.222
nodeName: minikube
targetRef:
kind: Pod
name: aibrix-gateway-plugins-774bc6b966-8cjhl
namespace: aibrix-system
uid: 0a667312-abe9-4084-871c-d32f2c8d4710
ports:
- name: gateway
port: 50052
protocol: TCP
- name: metrics
port: 8080
protocol: TCP
- name: profiling
port: 6060
protocol: TCPFailure:➜ ~ kubectl delete pods -naibrix-system aibrix-gateway-plugins-774bc6b966-kc5kf
pod "aibrix-gateway-plugins-774bc6b966-kc5kf" deletedFailover Monitoring:➜ ~ kubectl get pods -n aibrix-system -l app.kubernetes.io/component=aibrix-gateway-plugin -w
NAME READY STATUS RESTARTS AGE
aibrix-gateway-plugins-774bc6b966-8cjhl 0/1 Running 0 72m
aibrix-gateway-plugins-774bc6b966-ffd5p 0/1 Running 0 6s
aibrix-gateway-plugins-774bc6b966-jnjg8 0/1 Running 0 100m
aibrix-gateway-plugins-774bc6b966-jnjg8 1/1 Running 0 101m
^C%Post-Failure State:➜ ~ kubectl get endpoints -n aibrix-system aibrix-gateway-plugins -o yaml
apiVersion: v1
kind: Endpoints
metadata:
annotations:
endpoints.kubernetes.io/last-change-trigger-time: "2025-11-28T05:13:12Z"
creationTimestamp: "2025-11-28T03:26:57Z"
labels:
app.kubernetes.io/component: aibrix-gateway-plugin
app.kubernetes.io/instance: aibrix
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: aibrix
app.kubernetes.io/version: 0.5.0
helm.sh/chart: 0.5.0
name: aibrix-gateway-plugins
namespace: aibrix-system
resourceVersion: "12430565"
uid: c87d0376-c554-48c4-ad25-a87565758d8b
subsets:
- addresses:
- ip: 10.244.86.221
nodeName: minikube
targetRef:
kind: Pod
name: aibrix-gateway-plugins-774bc6b966-jnjg8
namespace: aibrix-system
uid: 3bfbfb69-f3df-47cd-9acf-edd91bfe49fa
notReadyAddresses:
- ip: 10.244.86.222
nodeName: minikube
targetRef:
kind: Pod
name: aibrix-gateway-plugins-774bc6b966-8cjhl
namespace: aibrix-system
uid: 0a667312-abe9-4084-871c-d32f2c8d4710
- ip: 10.244.86.223
nodeName: minikube
targetRef:
kind: Pod
name: aibrix-gateway-plugins-774bc6b966-ffd5p
namespace: aibrix-system
uid: 227f893a-d7b1-4194-b8f3-48d13a7d7976
ports:
- name: gateway
port: 50052
protocol: TCP
- name: metrics
port: 8080
protocol: TCP
- name: profiling
port: 6060
protocol: TCP** 2. Key Verification Points**1. Automatic Election Successful:
2. Service Auto-Update:
3. No Service Interruption:
4. HA Mechanism VerificationLeader Election Mechanism:
Health Check Mechanism:
5. Final State Verification# Final stable state
aibrix-gateway-plugins-774bc6b966-jnjg8 1/1 Running # New Leader
aibrix-gateway-plugins-774bc6b966-8cjhl 0/1 Running # Follower
aibrix-gateway-plugins-774bc6b966-ffd5p 0/1 Running # New Pod (Follower)
# Service Endpoints
addresses: [10.244.86.221] # New Leader
notReadyAddresses: [10.244.86.222, 10.244.86.223] # Followers |
a306a06 to
b4d1caf
Compare
|
Note the CI failed was caused by using the nightly image which doesn't contain the PR changes yet. Root Cause:
Solution:
|
Signed-off-by: CYJiang <[email protected]>
b4d1caf to
1e26d8d
Compare
Pull Request Description
[Please provide a clear and concise description of your changes here]
Related Issues
Resolves: #1805
Important: Before submitting, please complete the description above and review the checklist below.
Contribution Guidelines (Expand for Details)
We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:
Pull Request Title Format
Your PR title should start with one of these prefixes to indicate the nature of the change:
[Bug]: Corrections to existing functionality[CI]: Changes to build process or CI pipeline[Docs]: Updates or additions to documentation[API]: Modifications to aibrix's API or interface[CLI]: Changes or additions to the Command Line Interface[Misc]: For changes not covered above (use sparingly)Note: For changes spanning multiple categories, use multiple prefixes in order of importance.
Submission Checklist
By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.