feat (infra): Added probes for UI and Proxy containers.#112
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds HTTP readiness and liveness probes to manager, UI, and proxy containers in Kubernetes manifests, adds a /health nginx route, and adds a Next.js /api/health GET route that returns JSON { status: "ok" }. (≤50 words) Changes
Sequence Diagram(s)sequenceDiagram
participant Kubelet as Kubelet (probe)
participant Nginx as Nginx (proxy)
participant App as Next.js app (/api/health)
Kubelet->>Nginx: HTTP GET /health
alt Nginx responds directly
Nginx-->>Kubelet: 200 "ok"
else Nginx proxies to app
Nginx->>App: HTTP GET /api/health
App-->>Nginx: 200 {"status":"ok"}
Nginx-->>Kubelet: 200 "ok"
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ui/src/app/api/health/route.ts`:
- Around line 1-4: Add an explicit dynamic export and remove the unnecessary
async: in the route handler file update the GET handler by adding export const
dynamic = "force-dynamic" at top-level to prevent static build-time caching, and
change the function signature from export async function GET() to export
function GET() (remove async since there is no await) so the handler is forced
to run dynamically and the signature matches usage.
---
Duplicate comments:
In `@deployment/local.install.yaml`:
- Around line 692-739: Update the review summary to correctly reference the
modified Deployment and avoid duplicate notes: the readiness/liveness probes
shown belong to the skyflo-ai-ui Deployment (the UI container using
/api/health:3000 and the proxy container using /health:80 with imagePullPolicy:
Never), not the skyflo-ai-controller; remove the duplicate comment flag and
ensure the summary explicitly mentions "skyflo-ai-ui" and the proxy container
when describing the probe additions so the review matches the actual changes in
the manifest.
5467622 to
d05109a
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
deployment/install.yaml (1)
646-693: 🧹 Nitpick | 🔵 TrivialPrefer named ports over numeric literals in probe specs for consistency.
Same pattern as
local.install.yaml: useport: http(UI, port 3000) andport: http-proxy(proxy, port 80) to match the convention used by all other HTTP probes in this file (engine → http,mcp → http,manager → health).♻️ Suggested refactor
readinessProbe: httpGet: path: /api/health - port: 3000 + port: http ... livenessProbe: httpGet: path: /api/health - port: 3000 + port: http ... readinessProbe: httpGet: path: /health - port: 80 + port: http-proxy ... livenessProbe: httpGet: path: /health - port: 80 + port: http-proxy🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deployment/install.yaml` around lines 646 - 693, Replace numeric port literals in the readinessProbe and livenessProbe specs with the corresponding named ports for consistency: for the UI container probes use port: http (the container exposing port 3000), and for the proxy container (name: proxy) use port: http-proxy (the container exposing port 80); update the probe blocks that currently use port: 3000 and port: 80 to reference these names so they match the convention used by other containers (engine → http, mcp → http, manager → health).deployment/local.install.yaml (1)
692-739: 🧹 Nitpick | 🔵 TrivialPrefer named ports over numeric literals in probe specs for consistency.
The UI probe uses
port: 3000and the proxy probe usesport: 80. Every other HTTP probe in this file references the named port (engine→port: http,mcp→port: http,manager→port: health). Using named ports (httpandhttp-proxy) is more resilient to port-number changes.♻️ Suggested refactor
readinessProbe: httpGet: path: /api/health - port: 3000 + port: http initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 livenessProbe: httpGet: path: /api/health - port: 3000 + port: http initialDelaySeconds: 15 ... readinessProbe: httpGet: path: /health - port: 80 + port: http-proxy initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 livenessProbe: httpGet: path: /health - port: 80 + port: http-proxy initialDelaySeconds: 15🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deployment/local.install.yaml` around lines 692 - 739, Replace numeric port literals in the readinessProbe and livenessProbe specs with the named ports used elsewhere: change the probes that currently use port: 3000 to use the named port "http" and change the proxy probes that use port: 80 to use the named port "http-proxy"; update both readinessProbe and livenessProbe entries for the UI probe (the block with path /api/health) and the proxy container (image skyfloaiagent/proxy:${VERSION}) so they reference port: http and port: http-proxy respectively.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@deployment/install.yaml`:
- Around line 646-693: Replace numeric port literals in the readinessProbe and
livenessProbe specs with the corresponding named ports for consistency: for the
UI container probes use port: http (the container exposing port 3000), and for
the proxy container (name: proxy) use port: http-proxy (the container exposing
port 80); update the probe blocks that currently use port: 3000 and port: 80 to
reference these names so they match the convention used by other containers
(engine → http, mcp → http, manager → health).
In `@deployment/local.install.yaml`:
- Around line 692-739: Replace numeric port literals in the readinessProbe and
livenessProbe specs with the named ports used elsewhere: change the probes that
currently use port: 3000 to use the named port "http" and change the proxy
probes that use port: 80 to use the named port "http-proxy"; update both
readinessProbe and livenessProbe entries for the UI probe (the block with path
/api/health) and the proxy container (image skyfloaiagent/proxy:${VERSION}) so
they reference port: http and port: http-proxy respectively.
d05109a to
4d909a1
Compare
02743b5 to
d818f4d
Compare
KaranJagtiani
left a comment
There was a problem hiding this comment.
Fix the commit message to follow the repo's conventional commits format (imperative mood, lowercase): feat (infra): add probes for UI and proxy containers
d818f4d to
eac8b2a
Compare
eac8b2a to
63459c0
Compare
Done. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deployment/install.yaml`:
- Around line 646-661: Add a startupProbe entry alongside the existing
readinessProbe and livenessProbe to prevent liveness checks from killing the pod
during slow initialization: under the same container spec where readinessProbe
and livenessProbe are defined, add startupProbe with an httpGet to path
/api/health and port http, and configure conservative timings (higher
initialDelaySeconds than readiness/liveness, e.g., significantly larger
initialDelaySeconds, with periodSeconds and timeoutSeconds similar to the
existing probes and an appropriate failureThreshold) so the app/proxy can fully
start before readiness/liveness take effect.
In `@deployment/local.install.yaml`:
- Around line 692-707: Add a startupProbe to the deployments that currently
define readinessProbe and livenessProbe (e.g., the UI and proxy pods which
contain the readinessProbe:/api/health and livenessProbe:/api/health blocks) so
the pod has a gate for cold starts before liveness checks begin; implement a
startupProbe.httpGet to /api/health on the same port with a longer
initialDelaySeconds and periodSeconds (for example significantly larger than
readiness initialDelaySeconds) and an appropriate failureThreshold to allow slow
image startup, ensuring it sits alongside the existing readinessProbe and
livenessProbe entries for the same container definitions.
ℹ️ Review info
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
deployment/install.yamldeployment/local.install.yamldeployment/ui/nginx.confui/src/app/api/health/route.ts
| readinessProbe: | ||
| httpGet: | ||
| path: /api/health | ||
| port: http | ||
| initialDelaySeconds: 10 | ||
| periodSeconds: 10 | ||
| timeoutSeconds: 5 | ||
| failureThreshold: 3 | ||
| livenessProbe: | ||
| httpGet: | ||
| path: /api/health | ||
| port: http | ||
| initialDelaySeconds: 15 | ||
| periodSeconds: 10 | ||
| timeoutSeconds: 5 | ||
| failureThreshold: 3 |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Add startupProbe here as well to mirror robust startup behavior in production installs.
This keeps liveness from acting before app/proxy initialization completes under slower startup conditions.
Suggested patch
- name: ui
image: skyfloaiagent/ui:${VERSION}
@@
securityContext:
allowPrivilegeEscalation: false
runAsUser: 1002
runAsGroup: 1002
capabilities:
drop:
- ALL
+ startupProbe:
+ httpGet:
+ path: /api/health
+ port: http
+ periodSeconds: 5
+ timeoutSeconds: 5
+ failureThreshold: 12
readinessProbe:
httpGet:
path: /api/health
port: http
@@
- name: proxy
image: skyfloaiagent/proxy:${VERSION}
@@
resources:
limits:
cpu: 200m
memory: 256Mi
requests:
cpu: 100m
memory: 128Mi
+ startupProbe:
+ httpGet:
+ path: /health
+ port: http-proxy
+ periodSeconds: 5
+ timeoutSeconds: 5
+ failureThreshold: 12
readinessProbe:
httpGet:
path: /health
port: http-proxyAlso applies to: 678-693
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@deployment/install.yaml` around lines 646 - 661, Add a startupProbe entry
alongside the existing readinessProbe and livenessProbe to prevent liveness
checks from killing the pod during slow initialization: under the same container
spec where readinessProbe and livenessProbe are defined, add startupProbe with
an httpGet to path /api/health and port http, and configure conservative timings
(higher initialDelaySeconds than readiness/liveness, e.g., significantly larger
initialDelaySeconds, with periodSeconds and timeoutSeconds similar to the
existing probes and an appropriate failureThreshold) so the app/proxy can fully
start before readiness/liveness take effect.
| readinessProbe: | ||
| httpGet: | ||
| path: /api/health | ||
| port: http | ||
| initialDelaySeconds: 10 | ||
| periodSeconds: 10 | ||
| timeoutSeconds: 5 | ||
| failureThreshold: 3 | ||
| livenessProbe: | ||
| httpGet: | ||
| path: /api/health | ||
| port: http | ||
| initialDelaySeconds: 15 | ||
| periodSeconds: 10 | ||
| timeoutSeconds: 5 | ||
| failureThreshold: 3 |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider adding startupProbe for UI and proxy to improve cold-start resilience.
Current probes are valid, but startupProbe prevents premature liveness failures on slower nodes/images and cleanly gates when liveness/readiness begin.
Suggested patch
- name: ui
image: skyfloaiagent/ui:${VERSION}
@@
securityContext:
allowPrivilegeEscalation: false
runAsUser: 1002
runAsGroup: 1002
capabilities:
drop:
- ALL
+ startupProbe:
+ httpGet:
+ path: /api/health
+ port: http
+ periodSeconds: 5
+ timeoutSeconds: 5
+ failureThreshold: 12
readinessProbe:
httpGet:
path: /api/health
port: http
@@
- name: proxy
image: skyfloaiagent/proxy:${VERSION}
@@
resources:
limits:
cpu: 200m
memory: 256Mi
requests:
cpu: 100m
memory: 128Mi
+ startupProbe:
+ httpGet:
+ path: /health
+ port: http-proxy
+ periodSeconds: 5
+ timeoutSeconds: 5
+ failureThreshold: 12
readinessProbe:
httpGet:
path: /health
port: http-proxyAlso applies to: 724-739
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@deployment/local.install.yaml` around lines 692 - 707, Add a startupProbe to
the deployments that currently define readinessProbe and livenessProbe (e.g.,
the UI and proxy pods which contain the readinessProbe:/api/health and
livenessProbe:/api/health blocks) so the pod has a gate for cold starts before
liveness checks begin; implement a startupProbe.httpGet to /api/health on the
same port with a longer initialDelaySeconds and periodSeconds (for example
significantly larger than readiness initialDelaySeconds) and an appropriate
failureThreshold to allow slow image startup, ensuring it sits alongside the
existing readinessProbe and livenessProbe entries for the same container
definitions.
Description
Please include a summary of the changes and the motivation behind them.
Added readiness/liveness probes for UI and proxy, add /health for proxy
Output after the change:
kubectl exec skyflo-ai-ui-7cffc646cb-4mpq9 -c proxy -- kill 1Related Issue(s)
Fixes #93
Type of Change
Testing
Please describe the tests you've added/performed to verify your changes.
Checklist
Before Requesting Review
Code Quality
printstatements orconsole.logcallspackage-lock.json(we useyarnonly for the UI)Screenshots (if applicable)
Additional Notes
Why we changed deployment/ui/nginx.conf
The proxy’s readiness and liveness probes were using path / on port 80. Nginx serves / by proxying to the UI at http://skyflo-ai-ui:3000. The pod is only added to the Service’s endpoints when it is Ready, and the proxy only becomes Ready when its probe gets a 2xx. So the probe hit the proxy → proxy called the Service → the Service had no Ready pods (including this one) → no backend → 502. That kept the proxy (and the pod) from ever becoming Ready.
We added a location = /health that nginx handles itself (no proxy_pass), returning 200 "OK". Probes now use path /health for the proxy, so they no longer depend on the UI or the Service. That removes the circular dependency and allows the proxy to become Ready and stay healthy.
Got this error when running nginx container without any change in nginx.conf.
Why we added UI /api/health
The UI probes originally used path / on port 3000. That triggers full page rendering and is heavier and less reliable for frequent probes. We added a minimal Next.js API route at /api/health that returns { "ok": true } with no rendering or external calls, so readiness/liveness checks are cheap and stable. Probes were updated to use /api/health instead of /.
Why we need to rebuild and push new images
Proxy: The proxy container is built from deployment/ui/proxy.Dockerfile, which copies deployment/ui/nginx.conf into the image. The new /health behaviour only exists in the cluster after that updated nginx config is inside the image. You must rebuild the proxy image (so it includes the new nginx.conf), push it to your registry with the tag your manifests use (e.g. skyfloaiagent/proxy:v0.5.0), and redeploy (e.g. kubectl rollout restart deployment/skyflo-ai-ui) so pods use the new image. Until then, the running proxy will keep proxying /health to the UI and probes can keep getting 502.
UI: The UI image must include the new /api/health route (e.g. ui/src/app/api/health/route.ts). Rebuild the UI image, push it with the tag used in the manifests (e.g. skyfloaiagent/ui:v0.5.0), and redeploy so probes to /api/health succeed.