fix: wait for DataScienceCluster CRD before applying DSC#1187
fix: wait for DataScienceCluster CRD before applying DSC#1187Gkrumbach07 merged 2 commits intomainfrom
Conversation
The RHOAI operator CSV succeeding doesn't guarantee CRDs are registered. Add an explicit wait for the DataScienceCluster CRD between the operator readiness check and the DSC apply step in both deploy workflows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughWorkflows now wait for the Changes
Sequence Diagram(s)sequenceDiagram
participant GH as GitHub Actions
participant OC as OpenShift API / oc
participant Pod as ambient-code deploy/postgresql Pod
participant DB as PostgreSQL (mlflow)
participant ML as MLflow Operator
GH->>OC: loop: oc get crd datascienceclusters.datasciencecluster.opendatahub.io
OC-->>GH: CRD present? (yes/no)
alt CRD absent (after retries)
OC-->>GH: no -> GH: fail job
else CRD present
GH->>OC: oc apply dsci.yaml / datasciencecluster.yaml
GH->>OC: wait for MLflow CRD (existing wait)
GH->>Pod: kubectl exec psql -c "SELECT 1 FROM pg_database WHERE datname='mlflow'"
Pod-->>DB: query pg_database
DB-->>Pod: result (exists / not exists)
alt does not exist
Pod->>DB: CREATE DATABASE mlflow
end
GH->>ML: proceed to verify secrets and deploy MLflow
end
🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Comment |
- Set replicas to 1 (RWO PVC prevents multi-attach with >1 replica) - Add "Ensure mlflow database exists" step to both deploy workflows so existing PostgreSQL instances get the database created without requiring a pod restart to re-run init scripts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
.github/workflows/components-build-deploy.yml (1)
238-253: Wait for both CRDs before applying both custom resources.Line 256 applies
dsci.yamland Line 257 appliesdatasciencecluster.yaml, but this loop only waits fordatascienceclusters.datasciencecluster.opendatahub.io. Ifdscinitializations.dscinitialization.opendatahub.iois delayed, this still flakes.Suggested patch
- - name: Wait for DataScienceCluster CRD to be available + - name: Wait for required OpenDataHub CRDs to be available run: | - echo "Waiting for DataScienceCluster CRD to be registered..." - for i in $(seq 1 60); do - if oc get crd datascienceclusters.datasciencecluster.opendatahub.io &>/dev/null; then - echo "DataScienceCluster CRD is available" - break - fi - if [ "$i" -eq 60 ]; then - echo "::error::DataScienceCluster CRD did not become available within timeout" - exit 1 - fi - echo "Attempt $i/60 - CRD not yet available, waiting 10s..." - sleep 10 - done + for crd in \ + dscinitializations.dscinitialization.opendatahub.io \ + datascienceclusters.datasciencecluster.opendatahub.io; do + echo "Waiting for ${crd} CRD to be registered..." + for i in $(seq 1 60); do + if oc get crd "$crd" &>/dev/null; then + echo "${crd} CRD is available" + break + fi + if [ "$i" -eq 60 ]; then + echo "::error::${crd} CRD did not become available within timeout" + exit 1 + fi + echo "Attempt $i/60 - ${crd} CRD not yet available, waiting 10s..." + sleep 10 + done + done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/components-build-deploy.yml around lines 238 - 253, The workflow currently waits only for datascienceclusters.datasciencecluster.opendatahub.io before applying dsci.yaml and datasciencecluster.yaml; change the wait logic to ensure both CRDs (datascienceclusters.datasciencecluster.opendatahub.io and dscinitializations.dscinitialization.opendatahub.io) are registered before proceeding — either by extending the existing loop to check both CRDs in the if condition or by adding a second similar wait loop for dscinitializations.dscinitialization.opendatahub.io so that the apply steps for dsci.yaml and datasciencecluster.yaml will not run until both CRDs are available.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/components-build-deploy.yml:
- Around line 275-282: Replace the current check-then-create sequence with an
idempotent create that tolerates a concurrent creator: run oc exec ... psql -U
postgres -c "CREATE DATABASE mlflow" and if that command fails, run the SELECT
"SELECT 1 FROM pg_database WHERE datname = 'mlflow'" (the same query currently
used) and treat the step as successful if the SELECT returns 1, otherwise fail.
Update the workflow step named "Ensure mlflow database exists in PostgreSQL" to
implement this retry/fallback logic so the CREATE failure due to "already
exists" is handled gracefully.
In @.github/workflows/prod-release-deploy.yaml:
- Around line 377-392: Update the CRD wait step so it waits for both CRDs before
proceeding: instead of only checking
datascienceclusters.datasciencecluster.opendatahub.io in the step currently
named "Wait for DataScienceCluster CRD to be available", extend the loop to
verify both datascienceclusters.datasciencecluster.opendatahub.io and
dscinitializations.dscinitialization.opendatahub.io are registered (break only
when both succeed), update the status messages to reflect which CRD(s) are still
pending, and keep the same timeout/exit-on-failure behavior to avoid applying
dsci.yaml before its CRD exists.
- Around line 414-421: Replace the TOCTOU check-then-create pattern in the
"Ensure mlflow database exists in PostgreSQL" step (the oc exec ... psql
invocation) with a single idempotent psql statement that attempts to create the
database and swallows the duplicate-database error (use a PL/pgSQL DO block that
runs CREATE DATABASE mlflow and catches duplicate_database to do nothing); apply
the same replacement to the corresponding step in components-build-deploy.yml so
both workflows use the exception-safe create approach instead of separate SELECT
and CREATE commands.
---
Nitpick comments:
In @.github/workflows/components-build-deploy.yml:
- Around line 238-253: The workflow currently waits only for
datascienceclusters.datasciencecluster.opendatahub.io before applying dsci.yaml
and datasciencecluster.yaml; change the wait logic to ensure both CRDs
(datascienceclusters.datasciencecluster.opendatahub.io and
dscinitializations.dscinitialization.opendatahub.io) are registered before
proceeding — either by extending the existing loop to check both CRDs in the if
condition or by adding a second similar wait loop for
dscinitializations.dscinitialization.opendatahub.io so that the apply steps for
dsci.yaml and datasciencecluster.yaml will not run until both CRDs are
available.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 519988f8-055d-4a8e-899a-ca6bcbc422f2
📒 Files selected for processing (3)
.github/workflows/components-build-deploy.yml.github/workflows/prod-release-deploy.yamlcomponents/manifests/components/openshift-ai/mlflow.yaml
✅ Files skipped from review due to trivial changes (1)
- components/manifests/components/openshift-ai/mlflow.yaml
| - name: Ensure mlflow database exists in PostgreSQL | ||
| run: | | ||
| oc exec -n ambient-code deploy/postgresql -- \ | ||
| psql -U postgres -tAc \ | ||
| "SELECT 1 FROM pg_database WHERE datname = 'mlflow'" | grep -q 1 \ | ||
| || oc exec -n ambient-code deploy/postgresql -- \ | ||
| psql -U postgres -c "CREATE DATABASE mlflow" | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find and read the workflow file
find . -path "*/.github/workflows/components-build-deploy.yml" -type fRepository: ambient-code/platform
Length of output: 112
🏁 Script executed:
# Read the workflow file, particularly around lines 275-282
if [ -f ".github/workflows/components-build-deploy.yml" ]; then
sed -n '270,290p' ".github/workflows/components-build-deploy.yml" | cat -n
else
echo "File not found"
fiRepository: ambient-code/platform
Length of output: 1176
🏁 Script executed:
# Get broader context - check how many steps/jobs exist and what the workflow does
if [ -f ".github/workflows/components-build-deploy.yml" ]; then
wc -l ".github/workflows/components-build-deploy.yml"
# Check for action version pinning in the file
grep -n "uses:" ".github/workflows/components-build-deploy.yml" | head -20
fiRepository: ambient-code/platform
Length of output: 810
🏁 Script executed:
# Check if there are any secrets referenced in the step or nearby context
if [ -f ".github/workflows/components-build-deploy.yml" ]; then
sed -n '260,295p' ".github/workflows/components-build-deploy.yml" | grep -i "secret\|credential\|password\|token" || echo "No secrets found in context"
fiRepository: ambient-code/platform
Length of output: 412
Make DB creation idempotent to avoid race-condition failures.
The check-then-create pattern fails when another concurrent run creates the database between the check and create steps. The CREATE DATABASE command will error with "already exists", causing the step to fail even though the desired end state (database exists) is correct. The suggested patch adds explicit error handling and a verification fallback to ensure idempotence.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/components-build-deploy.yml around lines 275 - 282,
Replace the current check-then-create sequence with an idempotent create that
tolerates a concurrent creator: run oc exec ... psql -U postgres -c "CREATE
DATABASE mlflow" and if that command fails, run the SELECT "SELECT 1 FROM
pg_database WHERE datname = 'mlflow'" (the same query currently used) and treat
the step as successful if the SELECT returns 1, otherwise fail. Update the
workflow step named "Ensure mlflow database exists in PostgreSQL" to implement
this retry/fallback logic so the CREATE failure due to "already exists" is
handled gracefully.
| - name: Wait for DataScienceCluster CRD to be available | ||
| run: | | ||
| echo "Waiting for DataScienceCluster CRD to be registered..." | ||
| for i in $(seq 1 60); do | ||
| if oc get crd datascienceclusters.datasciencecluster.opendatahub.io &>/dev/null; then | ||
| echo "DataScienceCluster CRD is available" | ||
| break | ||
| fi | ||
| if [ "$i" -eq 60 ]; then | ||
| echo "::error::DataScienceCluster CRD did not become available within timeout" | ||
| exit 1 | ||
| fi | ||
| echo "Attempt $i/60 - CRD not yet available, waiting 10s..." | ||
| sleep 10 | ||
| done | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and examine the workflow file
find . -name "prod-release-deploy.yaml" -type fRepository: ambient-code/platform
Length of output: 109
🏁 Script executed:
# Once found, let's check the relevant lines
cat -n .github/workflows/prod-release-deploy.yaml | sed -n '370,410p'Repository: ambient-code/platform
Length of output: 2089
🏁 Script executed:
# Also search for references to dsci.yaml and datasciencecluster.yaml to understand what they do
rg "dsci\.yaml|datasciencecluster\.yaml" --type yaml -B 2 -A 2Repository: ambient-code/platform
Length of output: 477
🏁 Script executed:
# Check if there are other workflow files with similar patterns
find .github/workflows -name "*.yaml" -o -name "*.yml" | head -20Repository: ambient-code/platform
Length of output: 818
🏁 Script executed:
cat -n components/manifests/components/openshift-ai/dsci.yamlRepository: ambient-code/platform
Length of output: 414
🏁 Script executed:
# Also check if dsci.yaml is a template or has any conditionals
file components/manifests/components/openshift-ai/dsci.yaml
wc -l components/manifests/components/openshift-ai/dsci.yamlRepository: ambient-code/platform
Length of output: 182
🏁 Script executed:
# Search for any DSCInitialization references in the codebase
rg "DSCInitialization|dscinitialization" --type yaml -iRepository: ambient-code/platform
Length of output: 246
Wait for both DSCInitialization and DataScienceCluster CRDs before applying manifests.
Line 395 applies dsci.yaml, which requires the dscinitializations.dscinitialization.opendatahub.io CRD. The current wait (lines 377-392) only checks datascienceclusters.datasciencecluster.opendatahub.io, leaving a race condition where dsci.yaml could fail if its CRD hasn't registered yet.
Suggested patch
- - name: Wait for DataScienceCluster CRD to be available
+ - name: Wait for required OpenDataHub CRDs to be available
run: |
- echo "Waiting for DataScienceCluster CRD to be registered..."
- for i in $(seq 1 60); do
- if oc get crd datascienceclusters.datasciencecluster.opendatahub.io &>/dev/null; then
- echo "DataScienceCluster CRD is available"
- break
- fi
- if [ "$i" -eq 60 ]; then
- echo "::error::DataScienceCluster CRD did not become available within timeout"
- exit 1
- fi
- echo "Attempt $i/60 - CRD not yet available, waiting 10s..."
- sleep 10
- done
+ for crd in \
+ dscinitializations.dscinitialization.opendatahub.io \
+ datascienceclusters.datasciencecluster.opendatahub.io; do
+ echo "Waiting for ${crd} CRD to be registered..."
+ for i in $(seq 1 60); do
+ if oc get crd "$crd" &>/dev/null; then
+ echo "${crd} CRD is available"
+ break
+ fi
+ if [ "$i" -eq 60 ]; then
+ echo "::error::${crd} CRD did not become available within timeout"
+ exit 1
+ fi
+ echo "Attempt $i/60 - ${crd} CRD not yet available, waiting 10s..."
+ sleep 10
+ done
+ done📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: Wait for DataScienceCluster CRD to be available | |
| run: | | |
| echo "Waiting for DataScienceCluster CRD to be registered..." | |
| for i in $(seq 1 60); do | |
| if oc get crd datascienceclusters.datasciencecluster.opendatahub.io &>/dev/null; then | |
| echo "DataScienceCluster CRD is available" | |
| break | |
| fi | |
| if [ "$i" -eq 60 ]; then | |
| echo "::error::DataScienceCluster CRD did not become available within timeout" | |
| exit 1 | |
| fi | |
| echo "Attempt $i/60 - CRD not yet available, waiting 10s..." | |
| sleep 10 | |
| done | |
| - name: Wait for required OpenDataHub CRDs to be available | |
| run: | | |
| for crd in \ | |
| dscinitializations.dscinitialization.opendatahub.io \ | |
| datascienceclusters.datasciencecluster.opendatahub.io; do | |
| echo "Waiting for ${crd} CRD to be registered..." | |
| for i in $(seq 1 60); do | |
| if oc get crd "$crd" &>/dev/null; then | |
| echo "${crd} CRD is available" | |
| break | |
| fi | |
| if [ "$i" -eq 60 ]; then | |
| echo "::error::${crd} CRD did not become available within timeout" | |
| exit 1 | |
| fi | |
| echo "Attempt $i/60 - ${crd} CRD not yet available, waiting 10s..." | |
| sleep 10 | |
| done | |
| done |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/prod-release-deploy.yaml around lines 377 - 392, Update
the CRD wait step so it waits for both CRDs before proceeding: instead of only
checking datascienceclusters.datasciencecluster.opendatahub.io in the step
currently named "Wait for DataScienceCluster CRD to be available", extend the
loop to verify both datascienceclusters.datasciencecluster.opendatahub.io and
dscinitializations.dscinitialization.opendatahub.io are registered (break only
when both succeed), update the status messages to reflect which CRD(s) are still
pending, and keep the same timeout/exit-on-failure behavior to avoid applying
dsci.yaml before its CRD exists.
| - name: Ensure mlflow database exists in PostgreSQL | ||
| run: | | ||
| oc exec -n ambient-code deploy/postgresql -- \ | ||
| psql -U postgres -tAc \ | ||
| "SELECT 1 FROM pg_database WHERE datname = 'mlflow'" | grep -q 1 \ | ||
| || oc exec -n ambient-code deploy/postgresql -- \ | ||
| psql -U postgres -c "CREATE DATABASE mlflow" | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and read the specific file and lines
head -n 425 .github/workflows/prod-release-deploy.yaml | tail -n 20Repository: ambient-code/platform
Length of output: 961
🏁 Script executed:
# Also verify the file exists and get context around those lines
wc -l .github/workflows/prod-release-deploy.yamlRepository: ambient-code/platform
Length of output: 111
🏁 Script executed:
# Let's check the exact content at lines 414-421
sed -n '410,425p' .github/workflows/prod-release-deploy.yamlRepository: ambient-code/platform
Length of output: 807
🏁 Script executed:
# Check if there are other similar database creation patterns in workflows
rg "SELECT.*pg_database" .github/workflows/ -A 3 -B 1Repository: ambient-code/platform
Length of output: 978
🏁 Script executed:
# Check if there are any ACTION versions that need pinning in this file
rg "uses:" .github/workflows/prod-release-deploy.yamlRepository: ambient-code/platform
Length of output: 665
Harden mlflow DB creation against check-then-create races.
The current logic has a TOCTOU race: if another process creates the database between the SELECT check and CREATE statement, the latter fails, causing unnecessary release failures despite correct final state.
Suggested patch
- name: Ensure mlflow database exists in PostgreSQL
run: |
- oc exec -n ambient-code deploy/postgresql -- \
- psql -U postgres -tAc \
- "SELECT 1 FROM pg_database WHERE datname = 'mlflow'" | grep -q 1 \
- || oc exec -n ambient-code deploy/postgresql -- \
- psql -U postgres -c "CREATE DATABASE mlflow"
+ set -euo pipefail
+ if ! oc exec -n ambient-code deploy/postgresql -- \
+ psql -U postgres -d postgres -tAc \
+ "SELECT 1 FROM pg_database WHERE datname = 'mlflow'" | grep -q 1; then
+ oc exec -n ambient-code deploy/postgresql -- \
+ psql -U postgres -d postgres -v ON_ERROR_STOP=1 -c "CREATE DATABASE mlflow" \
+ || oc exec -n ambient-code deploy/postgresql -- \
+ psql -U postgres -d postgres -tAc \
+ "SELECT 1 FROM pg_database WHERE datname = 'mlflow'" | grep -q 1
+ fiSame pattern exists in components-build-deploy.yml.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: Ensure mlflow database exists in PostgreSQL | |
| run: | | |
| oc exec -n ambient-code deploy/postgresql -- \ | |
| psql -U postgres -tAc \ | |
| "SELECT 1 FROM pg_database WHERE datname = 'mlflow'" | grep -q 1 \ | |
| || oc exec -n ambient-code deploy/postgresql -- \ | |
| psql -U postgres -c "CREATE DATABASE mlflow" | |
| - name: Ensure mlflow database exists in PostgreSQL | |
| run: | | |
| set -euo pipefail | |
| if ! oc exec -n ambient-code deploy/postgresql -- \ | |
| psql -U postgres -d postgres -tAc \ | |
| "SELECT 1 FROM pg_database WHERE datname = 'mlflow'" | grep -q 1; then | |
| oc exec -n ambient-code deploy/postgresql -- \ | |
| psql -U postgres -d postgres -v ON_ERROR_STOP=1 -c "CREATE DATABASE mlflow" \ | |
| || oc exec -n ambient-code deploy/postgresql -- \ | |
| psql -U postgres -d postgres -tAc \ | |
| "SELECT 1 FROM pg_database WHERE datname = 'mlflow'" | grep -q 1 | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/prod-release-deploy.yaml around lines 414 - 421, Replace
the TOCTOU check-then-create pattern in the "Ensure mlflow database exists in
PostgreSQL" step (the oc exec ... psql invocation) with a single idempotent psql
statement that attempts to create the database and swallows the
duplicate-database error (use a PL/pgSQL DO block that runs CREATE DATABASE
mlflow and catches duplicate_database to do nothing); apply the same replacement
to the corresponding step in components-build-deploy.yml so both workflows use
the exception-safe create approach instead of separate SELECT and CREATE
commands.
## Summary - Fixes the `deploy-rhoai-mlflow` GHA job that's still failing after #1187 - The CRD wait was checking for CRD *existence* (v1 was already there), but the DSC manifest uses **v2** which gets registered later - Now waits for `v2` to appear in `oc api-resources` before applying - Also includes: MLflow replicas set to 1, DB migration step for existing PostgreSQL instances ## Root cause The RHOAI operator registers the `DataScienceCluster` CRD with v1 first, then updates it to include v2. The previous wait found v1 and proceeded, but the `datasciencecluster.opendatahub.io/v2` DSC manifest failed because v2 wasn't served yet. ## Test plan - [ ] Re-run the `deploy-rhoai-mlflow` job and verify it passes - [ ] Verify the v2 API wait step logs show it waiting then succeeding 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Updated deployment readiness validation to check for API availability in release and component build workflows. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Ambient Code Bot <bot@ambient-code.local> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
deploy-rhoai-mlflowGHA job failure from Adding RHOAI MLflow component into the cluster #1166datascienceclusters.datasciencecluster.opendatahub.ioCRD between the operator readiness check and the DSC apply stepcomponents-build-deploy.ymlandprod-release-deploy.yamlTest plan
deploy-rhoai-mlflowjob and verify it passes🤖 Generated with Claude Code
Summary by CodeRabbit