Skip to content

fix: wait for DataScienceCluster CRD before applying DSC#1187

Merged
Gkrumbach07 merged 2 commits intomainfrom
fix/rhoai-mlflow-crd-wait
Apr 3, 2026
Merged

fix: wait for DataScienceCluster CRD before applying DSC#1187
Gkrumbach07 merged 2 commits intomainfrom
fix/rhoai-mlflow-crd-wait

Conversation

@Gkrumbach07
Copy link
Copy Markdown
Contributor

@Gkrumbach07 Gkrumbach07 commented Apr 3, 2026

Summary

  • Fixes the deploy-rhoai-mlflow GHA job failure from Adding RHOAI MLflow component into the cluster #1166
  • The RHOAI operator CSV succeeding doesn't guarantee CRDs are registered yet
  • Adds a wait loop for datascienceclusters.datasciencecluster.opendatahub.io CRD between the operator readiness check and the DSC apply step
  • Applied to both components-build-deploy.yml and prod-release-deploy.yaml

Test plan

  • Re-run the deploy-rhoai-mlflow job and verify it passes
  • Verify the CRD wait step logs show the CRD becoming available

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Chores
    • Deployment workflows now wait (up to ~10 minutes) for a cluster CRD to register before proceeding; the job fails with an error if the CRD never appears.
    • Deployment steps now verify the PostgreSQL database named "mlflow" exists and create it if absent.
  • Configuration
    • MLflow server replica count changed from 2 to 1.

The RHOAI operator CSV succeeding doesn't guarantee CRDs are registered.
Add an explicit wait for the DataScienceCluster CRD between the operator
readiness check and the DSC apply step in both deploy workflows.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

Workflows now wait for the datascienceclusters.datasciencecluster.opendatahub.io CRD before applying DS cluster manifests and ensure a PostgreSQL database named mlflow exists by running a check/create inside the ambient-code deploy/postgresql pod. Separately, the MLflow manifest reduces spec.replicas from 2 to 1.

Changes

Cohort / File(s) Summary
GitHub Actions workflows
.github/workflows/components-build-deploy.yml, .github/workflows/prod-release-deploy.yaml
Added a polling loop in the deploy-rhoai-mlflow job to wait (up to 60 attempts, 10s interval) for the datascienceclusters.datasciencecluster.opendatahub.io CRD before applying DS cluster manifests; after applying, added a step that execs into the ambient-code deploy/postgresql pod to run SELECT on pg_database and conditionally CREATE DATABASE mlflow if absent; workflow fails on CRD timeout.
MLflow manifest
components/manifests/components/openshift-ai/mlflow.yaml
Changed MLflow custom resource spec.replicas from 2 to 1.

Sequence Diagram(s)

sequenceDiagram
    participant GH as GitHub Actions
    participant OC as OpenShift API / oc
    participant Pod as ambient-code deploy/postgresql Pod
    participant DB as PostgreSQL (mlflow)
    participant ML as MLflow Operator

    GH->>OC: loop: oc get crd datascienceclusters.datasciencecluster.opendatahub.io
    OC-->>GH: CRD present? (yes/no)
    alt CRD absent (after retries)
        OC-->>GH: no -> GH: fail job
    else CRD present
        GH->>OC: oc apply dsci.yaml / datasciencecluster.yaml
        GH->>OC: wait for MLflow CRD (existing wait)
        GH->>Pod: kubectl exec psql -c "SELECT 1 FROM pg_database WHERE datname='mlflow'"
        Pod-->>DB: query pg_database
        DB-->>Pod: result (exists / not exists)
        alt does not exist
            Pod->>DB: CREATE DATABASE mlflow
        end
        GH->>ML: proceed to verify secrets and deploy MLflow
    end
Loading
🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title uses Conventional Commits format (fix: ...) and accurately summarizes the main change: adding a CRD wait mechanism before applying DataScienceCluster manifest.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Performance And Algorithmic Complexity ✅ Passed Added workflow steps introduce only bounded, sequential operations with appropriate timeout controls and no performance regressions.
Security And Secret Handling ✅ Passed PR introduces no security or secret handling violations with hardcoded values and proper GitHub Actions secrets referencing.
Kubernetes Resource Safety ✅ Passed PR introduces configuration changes to custom resources and workflow steps. All Kubernetes manifests maintain proper namespace scoping, define resource limits/requests at CR level, contain no RBAC wildcards, and don't create unmanaged child resources lacking OwnerReferences.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/rhoai-mlflow-crd-wait
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch fix/rhoai-mlflow-crd-wait

Comment @coderabbitai help to get the list of available commands and usage tips.

- Set replicas to 1 (RWO PVC prevents multi-attach with >1 replica)
- Add "Ensure mlflow database exists" step to both deploy workflows
  so existing PostgreSQL instances get the database created without
  requiring a pod restart to re-run init scripts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
.github/workflows/components-build-deploy.yml (1)

238-253: Wait for both CRDs before applying both custom resources.

Line 256 applies dsci.yaml and Line 257 applies datasciencecluster.yaml, but this loop only waits for datascienceclusters.datasciencecluster.opendatahub.io. If dscinitializations.dscinitialization.opendatahub.io is delayed, this still flakes.

Suggested patch
-      - name: Wait for DataScienceCluster CRD to be available
+      - name: Wait for required OpenDataHub CRDs to be available
         run: |
-          echo "Waiting for DataScienceCluster CRD to be registered..."
-          for i in $(seq 1 60); do
-            if oc get crd datascienceclusters.datasciencecluster.opendatahub.io &>/dev/null; then
-              echo "DataScienceCluster CRD is available"
-              break
-            fi
-            if [ "$i" -eq 60 ]; then
-              echo "::error::DataScienceCluster CRD did not become available within timeout"
-              exit 1
-            fi
-            echo "Attempt $i/60 - CRD not yet available, waiting 10s..."
-            sleep 10
-          done
+          for crd in \
+            dscinitializations.dscinitialization.opendatahub.io \
+            datascienceclusters.datasciencecluster.opendatahub.io; do
+            echo "Waiting for ${crd} CRD to be registered..."
+            for i in $(seq 1 60); do
+              if oc get crd "$crd" &>/dev/null; then
+                echo "${crd} CRD is available"
+                break
+              fi
+              if [ "$i" -eq 60 ]; then
+                echo "::error::${crd} CRD did not become available within timeout"
+                exit 1
+              fi
+              echo "Attempt $i/60 - ${crd} CRD not yet available, waiting 10s..."
+              sleep 10
+            done
+          done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/components-build-deploy.yml around lines 238 - 253, The
workflow currently waits only for
datascienceclusters.datasciencecluster.opendatahub.io before applying dsci.yaml
and datasciencecluster.yaml; change the wait logic to ensure both CRDs
(datascienceclusters.datasciencecluster.opendatahub.io and
dscinitializations.dscinitialization.opendatahub.io) are registered before
proceeding — either by extending the existing loop to check both CRDs in the if
condition or by adding a second similar wait loop for
dscinitializations.dscinitialization.opendatahub.io so that the apply steps for
dsci.yaml and datasciencecluster.yaml will not run until both CRDs are
available.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/components-build-deploy.yml:
- Around line 275-282: Replace the current check-then-create sequence with an
idempotent create that tolerates a concurrent creator: run oc exec ... psql -U
postgres -c "CREATE DATABASE mlflow" and if that command fails, run the SELECT
"SELECT 1 FROM pg_database WHERE datname = 'mlflow'" (the same query currently
used) and treat the step as successful if the SELECT returns 1, otherwise fail.
Update the workflow step named "Ensure mlflow database exists in PostgreSQL" to
implement this retry/fallback logic so the CREATE failure due to "already
exists" is handled gracefully.

In @.github/workflows/prod-release-deploy.yaml:
- Around line 377-392: Update the CRD wait step so it waits for both CRDs before
proceeding: instead of only checking
datascienceclusters.datasciencecluster.opendatahub.io in the step currently
named "Wait for DataScienceCluster CRD to be available", extend the loop to
verify both datascienceclusters.datasciencecluster.opendatahub.io and
dscinitializations.dscinitialization.opendatahub.io are registered (break only
when both succeed), update the status messages to reflect which CRD(s) are still
pending, and keep the same timeout/exit-on-failure behavior to avoid applying
dsci.yaml before its CRD exists.
- Around line 414-421: Replace the TOCTOU check-then-create pattern in the
"Ensure mlflow database exists in PostgreSQL" step (the oc exec ... psql
invocation) with a single idempotent psql statement that attempts to create the
database and swallows the duplicate-database error (use a PL/pgSQL DO block that
runs CREATE DATABASE mlflow and catches duplicate_database to do nothing); apply
the same replacement to the corresponding step in components-build-deploy.yml so
both workflows use the exception-safe create approach instead of separate SELECT
and CREATE commands.

---

Nitpick comments:
In @.github/workflows/components-build-deploy.yml:
- Around line 238-253: The workflow currently waits only for
datascienceclusters.datasciencecluster.opendatahub.io before applying dsci.yaml
and datasciencecluster.yaml; change the wait logic to ensure both CRDs
(datascienceclusters.datasciencecluster.opendatahub.io and
dscinitializations.dscinitialization.opendatahub.io) are registered before
proceeding — either by extending the existing loop to check both CRDs in the if
condition or by adding a second similar wait loop for
dscinitializations.dscinitialization.opendatahub.io so that the apply steps for
dsci.yaml and datasciencecluster.yaml will not run until both CRDs are
available.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 519988f8-055d-4a8e-899a-ca6bcbc422f2

📥 Commits

Reviewing files that changed from the base of the PR and between b926403 and 55e9c0c.

📒 Files selected for processing (3)
  • .github/workflows/components-build-deploy.yml
  • .github/workflows/prod-release-deploy.yaml
  • components/manifests/components/openshift-ai/mlflow.yaml
✅ Files skipped from review due to trivial changes (1)
  • components/manifests/components/openshift-ai/mlflow.yaml

Comment on lines +275 to +282
- name: Ensure mlflow database exists in PostgreSQL
run: |
oc exec -n ambient-code deploy/postgresql -- \
psql -U postgres -tAc \
"SELECT 1 FROM pg_database WHERE datname = 'mlflow'" | grep -q 1 \
|| oc exec -n ambient-code deploy/postgresql -- \
psql -U postgres -c "CREATE DATABASE mlflow"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's find and read the workflow file
find . -path "*/.github/workflows/components-build-deploy.yml" -type f

Repository: ambient-code/platform

Length of output: 112


🏁 Script executed:

# Read the workflow file, particularly around lines 275-282
if [ -f ".github/workflows/components-build-deploy.yml" ]; then
  sed -n '270,290p' ".github/workflows/components-build-deploy.yml" | cat -n
else
  echo "File not found"
fi

Repository: ambient-code/platform

Length of output: 1176


🏁 Script executed:

# Get broader context - check how many steps/jobs exist and what the workflow does
if [ -f ".github/workflows/components-build-deploy.yml" ]; then
  wc -l ".github/workflows/components-build-deploy.yml"
  # Check for action version pinning in the file
  grep -n "uses:" ".github/workflows/components-build-deploy.yml" | head -20
fi

Repository: ambient-code/platform

Length of output: 810


🏁 Script executed:

# Check if there are any secrets referenced in the step or nearby context
if [ -f ".github/workflows/components-build-deploy.yml" ]; then
  sed -n '260,295p' ".github/workflows/components-build-deploy.yml" | grep -i "secret\|credential\|password\|token" || echo "No secrets found in context"
fi

Repository: ambient-code/platform

Length of output: 412


Make DB creation idempotent to avoid race-condition failures.

The check-then-create pattern fails when another concurrent run creates the database between the check and create steps. The CREATE DATABASE command will error with "already exists", causing the step to fail even though the desired end state (database exists) is correct. The suggested patch adds explicit error handling and a verification fallback to ensure idempotence.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/components-build-deploy.yml around lines 275 - 282,
Replace the current check-then-create sequence with an idempotent create that
tolerates a concurrent creator: run oc exec ... psql -U postgres -c "CREATE
DATABASE mlflow" and if that command fails, run the SELECT "SELECT 1 FROM
pg_database WHERE datname = 'mlflow'" (the same query currently used) and treat
the step as successful if the SELECT returns 1, otherwise fail. Update the
workflow step named "Ensure mlflow database exists in PostgreSQL" to implement
this retry/fallback logic so the CREATE failure due to "already exists" is
handled gracefully.

Comment on lines +377 to +392
- name: Wait for DataScienceCluster CRD to be available
run: |
echo "Waiting for DataScienceCluster CRD to be registered..."
for i in $(seq 1 60); do
if oc get crd datascienceclusters.datasciencecluster.opendatahub.io &>/dev/null; then
echo "DataScienceCluster CRD is available"
break
fi
if [ "$i" -eq 60 ]; then
echo "::error::DataScienceCluster CRD did not become available within timeout"
exit 1
fi
echo "Attempt $i/60 - CRD not yet available, waiting 10s..."
sleep 10
done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the workflow file
find . -name "prod-release-deploy.yaml" -type f

Repository: ambient-code/platform

Length of output: 109


🏁 Script executed:

# Once found, let's check the relevant lines
cat -n .github/workflows/prod-release-deploy.yaml | sed -n '370,410p'

Repository: ambient-code/platform

Length of output: 2089


🏁 Script executed:

# Also search for references to dsci.yaml and datasciencecluster.yaml to understand what they do
rg "dsci\.yaml|datasciencecluster\.yaml" --type yaml -B 2 -A 2

Repository: ambient-code/platform

Length of output: 477


🏁 Script executed:

# Check if there are other workflow files with similar patterns
find .github/workflows -name "*.yaml" -o -name "*.yml" | head -20

Repository: ambient-code/platform

Length of output: 818


🏁 Script executed:

cat -n components/manifests/components/openshift-ai/dsci.yaml

Repository: ambient-code/platform

Length of output: 414


🏁 Script executed:

# Also check if dsci.yaml is a template or has any conditionals
file components/manifests/components/openshift-ai/dsci.yaml
wc -l components/manifests/components/openshift-ai/dsci.yaml

Repository: ambient-code/platform

Length of output: 182


🏁 Script executed:

# Search for any DSCInitialization references in the codebase
rg "DSCInitialization|dscinitialization" --type yaml -i

Repository: ambient-code/platform

Length of output: 246


Wait for both DSCInitialization and DataScienceCluster CRDs before applying manifests.

Line 395 applies dsci.yaml, which requires the dscinitializations.dscinitialization.opendatahub.io CRD. The current wait (lines 377-392) only checks datascienceclusters.datasciencecluster.opendatahub.io, leaving a race condition where dsci.yaml could fail if its CRD hasn't registered yet.

Suggested patch
-      - name: Wait for DataScienceCluster CRD to be available
+      - name: Wait for required OpenDataHub CRDs to be available
         run: |
-          echo "Waiting for DataScienceCluster CRD to be registered..."
-          for i in $(seq 1 60); do
-            if oc get crd datascienceclusters.datasciencecluster.opendatahub.io &>/dev/null; then
-              echo "DataScienceCluster CRD is available"
-              break
-            fi
-            if [ "$i" -eq 60 ]; then
-              echo "::error::DataScienceCluster CRD did not become available within timeout"
-              exit 1
-            fi
-            echo "Attempt $i/60 - CRD not yet available, waiting 10s..."
-            sleep 10
-          done
+          for crd in \
+            dscinitializations.dscinitialization.opendatahub.io \
+            datascienceclusters.datasciencecluster.opendatahub.io; do
+            echo "Waiting for ${crd} CRD to be registered..."
+            for i in $(seq 1 60); do
+              if oc get crd "$crd" &>/dev/null; then
+                echo "${crd} CRD is available"
+                break
+              fi
+              if [ "$i" -eq 60 ]; then
+                echo "::error::${crd} CRD did not become available within timeout"
+                exit 1
+              fi
+              echo "Attempt $i/60 - ${crd} CRD not yet available, waiting 10s..."
+              sleep 10
+            done
+          done
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Wait for DataScienceCluster CRD to be available
run: |
echo "Waiting for DataScienceCluster CRD to be registered..."
for i in $(seq 1 60); do
if oc get crd datascienceclusters.datasciencecluster.opendatahub.io &>/dev/null; then
echo "DataScienceCluster CRD is available"
break
fi
if [ "$i" -eq 60 ]; then
echo "::error::DataScienceCluster CRD did not become available within timeout"
exit 1
fi
echo "Attempt $i/60 - CRD not yet available, waiting 10s..."
sleep 10
done
- name: Wait for required OpenDataHub CRDs to be available
run: |
for crd in \
dscinitializations.dscinitialization.opendatahub.io \
datascienceclusters.datasciencecluster.opendatahub.io; do
echo "Waiting for ${crd} CRD to be registered..."
for i in $(seq 1 60); do
if oc get crd "$crd" &>/dev/null; then
echo "${crd} CRD is available"
break
fi
if [ "$i" -eq 60 ]; then
echo "::error::${crd} CRD did not become available within timeout"
exit 1
fi
echo "Attempt $i/60 - ${crd} CRD not yet available, waiting 10s..."
sleep 10
done
done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/prod-release-deploy.yaml around lines 377 - 392, Update
the CRD wait step so it waits for both CRDs before proceeding: instead of only
checking datascienceclusters.datasciencecluster.opendatahub.io in the step
currently named "Wait for DataScienceCluster CRD to be available", extend the
loop to verify both datascienceclusters.datasciencecluster.opendatahub.io and
dscinitializations.dscinitialization.opendatahub.io are registered (break only
when both succeed), update the status messages to reflect which CRD(s) are still
pending, and keep the same timeout/exit-on-failure behavior to avoid applying
dsci.yaml before its CRD exists.

Comment on lines +414 to +421
- name: Ensure mlflow database exists in PostgreSQL
run: |
oc exec -n ambient-code deploy/postgresql -- \
psql -U postgres -tAc \
"SELECT 1 FROM pg_database WHERE datname = 'mlflow'" | grep -q 1 \
|| oc exec -n ambient-code deploy/postgresql -- \
psql -U postgres -c "CREATE DATABASE mlflow"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and read the specific file and lines
head -n 425 .github/workflows/prod-release-deploy.yaml | tail -n 20

Repository: ambient-code/platform

Length of output: 961


🏁 Script executed:

# Also verify the file exists and get context around those lines
wc -l .github/workflows/prod-release-deploy.yaml

Repository: ambient-code/platform

Length of output: 111


🏁 Script executed:

# Let's check the exact content at lines 414-421
sed -n '410,425p' .github/workflows/prod-release-deploy.yaml

Repository: ambient-code/platform

Length of output: 807


🏁 Script executed:

# Check if there are other similar database creation patterns in workflows
rg "SELECT.*pg_database" .github/workflows/ -A 3 -B 1

Repository: ambient-code/platform

Length of output: 978


🏁 Script executed:

# Check if there are any ACTION versions that need pinning in this file
rg "uses:" .github/workflows/prod-release-deploy.yaml

Repository: ambient-code/platform

Length of output: 665


Harden mlflow DB creation against check-then-create races.

The current logic has a TOCTOU race: if another process creates the database between the SELECT check and CREATE statement, the latter fails, causing unnecessary release failures despite correct final state.

Suggested patch
       - name: Ensure mlflow database exists in PostgreSQL
         run: |
-          oc exec -n ambient-code deploy/postgresql -- \
-            psql -U postgres -tAc \
-            "SELECT 1 FROM pg_database WHERE datname = 'mlflow'" | grep -q 1 \
-          || oc exec -n ambient-code deploy/postgresql -- \
-            psql -U postgres -c "CREATE DATABASE mlflow"
+          set -euo pipefail
+          if ! oc exec -n ambient-code deploy/postgresql -- \
+            psql -U postgres -d postgres -tAc \
+            "SELECT 1 FROM pg_database WHERE datname = 'mlflow'" | grep -q 1; then
+            oc exec -n ambient-code deploy/postgresql -- \
+              psql -U postgres -d postgres -v ON_ERROR_STOP=1 -c "CREATE DATABASE mlflow" \
+            || oc exec -n ambient-code deploy/postgresql -- \
+              psql -U postgres -d postgres -tAc \
+              "SELECT 1 FROM pg_database WHERE datname = 'mlflow'" | grep -q 1
+          fi

Same pattern exists in components-build-deploy.yml.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Ensure mlflow database exists in PostgreSQL
run: |
oc exec -n ambient-code deploy/postgresql -- \
psql -U postgres -tAc \
"SELECT 1 FROM pg_database WHERE datname = 'mlflow'" | grep -q 1 \
|| oc exec -n ambient-code deploy/postgresql -- \
psql -U postgres -c "CREATE DATABASE mlflow"
- name: Ensure mlflow database exists in PostgreSQL
run: |
set -euo pipefail
if ! oc exec -n ambient-code deploy/postgresql -- \
psql -U postgres -d postgres -tAc \
"SELECT 1 FROM pg_database WHERE datname = 'mlflow'" | grep -q 1; then
oc exec -n ambient-code deploy/postgresql -- \
psql -U postgres -d postgres -v ON_ERROR_STOP=1 -c "CREATE DATABASE mlflow" \
|| oc exec -n ambient-code deploy/postgresql -- \
psql -U postgres -d postgres -tAc \
"SELECT 1 FROM pg_database WHERE datname = 'mlflow'" | grep -q 1
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/prod-release-deploy.yaml around lines 414 - 421, Replace
the TOCTOU check-then-create pattern in the "Ensure mlflow database exists in
PostgreSQL" step (the oc exec ... psql invocation) with a single idempotent psql
statement that attempts to create the database and swallows the
duplicate-database error (use a PL/pgSQL DO block that runs CREATE DATABASE
mlflow and catches duplicate_database to do nothing); apply the same replacement
to the corresponding step in components-build-deploy.yml so both workflows use
the exception-safe create approach instead of separate SELECT and CREATE
commands.

@Gkrumbach07 Gkrumbach07 merged commit 2118509 into main Apr 3, 2026
40 checks passed
@Gkrumbach07 Gkrumbach07 deleted the fix/rhoai-mlflow-crd-wait branch April 3, 2026 13:33
@ambient-code ambient-code bot added this to the Review Queue milestone Apr 3, 2026
Gkrumbach07 added a commit that referenced this pull request Apr 3, 2026
## Summary
- Fixes the `deploy-rhoai-mlflow` GHA job that's still failing after
#1187
- The CRD wait was checking for CRD *existence* (v1 was already there),
but the DSC manifest uses **v2** which gets registered later
- Now waits for `v2` to appear in `oc api-resources` before applying
- Also includes: MLflow replicas set to 1, DB migration step for
existing PostgreSQL instances

## Root cause
The RHOAI operator registers the `DataScienceCluster` CRD with v1 first,
then updates it to include v2. The previous wait found v1 and proceeded,
but the `datasciencecluster.opendatahub.io/v2` DSC manifest failed
because v2 wasn't served yet.

## Test plan
- [ ] Re-run the `deploy-rhoai-mlflow` job and verify it passes
- [ ] Verify the v2 API wait step logs show it waiting then succeeding

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Updated deployment readiness validation to check for API availability
in release and component build workflows.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Ambient Code Bot <bot@ambient-code.local>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant