Skip to content

Conversation

grokspawn
Copy link
Contributor

@grokspawn grokspawn commented Aug 19, 2025

Description of the change:
opm validate fails when an edge is stranded because the replaces chain from the head is broken by skipped edges.

Motivation for the change:
Due to OLMv0 graph mechanics, any skips edge will cause OLMv0 to ignore the bundle version when considering upgrades (since v0 discards graph contribution from skipped bundle versions).
Since the purpose of a replaces edge is to enable upgrade mobility across a graph, allowing the bundle version to be ignored (due to the skips entry) is an error, and potentially results in stranding.

For example, take input olm.channel:

schema: olm.channel
name: stable-v1
package: test-operator
entries:
  - name: test-operator-v1.0.0 # stranded due to skip of test-operator-v1.1.2
  - name: test-operator-v1.1.0 # stranded due to skip of test-operator-v1.1.2
    replaces: test-operator-v1.0.0
  - name: test-operator-v1.1.2
    replaces: test-operator-v1.1.0
  - name: test-operator-v1.1.4
    replaces: test-operator-v1.1.2
    skips:
      - test-operator-v1.1.2
  - name: test-operator-v1.2.0
  - name: test-operator-v1.2.1
    replaces: test-operator-v1.1.4
    skips:
      - test-operator-v1.1.4
      - test-operator-v1.2.0
  - name: test-operator-v1.3.0
    replaces: test-operator-v1.2.1
    skips:
      - test-operator-v1.2.1
  - name: test-operator-v1.4.0
    replaces: test-operator-v1.3.0
    skips:
      - test-operator-v1.3.0

Using a new version of opm which can optionally display OLMv0 graph semantics, skipped objects are limned in red and ignored edges are red dashed arrows to help visualize the stranded edges.

image

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /docs
  • Commit messages sensible and descriptive

Copy link

codecov bot commented Aug 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 55.27%. Comparing base (bf8476b) to head (5bd337c).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1750   +/-   ##
=======================================
  Coverage   55.26%   55.27%           
=======================================
  Files         136      136           
  Lines       15974    15976    +2     
=======================================
+ Hits         8828     8830    +2     
  Misses       5991     5991           
  Partials     1155     1155           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@grokspawn
Copy link
Contributor Author

/approve

Copy link
Contributor

openshift-ci bot commented Aug 20, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: grokspawn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 20, 2025
if slices.Contains(entry.Skips, entry.Replaces) {
return nil, fmt.Errorf("invalid package %q, channel %q: entry %q has identical replaces and skips: %q", c.Package, c.Name, entry.Name, entry.Replaces)
}
}
Copy link
Contributor

@camilamacedo86 camilamacedo86 Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Make sense for me my only concern is:
Did we check how many cases do we have that fail in this scenario?
we might need to create a script to validate, what we do if we have FBC catalogs with?

But maybe it will need to see outside of this PR

/lgtm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do see one instance in the operatorhubio catalog:

./operatorhubio/latest
FATA[0002] invalid package "grafana-operator", channel "v5": entry "grafana-operator.v5.10.0" has identical replaces and skips: "grafana-operator.v5.9.2"

let's
/hold
this until we can talk to some impacted folks and determine if this is a big enough problem to have to solve NOW.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 20, 2025
@joelanford
Copy link
Member

This validation check seems to be very narrowly tailored to "can't both skip and replace the same thing in one entry", which is good!

However, I think it very slightly misses the point and the broader problem.

  1. It is actually okay to both skip and replace a bundle that is already a leaf node in the graph.
  2. When a node is skip-ed and causes other entries to no longer have a path to the channel head, that is the real problem that we need to check for.

@grokspawn
Copy link
Contributor Author

This validation check seems to be very narrowly tailored to "can't both skip and replace the same thing in one entry", which is good!

However, I think it very slightly misses the point and the broader problem.

1. It is actually okay to both `skip` and `replace` a bundle that is already a leaf node in the graph.

This is totally fine in any OLMv1 context, but I'd argue that since it comes with migration side-effects for OLMv0 that it's never OK. In general, we should not have these kind of surprises, and I think it's reasonable to enforce the most-restrictive case here (because it's easier to grow-permissive than -restrictive).

2. When a node is `skip`-ed and causes other entries to no longer have a path to the channel head, _that_ is the real problem that we need to check for.

That's a specific flavor of this more general issue. But I'd argue that it is also resolved by preventing the more general issue.

"name": "clusterwide-alpha",
"entries": [
{"name": "etcdoperator.v0.9.0"},
{"name": "etcdoperator.v0.9.2-clusterwide", "replaces": "etcdoperator.v0.9.0", "skips": ["etcdoperator.v0.6.1","etcdoperator.v0.9.0"], "skipRange": ">=0.9.0 <=0.9.1"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change related to the model validation change somehow? It seems unrelated to me at first glance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removes a skips from the slice where it duplicates the replaces edge.
It was needed for the previous commit, and I haven't yet checked to see if the existing catalogs impact is different with the new commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the catalogs impact and it looks the same as before. HOWEVER, we no longer need this test change, because somehow it's OK for 0.9.2-clusterwide to skip AND replace v0.9.0 ...?

@grokspawn grokspawn force-pushed the channel-edge-no-dupe-skip-replace branch from 9acb503 to 7607718 Compare August 28, 2025 20:41
@bandrade
Copy link
Contributor

bandrade commented Sep 3, 2025

I've tested this PR using a synthetic catalog with the following structure to explicitly trigger the replaces + skips conflict detection introduced by this PR.


📦 Catalog Structure

All files were placed under a directory named minimal-test, and used .yaml extension to ensure opm validate would parse them.

test.package.yaml

schema: olm.package
name: test-operator
defaultChannel: stable-v1

stable-v1.channel.yaml

schema: olm.channel
name: stable-v1
package: test-operator
entries:
  - name: test-operator-v1.0.0
  - name: test-operator-v1.1.0
  - name: test-operator-v1.1.2
  - name: test-operator-v1.1.4
    replaces: test-operator-v1.0.0
    skips:
      - test-operator-v1.0.0
      - test-operator-v1.1.0
      - test-operator-v1.1.2
  - name: test-operator-v1.2.0
  - name: test-operator-v1.2.1
    replaces: test-operator-v1.1.4
    skips:
      - test-operator-v1.1.4     # <- conflict with replaces
      - test-operator-v1.2.0
  - name: test-operator-v1.3.0
    replaces: test-operator-v1.2.1
    skips:
      - test-operator-v1.2.1     # <- conflict with replaces
  - name: test-operator-v1.4.0
    replaces: test-operator-v1.3.0
    skips:
      - test-operator-v1.3.0     # <- conflict with replaces

test-operator-.bundle.yaml (for each version)

Each file includes the required olm.package property, e.g.:

schema: olm.bundle
name: test-operator-v1.2.1
package: test-operator
image: quay.io/example/test-operator:v1.2.1
properties:
  - type: olm.package
    value:
      packageName: test-operator
      version: 1.2.1

opm validate completed without any errors.

I was expecting results like this

invalid package "test-operator", channel "stable-v1": entry "test-operator-v1.2.1" has identical replaces and skips: "test-operator-v1.1.4"
invalid package "test-operator", channel "stable-v1": entry "test-operator-v1.3.0" has identical replaces and skips: "test-operator-v1.2.1"
invalid package "test-operator", channel "stable-v1": entry "test-operator-v1.4.0" has identical replaces and skips: "test-operator-v1.3.0"

Could you validate if is there something wrong on my test? Thanks

Signed-off-by: grokspawn <[email protected]>
@grokspawn grokspawn force-pushed the channel-edge-no-dupe-skip-replace branch from 7607718 to 5bd337c Compare September 5, 2025 14:04
@grokspawn
Copy link
Contributor Author

grokspawn commented Sep 5, 2025

I've tested this PR using a synthetic catalog with the following structure to explicitly trigger the replaces + skips conflict detection introduced by this PR.

Could you validate if is there something wrong on my test? Thanks

Hey @bandrade I'll need to update the PR description, because the new commit changed the functionality to not merely refuse skipped-replaces, but to really consider if a skipped-replace strands bundles across the replaces chain.

The original example essentially ignores ALL lower bundle versions, so the new check does not identify it as a failure.
In order for it to be identified, there have to be non-skipped bundles earlier in the replaces chain which are stranded because intermediary links are ignored (belong to a skipped edge).

For e.g., this modification to your channel.yaml results in a failure:

schema: olm.channel
name: stable-v1
package: test-operator
entries:
  - name: test-operator-v1.0.0  # stranded because of skip on v1.1.4
  - name: test-operator-v1.1.0  # stranded because of skip on v1.1.4
    replaces: test-operator-v1.0.0
  - name: test-operator-v1.1.2
    replaces: test-operator-v1.1.0
  - name: test-operator-v1.1.4
    replaces: test-operator-v1.1.2
    skips:
      - test-operator-v1.1.2
  - name: test-operator-v1.2.0
  - name: test-operator-v1.2.1
    replaces: test-operator-v1.1.4
    skips:
      - test-operator-v1.1.4
      - test-operator-v1.2.0
  - name: test-operator-v1.3.0
    replaces: test-operator-v1.2.1
    skips:
      - test-operator-v1.2.1 
  - name: test-operator-v1.4.0
    replaces: test-operator-v1.3.0
    skips:
      - test-operator-v1.3.0

results in the message

FATA[0000] invalid index:
└── invalid package "test-operator":
    └── invalid channel "stable-v1":
        └── channel contains one or more stranded bundles: test-operator-v1.0.0, test-operator-v1.1.0

@grokspawn grokspawn changed the title validate fail for dupe skips+replaces channel entries validate fail for stranded channel entries Sep 8, 2025
@grokspawn
Copy link
Contributor Author

/hold cancel
Removing the hold after consensus with stakeholders of existing, known catalogs.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 8, 2025
@grokspawn
Copy link
Contributor Author

grokspawn commented Sep 8, 2025

There is a coverage gap in this PR, in that if there are no non-skipped edges below a skipped edge, then it cannot identify stranded edges.
This is because it is edge availability (they are not skipped) in the replacement chain is the key to identifying this problem.
This is an edge case which already exists if folks use startingCsv to pin an initial-installation version, and we can add new validation to catch it later, like #1762 which attempts to adjust the underlying heuristics to cover all stranding cases.

@bandrade
Copy link
Contributor

bandrade commented Sep 9, 2025

Thanks for the clarification and updated logic — I just reproduced the new stranded bundle detection using the modified channel.yaml you suggested.

I created a synthetic catalog with the following replaces chain:

 ./bin/opm validate /Users/bandrade/Downloads/stranded-replaces-chain
FATA[0000] invalid index:
└── invalid package "test-operator":
    └── invalid channel "stable-v1":
        └── channel contains one or more stranded bundles: test-operator-v1.0.0, test-operator-v1.1.0 

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 9, 2025
@bandrade
Copy link
Contributor

bandrade commented Sep 9, 2025

/label qe-verified
/verified by @bandrade

Copy link
Contributor

openshift-ci bot commented Sep 9, 2025

@bandrade: The label(s) /label qe-verified cannot be applied. These labels are supported: acknowledge-critical-fixes-only, platform/aws, platform/azure, platform/baremetal, platform/google, platform/libvirt, platform/openstack, ga, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, px-approved, docs-approved, qe-approved, ux-approved, no-qe, downstream-change-needed, rebase/manual, cluster-config-api-changed, run-integration-tests, approved, backport-risk-assessed, bugzilla/valid-bug, cherry-pick-approved, jira/valid-bug, ok-to-test, stability-fix-approved, staff-eng-approved. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/label qe-verified
/verified by @bandrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link

@bandrade: This PR has been marked as verified by @bandrade.

In response to this:

/label qe-verified
/verified by @bandrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bandrade
Copy link
Contributor

bandrade commented Sep 9, 2025

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Sep 9, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 3f87bea into operator-framework:master Sep 9, 2025
13 checks passed
@grokspawn grokspawn deleted the channel-edge-no-dupe-skip-replace branch September 11, 2025 20:50
@mantomas
Copy link

It looks like thanks to this change, the k8s-operatorhub/community-operators are unable to release anything new, as there is grafana-operator present and is failing the new validation (eg. here). We will pin the previous opm version for now.

@grokspawn
Copy link
Contributor Author

grokspawn commented Oct 1, 2025

Assessment of existing, known catalogs was tabulated here: https://docs.google.com/spreadsheets/d/1ngHlFDOflLkpzf7Fd3_AAet_wqx65vjV8PkpHPonO2w/ ("[old] Summary" tab).

For community-operators, @mantomas it appears that the infinispan-operator has unhealthy graph edges which are identified with this change.

operatorhubio catalog has failures with grafana-operator contributions. I'd suggest touching base with the maintainers there to get them to make updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR verified
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants