Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom taints and toleration node operation #9920

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

vkathole
Copy link
Contributor

@vkathole vkathole commented Jun 7, 2024

No description provided.

@vkathole vkathole requested review from a team as code owners June 7, 2024 10:27
@pull-request-size pull-request-size bot added the size/L PR that changes 100-499 lines label Jun 7, 2024
@vkathole vkathole added team/e2e E2E team related issues/PRs and removed size/L PR that changes 100-499 lines labels Jun 7, 2024
@vkathole vkathole self-assigned this Jun 7, 2024
ocs_ci/ocs/resources/pod.py Outdated Show resolved Hide resolved
ocs_ci/ocs/resources/pod.py Outdated Show resolved Hide resolved
ocs_ci/ocs/resources/pod.py Outdated Show resolved Hide resolved
ocs_ci/ocs/resources/pod.py Outdated Show resolved Hide resolved
@pull-request-size pull-request-size bot added the size/M PR that changes 30-99 lines label Sep 19, 2024
Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation

Cluster Name:
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job UNSTABLE (some or all tests failed).

@pull-request-size pull-request-size bot added size/L PR that changes 100-499 lines and removed size/M PR that changes 30-99 lines labels Sep 26, 2024
Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: vkathole-t26
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job UNSTABLE (some or all tests failed).

Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation

Cluster Name:
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job UNSTABLE (some or all tests failed).

Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation

Cluster Name:
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job UNSTABLE (some or all tests failed).

Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: vkathole-o1
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job UNSTABLE (some or all tests failed).

Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: vkathole-o1
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job UNSTABLE (some or all tests failed).

Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation

Cluster Name:
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job UNSTABLE (some or all tests failed).

if "config" in subscription_data.get("spec", {}):
params = '[{"op": "remove", "path": "/spec/config"}]'
sub_obj.patch(resource_name=sub, params=params, format_type="json")
time.sleep(180)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we not supposed to remove the tolerations from the rook-ceph operator configmap and ocsinitializations.ocs.openshift.io too ??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not added in that place now, we are adding it to storagecluster yaml only

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you have added tolerations on configmap and ocsint in func apply_custom_taint_and_toleration() and also since this test has the ability to run on all ODF versions >, < 4.16, the cleanup should be according to that. Removing toleration just from storagecluster might not be right for version < 4.16

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check and add cleanup accordingly.

def test_negative_custom_taint(self, nodes):
"""
Test runs the following steps
1. Taint odf nodes with non-ocs taint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Taint odf nodes with non-ocs taint
1. Taint odf worker nodes with non-ocs taint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


assert not wait_for_pods_to_be_running(
timeout=120, sleep=15
), "Pods are running when they should not be."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we expecting all pods to go in a bad state ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we apply tolerations on storagecluster and subscription other than ODF, are we sure all pods will not be running if the toleration is just not applied properly on sub when we are setting it properly on storagecluster ? Please check the scenario again. if we are setting the toleration properly on storagecluster few pods should be up and running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we are not expecting all pods in bad state, some in pending and some in running state, wait_for_pods_to_be_running will fail with even one pod in pending state. so assert not wait_for_pods_to_be_running will work fine for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of checking for all the pods we can check for pods that are affected by the subscription changes. This will not be a blocker for the merge though.

@Shrivaibavi Shrivaibavi requested review from a team and removed request for a team October 10, 2024 17:51
Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation

Cluster Name:
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job UNSTABLE (some or all tests failed).

Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: vkathole-m4
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py::TestNonOCSTaintAndTolerations::test_negative_custom_taint
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job UNSTABLE (some or all tests failed).

Copy link

openshift-ci bot commented Nov 12, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vkathole

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

openshift-ci bot commented Nov 12, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vkathole

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: vkathole <[email protected]>
Signed-off-by: vkathole <[email protected]>
Signed-off-by: vkathole <[email protected]>
Signed-off-by: vkathole <[email protected]>
Signed-off-by: vkathole <[email protected]>
Signed-off-by: vkathole <[email protected]>
Signed-off-by: vkathole <[email protected]>
Signed-off-by: vkathole <[email protected]>
Signed-off-by: vkathole <[email protected]>
Signed-off-by: vkathole <[email protected]>
Signed-off-by: vkathole <[email protected]>
Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: vkathole-m11
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py::TestNonOCSTaintAndTolerations::test_non_ocs_taint_and_tolerations tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py::TestNonOCSTaintAndTolerations::test_reboot_on_tainted_node tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py::TestNonOCSTaintAndTolerations::test_negative_custom_taint
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job UNSTABLE (some or all tests failed).

Signed-off-by: vkathole <[email protected]>
Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: vkathole-m11
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py::TestNonOCSTaintAndTolerations::test_non_ocs_taint_and_tolerations tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py::TestNonOCSTaintAndTolerations::test_reboot_on_tainted_node tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py::TestNonOCSTaintAndTolerations::test_negative_custom_taint
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job UNSTABLE (some or all tests failed).

Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation

Cluster Name:
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py::TestNonOCSTaintAndTolerations::test_non_ocs_taint_and_tolerations tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py::TestNonOCSTaintAndTolerations::test_reboot_on_tainted_node tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py::TestNonOCSTaintAndTolerations::test_negative_custom_taint
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job UNSTABLE (some or all tests failed).

Signed-off-by: vkathole <[email protected]>
@@ -5392,3 +5401,130 @@ def verify_performance_profile_change(perf_profile):
), f"Performance profile is not updated successfully to {perf_profile}"
logger.info(f"Performance profile successfully got updated to {perf_profile} mode")
return True


def apply_custom_taint_and_toleration(taint_label):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can have the custom taint label defined here in the arg

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation

Cluster Name:
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py::TestNonOCSTaintAndTolerations::test_non_ocs_taint_and_tolerations tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py::TestNonOCSTaintAndTolerations::test_reboot_on_tainted_node tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py::TestNonOCSTaintAndTolerations::test_negative_custom_taint
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job PASSED.

Signed-off-by: vkathole <[email protected]>
Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unknown PR validation on existing cluster

Cluster Name: vkathole-m11
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py::TestNonOCSTaintAndTolerations::test_non_ocs_taint_and_tolerations tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py::TestNonOCSTaintAndTolerations::test_reboot_on_tainted_node tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py::TestNonOCSTaintAndTolerations::test_negative_custom_taint
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job state: ABORTED.

Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation

Cluster Name:
Cluster Configuration:
PR Test Suite: tier4b
PR Test Path: tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py::TestNonOCSTaintAndTolerations::test_non_ocs_taint_and_tolerations tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py::TestNonOCSTaintAndTolerations::test_reboot_on_tainted_node tests/functional/z_cluster/nodes/test_non_ocs_taint_and_toleration.py::TestNonOCSTaintAndTolerations::test_negative_custom_taint
Additional Test Params:
OCP VERSION: 4.17
OCS VERSION: 4.17
tested against branch: master

Job FAILED (installation failed, tests not executed).

if "config" in subscription_data.get("spec", {}):
params = '[{"op": "remove", "path": "/spec/config"}]'
sub_obj.patch(resource_name=sub, params=params, format_type="json")
time.sleep(180)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you have added tolerations on configmap and ocsint in func apply_custom_taint_and_toleration() and also since this test has the ability to run on all ODF versions >, < 4.16, the cleanup should be according to that. Removing toleration just from storagecluster might not be right for version < 4.16

if "config" in subscription_data.get("spec", {}):
params = '[{"op": "remove", "path": "/spec/config"}]'
sub_obj.patch(resource_name=sub, params=params, format_type="json")
time.sleep(180)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check and add cleanup accordingly.

logger.info(
"After adding toleration wait for some time for pods to respin as expected"
)
time.sleep(300)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reinitializing the pod variable again would be the best approach instead of sleep. Otherwise reduce the sleep time and timeout in line 214.

from tests.functional.z_cluster.nodes.test_node_replacement_proactive import (
delete_and_create_osd_node,
select_osd_node_name,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might have to modify
@pytest.mark.polarion_id("OCS-2705") markers accordingly. I see many tests added to this class. so markers to be modified accordingly


assert not wait_for_pods_to_be_running(
timeout=120, sleep=15
), "Pods are running when they should not be."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of checking for all the pods we can check for pods that are affected by the subscription changes. This will not be a blocker for the merge though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XL team/e2e E2E team related issues/PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants