Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCIP-3461: Optimize mainnet soak test #1458

Merged
merged 11 commits into from
Sep 26, 2024
Merged

CCIP-3461: Optimize mainnet soak test #1458

merged 11 commits into from
Sep 26, 2024

Conversation

b-gopalswami
Copy link
Collaborator

@b-gopalswami b-gopalswami commented Sep 20, 2024

Motivation

Optimize the resource and fund consumption of the mainnet soak test. Currently, the mainnet soak test runs every six hours in Kubernetes, executing one CCIP transaction per hour for a duration of five hours across 28 bidirectional lanes. This setup is designed to provide consistent observability data on the CCIP mainnet, allowing us to differentiate between service outages and quiet periods. However, these tests are inefficient, consuming Kubernetes resources and significant mainnet funds.

https://smartcontract-it.atlassian.net/browse/CCIP-3461

Ideas:

Discussion thread is initiated with o11y team and decided that it's not required to create txs for every hour instead create one tx if there are no activity for last 24h.
Convert to smoke test as we are planning to fire only one request instead of Soak test.
Converting to smoke will elevate the K8 resource consumption as the test will run using github runner.

Solution

  • 1. Modify the pipeline to run it as smoke test
  • 2. Add traffic check to smoke test
  • 3. Add new set of additional 21 lanes and resulted in total of 49 lanes. (i.e 98 unique lanes)
  • 5. Add RPCs, Wallet key for new lanes to test secrets
  • 6. Load funds for new lanes
  • 7. Schedule pipeline to run for every 6hrs

Key outcomes:

Present transaction count: 1tx * (28 * 2)lanes * 24hrs = 1344
After this change: 1tx * (49 * 2)lanes = 98 which is close to 92% reduction with additional 42 lanes coverage.

As per the last analysis on the cost, we spend around 84k per quarter.
I expect the fund reduction close to 90% which will give saving close to 300k annually.

@b-gopalswami b-gopalswami marked this pull request as ready for review September 21, 2024 02:09
@b-gopalswami b-gopalswami requested review from a team as code owners September 21, 2024 02:09
@b-gopalswami
Copy link
Collaborator Author

Hey @matYang @kalverra @AnieeG @emate @andrevmatos @mateusz-sekara, Could you all please take a look at this PR and share your feedback?

integration-tests/ccip-tests/actions/ccip_helpers.go Outdated Show resolved Hide resolved
@@ -133,8 +135,8 @@ jobs:
matrix:
config: [mainnet.toml]
needs: [ build-chainlink, build-test-image ]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep any of the load test options around if this is fully converting to just smoke? Is this used for other purposes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if it's only smoke now, it might make sense to run this directly in github action instead of remote runner, It will save a lot of time reducing the step for building the test image, Also no need for K8 env for that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I left that as is to have capability available and we can run it whenever we need to.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kalverra @AnieeG Fixed the workflow run using github runner along with matrix option to handle load in parallel and for better debugging. Have also provided option to override the phase timeout so that based on slower chains, we can update this.

]

BiDirectionalLane = true
PhaseTimeout = '20m'
PhaseTimeout = '40m'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator Author

@b-gopalswami b-gopalswami Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phase timeouts varies widely between lanes. I think we need a better solution to define this per lane. This 20m timeout works for fastest lanes but fails to wait for most of the lanes. Even with this 40m update, still there may be lanes which takes more time than that!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to report those lanes.

integration-tests/ccip-tests/smoke/ccip_test.go Outdated Show resolved Hide resolved
@AnieeG
Copy link
Contributor

AnieeG commented Sep 23, 2024

Why not update the cron schedule as well?

@b-gopalswami
Copy link
Collaborator Author

b-gopalswami commented Sep 24, 2024

Why not update the cron schedule as well?

I think we can keep this as is and can reduce it if needed.

@cl-sonarqube-production
Copy link

@b-gopalswami b-gopalswami merged commit 374482e into ccip-develop Sep 26, 2024
117 checks passed
@b-gopalswami b-gopalswami deleted the ccip-3461 branch September 26, 2024 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants