Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-37232][runtime] Fix for broken synchronization assumption on the AdaptiveScheduler's side introduced by FLIP-272 #26088

Merged
merged 2 commits into from
Feb 4, 2025

Conversation

ztison
Copy link
Contributor

@ztison ztison commented Jan 28, 2025

What is the purpose of the change

The PR fixes the issue with the rescale state transition short cut that was introduced in FLIP-472 where the WaitingForRequirements state is omitted and the transition goes directly from the Restarting state to the CreatingExecutionGraph state.

Brief change log

  • Pass the available VertexParallelism that lead to the rescale decision to the Restarting state and check when the job is cancelled whether that parallelism has changed.

Verifying this change

This change is already covered by existing tests, such as (please describe tests).

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 28, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@XComp XComp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. CI is failing due to spotless.

./mvnw -pl flink-runtime spotless:apply should do the trick.

@ztison ztison changed the title [FLINK-37232][runtime] Fix for broken synchronization assumption on t… [FLINK-37232][runtime] Fix for broken synchronization assumption on the AdaptiveScheduler's side introduced by FLIP-272 Jan 29, 2025
Copy link
Contributor

@XComp XComp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for going through my comments. Looks good from my end. 👍

@ztison
Copy link
Contributor Author

ztison commented Jan 29, 2025

@XComp Thanks for review, I addressed the comments and updated the description. PTAL.

@ztison
Copy link
Contributor Author

ztison commented Jan 29, 2025

@flinkbot run azure

@davidradl
Copy link
Contributor

@ztison looks like you bot command did not take- we still see yesterdays CI FAILURE

@XComp
Copy link
Contributor

XComp commented Jan 29, 2025

@flinkbot run azure

1 similar comment
@XComp
Copy link
Contributor

XComp commented Jan 29, 2025

@flinkbot run azure

…he AdaptiveScheduler's side introduced by FLIP-272
@airlock-confluentinc airlock-confluentinc bot force-pushed the AUTO-1788 branch 2 times, most recently from 0d2c955 to a60f95b Compare February 3, 2025 15:31
Copy link
Contributor

@XComp XComp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job and thanks for adding the ITCase. 👍

I verified that we can reproduce the error with the ITCase and the changes not being applied:

StandaloneDispatcher [flink-pekko.actor.default-dispatcher-4] INFO  Job 98f97effc400a2ee5004f62ca5c4cea4 reached terminal state FAILED.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough resources available for scheduling.
	at org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler.lambda$determineParallelism$26(AdaptiveScheduler.java:1136)

I have one question, though. PTAL

Copy link
Contributor

@XComp XComp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing else to add. Thanks for fixing the issue. 👍 Please provide backports for release-2.0 and release-2.0-preview1-rc1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants