Skip to content

Commit 811f8e4

Browse files
committed
[FLINK-33392][docs] Add the documentation pages for balanced tasks scheduling.
1 parent 5b61b1c commit 811f8e4

File tree

7 files changed

+1510
-0
lines changed

7 files changed

+1510
-0
lines changed
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
title: Tasks Scheduling
3+
bookCollapseSection: true
4+
weight: 9
5+
---
6+
<!--
7+
Licensed to the Apache Software Foundation (ASF) under one
8+
or more contributor license agreements. See the NOTICE file
9+
distributed with this work for additional information
10+
regarding copyright ownership. The ASF licenses this file
11+
to you under the Apache License, Version 2.0 (the
12+
"License"); you may not use this file except in compliance
13+
with the License. You may obtain a copy of the License at
14+
15+
http://www.apache.org/licenses/LICENSE-2.0
16+
17+
Unless required by applicable law or agreed to in writing,
18+
software distributed under the License is distributed on an
19+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
20+
KIND, either express or implied. See the License for the
21+
specific language governing permissions and limitations
22+
under the License.
23+
-->
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
---
2+
title: Balanced Tasks Scheduling
3+
weight: 5
4+
type: docs
5+
6+
---
7+
<!--
8+
Licensed to the Apache Software Foundation (ASF) under one
9+
or more contributor license agreements. See the NOTICE file
10+
distributed with this work for additional information
11+
regarding copyright ownership. The ASF licenses this file
12+
to you under the Apache License, Version 2.0 (the
13+
"License"); you may not use this file except in compliance
14+
with the License. You may obtain a copy of the License at
15+
16+
http://www.apache.org/licenses/LICENSE-2.0
17+
18+
Unless required by applicable law or agreed to in writing,
19+
software distributed under the License is distributed on an
20+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
21+
KIND, either express or implied. See the License for the
22+
specific language governing permissions and limitations
23+
under the License.
24+
-->
25+
26+
# Balanced Tasks Scheduling
27+
28+
This page describes the background and principle of balanced tasks scheduling,
29+
how to use it when running streaming jobs.
30+
31+
## Background
32+
33+
When the parallelism of all vertices within a Flink streaming job is inconsistent,
34+
the [default strategy]({{< ref "docs/deployment/config" >}}#taskmanager-load-balance-mode)
35+
of Flink to deploy tasks sometimes leads some `TaskManagers` have more tasks while others have fewer tasks,
36+
resulting in excessive resource utilization at some `TaskManagers`
37+
that contain more tasks and becoming a bottleneck for the entire job processing.
38+
39+
{{< img src="/fig/deployments/tasks-scheduling/tasks_scheduling_skew_case.svg" alt="The Skew Case of Tasks Scheduling" class="offset" width="50%" >}}
40+
41+
As shown in figure (a), given a Flink job comprising two vertices, `JobVertex-A (JV-A)` and `JobVertex-B (JV-B)`,
42+
with parallelism degrees of `6` and `3` respectively,
43+
and both vertices sharing the same slot sharing group.
44+
Under the default tasks scheduling strategy, as illustrated in figure (b),
45+
the distribution of tasks across `TaskManagers` may result in significant disparities in task load.
46+
Specifically, the `TaskManager`s with the highest number of tasks may host `4` tasks,
47+
while the one with the lowest load may have only `2` tasks.
48+
Consequently, the `TaskManager`s bearing 4 tasks is prone to become a performance bottleneck for the entire job.
49+
50+
Therefore, Flink provides a task-quantity-based balanced tasks scheduling capability.
51+
Within the job's resource view, it aims to ensure that the number of tasks
52+
scheduled to each `TaskManager` as close as possible to, thereby improving the resource usage skew among `TaskManagers`.
53+
54+
## Principle
55+
56+
The task-quantity-based load balancing tasks scheduling strategy completes the assignment of tasks to `TaskManagers` in two phases:
57+
- The tasks-to-slots assignment phase
58+
- The slots-to-TaskManagers assignment phase
59+
60+
This section will use two examples to illustrate the simplified process and principle of
61+
how the task-quantity-based tasks scheduling strategy handles the assignments in these two phases.
62+
63+
### The tasks-to-slots assignment phase
64+
65+
Taking the job shown in figure (c) as an example, it contains five job vertices with parallelism degrees of `1`, `4`, `4`, `2`, and `3`, respectively.
66+
All five job vertices belong to the default slot sharing group.
67+
68+
{{< img src="/fig/deployments/tasks-scheduling/tasks_to_slots_allocation_principle.svg" alt="The Tasks To Slots Allocation Principle Demo" class="offset" width="65%" >}}
69+
70+
During the tasks-to-slots assignment phase, this tasks scheduling strategy:
71+
- First directly assigns the tasks of the vertices with the highest parallelism to the `i-th` slot.
72+
73+
That is, task `JV-Bi` is assigned directly to `sloti`, and task `JV-Ci` is assigned directly to `sloti`.
74+
75+
- Next, for tasks belonging to job vertices with sub-maximal parallelism, they are assigned in a round-robin fashion across the slots within the current
76+
slot sharing group until all tasks are allocated.
77+
78+
As shown in figure (e), under the task-quantity-based assignment strategy, the range (max-min difference) of the number of tasks per slot is `1`,
79+
which is better than the range of `3` under the default strategy shown in figure (d).
80+
81+
Thus, this ensures a more balanced distribution of the number of tasks across slots.
82+
83+
### The slots-to-TaskManagers assignment phase
84+
85+
As shown in figure (f), given a Flink job comprising two vertices, `JV-A` and `JV-B`, with parallelism of `6` and `3` respectively,
86+
and both vertices sharing the same slot sharing group.
87+
88+
{{< img src="/fig/deployments/tasks-scheduling/slots_to_taskmanagers_allocation_principle.svg" alt="The Slots to TaskManagers Allocation Principle Demo" class="offset" width="75%" >}}
89+
90+
The assignment result after the first phase is shown in figure (g),
91+
where `Slot0`, `Slot1`, and `Slot2` each contain `2` tasks, while the remaining slots contain `1` task each.
92+
93+
Subsequently:
94+
- The strategy submits all slot requests and waits until all slot resources required for the current job are ready.
95+
96+
Once the slot resources are ready:
97+
- The strategy then sorts all slot requests in descending order based on the number of tasks contained in each request.
98+
Afterwards, it sequentially assigns each slot request to the `TaskManager` with the smallest current tasks loading.
99+
This process continues until all slot requests have been allocated.
100+
101+
The final assignment result is shown in figure (i), where each `TaskManager` ends up with exactly `3` tasks,
102+
resulting in a task count difference of `0` between `TaskManagers`. In contrast, the scheduling result under the default strategy,
103+
shown in figure (h), has a task count difference of `2` between `TaskManagers`.
104+
105+
Therefore, theoretically, using this load balancing tasks scheduling strategy could effectively mitigate the issue of
106+
resource usage skew caused by significant disparities in the number of tasks across `TaskManagers` .
107+
108+
## Usage
109+
110+
You can enable balanced tasks scheduling through the following configuration item:
111+
112+
- `taskmanager.load-balance.mode`: `tasks`
113+
114+
## More details
115+
116+
See the <a href="https://cwiki.apache.org/confluence/x/U56zDw">FLIP-370</a> for more details.
117+
118+
{{< top >}}
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
title: Tasks Scheduling
3+
bookCollapseSection: true
4+
weight: 9
5+
---
6+
<!--
7+
Licensed to the Apache Software Foundation (ASF) under one
8+
or more contributor license agreements. See the NOTICE file
9+
distributed with this work for additional information
10+
regarding copyright ownership. The ASF licenses this file
11+
to you under the Apache License, Version 2.0 (the
12+
"License"); you may not use this file except in compliance
13+
with the License. You may obtain a copy of the License at
14+
15+
http://www.apache.org/licenses/LICENSE-2.0
16+
17+
Unless required by applicable law or agreed to in writing,
18+
software distributed under the License is distributed on an
19+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
20+
KIND, either express or implied. See the License for the
21+
specific language governing permissions and limitations
22+
under the License.
23+
-->
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
---
2+
title: Balanced Tasks Scheduling
3+
weight: 5
4+
type: docs
5+
6+
---
7+
<!--
8+
Licensed to the Apache Software Foundation (ASF) under one
9+
or more contributor license agreements. See the NOTICE file
10+
distributed with this work for additional information
11+
regarding copyright ownership. The ASF licenses this file
12+
to you under the Apache License, Version 2.0 (the
13+
"License"); you may not use this file except in compliance
14+
with the License. You may obtain a copy of the License at
15+
16+
http://www.apache.org/licenses/LICENSE-2.0
17+
18+
Unless required by applicable law or agreed to in writing,
19+
software distributed under the License is distributed on an
20+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
21+
KIND, either express or implied. See the License for the
22+
specific language governing permissions and limitations
23+
under the License.
24+
-->
25+
26+
# Balanced Tasks Scheduling
27+
28+
This page describes the background and principle of balanced tasks scheduling,
29+
how to use it when running streaming jobs.
30+
31+
## Background
32+
33+
When the parallelism of all vertices within a Flink streaming job is inconsistent,
34+
the [default strategy]({{< ref "docs/deployment/config" >}}#taskmanager-load-balance-mode)
35+
of Flink to deploy tasks sometimes leads some `TaskManagers` have more tasks while others have fewer tasks,
36+
resulting in excessive resource utilization at some `TaskManagers`
37+
that contain more tasks and becoming a bottleneck for the entire job processing.
38+
39+
{{< img src="/fig/deployments/tasks-scheduling/tasks_scheduling_skew_case.svg" alt="The Skew Case of Tasks Scheduling" class="offset" width="50%" >}}
40+
41+
As shown in figure (a), given a Flink job comprising two vertices, `JobVertex-A (JV-A)` and `JobVertex-B (JV-B)`,
42+
with parallelism degrees of `6` and `3` respectively,
43+
and both vertices sharing the same slot sharing group.
44+
Under the default tasks scheduling strategy, as illustrated in figure (b),
45+
the distribution of tasks across `TaskManagers` may result in significant disparities in task load.
46+
Specifically, the `TaskManager`s with the highest number of tasks may host `4` tasks,
47+
while the one with the lowest load may have only `2` tasks.
48+
Consequently, the `TaskManager`s bearing 4 tasks is prone to become a performance bottleneck for the entire job.
49+
50+
Therefore, Flink provides a task-quantity-based balanced tasks scheduling capability.
51+
Within the job's resource view, it aims to ensure that the number of tasks
52+
scheduled to each `TaskManager` as close as possible to, thereby improving the resource usage skew among `TaskManagers`.
53+
54+
## Principle
55+
56+
The task-quantity-based load balancing tasks scheduling strategy completes the assignment of tasks to `TaskManagers` in two phases:
57+
- The tasks-to-slots assignment phase
58+
- The slots-to-TaskManagers assignment phase
59+
60+
This section will use two examples to illustrate the simplified process and principle of
61+
how the task-quantity-based tasks scheduling strategy handles the assignments in these two phases.
62+
63+
### The tasks-to-slots assignment phase
64+
65+
Taking the job shown in figure (c) as an example, it contains five job vertices with parallelism degrees of `1`, `4`, `4`, `2`, and `3`, respectively.
66+
All five job vertices belong to the default slot sharing group.
67+
68+
{{< img src="/fig/deployments/tasks-scheduling/tasks_to_slots_allocation_principle.svg" alt="The Tasks To Slots Allocation Principle Demo" class="offset" width="65%" >}}
69+
70+
During the tasks-to-slots assignment phase, this tasks scheduling strategy:
71+
- First directly assigns the tasks of the vertices with the highest parallelism to the `i-th` slot.
72+
73+
That is, task `JV-Bi` is assigned directly to `sloti`, and task `JV-Ci` is assigned directly to `sloti`.
74+
75+
- Next, for tasks belonging to job vertices with sub-maximal parallelism, they are assigned in a round-robin fashion across the slots within the current
76+
slot sharing group until all tasks are allocated.
77+
78+
As shown in figure (e), under the task-quantity-based assignment strategy, the range (max-min difference) of the number of tasks per slot is `1`,
79+
which is better than the range of `3` under the default strategy shown in figure (d).
80+
81+
Thus, this ensures a more balanced distribution of the number of tasks across slots.
82+
83+
### The slots-to-TaskManagers assignment phase
84+
85+
As shown in figure (f), given a Flink job comprising two vertices, `JV-A` and `JV-B`, with parallelism of `6` and `3` respectively,
86+
and both vertices sharing the same slot sharing group.
87+
88+
{{< img src="/fig/deployments/tasks-scheduling/slots_to_taskmanagers_allocation_principle.svg" alt="The Slots to TaskManagers Allocation Principle Demo" class="offset" width="75%" >}}
89+
90+
The assignment result after the first phase is shown in figure (g),
91+
where `Slot0`, `Slot1`, and `Slot2` each contain `2` tasks, while the remaining slots contain `1` task each.
92+
93+
Subsequently:
94+
- The strategy submits all slot requests and waits until all slot resources required for the current job are ready.
95+
96+
Once the slot resources are ready:
97+
- The strategy then sorts all slot requests in descending order based on the number of tasks contained in each request.
98+
Afterwards, it sequentially assigns each slot request to the `TaskManager` with the smallest current tasks loading.
99+
This process continues until all slot requests have been allocated.
100+
101+
The final assignment result is shown in figure (i), where each `TaskManager` ends up with exactly `3` tasks,
102+
resulting in a task count difference of `0` between `TaskManagers`. In contrast, the scheduling result under the default strategy,
103+
shown in figure (h), has a task count difference of `2` between `TaskManagers`.
104+
105+
Therefore, theoretically, using this load balancing tasks scheduling strategy could effectively mitigate the issue of
106+
resource usage skew caused by significant disparities in the number of tasks across `TaskManagers` .
107+
108+
## Usage
109+
110+
You can enable balanced tasks scheduling through the following configuration item:
111+
112+
- `taskmanager.load-balance.mode`: `tasks`
113+
114+
## More details
115+
116+
See the <a href="https://cwiki.apache.org/confluence/x/U56zDw">FLIP-370</a> for more details.
117+
118+
{{< top >}}

0 commit comments

Comments
 (0)