forked from SchedMD/slurm
-
Notifications
You must be signed in to change notification settings - Fork 0
/
RELEASE_NOTES
205 lines (168 loc) · 10 KB
/
RELEASE_NOTES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
RELEASE NOTES FOR SLURM VERSION 18.08
26 October 2017
IMPORTANT NOTES:
THE MAXJOBID IS NOW 67,108,863. ANY PRE-EXISTING JOBS WILL CONTINUE TO RUN BUT
NEW JOB IDS WILL BE WITHIN THE NEW MAXJOBID RANGE. Adjust your configured
MaxJobID value as needed to eliminate any confusion.
If using the slurmdbd (Slurm DataBase Daemon) you must update this first.
NOTE: If using a backup DBD you must start the primary first to do any
database conversion, the backup will not start until this has happened.
The 18.08 slurmdbd will work with Slurm daemons of version 17.02 and above.
You will not need to update all clusters at the same time, but it is very
important to update slurmdbd first and having it running before updating
any other clusters making use of it. No real harm will come from updating
your systems before the slurmdbd, but they will not talk to each other
until you do. Also at least the first time running the slurmdbd you need to
make sure your my.cnf file has innodb_buffer_pool_size equal to at least 64M.
You can accomplish this by adding the line
innodb_buffer_pool_size=64M
under the [mysqld] reference in the my.cnf file and restarting the mysqld. The
buffer pool size must be smaller than the size of the MySQL tmpdir. This is
needed when converting large tables over to the new database schema.
Slurm can be upgraded from version 17.02 or 17.11 to version 18.08 without loss
of jobs or other state information. Upgrading directly from an earlier version
of Slurm will result in loss of state information.
If using SPANK plugins that use the Slurm APIs, they should be recompiled when
upgrading Slurm to a new major release.
NOTE FOR THOSE UPGRADING SLURMDBD: The database conversion process from
SlurmDBD 16.05 or 17.02 may not work properly with MySQL 5.1 (as was the
default version for RHEL 6). Upgrading to a newer version of MariaDB or
MySQL is strongly encouraged to prevent this problem.
NOTE FOR THOSE RUNNING 17.11.[0|1]: It was found a seeded MySQL auto_increment
value would be lost eventually if used. This was found in the tres_table
which tracks static TRES under 1001. This was fixed in MariaDB >=10.2.4,
but at the time of writing this was still around in MySQL. Regardless if
you are tracking licenses or GRES in the database
(i.e. AccountingStorageTRES=gres/gpu) you might have this this issue.
This would mean the id for gres/gpu could have been issued 5 instead of
1001. This is fine uptil 17.11 where a new static TRES was added taking
up the id of 5. If you are already running 17.11 you can easily check to
see if you hit this problem by running 'sacctmgr list tres'. If you see
any entry in the Name column for the Type 'billing' TRES (id=5) you are
unfortunately hit with the bug. The fix for this issue requires manual
intervention with the database. Most likely if you started a slurmctld
up against the slurmdbd the overwritten TRES is now at a different id.
You can fix the double issue by altering all the tables with the new TRES
id back to 5, remove that entry in the tres_table, and then change the
Type of billing back to the original Type and restart the slurmdbd which
should finish the conversion. SchedMD can assist with this. Supported
sites please open a ticket at https://bugs.schedmd.com/. Non-supported
sites please contact SchedMD at [email protected] if you would like to
discuss commercial support options.
NOTE: The slurm.spec file used to build RPM packages has been aggressively
refactored, and some package names may now be different. Notably,
the three daemons (slurmctld, slurmd/slurmstepd, slurmdbd) each
have their own separate package with the binary and the appropriate
systemd service file, which will be installed automatically (but
not enabled).
The slurm-plugins, slurm-munge, and slurm-lua package has been removed,
and the contents moved in to the main slurm package.
The slurm-sql package has been removed, and merged in with the slurm
(job_comp_mysql.so) and slurm-slurmdbd (accounting_storage_mysql)
packages.
The example configuration files have been moved to slurm-example-configs.
NOTE: The refactored slurm.spec file now requires systemd to build. When
building on older distributions, you must use the older variant which
has been preserved as contribs/slurm.spec-legacy.
NOTE: The slurmctld is now set to fatal if there are any problems with
any state files. To avoid this use the new '-i' flag.
NOTE: systemd services files are installed automatically, but not enabled.
You will need to manually enable them on the appropriate systems:
- Controller: systemctl enable slurmctld
- Database: systemctl enable slurmdbd
- Compute Nodes: systemctl enable slurmd
NOTE: If you are not using Munge, but are using the "service" scripts to
start Slurm daemons, then you will need to remove this check from the
etc/slurm*service scripts.
NOTE: If you are upgrading with any jobs from 14.03 or earlier
(i.e. quick upgrade from 14.03 -> 15.08 -> 17.02) you will need
to wait until after those jobs are gone before you upgrade to 17.02
or 17.11 or 18.08.
NOTE: If you interact with any memory values in a job_submit plugin, you will
need to test against NO_VAL64 instead of NO_VAL, and change your printf
format as well.
NOTE: The SLURM_ID_HASH used for Cray systems has changed to fully use the
entire 64 bits of the hash. Previously the stepid was multiplied by
10,000,000,000 to make it easy to read both the jobid as well as the
stepid in the hash separated by at least a couple of zeros, but this
lead to overflow on the hash with steps like the batch and extern step
where they used all 32 bits to represent the step. While the new method
ruins the easy readability it fixes the more important overflow issue.
This most likely will go unnoticed by most, just a note of the change.
NOTE: Starting in 17.11 the slurm commands and daemons dynamically link to
libslurmfull.so instead of statically linking. This dramatically reduces
the footprint of Slurm. If for some reason this creates issues with
your build you can configure slurm with --without-shared-libslurm.
NOTE: Spank options handled in local and allocator contexts should be able to
handle being called multiple times. An option could be set multiple times
through environment variables and command line options. Environment
variables are processed first.
NOTE: IBM BlueGene/Q and Cray/ALPS modes are deprecated and will be removed
in an upcoming release. You must add the --enable-deprecated option to
configure to build these targets.
NOTE: Built-in BLCR support is deprecated, no longer built automatically, and
will be removed in an upcoming release. You must add --with-blcr and
--enable-deprecated options to configure to build this plugin.
NOTE: srun will now only read in the environment variables SLURM_JOB_NODES and
SLURM_JOB_NODELIST instead of SLURM_NNODES and SLURM_NODELIST. These
latter variables have been obsolete for some time please update any
scripts still using them.
HIGHLIGHTS
==========
-- Add support for parenthesis in a job's constraint specification to group
like options together. For example
--constraint="[(knl&snc4&flat)*4&haswell*1]" might be used to specify that
four nodes with the features "knl", "snc4" and "flat" plus one node with
the feature "haswell" are required.
-- Enable jobs with zero node count for creation and/or deletion of persistent
burst buffers.
-- Add "scontrol show dwstat" command to display Cray burst buffer status.
-- Heterogeneous job steps allocations supported with
* Open MPI (with Slurm's PMI2 and PMIx plugins) and
* Intel MPI (with Slurm's PMI2 plugin)
* No support for Cray systems at this time.
-- Add Slurm configuration file check logic using "slurmctld -t" command.
RPMBUILD CHANGES
================
CONFIGURATION FILE CHANGES (see man appropriate man page for details)
=====================================================================
-- Configuration parameters "ControlMachine", "ControlAddr", "BackupController"
and "BackupAddr" replaced by an ordered list of "SlurmctldHost" records
with the optional address appended to the name enclosed in parenthesis.
For example: "SlurmctldHost=head(12.34.56.78)". An arbitrary number of
backup servers can be configured.
-- Add "GetSysStatus" option to burst_buffer.conf file. For burst_buffer/cray
this would indicate the location of the "dwstat" command.
-- Add node and partition configuration options of "CpuBind" to control default
task binding. Modify the scontrol to report and modify these parameters.
-- Add "NumaCpuBind" option to knl.conf file to automatically change a node's
CpuBind parameter based upon changes to a node's NUMA mode.
-- Remove support for "ChosLoc" configuration parameter.
COMMAND CHANGES (see man pages for details)
===========================================
-- Add sbatch "--batch" option to identify features required on batch node.
For example "sbatch --batch=haswell ...".
-- Report cgroup and NodeFeatures plugin configuration with scontrol and
sview commands.
OTHER CHANGES
=============
API CHANGES
===========
Changed members of the following structs
========================================
Added members to the following struct definitions
=================================================
Added the following struct definitions
======================================
Removed members from the following struct definitions
=====================================================
Changed the following enums and #defines
========================================
Added the following API's
=========================
Changed the following API's
============================
Removed the following API's
===========================
-- Removed slurm_xslurm_strerrorcat() function.
-- Removed slurm_xstrstrip() function.