-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU and Max RSS Analysis tools #6663
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
cylc/flow/etc/job.sh
Outdated
@@ -138,6 +138,10 @@ cylc__job__main() { | |||
mkdir -p "$(dirname "${CYLC_TASK_WORK_DIR}")" || true | |||
mkdir -p "${CYLC_TASK_WORK_DIR}" | |||
cd "${CYLC_TASK_WORK_DIR}" | |||
|
|||
cylc profile & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Pass through a value for the -m
option.
cylc/flow/scripts/profile.py
Outdated
# Get the PID of the current process | ||
pid = os.getpid() | ||
# Get the cgroup information for the current process | ||
result = subprocess.run(['cat', '/proc/' + str(pid) + '/cgroup'], capture_output=True, text=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look at using pathlib.Path
for filesystem path building / manipulation.
I think capture_output
captures both stdout and stderr, so if there is any error here, it will be swallowed. You might want to only capture stdout (stdout=subprocess.PIPE
) or report stderr if the process fails (print(result.stderr, file=sys.stderr)
).
cylc/flow/scripts/profile.py
Outdated
print(e) | ||
raise FileNotFoundError("cgroups not found:" + args.cgroup_location) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probs best to write a user-friendly message to stderr, then raise the original exception, e.g:
print(e) | |
raise FileNotFoundError("cgroups not found:" + args.cgroup_location) | |
print(f'cgroups not found: {args.cgroup_location}', file=sys.stderr) | |
raise from None |
cylc/flow/scripts/profile.py
Outdated
except FileNotFoundError as e: | ||
print(e) | ||
raise FileNotFoundError("cgroups not found:" + args.cgroup_location) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
cylc/flow/scripts/profile.py
Outdated
except (OSError, IOError, ValueError) as error: | ||
print(error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(should write the error to stderr not stdout)
I think this is here to tolerate failures in cgroup polling? I guess this might happen from time to time?
Perhaps the profiler should bail if, say, three polls fail in a row (indicating a systematic failure) rather than continuing with fruitless polling?
fb1b12b
to
c5d30b3
Compare
Changed the name of the profiler module. Linting Profiler sends KB instead of bytes Time Series now working CPU/Memory Logging working
4a6dbe8
to
30a7bb0
Compare
Initial profiler implementation (non working) Changed the name of the profiler module. Linting Profiler sends KB instead of bytes Time Series now working CPU/Memory Logging working Adding profiler unit tests updating tests Fail gracefully if cgroups cannot be found Revert "Fail gracefully if cgroups cannot be found" This reverts commit 92e1e11c9b392b4742501d399f191f590814e95e. Linting Modifying unit tests Linting Changed the name of the profiler module. Profiler sends KB instead of bytes Time Series now working
Co-authored-by: Hilary James Oliver <[email protected]>
* The COPYING file appears to have moved from `dist-info/COPYING` into `dist-info/licenses/COPYING`.
Co-authored-by: Hilary James Oliver <[email protected]>
* Attempt to fix flaky test. * Cut out shell profile files to omit some spurious stderr.
Co-authored-by: Hilary James Oliver <[email protected]>
* Attempt to fix flaky test. * Cut out shell profile files to omit some spurious stderr.
Revert "Fail gracefully if cgroups cannot be found" This reverts commit 92e1e11c9b392b4742501d399f191f590814e95e. Adding profiler unit tests updating tests
* Attempt to fix flaky test. * Cut out shell profile files to omit some spurious stderr.
30a7bb0
to
7091711
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
.. versionadded:: 8.0.0 | ||
''') | ||
Conf('cgroups path', VDR.V_STRING, '/sys/fs/cgroup') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a description for this configuration.
@@ -157,11 +163,23 @@ cylc__job__main() { | |||
cylc__set_return "$ret_code" | |||
fi | |||
} | |||
# Grab the max rss and cpu_time value before moving directory | |||
if [[ -f "max_rss" ]]; then | |||
max_rss=$(sed -n '1p' max_rss) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max_rss
file should only have a single value in it, so we shouldn't need to extract only the first line.
rm max_rss | ||
fi | ||
if [[ -f "cpu_time" ]]; then | ||
cpu_time=$(sed -n '1p' cpu_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
default=10, dest="delay") | ||
parser.add_option( | ||
"-m", type=str, help="Location of cgroups directory", | ||
default="/sys/fs/cgroup", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this default is hardcoded in the Cylc config, we could make this a required argument to avoid the need to hardcode this default twice.
def write_data(data, filename): | ||
with open(filename, 'w') as f: | ||
f.write(data + "\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is adding one line to the file for each poll. The Cylc job script picks the first line of the file and sends it back to the scheduler. So I think we're sending back the first poll of the memory usage, not the max poll?
We only need to store the maximum values in these files, so we could do the maths in Python and keep this file to a single line:
with open('max_rss', 'w+') as max_rss:
prev_memory = 0
while keep_looping():
memory = poll_memory()
if memory > prev_memory:
max_rss.seek(0, 0) # wind back to the start of the file
max_rss.write(f'{memory}\n') # (over)write the first line
max_rss.flush()
prev_memory = memory
time.sleep(delay)
# You should have received a copy of the GNU General Public License | ||
# along with this program. If not, see <http://www.gnu.org/licenses/>. | ||
#------------------------------------------------------------------------------- | ||
# cylc profile test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test will run regular background jobs, no slurm / pbs / whatever, so no cgroups.
I think this is testing that the profiler will not cause the job to fail, even if it cannot poll cgroups? Which is worthwhile testing.
We should test the jobs stderr for the line(s) written by the profiler script complaining of the fault.
[meta] | ||
title = "job script torture test" | ||
|
||
description = """Any task job script may fail regardless of user runtime | ||
settings if changes to cylc re-order the job script sections badly: e.g. | ||
"cylc task started" must be called after the CYLC_ environment variables | ||
are exported. Additionally, users may rely on the order of variable | ||
definition in each environment and script section: e.g. workflow | ||
bin directory must go in PATH before the task runtime environment is | ||
defined because workflow bin commands could be used in variable assignment | ||
expressions.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is relevant here?
echo "HELLO FROM INIT-SCRIPT" | ||
# define a variable | ||
export VAR_IS=is""" | ||
pre-script = """ | ||
echo "HELLO FROM PRE-SCRIPT" | ||
# init-script must be done: | ||
echo VAR_IS is $VAR_IS | ||
# user environment must be done: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed for this test?
def get_config(args): | ||
# Find the cgroup that this process is running in. | ||
# Cylc will put this profiler in the same cgroup | ||
# as the job it is profiling | ||
cgroup_name = get_cgroup_name() | ||
cgroup_version = get_cgroup_version(args.cgroup_location, cgroup_name) | ||
process = get_cgroup_paths(cgroup_version, | ||
args.cgroups_location, | ||
cgroup_name) | ||
|
||
profile(process, cgroup_version, args.delay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov has spotted a couple of test coverage holes here that it would be good to fill.
(please ignore the manylinux test failures, we'll be removing this test on master shortly) |
This apart of 3 pull requests for adding CPU time and Max RSS analysis to the Cylc UI.
This adds the Max RSS and CPU time (as measured by cgroups) to the table view, box plot and time series views.
This adds a python profiler script. This profiler will will be ran by cylc in the same crgroup as the cylc task. It will periodically poll cgroups and save data to a file. Cylc will then store these values in the sql db file.
Linked to;
cylc/cylc-ui#2100
cylc/cylc-uiserver#675
Check List
CONTRIBUTING.md
and added my name as a Code Contributor.setup.cfg
(andconda-environment.yml
if present).?.?.x
branch.