Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU and Max RSS Analysis tools #6663

Open
wants to merge 29 commits into
base: master
Choose a base branch
from

Conversation

ChrisPaulBennett
Copy link

@ChrisPaulBennett ChrisPaulBennett commented Mar 12, 2025

This apart of 3 pull requests for adding CPU time and Max RSS analysis to the Cylc UI.

This adds the Max RSS and CPU time (as measured by cgroups) to the table view, box plot and time series views.

This adds a python profiler script. This profiler will will be ran by cylc in the same crgroup as the cylc task. It will periodically poll cgroups and save data to a file. Cylc will then store these values in the sql db file.

Linked to;
cylc/cylc-ui#2100
cylc/cylc-uiserver#675

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • Changelog entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@ChrisPaulBennett ChrisPaulBennett marked this pull request as draft March 12, 2025 09:19
Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@@ -138,6 +138,10 @@ cylc__job__main() {
mkdir -p "$(dirname "${CYLC_TASK_WORK_DIR}")" || true
mkdir -p "${CYLC_TASK_WORK_DIR}"
cd "${CYLC_TASK_WORK_DIR}"

cylc profile &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Pass through a value for the -m option.

# Get the PID of the current process
pid = os.getpid()
# Get the cgroup information for the current process
result = subprocess.run(['cat', '/proc/' + str(pid) + '/cgroup'], capture_output=True, text=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at using pathlib.Path for filesystem path building / manipulation.

I think capture_output captures both stdout and stderr, so if there is any error here, it will be swallowed. You might want to only capture stdout (stdout=subprocess.PIPE) or report stderr if the process fails (print(result.stderr, file=sys.stderr)).

Comment on lines 144 to 145
print(e)
raise FileNotFoundError("cgroups not found:" + args.cgroup_location)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probs best to write a user-friendly message to stderr, then raise the original exception, e.g:

Suggested change
print(e)
raise FileNotFoundError("cgroups not found:" + args.cgroup_location)
print(f'cgroups not found: {args.cgroup_location}', file=sys.stderr)
raise from None

Comment on lines 152 to 154
except FileNotFoundError as e:
print(e)
raise FileNotFoundError("cgroups not found:" + args.cgroup_location)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Comment on lines 168 to 169
except (OSError, IOError, ValueError) as error:
print(error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(should write the error to stderr not stdout)

I think this is here to tolerate failures in cgroup polling? I guess this might happen from time to time?

Perhaps the profiler should bail if, say, three polls fail in a row (indicating a systematic failure) rather than continuing with fruitless polling?

Changed the name of the profiler module.
Linting

Profiler sends KB instead of bytes

Time Series now working

CPU/Memory Logging working
@ChrisPaulBennett ChrisPaulBennett force-pushed the cylc_profiler branch 2 times, most recently from 4a6dbe8 to 30a7bb0 Compare April 1, 2025 15:31
ChrisPaulBennett and others added 19 commits April 2, 2025 09:34
Initial profiler implementation (non working)

Changed the name of the profiler module.
Linting

Profiler sends KB instead of bytes

Time Series now working

CPU/Memory Logging working

Adding profiler unit tests

updating tests

Fail gracefully if cgroups cannot be found

Revert "Fail gracefully if cgroups cannot be found"

This reverts commit 92e1e11c9b392b4742501d399f191f590814e95e.

Linting

Modifying unit tests
Linting
Changed the name of the profiler module.

Profiler sends KB instead of bytes

Time Series now working
* The COPYING file appears to have moved from `dist-info/COPYING` into
  `dist-info/licenses/COPYING`.
* Attempt to fix flaky test.
* Cut out shell profile files to omit some spurious stderr.
* Attempt to fix flaky test.
* Cut out shell profile files to omit some spurious stderr.
Revert "Fail gracefully if cgroups cannot be found"

This reverts commit 92e1e11c9b392b4742501d399f191f590814e95e.

Adding profiler unit tests

updating tests
oliver-sanders and others added 3 commits April 2, 2025 09:34
@ChrisPaulBennett ChrisPaulBennett marked this pull request as ready for review April 2, 2025 14:20
Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

.. versionadded:: 8.0.0
''')
Conf('cgroups path', VDR.V_STRING, '/sys/fs/cgroup')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a description for this configuration.

@@ -157,11 +163,23 @@ cylc__job__main() {
cylc__set_return "$ret_code"
fi
}
# Grab the max rss and cpu_time value before moving directory
if [[ -f "max_rss" ]]; then
max_rss=$(sed -n '1p' max_rss)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max_rss file should only have a single value in it, so we shouldn't need to extract only the first line.

rm max_rss
fi
if [[ -f "cpu_time" ]]; then
cpu_time=$(sed -n '1p' cpu_time)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

default=10, dest="delay")
parser.add_option(
"-m", type=str, help="Location of cgroups directory",
default="/sys/fs/cgroup",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this default is hardcoded in the Cylc config, we could make this a required argument to avoid the need to hardcode this default twice.

Comment on lines +103 to +105
def write_data(data, filename):
with open(filename, 'w') as f:
f.write(data + "\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is adding one line to the file for each poll. The Cylc job script picks the first line of the file and sends it back to the scheduler. So I think we're sending back the first poll of the memory usage, not the max poll?

We only need to store the maximum values in these files, so we could do the maths in Python and keep this file to a single line:

with open('max_rss', 'w+') as max_rss:
    prev_memory = 0
    while keep_looping():
        memory = poll_memory()
        if memory > prev_memory:
            max_rss.seek(0, 0)  # wind back to the start of the file
            max_rss.write(f'{memory}\n')  # (over)write the first line
            max_rss.flush()
            prev_memory = memory
        time.sleep(delay)

# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
#-------------------------------------------------------------------------------
# cylc profile test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test will run regular background jobs, no slurm / pbs / whatever, so no cgroups.

I think this is testing that the profiler will not cause the job to fail, even if it cannot poll cgroups? Which is worthwhile testing.

We should test the jobs stderr for the line(s) written by the profiler script complaining of the fault.

Comment on lines +1 to +11
[meta]
title = "job script torture test"

description = """Any task job script may fail regardless of user runtime
settings if changes to cylc re-order the job script sections badly: e.g.
"cylc task started" must be called after the CYLC_ environment variables
are exported. Additionally, users may rely on the order of variable
definition in each environment and script section: e.g. workflow
bin directory must go in PATH before the task runtime environment is
defined because workflow bin commands could be used in variable assignment
expressions."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is relevant here?

Comment on lines +20 to +27
echo "HELLO FROM INIT-SCRIPT"
# define a variable
export VAR_IS=is"""
pre-script = """
echo "HELLO FROM PRE-SCRIPT"
# init-script must be done:
echo VAR_IS is $VAR_IS
# user environment must be done:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed for this test?

Comment on lines +171 to +181
def get_config(args):
# Find the cgroup that this process is running in.
# Cylc will put this profiler in the same cgroup
# as the job it is profiling
cgroup_name = get_cgroup_name()
cgroup_version = get_cgroup_version(args.cgroup_location, cgroup_name)
process = get_cgroup_paths(cgroup_version,
args.cgroups_location,
cgroup_name)

profile(process, cgroup_version, args.delay)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codecov has spotted a couple of test coverage holes here that it would be good to fill.

@oliver-sanders
Copy link
Member

(please ignore the manylinux test failures, we'll be removing this test on master shortly)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants