Skip to content

Commit 223b74d

Browse files
authored
Merge pull request #27 from SIMEXP/enh/alliance-resources
[ENH] Wiki content for Alliance resource tracking
2 parents a1e96f5 + 9603f8e commit 223b74d

File tree

1 file changed

+79
-0
lines changed

1 file changed

+79
-0
lines changed
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Resource tracking
2+
3+
Tracking resource usage is helpful to understand how your jobs are currently consuming cluster resources and to monitor for any unexpected changes (e.g., large, unexpected changes in priority can reveal when jobs may be less efficient than expected).
4+
In this section, we provide general guidelines for checking your account's (and its associated allocation's) resource usage using both a [GUI-based dashboard](#using-the-metrix-portal) as well as via directly [at the command line](#tracking-resource-usage-at-an-account-level).
5+
We also overview how to [check current usage patterns on a given cluster](#requesting-resources), such that you can make an informed choice about which resources to request.
6+
7+
## Using the Metrix portal
8+
9+
Alliance Canada maintains a [metrix portal](https://docs.alliancecan.ca/wiki/Metrix) for each of its clusters.
10+
Please refer to the the Alliance Canada documentation for each cluster (e.g., [Rorqual](https://docs.alliancecan.ca/wiki/Rorqual/en)) for the most up-to-date link for the portal.
11+
12+
This is the most user-friendly way to access all of the information we discuss below.
13+
However, it is still a good idea to understand how to access this information outside of the metrix portal; as this data is pulled in real time, it may fail to populate if the system is over-subscribed !
14+
If there is a known problem with the Metrix portal, this should be documentated on https://status.alliancecan.ca.
15+
16+
## Tracking resource usage at an account level
17+
18+
To view all users in a given allocation (e.g., `rrg-pbellec_gpu`) or in multiple allocations, we can run:
19+
20+
```{bash}
21+
sshare -l --accounts=rrg-pbellec_gpu -a
22+
```
23+
24+
We can also pass multiple allocations in a comma-separated list.
25+
To view only a subset of users within allocations, simply pass the `-u` flag with the relevant user name(s):
26+
27+
```{bash}
28+
sshare -l --accounts=rrg-pbellec_cpu,def-pbellec_cpu -u emdupre
29+
```
30+
31+
We can interpret each of the returned fields following [Alliance Canada's documentation](https://docs.alliancecan.ca/wiki/Job_scheduling_policies#Priority_and_fair-share):
32+
33+
- `RawShares` is proportional to the number of CPU-years that was granted to the project for use on this cluster in the Resource Allocation Competition.
34+
- `NormShares` is the number of shares assigned to the user (or account) divided by the total number of assigned shares within the level.
35+
- `RawUsage` is calculated from the total number of resource-seconds (that is, CPU time, GPU time, and memory) that have been charged to this account. **Past usage is discounted with a half-life of one week, so usage more than a few weeks in the past will have only a small effect on priority.**
36+
- `EffectvUsage` is the account's usage normalized with its parent; that is, the project's usage relative to other projects, the user's relative to other users in that project.
37+
- `LevelFS` is the account's fairshare value compared to its siblings, calculated as `NormShares` / `EffectvUsage`.
38+
* If an account is over-served, the value is between 0 and 1.
39+
* If an account is under-served, the value is greater than 1.
40+
* Accounts with no usage receive the highest possible value, `inf` or "infinity".
41+
42+
43+
### Tracking resource usage at a job level
44+
45+
The Alliance Canada documentation has lots of resources for [monitoring jobs, including tracking their resource usage](https://docs.alliancecan.ca/wiki/Monitoring_jobs).
46+
47+
One useful command to get a full accounting of a completed job is with `scontrol`:
48+
49+
```{bash}
50+
scontrol show job -dd <JOBID>
51+
```
52+
53+
## Requesting resources
54+
55+
In order to more efficiently _request_ resources, we can use the `partition-stats` command, called simply using:
56+
57+
```{bash}
58+
partition-stats
59+
```
60+
61+
We can interpret its output again following the [Alliance Canada documentation](https://docs.alliancecan.ca/wiki/Job_scheduling_policies#Percentage_of_the_nodes_you_have_access_to).
62+
Specifically, the command will return:
63+
64+
- how many jobs are waiting to run ("queued") in each partition,
65+
- how many jobs are currently running,
66+
- how many nodes are currently idle, and
67+
- how many nodes are assigned to each partition.
68+
69+
70+
### Estimating start time for a given job
71+
72+
As Alliance Resources [implement backfilling](https://docs.alliancecan.ca/wiki/Job_scheduling_policies#Backfilling), jobs are not strictly started in terms of priority order, but also in terms of what resources are available.
73+
The start time for a given job can therefore be estimated using:
74+
75+
```{bash}
76+
squeue --start -j <JOBID>
77+
```
78+
79+
Note, though, that this is not a strictly accurate estimate, as it depends on multiple factors including that other, currently running jobs have requested accurate time limits.

0 commit comments

Comments
 (0)