Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 22 additions & 95 deletions checking_on_running_jobs.md
Original file line number Diff line number Diff line change
@@ -1,117 +1,44 @@
# Checking on running jobs
### Checking on the status of your Job:
If you would like to check the status of your job, you can use the `qstat` command to do so. Typing `qstat` without any options will output all currently running or queued jobs to your terminal window, but there are many options to help display relevant information. To find more of these options type `man qstat` when logged in to a CARC machine. To see which jobs are running and queued in the standard output type the following in a terminal window:
If you would like to check the status of your job, you can use the `squeue` command to do so. Typing `squeue` without any options will output all currently running or queued jobs to your terminal window, but there are many options to help display relevant information. To find more of these options type `man squeue` when logged in to a CARC machine. To see which jobs are running and queued in the standard output type the following in a terminal window:

```bash
qstat
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
127506.wheeler-sn.alliance.un pebble30_80 user 288:43:2 R default
127508.wheeler-sn.alliance.un pebble30_90 user 279:41:4 R default
127509.wheeler-sn.alliance.un pebble30_70 user 323:06:0 R default
128012.wheeler-sn.alliance.un canu_wheeler.sh user 0 Q default
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
155161 bigmem job1 usr1 PD 0:00 1 (Resources)
155071 bigmem job2 usr2 R 17:17:00 1 easley050
155068 bigmem job3 usr3 R 17:29:37 1 easley050
152827 debug job4 usr4 PD 0:00 1 (PartitionTimeLimit)
```

The output of `qstat` give you the Job ID, the name of the Job, which user owns that Job, CPU time, the status of the Job, either queued (Q), running (R), and sometimes on hold (H), and lastly, which queue the Job is in. To look at a specific job without seeing everything running you can use the Job ID by typing `qstat Job ID`, or by using the `-u` flag followed by the username, `qstat -u user`.
The output of `squeue` shows the job ID, partition, job name, job owner, job status (such as pending (PD) or running (R)), the number of nodes allocated, and the reason a job is pending or the names of the nodes on which a job is running. To view a specific job in the queue without listing every running job, you can use the job ID with `squeue -j <jobID>`, or you can filter by user with `squeue -u <username>`. Additionally, you can use `squeue --me` to veiw only your jobs.
For example:

```bash
qstat 127506
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
127506.wheeler-sn.alliance.un pebble30_80 user 289:04:1 R default
squeue -j 155161
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
155161 bigmem job1 user1 PD 0:00 1 (Resources)
```

A useful option is the `-a` flag which shows more information about jobs than `qstat` alone. As well as the information above, the `-a` option also outputs requested nodes, processors, memory, wall time, and actual runtime instead of CPU time.

```bash
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
127506.wheeler-sn.alli user default pebble30_80 8739 1 8 -- 240:00:00 R 229:13:18
127508.wheeler-sn.alli user default pebble30_90 25507 1 8 -- 240:00:00 R 229:09:10
127509.wheeler-sn.alli user default pebble30_70 20372 1 8 -- 240:00:00 R 229:08:46
128012.wheeler-sn.alli user default canu_wheeler.sh -- 1 8 64gb 24:00:00 Q

```
`qstat -f` Specifies a "full" format display of information. It displays the informations regarding job name,owner,cpu_time, memory usage, walltime, job staus, error and output file path, executing host, nodes and core allocation and others.
`qstat -f <jobid>` displays the information corresponding to that jobid.
Example

(user) xena:~$ qstat qstat -f 67048
Job Id: 67048.xena.xena.alliance.unm.edu
Job_Name = BipolarCox_138
Job_Owner = [email protected]
resources_used.cput = 00:35:53
resources_used.energy_used = 0
resources_used.mem = 31427708kb
resources_used.vmem = 31792364kb
resources_used.walltime = 00:35:58
job_state = R
queue = singleGPU
server = xena.xena.alliance.unm.edu
Checkpoint = u
ctime = Mon Feb 18 16:19:19 2019
Error_Path = xena.xena.alliance.unm.edu:/users/user/experiments/newsui
cidality-injury/BipolarCox_138.e67048
exec_host = xena21/0-1
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Tue Feb 19 12:47:56 2019
Output_Path = xena.xena.alliance.unm.edu:/users/user/experiments/newsu
icidality-injury/BipolarCox_138.o67048
Priority = 0
qtime = Mon Feb 18 16:19:19 2019
Rerunable = True
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=2
Resource_List.walltime = 03:00:00
session_id = 74594
Shell_Path_List = /bin/bash
euser = dccannon
egroup = users
queue_type = E
etime = Mon Feb 18 16:19:19 2019
submit_args = -N BipolarCox_138 -v run_id=138 runRScript.sh
start_time = Tue Feb 19 12:47:56 2019
Walltime.Remaining = 8598
start_count = 1
fault_tolerant = False
job_radix = 0
submit_host = xena.xena.alliance.unm.edu
request_version = 1

`watch qstat -u <username>` allows an interactive statistics of jobs for that user which updates for every 2sec. Example

(user) xena:~$watch qstat -u ceodspsp
Every 2.0s: qstat -u ceodspsp Tue Feb 19 13:45:50 2019

A useful Slurm option is `squeue -l`, which displays more detailed job information than squeue alone; in addition to the standard fields, it includes requested resources such as nodes, tasks, memory, wall-time limits, and the job’s actual runtime.
The scontrol `show job <jobID>` command in SLURM provides a “full” display of information about a job. It shows details such as the job name, owner, CPU time, memory usage, walltime, job status, paths to output and error files, executing nodes, core allocation, and other relevant information. Using `scontrol show job <jobID>` displays these details for the specified job ID.

xena.xena.alliance.unm.edu:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
66908.xena.xena.allian ceodspsp dualGPU smoke_1_5 103419 2 32 -- 48:00:00 R 21:50:33
67438.xena.xena.allian ceodspsp dualGPU smoke_5_10 66632 2 32 -- 48:00:00 R 09:39:00
`watch squeue -u <username>` allows an interactive statistics of jobs for that user which updates for every 2sec.

### Determining which nodes your Job is using:
If you would like to check which nodes your job is using, you can pass the `-n` option to qstat. When your job is finished, your processes on each node will be killed by the system, and the node will be released back into the available resource pool.
If you would like to check which nodes your job is using, you can pass the `-j` option to squeue. When your job is finished, your processes on each node will be killed by the system, and the node will be released back into the available resource pool.

```bash
qstat -an
wheeler-sn.alliance.unm.edu:
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------------------- ------------ -------- ---------- ------- --- --- ------ -------- -- --------
55811.wheeler-sn.alliance.u user default B19F_re5e4 0 4 32 - - 48:00:00 R 47:30:42
wheeler296/0-7+wheeler295/0-7+wheeler282/0-7+wheeler280/0-7
squeue -j 156510
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
156510 l40s interact usr R 2:02 1 easley056
```
Here, the nodes that this job is running on are wheeler296, wheeler295, wheeler282, and wheeler280, with 8 processors per node.
Here, the node that this job is running is easley056.

### Viewing Output and Error Files:
Once your job has completed, you should see two files, one output file and one error file, in the directory from which you submitted the Job: Jobname.oJobID and Jobname.eJobID (where Jobname refers to the name of the Job returned by `qstat`, and JobID refers to the numerical portion of the job identifier returned by `qstat`).
For the example job above, these two files would be named `B19F_re5E4.o55811` and `B19F_re5E4.e55811` respectively.
Any output from the job sent to “standard output” will be written to the output file, and any output sent to “standard error” will be written to the error file. The amount of information in the output and error files varies depending on the program being run and how the PBS batch script was set up.
Once your job has completed, you should see two files, one output file and one error file, in the directory from which you submitted the Job: slurm-JobID.out and slurm-JobID.err (where JobID refers to the ID of the Job returned by `sbatch`.
For the example job above, these two files would be named `slurm-155161.out` and `slurm-155161.err` respectively.
Any output from the job sent to “standard output” will be written to the output file, and any output sent to “standard error” will be written to the error file. The amount of information in the output and error files varies depending on the program being run and how the sbatch batch script was set up.



Expand Down