Advanced Research Computing

Efficient jobs

Submitting efficient jobs to Slurm is your ticket to shortest queue times, and maximum resources. If jobs are submitted in an inefficient manner, it can negatively affect you and your group’s ability to use the cluster as much as possible. Here are some of the benefits of submitting jobs efficiently:

Faster job start times
More resources for you
More resources for your group
Higher utilization of Monsoon
More work being done in general

How can you make your jobs more efficient? It’s pretty easy! If you run our jobstats wrapper script, it will provide you with a clear comparison of the most important parameters involved with the efficiency of a job:

Job time
CPU usage
Memory usage

Let’s look at a couple examples of inefficient jobs:

       JobID JobName   RegMem   MaxRSSS ReqCPUS   UserCPU  Timelimit  Elapsed     State
6124735      MetaSeQ 126000Mn                 8 44:12.097 2-00:00:00 00:14:33 COMPLETED
6124735.bat+   batch 126000Mn     3360K       8 00:00.008            00:14:33 COMPLETED
6124735.0      batch 126000Mn 70060316K       8 44:12.088            00:14:18 COMPLETED

The overall parent process for the job is the first line, the second line is the batch portion, and the third line is where the work was actually taking place in the job. Lines with a job id of jobid.0 are where you need to be taking a close look for trimming down memory. The first line will show the overall stats for the job (excluding memory) and the sum of time for all steps in the job: submit, prep, and running of the analysis code.

The memory estimate of 126GB (ReqMem) is over estimated by 45% versus the 70GB (MaxRSS) which was actually utilized
The time estimate of 2 days (Timelimit) is over estimated by a huge amount at 95% versus the elapsed 14 minutes (Elapsed)
The UserCPU number has something amiss, as well. Ideally, this number should normally be: n * elapsed, where n is the number of cores requested. This can indicate something strange going on. For example, the application is possibly trying to start more threads than cores asked for. If this is happening to you, ask us to help you look into it.

       JobID    JobName   RegMem   MaxRSSS ReqCPUS    UserCPU  Timelimit  Elapsed     State
6125571      metseq_re+    126Gn                 4   22:44:10 2-00:00:00 05:59:14 COMPLETED
6125571.bat+      batch    126Gn     3364K       4  00:00.014            05:59:14 COMPLETED
6125571.0         batch    126Gn 19047016K       4   22:44:09            05:59:10 COMPLETED

This job above is not efficient in the memory or the time limit department, but is efficient with CPU usage (UserCPU).

Look at ReqMem vs MaxRSS on the third line. 126GB requested vs. 18GB utilized
The time is 2 days requested vs. 6 hours elapsed
The CPU time is right on, however. Expected would be: 24 hours (6 hours * 4 CPUs = ~24 hours of user CPU), and we see ~22 hours.

Note: When the scheduler is looking for resources for your job, it has to find a window of time that matches all of your requests for resources. If time limit, memory, or CPU is overestimated, your job’s start time will suffer.

Another thing to keep in mind regarding efficient jobs: Each group on Monsoon is given a fixed amount of CPU minutes that can be in use at any one time. Currently, we have the number set at 700,000. With this number of minutes a group can potentially run:

20 jobs, 4 days in length (in minutes), with 6 cores each: 20 * 5760 * 6 = 691200
100 jobs, 1 day in length, with 4 cores each: 100 * 1440 * 4 = 576000
500 jobs, 2 hours in length, with 11 cores each: 500 * 120 * 11 = 660000
1 job, 1.5 days in length, with 256 cores: 1 * 2160 * 256 = 552960

Any jobs submitted that would put the group over 700,000 are stuck pending until more minutes are available. Note: as jobs move along in time, their contributing minutes are subtracted from the sum as the time limit remaining becomes less and less.

Take away from all of this: the efficiency of your jobs will dictate how much Monsoon resources you and your group get — it’s that simple.

Please consider running jobstats now and then to see how you are doing as far as efficiency in your jobs. The following would grab all of one’s jobs from January 1st to now:

jobstats -u userid -S 1/1/17

Note: you can run this on any of your group members to take a look at how their jobs are faring.