Advanced Research Computing
Efficient jobs
Submitting efficient jobs to Slurm is your ticket to shortest queue times, and maximum resources. If jobs are submitted in an inefficient manner, it can negatively affect you and your group’s ability to use the cluster as much as possible. Here are some of the benefits of submitting jobs efficiently:
- Faster job start times
- More resources for you
- More resources for your group
- Higher utilization of Monsoon
- More work being done in general
How can you make your jobs more efficient? It’s pretty easy! If you run our jobstats wrapper script, it will provide you with a clear comparison of the most important parameters involved with the efficiency of a job:
- Job time
- CPU usage
- Memory usage
Let’s look at a couple examples of inefficient jobs:
JobID JobName RegMem MaxRSSS ReqCPUS UserCPU Timelimit Elapsed State
6124735 MetaSeQ 126000Mn 8 44:12.097 2-00:00:00 00:14:33 COMPLETED
6124735.bat+ batch 126000Mn 3360K 8 00:00.008 00:14:33 COMPLETED
6124735.0 batch 126000Mn 70060316K 8 44:12.088 00:14:18 COMPLETED
The overall parent process for the job is the first line, the second line is the batch portion, and the third line is where the work was actually taking place in the job. Lines with a job id of jobid.0 are where you need to be taking a close look for trimming down memory. The first line will show the overall stats for the job (excluding memory) and the sum of time for all steps in the job: submit, prep, and running of the analysis code.
- The memory estimate of 126GB (ReqMem) is over estimated by 45% versus the 70GB (MaxRSS) which was actually utilized
- The time estimate of 2 days (Timelimit) is over estimated by a huge amount at 95% versus the elapsed 14 minutes (Elapsed)
- The UserCPU number has something amiss, as well. Ideally, this number should normally be: n * elapsed, where n is the number of cores requested. This can indicate something strange going on. For example, the application is possibly trying to start more threads than cores asked for. If this is happening to you, ask us to help you look into it.
JobID JobName RegMem MaxRSSS ReqCPUS UserCPU Timelimit Elapsed State
6125571 metseq_re+ 126Gn 4 22:44:10 2-00:00:00 05:59:14 COMPLETED
6125571.bat+ batch 126Gn 3364K 4 00:00.014 05:59:14 COMPLETED
6125571.0 batch 126Gn 19047016K 4 22:44:09 05:59:10 COMPLETED
This job above is not efficient in the memory or the time limit department, but is efficient with CPU usage (UserCPU).
- Look at ReqMem vs MaxRSS on the third line. 126GB requested vs. 18GB utilized
- The time is 2 days requested vs. 6 hours elapsed
- The CPU time is right on, however. Expected would be: 24 hours (6 hours * 4 CPUs = ~24 hours of user CPU), and we see ~22 hours.
Note: When the scheduler is looking for resources for your job, it has to find a window of time that matches all of your requests for resources. If time limit, memory, or CPU is overestimated, your job’s start time will suffer.
Another thing to keep in mind regarding efficient jobs: Each group on Monsoon is given a fixed amount of CPU minutes that can be in use at any one time. Currently, we have the number set at 700,000. With this number of minutes a group can potentially run:
- 20 jobs, 4 days in length (in minutes), with 6 cores each: 20 * 5760 * 6 = 691200
- 100 jobs, 1 day in length, with 4 cores each: 100 * 1440 * 4 = 576000
- 500 jobs, 2 hours in length, with 11 cores each: 500 * 120 * 11 = 660000
- 1 job, 1.5 days in length, with 256 cores: 1 * 2160 * 256 = 552960
Any jobs submitted that would put the group over 700,000 are stuck pending until more minutes are available. Note: as jobs move along in time, their contributing minutes are subtracted from the sum as the time limit remaining becomes less and less.
Take away from all of this: the efficiency of your jobs will dictate how much Monsoon resources you and your group get — it’s that simple.
Please consider running jobstats now and then to see how you are doing as far as efficiency in your jobs. The following would grab all of one’s jobs from January 1st to now:
jobstats -u userid -S 1/1/17
Note: you can run this on any of your group members to take a look at how their jobs are faring.