Hello!
TL;DR: Use our handy “jobstats” command-line tool regularly to ensure you are submitting efficient jobs! Then, keep track of your efficiency on Grafana.
Did you know that on average, jobs submitted to monsoon’s scheduler request an excess of ~80% of memory and ~90% of time? Yes, it’s true and you can see for yourself with our handy dandy job efficiency web app called Doppler.
The above link shows that since the beginning of the year, on average, 88% memory, and 93% of time has been wasted. Of course, it’s entirely not possible to have a completely accurate resource ask for your job as there is variability in job runs even for similar data sets. Yet, it is very doable (albeit with some work) to get much more accurate with these resource asks.
Why does being efficient matter?
If your job asked for 50GB of memory, but used 1GB max, your slurm account is billed for use of 50GB for the elapsed time of your job! If you requested 10 cores, but used only 1, your slurm account is billed for those 9 cores that you did not use! Why? Because no other researcher could utilize those resources while your job ran. Thus, your groups fairshare is billed for the use.
While use of monsoon is free (isn’t that fantastic), you are still billed (towards your groups fairshare) for what you requested for the elapsed time of the job. Fairshare is essentially the real-time priority for your group (and you as well) that Slurm determines when comparing all active groups usage of resources. This fairshare usage is decayed over time. We have our fairshare decay set at 1 day. So, a group that stops utilizing monsoon will have 75% of their usage wiped away after two days.
Efficient resource requests by you, your group, or the combination of all research groups in general effect:
- The number of concurrent jobs that you and your group can have running at a time
- The number of resources you and your group have access to
- The time jobs spend in the pending state
- Total compute throughput for monsoon
One big point that we try to drive home in our trainings is to be efficient with your Slurm resource requests. This is because each group is given a max amount of cpus, gpus, and memory to be used at any point in time. When the group limits we have in place are hit (usually our memory limit), your jobs cannot start until some of your jobs move through time. When this happens, the researcher will see the pending reason “AssocGrpMemRunMinutes”, meaning that the group has hit the limit for the total amount of memory for their group. A similar reason is shown for cpus, and gpus.
What can be done about this?
For starters, aim lean in your jobs by default, if they fail due to insufficient resources, try doubling, and repeat. After a job runs to completion run our “jobstats” utility to examine areas where you can improve, we highlight the most glaring areas in red. You can also run jobstats on running jobs with “jobstats -r”.
But how do you know where to start?
Usually, the amount of resources it takes for the program/data set on your workstation is going to be roughly equivalent on Monsoon so try starting there.
But I have datasets that require varying amounts of memory or time!
Try to bundle similarly sized data together in job arrays, or sets of batch of jobs. Keep in mind that when utilizing Slurm job arrays, each child job will have the same resources requested by the parent. So, for arrays, ensure the data being referenced in the array are of similar size/complexity.
Why do we have limits?
Limits help in many ways such as:
- allowing quick job start times for everyone
- preventing big groups/individuals from squeezing out the little groups/individuals
- keeping monsoon running smooth and its researchers as happy as possible!
Curious how you’re doing efficiency wise on monsoon? Check it out here on our doppler metrics app! We have our groups and individual researchers ranked here.
You’ll see groups listed under “Account Ranking”, and individual researchers listed under “User Ranking”. Each column can be sorted by clicking on the header. You can change the effective time period up in the right corner of the page.
We will soon offer incentives for being efficient in the form of extra priority, and/or access to higher number of resources. Stay tuned!
More info on our limiting mechanism (TRESRunMins) can be found here.
Any questions? Send us an email to ask-arc@nau.edu we’ll be glad to help you become more efficient!!