Contact Advanced Research Computing
Quick Links
Monitoring Memory Usage
A job’s maximal resident set size (MaxRss) is the maximum amount of memory that the job may occupy. Understanding the memory behavior of specific jobs will allow you and others on the cluster manage and use the provided resources in a more efficient manner.
Inadequate Memory Usage
A job may fail due to the an insufficient amount of memory being requested. Once Slurm detects that a job has hit its maximum amount of memory requested, the job will be killed. If a user estimates the memory usage of a job to be too big, then resources will likely be wasted. An example of an Out-Of-Memory error would look like:
slurmstepd: error: Detected 1 oom_kill event in StepID=2003044.0. Some of the step tasks have been OOM Killed.
srun: error: cn2: task 0: Out Of Memory
Time Command
Before submitting your job, the time command is a useful tool to check the maximum resident set size to check if the original amount of memory they requested is an effective and adequate use of resources to ensure the job does not run out of memory or use to much resources. The time command allows for users to get a nice straightforward way to get a good idea of the MaxRSS while testing their work, before checking what Slurm may say after a job submission.
Bash has a builtin time command that is the default, but it does not offer the same functionality, so one needs to provide a full path or prefix with command. Here is a sample command using time:
command time -v echo "hi"
Note: The time command and time recorded by Slurm use different methods to compute the amount of memory usage so there could be slight variations in the elapsed time and MaxRSS between the measurements obtained.
The MaxRss reported by the time command represents the maximal memory usage during a jobs execution while the MaxRss reported by Slurm is the actual memory usage of a job that has been completed.