Contact Advanced Research Computing
Quick Links
Overview
The Advanced Research Computing (ARC) group facilitates access and support of High-Performance Computing (HPC) resources in general and specifically, monsoon. Cluster resources are available to NAU faculty and staff for use in their research projects, and for students who are sponsored by a research faculty. Sponsorship of students implies that the sponsor is responsible for ensuring acceptable use of cluster resources by the sponsored individual.
Access to Monsoon can be obtained by submitting a New Monsoon User Request form. Once you have an account, you may login using ssh with your NAU credentials.
Scheduler and resource manager Accordion Closed
The scheduling and resource management on Monsoon is handled by Slurm. Priorities are handled by a fairshare policy in addition to other factors like age, job size, etc. Here are the essential Slurm commands you’ll want to know:
- squeue – inspect the queue, includes running and queued jobs
ex: squeue -t PD
ex: squeue -t R - sinfo – inspect the cluster state including queues, nodes
ex: sinfo -1 -N - sbatch – submit batch jobs
ex: sbatch <job-script-file> - srun – submit parallel jobs
ex: srun <job-command> - salloc – allocate resources for an interactive session
ex: salloc -N 2 - scontrol – for a user it can control, and inspect job state
ex: scontrol show <jobid#> - scancel – cancel your jobs
ex: scancel <jobid#>
The man pages are very well done, check them out for more info! For example, “man sbatch”.
Partitions Accordion Closed
core – includes all of the nodes, 14 day run limit
TRESRunMins Accordion Closed
This is a variable in Slurm that represents a number assigned to an account (or qos) which limits the total number of remaining cpu minutes which your running jobs can occupy. Having this feature enabled on monsoon helps with:
- Flexible resource limiting
- Staggering jobs
- Increasing cluster utilization
- More accurate resource requests
The current value for the limit is 3000000. This value is sometimes increased as cluster utilization drops to allow folks to use the idle cores. To calculate the TRESRunMins for your jobs, multiply the number of cpus being used by the time limit remaining, then multiply that number by the total number of jobs you are running
tresrunmins = sumofjobs( cpus * time remaining )
Examples:
500000 = 24, 1cpu, 2 week jobs
500000 = 49, 1 cpu, 1 week jobs
500000 = 347, 1 cpu, 1 day jobs
500000 = 21, 16 cpu, 1 day jobs
500000 = 130, 16 cpu, 4 hr jobs
500000 = 520, 4 cpu, 4 hr jobs
To see the current TRESRunMins for a single account or all accounts, use
sshare -l -A <account name> # single account
sshare -l # all accounts
In the output, the pertinent column will be labeled CPURunMins and will be the farthest to the right. This number changes dynamically as jobs change state.
Storage Accordion Closed
/common
- This is the cluster “common” area.
- Cluster dependencies reside here as well as areas for users to share contributions (contrib)
/scratch
- 450TB
- This is the primary shared working storage
- Write/read your temporary files, logs, and final products here
- 30 day retention period on files, emails sent at 28 days for warning – no quotas
/projects
- 500TB
- 5TB of free storage per faculty member (more can be purchased)
- This is a long-term storage solution
- Built on ZFS for enterprise-grade data integrity, scale, and performance
- 30 Gbps throughput
/packages
- Packages and modules
/home
- Keep scripts and other small files here
- This area is not meant for heavy writes like temp files, logs, and checkpoint files
- This area is snapshotted twice a a day and a total of 4 snapshots are kept here: /home/.snapshot
- 10G quota
/tmp
120GB – Local node storage
All storage areas are available around campus off of monsoon via SMB by visiting \\shares.hpc.nau.edu\cirrus.
Configured modules Accordion Closed
– List the available modules on Monsoon: “module avail”
– List the currently loaded modules in your login session: “module list”
– Load a module: “module load <module>”