Contact Advanced Research Computing

Quick Links

Overview

The Advanced Research Computing (ARC) group facilitates access and support of High-Performance Computing (HPC) resources in general and specifically, monsoon. Cluster resources are available to NAU faculty and staff for use in their research projects, and for students who are sponsored by a research faculty. Sponsorship of students implies that the sponsor is responsible for ensuring acceptable use of cluster resources by the sponsored individual.

Access to Monsoon can be obtained by submitting a New Monsoon User Request form. Once you have an account, you may login using ssh with your NAU credentials.

The scheduling and resource management on Monsoon is handled by Slurm. Priorities are handled by a fairshare policy in addition to other factors like age, job size, etc. Here are the essential Slurm commands you’ll want to know:

squeue – inspect the queue, includes running and queued jobs
ex: squeue -t PD
ex: squeue -t R
sinfo – inspect the cluster state including queues, nodes
ex: sinfo -1 -N
sbatch – submit batch jobs
ex: sbatch <job-script-file>
srun – submit parallel jobs
ex: srun <job-command>
salloc – allocate resources for an interactive session
ex: salloc -N 2
scontrol – for a user it can control, and inspect job state
ex: scontrol show <jobid#>
scancel – cancel your jobs
ex: scancel <jobid#>

The man pages are very well done, check them out for more info! For example, “man sbatch”.

core – includes all of the nodes, 14 day run limit

This is a variable in Slurm that represents a number assigned to an account (or qos) which limits the total number of remaining cpu minutes which your running jobs can occupy. Having this feature enabled on monsoon helps with:

Flexible resource limiting
Staggering jobs
Increasing cluster utilization
More accurate resource requests

The current value for the limit is 3000000. This value is sometimes increased as cluster utilization drops to allow folks to use the idle cores. To calculate the TRESRunMins for your jobs, multiply the number of cpus being used by the time limit remaining, then multiply that number by the total number of jobs you are running

tresrunmins = sumofjobs( cpus * time remaining )

Examples:

500000 = 24, 1cpu, 2 week jobs

500000 = 49, 1 cpu, 1 week jobs

500000 = 347, 1 cpu, 1 day jobs

500000 = 21, 16 cpu, 1 day jobs

500000 = 130, 16 cpu, 4 hr jobs

500000 = 520, 4 cpu, 4 hr jobs

To see the current TRESRunMins for a single account or all accounts, use

sshare -l -A <account name> # single account
sshare -l # all accounts

In the output, the pertinent column will be labeled CPURunMins and will be the farthest to the right. This number changes dynamically as jobs change state.

/common

This is the cluster “common” area.
Cluster dependencies reside here as well as areas for users to share contributions (contrib)

/scratch

450TB
This is the primary shared working storage
Write/read your temporary files, logs, and final products here
30 day retention period on files, emails sent at 28 days for warning – no quotas

/projects

500TB
5TB of free storage per faculty member (more can be purchased)
This is a long-term storage solution
Built on ZFS for enterprise-grade data integrity, scale, and performance
30 Gbps throughput

/packages

Packages and modules

/home

Keep scripts and other small files here
This area is not meant for heavy writes like temp files, logs, and checkpoint files
This area is snapshotted twice a a day and a total of 4 snapshots are kept here: /home/.snapshot
10G quota

/tmp

120GB – Local node storage
All storage areas are available around campus off of monsoon via SMB by visiting \\shares.hpc.nau.edu\cirrus.

– List the available modules on Monsoon: “module avail”

– List the currently loaded modules in your login session: “module list”

– Load a module: “module load <module>”

View Additional Documentation

Contact Advanced Research Computing

Quick Links

Overview

Scheduler and resource manager Accordion Closed

Partitions Accordion Closed

TRESRunMins Accordion Closed

Storage Accordion Closed

Configured modules Accordion Closed