Advanced Research Computing
Virtual Visit Request info Apply
MENUMENU
  • About
    • Overview
    • Details
    • Terms of Service
    • FAQs
    • Our Team
    • Testimonials
  • Services
    • ARC Support Bot
    • Coffee/Office Hours
    • Data Portal »
    • Pricing
    • Secure Computing
    • Service Requests
      • Request an Account
      • Request Classroom Access
      • Request Data Science Development/Consulting
      • Request Software
      • Request Storage
  • Resources
    • Documentation »
    • Workshops
    • Web Apps
      • Doppler (NAU only)
      • Metrics (NAU only)
      • OnDemand
      • XDMod
      • XDMoD Reports
  • Research
    • Current Projects
    • Publications
  • Collaboration
    • Arizona Research Computing
    • CRN
    • External
  • IN
  • ARC
  • Overview

Contact Advanced Research Computing

Email:
ask-arc​@nau.edu

Quick Links

  • Linux Resources
  • Policies
  • Monsoon User Creation Form
  • Documentation and Links
  • Connecting to Monsoon

Overview

The Advanced Research Computing (ARC) group facilitates access and support of High-Performance Computing (HPC) resources in general and specifically, monsoon. Cluster resources are available to NAU faculty and staff for use in their research projects, and for students who are sponsored by a research faculty. Sponsorship of students implies that the sponsor is responsible for ensuring acceptable use of cluster resources by the sponsored individual.

Access to Monsoon can be obtained by submitting a New Monsoon User Request form. Once you have an account, you may login using ssh with your NAU credentials.

Scheduler and resource manager Accordion Closed

The scheduling and resource management on Monsoon is handled by Slurm. Priorities are handled by a fairshare policy in addition to other factors like age, job size, etc. Here are the essential Slurm commands you’ll want to know:

  • squeue – inspect the queue, includes running and queued jobs
    ex: squeue -t PD
    ex: squeue -t R
  • sinfo – inspect the cluster state including queues, nodes
    ex: sinfo -1 -N
  • sbatch – submit batch jobs
    ex: sbatch <job-script-file>
  • srun – submit parallel jobs
    ex: srun <job-command>
  • salloc – allocate resources for an interactive session
    ex: salloc -N 2
  • scontrol – for a user it can control, and inspect job state
    ex: scontrol show <jobid#>
  • scancel – cancel your jobs
    ex: scancel <jobid#>

The man pages are very well done, check them out for more info! For example, “man sbatch”.

Partitions Accordion Closed

core – includes all of the nodes, 14 day run limit

TRESRunMins Accordion Closed

This is a variable in Slurm that represents a number assigned to an account (or qos) which limits the total number of remaining cpu minutes which your running jobs can occupy. Having this feature enabled on monsoon helps with:

  • Flexible resource limiting
  • Staggering jobs
  • Increasing cluster utilization
  • More accurate resource requests

The current value for the limit is 3000000. This value is sometimes increased as cluster utilization drops to allow folks to use the idle cores. To calculate the TRESRunMins for your jobs, multiply the number of cpus being used by the time limit remaining, then multiply that number by the total number of jobs you are running

tresrunmins = sumofjobs( cpus * time remaining )

Examples:

500000 = 24, 1cpu, 2 week jobs

500000 = 49, 1 cpu, 1 week jobs

500000 = 347, 1 cpu, 1 day jobs

500000 = 21, 16 cpu, 1 day jobs

500000 = 130, 16 cpu, 4 hr jobs

500000 = 520, 4 cpu, 4 hr jobs

To see the current TRESRunMins for a single account or all accounts, use

sshare -l -A <account name> # single account
sshare -l # all accounts

In the output, the pertinent column will be labeled CPURunMins and will be the farthest to the right. This number changes dynamically as jobs change state.

Storage Accordion Closed

/common

  • This is the cluster “common” area.
  • Cluster dependencies reside here as well as areas for users to share contributions (contrib)

/scratch

  • 450TB
  • This is the primary shared working storage
  • Write/read your temporary files, logs, and final products here
  • 30 day retention period on files, emails sent at 28 days for warning – no quotas

/projects

  • 500TB
  • 5TB of free storage per faculty member (more can be purchased)
  • This is a long-term storage solution
  • Built on ZFS for enterprise-grade data integrity, scale, and performance
  • 30 Gbps throughput

/packages

  • Packages and modules

/home

  • Keep scripts and other small files here
  • This area is not meant for heavy writes like temp files, logs, and checkpoint files
  • This area is snapshotted twice a a day and a total of 4 snapshots are kept here: /home/.snapshot
  • 10G quota

/tmp

120GB – Local node storage
All storage areas are available around campus off of monsoon via SMB by visiting \\shares.hpc.nau.edu\cirrus.

Configured modules Accordion Closed

– List the available modules on Monsoon: “module avail”

– List the currently loaded modules in your login session: “module list”

– Load a module: “module load <module>”

View Additional Documentation