{"id":68,"date":"2018-01-16T19:55:17","date_gmt":"2018-01-16T19:55:17","guid":{"rendered":"https:\/\/in.nau.edu\/hpc\/?page_id=68"},"modified":"2024-08-14T09:41:35","modified_gmt":"2024-08-14T16:41:35","slug":"efficient-jobs","status":"publish","type":"page","link":"https:\/\/in.nau.edu\/arc\/overview\/efficient-jobs\/","title":{"rendered":"Efficient Jobs"},"content":{"rendered":"<!-- shortcode-right-column -->\n<div class=\"shortcode-right-column\" >\n    <div class=\"shortcode-right-column__container\"><\/p>\n<p><!-- shortcode-contact -->\n<div class=\"shortcode-contact\">\n    <div class=\"contact-header\">\n        <h3>Advanced Research Computing<\/h3>\n    <\/div>\n    <div class=\"contact-body\">\n                <a href=\"mailto:ask-arc@nau.edu\" aria-label=\"Advanced Research Computing: Email Address\" title=\"Email Address\">\n            <div class=\"contact-icon-container\">\n                <i class=\"fas fa-envelope\" aria-hidden=\"true\"><\/i>\n                <span class=\"sr-only\">Email:<\/span>\n            <\/div>\n            <div class=\"contact-email\">ask-arc&#8203;@nau.edu<\/div>\n        <\/a>\n                    <\/div>\n<\/div>\n\n<br \/>\n<\/div>\n<\/div>\n\n<h1>Efficient jobs<\/h1>\n<p>Submitting efficient jobs to Slurm is your ticket to shortest queue times, and maximum resources. If jobs are submitted in an inefficient manner, it can negatively affect you and your group&#8217;s ability to use the cluster as much as possible. Here are some of the benefits of submitting jobs efficiently:<\/p>\n<ul>\n<li>Faster job start times<\/li>\n<li>More resources for you<\/li>\n<li>More resources for your group<\/li>\n<li>Higher utilization of Monsoon<\/li>\n<li>More work being done in general<\/li>\n<li style=\"list-style-type: none;\"><\/li>\n<\/ul>\n<p>How can you make your jobs more efficient? It&#8217;s pretty easy! If you run our <span style=\"font-size: 16px; font-family: monospace; border: 1px solid; border-radius: 4px; padding: 0px 4px 0px; border-color: #BBBBBB; background-color: white;\">jobstats<\/span> wrapper script, it will provide you with a clear comparison of the most important parameters involved with the efficiency of a job:<\/p>\n<ul>\n<li>Job time<\/li>\n<li>CPU usage<\/li>\n<li>Memory usage<\/li>\n<\/ul>\n<p>Let&#8217;s look at a couple examples of inefficient jobs:<\/p>\n<pre><code>       JobID JobName   RegMem   MaxRSSS ReqCPUS   UserCPU  Timelimit  Elapsed     State\r\n6124735      MetaSeQ 126000Mn                 8 44:12.097 2-00:00:00 00:14:33 COMPLETED\r\n6124735.bat+   batch 126000Mn     3360K       8 00:00.008            00:14:33 COMPLETED\r\n6124735.0      batch 126000Mn 70060316K       8 44:12.088            00:14:18 COMPLETED<\/code><\/pre>\n<p>The overall parent process for the job is the first line, the second line is the batch portion, and the third line is where the work was actually taking place in the job. Lines with a job id of <span style=\"font-size: 16px; font-family: monospace; border: 1px solid; border-radius: 4px; padding: 0px 4px 0px; border-color: #BBBBBB; background-color: white;\">jobid.0<\/span> are where you need to be taking a close look for trimming down memory. The first line will show the overall stats for the job (excluding memory) and the sum of time for all steps in the job: submit, prep, and running of the analysis code.<\/p>\n<ul>\n<li>The memory estimate of 126GB (<span style=\"font-size: 16px; font-family: monospace; border: 1px solid; border-radius: 4px; padding: 0px 4px 0px; border-color: #BBBBBB; background-color: white;\">ReqMem<\/span>) is over estimated by 45% versus the 70GB (<span style=\"font-size: 16px; font-family: monospace; border: 1px solid; border-radius: 4px; padding: 0px 4px 0px; border-color: #BBBBBB; background-color: white;\">MaxRSS<\/span>) which was actually utilized<\/li>\n<li>The time estimate of 2 days (<span style=\"font-size: 16px; font-family: monospace; border: 1px solid; border-radius: 4px; padding: 0px 4px 0px; border-color: #BBBBBB; background-color: white;\">Timelimit<\/span>) is over estimated by a huge amount at 95% versus the elapsed 14 minutes (<span style=\"font-size: 16px; font-family: monospace; border: 1px solid; border-radius: 4px; padding: 0px 4px 0px; border-color: #BBBBBB; background-color: white;\">Elapsed<\/span>)<\/li>\n<li>The <span style=\"font-size: 16px; font-family: monospace; border: 1px solid; border-radius: 4px; padding: 0px 4px 0px; border-color: #BBBBBB; background-color: white;\">UserCPU<\/span> number has something amiss, as well. Ideally, this number should normally be: <span style=\"font-size: 16px; font-family: monospace; border: 1px solid; border-radius: 4px; padding: 0px 4px 0px; border-color: #BBBBBB; background-color: white;\">n * elapsed<\/span>, where <span style=\"font-size: 16px; font-family: monospace; border: 1px solid; border-radius: 4px; padding: 0px 4px 0px; border-color: #BBBBBB; background-color: white;\">n<\/span> is the number of cores requested. This can indicate something strange going on. For example, the application is possibly trying to start more threads than cores asked for. If this is happening to you, ask us to help you look into it.<\/li>\n<\/ul>\n<pre><code>       JobID    JobName   RegMem   MaxRSSS ReqCPUS    UserCPU  Timelimit  Elapsed     State\r\n6125571      metseq_re+    126Gn                 4   22:44:10 2-00:00:00 05:59:14 COMPLETED\r\n6125571.bat+      batch    126Gn     3364K       4  00:00.014            05:59:14 COMPLETED\r\n6125571.0         batch    126Gn 19047016K       4   22:44:09            05:59:10 COMPLETED<\/code><\/pre>\n<p>This job above is not efficient in the memory or the time limit department, but is efficient with CPU usage (<span style=\"font-size: 16px; font-family: monospace; border: 1px solid; border-radius: 4px; padding: 0px 4px 0px; border-color: #BBBBBB; background-color: white;\">UserCPU<\/span>).<\/p>\n<ul>\n<li>Look at ReqMem vs MaxRSS on the third line. 126GB requested vs. 18GB utilized<\/li>\n<li>The time is 2 days requested vs. 6 hours elapsed<\/li>\n<li>The CPU time is right on, however. Expected would be: 24 hours (<span style=\"font-size: 16px; font-family: monospace; border: 1px solid; border-radius: 4px; padding: 0px 4px 0px; border-color: #BBBBBB; background-color: white;\">6 hours * 4 CPUs = ~24 hours of user CPU<\/span>), and we see ~22 hours.<\/li>\n<\/ul>\n<p><strong>Note:<\/strong> When the scheduler is looking for resources for your job, it has to find a window of time that matches all of your requests for resources. If time limit, memory, or CPU is overestimated, your job&#8217;s start time will suffer.<\/p>\n<p>Another thing to keep in mind regarding efficient jobs: Each group on Monsoon is given a fixed amount of CPU minutes that can be in use at any one time. Currently, we have the number set at <span style=\"font-size: 16px; font-family: monospace; border: 1px solid; border-radius: 4px; padding: 0px 4px 0px; border-color: #BBBBBB; background-color: white;\">700,000<\/span>. With this number of minutes a group can potentially run:<\/p>\n<ul>\n<li>20 jobs, 4 days in length (in minutes), with 6 cores each: <span style=\"font-size: 16px; font-family: monospace; border: 1px solid; border-radius: 4px; padding: 0px 4px 0px; border-color: #BBBBBB; background-color: white;\">20 * 5760 * 6 = 691200<\/span><\/li>\n<li>100 jobs, 1 day in length, with 4 cores each: <span style=\"font-size: 16px; font-family: monospace; border: 1px solid; border-radius: 4px; padding: 0px 4px 0px; border-color: #BBBBBB; background-color: white;\">100 * 1440 * 4 = 576000<\/span><\/li>\n<li>500 jobs, 2 hours in length, with 11 cores each: <span style=\"font-size: 16px; font-family: monospace; border: 1px solid; border-radius: 4px; padding: 0px 4px 0px; border-color: #BBBBBB; background-color: white;\">500 * 120 * 11 = 660000<\/span><\/li>\n<li>1 job, 1.5 days in length, with 256 cores: <span style=\"font-size: 16px; font-family: monospace; border: 1px solid; border-radius: 4px; padding: 0px 4px 0px; border-color: #BBBBBB; background-color: white;\">1 * 2160 * 256 = 552960<\/span><\/li>\n<\/ul>\n<p>Any jobs submitted that would put the group over <span style=\"font-size: 16px; font-family: monospace; border: 1px solid; border-radius: 4px; padding: 0px 4px 0px; border-color: #BBBBBB; background-color: white;\">700,000<\/span> are stuck pending until more minutes are available. Note: as jobs move along in time, their contributing minutes are subtracted from the sum as the time limit remaining becomes less and less.<\/p>\n<p><strong>Take away from all of this:<\/strong>\u00a0the efficiency of your jobs will dictate how much Monsoon resources you and your group get &#8212; it&#8217;s that simple.<\/p>\n<p>Please consider running jobstats now and then to see how you are doing as far as efficiency in your jobs. The following would grab all of one&#8217;s jobs from January 1st to now:<\/p>\n<pre><code>jobstats -u userid -S 1\/1\/17<\/code><\/pre>\n<p><strong>Note:<\/strong> you can run this on any of your group members to take a look at how their jobs are faring.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Efficient jobs Submitting efficient jobs to Slurm is your ticket to shortest queue times, and maximum resources. If jobs are submitted in an inefficient manner, it can negatively affect you and your group&#8217;s ability to use the cluster as much as possible. Here are some of the benefits of submitting jobs efficiently: Faster job start [&hellip;]<\/p>\n","protected":false},"author":76,"featured_media":145,"parent":49,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_relevanssi_hide_post":"","_relevanssi_hide_content":"","_relevanssi_pin_for_all":"","_relevanssi_pin_keywords":"","_relevanssi_unpin_keywords":"","_relevanssi_related_keywords":"","_relevanssi_related_include_ids":"","_relevanssi_related_exclude_ids":"","_relevanssi_related_no_append":"","_relevanssi_related_not_related":"","_relevanssi_related_posts":"","_relevanssi_noindex_reason":"","ring_central_script_selection":"","footnotes":""},"class_list":["post-68","page","type-page","status-publish","has-post-thumbnail","hentry"],"_links":{"self":[{"href":"https:\/\/in.nau.edu\/arc\/wp-json\/wp\/v2\/pages\/68","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/in.nau.edu\/arc\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/in.nau.edu\/arc\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/in.nau.edu\/arc\/wp-json\/wp\/v2\/users\/76"}],"replies":[{"embeddable":true,"href":"https:\/\/in.nau.edu\/arc\/wp-json\/wp\/v2\/comments?post=68"}],"version-history":[{"count":5,"href":"https:\/\/in.nau.edu\/arc\/wp-json\/wp\/v2\/pages\/68\/revisions"}],"predecessor-version":[{"id":3610,"href":"https:\/\/in.nau.edu\/arc\/wp-json\/wp\/v2\/pages\/68\/revisions\/3610"}],"up":[{"embeddable":true,"href":"https:\/\/in.nau.edu\/arc\/wp-json\/wp\/v2\/pages\/49"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/in.nau.edu\/arc\/wp-json\/wp\/v2\/media\/145"}],"wp:attachment":[{"href":"https:\/\/in.nau.edu\/arc\/wp-json\/wp\/v2\/media?parent=68"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}