Monitoring Resource Usage on SeaWulf
Introduction
Efficient resource utilization is essential for the effective use of our computing infrastructure on SeaWulf. To help users monitor and optimize their resource usage (nodes, CPUs, memory, etc.), we have compiled this article which outlines several tools and scripts available for tracking Slurm job resource consumption. This article will guide you through using these tools to ensure your jobs are running efficiently, thereby helping to conserve resources and improve overall system performance.
CPU Load
An important aspect of this optimization is understanding CPU load, which represents the average number of processes trying to use the CPU over a specified time interval. For example, on a fully utilized 40-core node, you would expect the load to be around 40. A significantly lower load might indicate underutilization of resources, while a much higher load likely points to oversubscription, potentially degrading code performance.
The CPU Load statistic is typically given as three values. These are the average sum of the number of processes waiting in the run-queue plus those currently executing over 1, 5, and 15-minute time periods respectively. By monitoring and adjusting CPU load, you can maintain optimal job efficiency and prevent performance bottlenecks, further enhancing resource utilization. In the following examples uses of our monitoring tools, CPU load values are highlighted in red.
Monitoring Tools
get_resource_usage.py
The get_resource_usage.py
script is a tool designed to help users monitor the resource usage of their Slurm jobs on SeaWulf. It provides a concise summary of CPU and memory utilization, making it easier to identify inefficiencies and optimize resource usage. Below, we delve into the script's functionality, usage options, and examples to help you make the most of this tool.
The script is available to all users at the following path:
/gpfs/software/hpc_tools/get_resource_usage.py
The script comes with several command-line options to tailor the output to your needs. Running the script with the --help option provides a detailed description of these options:
[user@milan2 ~]$ /gpfs/software/hpc_tools/get_resource_usage.py --help
usage: get_resource_usage.py [-h] [-a] [-u USER] [-l LOW] [-e HIGH] [-n NODE] [-j JOB]
Calculate CPU and Memory usage for each node running a Slurm job. If no options are specified, the script will report usage for the current user.
options:
-h, --help show this help message and exit
-u USER, --user USER Only report usage for this user
-l LOW, --low LOW Only report nodes with % CPU usage lower than this value
-e HIGH, --high HIGH Only report nodes with % CPU usage higher than this value
-n NODE, --node NODE Only report usage on this node
-j JOB, --job JOB Only report usage for this job ID
The primary function of get_resource_usage.py
is to calculate and report CPU and memory usage for each node running a Slurm job. The script can filter and display data based on various criteria such as user, node, and job ID. Additionally, it can report nodes where the CPU usage percentage exceeds or falls below a specified threshold. Nevertheless, for most users, the no-argument version should adequately provide the necessary information.
Here is the recommended usage of the script:
[user@login2 ~]$ /gpfs/software/hpc_tools/get_resource_usage.py
In the example above, one Intel Skylake 40-core node is allocated, utilizing less than 26.5% of the CPU with a CPU load of 10.58. Additionally, memory usage is at a meager 14.2%. This output suggests an inefficient use of the compute node.
ssh
To check resource usage using tools like top
, htop
, or glances
, you'll need to ssh
into the specific nodes where your jobs are active. These tools provide thorough insights into CPU and memory usage, which is crucial for keeping an eye on performance and making necessary optimizations. By directly accessing the nodes your jobs are currently running on via SSH, you can get instant updates on resource allocation, spot any potential bottlenecks, and address them promptly to improve efficiency. This access lets you actively monitor and control your processes, ensuring that system resources are used wisely and any problems are swiftly dealt with.
[user@milan2 ~]$ squeue -u user
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
485898 short-40c bash ssperrot R 0:23 1 dn045
[user@milan2 ~]$ ssh dn045
[user@dn045 ~]$
When you execute the command squeue -u <username>
, it displays a list of your active jobs (assuming the Slurm module is loaded). For instance, the output might show a job running on a particular node, as indicated by its ID and node name. By SSH-ing into that node, you gain direct access to where your job is executing. This allows you to oversee its progress, address any issues, or interact with the environment as necessary.
glances
glances
is, in most cases, the best tool for real-time system resource monitoring on SeaWulf. It offers a detailed interface with comprehensive system resource monitoring capabilities, making it an attractive choice for users who prefer straightforward yet powerful tools. It helps in the identification of resource-intensive processes and system bottlenecks, facilitating effective troubleshooting. Additionally, glances
offers built-in plugins provide additional functionalities such as network and disk I/O monitoring, expanding its utility beyond basic system resource tracking. This versatility enhances glances
effectiveness in providing a comprehensive overview of CPU, memory, and process data, aiding in efficient system performance analysis.
To utilize glances
, start by SSH-ing into the computing node where your code is executing. Then, load the module using the following command:
[user@dn045 ~]$ module load glances
Finally, execute the glances
command to access a dynamic overview of current processes and resource utilization.
[user@dn045 ~]$ glances
In the example above, using the glances command, you can easily spot inefficiencies in resource allocation such as certain processes monopolizing CPU resources or excessive memory usage, indicating potential optimization opportunities. In this case, you can see optimal resource utilization across the node.
Read the glances
documentation for more information on the available features.
htop
The htop
command has been a reliable tool for monitoring real-time system resource usage on SeaWulf for many years. It offers a comprehensive display of CPU, memory, and process data, providing a detailed insight into system performance.
To utilize this tool, begin by SSH-ing into the computing node where your code is running, then load the htop module with the following command:
[user@dn045 ~]$ module load htop
Finally, execute the htop
command to access a dynamic overview of current processes and resource usage.
[user@dn045 ~]$ htop
In the example above, you can readily identify inefficiencies in resource utilization. For instance, the Amber command pmemd
is seen to occupy only a single core. This highlights inefficient utilization of the computing node, leading to sub-optimal performance
In contrast, the htop
command will also help you to observe the program efficiently employing MPI to distribute tasks across 40 concurrent processes, fully utilizing all available CPU cores on the node for optimal performance.
Read the htop
documentation for more information on the available features.
top
If you find yourself only need a subset of the features outlined by the previous tools, the top
command serves as a fundamental tool for you to monitor real-time system resource usage on SeaWulf. It presents a clear overview of CPU, memory, and process data, allowing users to detect anomalies or resource-intensive processes without requiring additional modules.
To utilize this tool, begin by SSH-ing into the computing node where your code is running and run the following command:
[user@dn045 ~]$ top
Read the top
documentation for more information on the available features.
Optimizing Resource Usage
We hope you'll make use of these tools to ensure your jobs are efficiently utilizing the resources you've requested. If you find any discrepancies or inefficiencies, don't hesitate to take action to improve your resource usage. This could involve refining your job configurations, adjusting resource requests, or optimizing your code to better match the allocated resources. If you encounter any challenges or need guidance in enhancing your resource efficiency, our support team is here to assist you every step of the way.
Click here to submit a ticket to the HPC support site.
Shared Queues
Lastly, we want to highlight that if you find that your job doesn't need all or most of the resources on a node, we encourage you to utilize the "shared" queues. These queues allow for more efficient resource allocation by enabling multiple jobs to run simultaneously on the same node, maximizing resource utilization. For more information on how to use the "shared" queues and optimize your job submissions, please refer to our comprehensive FAQ article, where you'll find detailed guidance and instructions.
For More Information Contact
Still Need Help? The best way to report your issue or make a request is by submitting a ticket.
Request Access or Report an Issue