SeaWulf Queues

Audience: Faculty, Guests, Researchers, Staff and Students

This KB Article References: High Performance Computing
This Information is Intended for: Faculty, Guests, Researchers, Staff, Students
Last Updated: February 13, 2024
Average Rating: Not Rated
Your feedback is important to us, help us by logging in to rate this article and provide feedback.

How are the queues different?

SeaWulf's queues (also known as partitions) mostly differ in the maximum runtime and number of nodes that can be allocated to jobs run on them.  Some sets of queues offer different hardware.  More specifically:

  • The debug-28core, short-28core, long-28core, extended-28core, medium-28core, and large-28core queues share a set of identical nodes that have a max of 28 Haswell cores.
  • The short-40core, long-40core, extended-40core, medium-40core, and large-40core queues share a set of identical nodes that have 40 Skylake cores.
  • The short-96core, long-96core, extended-96core, medium-96core, and large-96core queues share a set of identical nodes that have 96 AMD EPYC Milan cores.
  • The short-96core-shared, long-96core-shared, and extended-96core-shared queues also share the same set of identical nodes that have 96 AMD EPYC Milan cores, but multiple jobs are allowed to run on the same node simultaneously.
  • The hbm-short-96core, hbm-long-96core, hbm-extended-96core, hbm-medium-96core, and hbm-large-96core queues share a set of identical nodes that have 96 Intel Sapphire Rapids cores.
  • The hbm-1tb-long-96core queue allocates jobs to 4 identical nodes that have 96 Intel Sapphire Rapids cores.  These nodes differ from the other hbm nodes in that they are configured in Cache mode and have 1 TB DDR5 memory.
  • The gpu and gpu-long queues share a third set of identical nodes that are similar to those used by the short, long, etc. queues but with 4x K80 24GB GPUs each.
  • The p100 and v100 queues each allocate jobs to a single node that has two Tesla P100 16GB or 2x V100 32GB GPUs, respectively.
  • The a100, a100-long, and a100-large queues have 4x A100 80GB GPUs and 64 cores of Intel Xeon Ice Lake CPUs.

What does each queue provide?

The following table details hardware information and resource limits on jobs submitted to each queue via Slurm:

 

Queues accessed from login1 and login2:

Queue

CPU Architecture Latest Advanced Vector/Matrix Extension supported CPU cores per node GPUs per node Node memory (Gb)1

Default run time

Max run time

Max # of nodes

Min # of nodes

Max # of simultaneous jobs per user

debug-28core Intel Haswell AVX2 28 0

128

1 hour 1 hour 8 n/a n/a
extended-28core Intel Haswell AVX2 28 0 128 8 hours 7 days 2 n/a 6
gpu Intel Haswell AVX2 28 4 128 1 hour 8 hours 2 n/a 2
gpu-long Intel Haswell AVX2 28 4 128 8 hours 48 hours 1 n/a 2
gpu-large Intel Haswell AVX2 28 4 128 1 hour 8 hours 4 n/a 1
p100 Intel Haswell AVX2 12 2 64 1 hour 24 hours 1 n/a 1
v100 Intel Haswell AVX2 28 2 128 1 hour 24 hours 1 n/a

1

large-28core Intel Haswell AVX2 28 0 128 4 hours 8 hours 80 24 1
long-28core Intel Haswell AVX2 28 0 128 8 hours 48 hours 8 n/a 6
medium-28core Intel Haswell AVX2 28 0 128 4 hours 12 hours 24 8 2
short-28core Intel Haswell AVX2 28 0 128 1 hour 4 hours 12 n/a 8

1A small subset of node memory is reserved for the OS and file system and is not available for user applications.

 

Queues accessed from milan1 and milan2:

Queue CPU Architecture Latest Advanced Vector/Matrix Extensions supported CPU cores per node GPUs per node Node memory (GB)1 Default run time Max run time Max # of nodes Min # of nodes Max # of simultaneous jobs per user Multiple users can share the same node?
extended-40core Intel Skylake AVX512 40 0 192 8 hours 7 days 2 n/a 3 No
hbm-extended-96core Intel Sapphire Rapids AMX, AVX512 & Intel DL Boost 96 0 384 (256GB DDR5 + 128GB HBM) 8 hours 7 days 2 n/a 3 No
extended-96core AMD EPYC Milan AVX2 96 0 256 8 hours 7 days 2 n/a 3 No
extended-96core-shared AMD EPYC Milan AVX2 96 0 256 8 hours 7 days 1 n/a n/a Yes
large-40core Intel Skylake AVX512 40 0 192 4 hours 8 hours 50 16 1 No
hbm-large-96core Intel Sapphire Rapids AMX, AVX512 & Intel DL Boost 96 0 384 (256GB DDR5 + 128GB HBM) 4 hours 8 hours 38 16 1 No
large-96core AMD EPYC Milan AVX2 96 0 256 4 hours 8 hours 38 16 1 No
long-40core Intel Skylake AVX512 40 0 192 8 hours 48 hours 6 n/a 3 No
hbm-long-96core Intel Sapphire Rapids AMX, AVX512 & Intel DL Boost 96 0 384 (256GB DDR5 + 128GB HBM) 8 hours 48 hours 6 n/a 3 No
hbm-1tb-long-96core Intel Sapphire Rapids AMX, AVX512 & Intel DL Boost 96 0 1000 (1 TB DDR5 + 128 GB HBM configured as level 4 cache) 8 hours 48 hours 1 n/a 1 No
long-96core AMD EPYC Milan AVX2 96 0 256 8 hours 48 hours 6 n/a 3 No
long-96core-shared AMD EPYC Milan AVX2 96 0 256 8 hours 48 hours 3 n/a n/a Yes
hbm-medium-96core Intel Sapphire Rapids AMX, AVX512 & Intel DL Boost 96 0 384 (256GB DDR5 + 128GB HBM) 4 hours 12 hours 16 6 1 No
medium-40core Intel Skylake AVX512 40 0 192 4 hours 12 hours 16 6 1 No
medium-96core AMD EPYC Milan AVX2 96 0 256 4 hours 12 hours 16 6 1 No
hbm-short-96core Intel Sapphire Rapids AMX, AVX512 & Intel DL Boost 96 0 384 (256GB DDR5 + 128GB HBM) 1 hour 4 hours 8 n/a 4 No
short-40core Intel Skylake AVX512 40 0 192 1 hour 4 hours 8 n/a 4 No
short-96core AMD EPYC Milan AVX2 96 0 256 1 hour 4 hours 8 n/a 4 No
short-96core-shared AMD EPYC Milan AVX2 96 0 256 1 hour 4 hours 4 n/a n/a Yes
a100 Intel Ice Lake AVX512 & Intel DL Boost 64 4 256 1 hour 8 hours 2 n/a 2 Yes
a100-long Intel Ice Lake AVX512 & Intel DL Boost 64 4 256 8 hours 48 hours 1 n/a 2 Yes
a100-large Intel Ice Lake AVX512 & Intel DL Boost 64 4 256 1 hour 8 hours 4 n/a 1 Yes

1A small subset of node memory is reserved for the OS and file system and is not available for user applications.

 

In addition to the limits in the tables above, users cannot use more than 32 nodes at one time unless running jobs in one of the large queues, and the maximum number of jobs that a user can have queued at any given time is 100.
 

Which queue should I use?

In general, users should expect a trade-off between the amount of resources requested (number of nodes, job time), and how long your job will wait in the queue. In addition, some software, even those written written with MPI support, may not meaningfully benefit from using multiple nodes or even all the cores on a single node.  Therefore, instead of wasting resources (and potentially spending more time waiting in the queue), we recommend that users write small test jobs to determine what computational resources are required before submitting larger, production runs of their code.  Based on these test results, you should then select a queue that best matches your workload's requirements.

In addition:

  • Do not run CPU-only applications in queues that provide access to GPUs.  Please consult your software's documentation if you're unsure whether it can make use of GPUs.
  • For jobs that require relatively few computational resources, we recommend using one of the "shared" 96-core nodes, which allow multiple jobs to be run on the same node simultaneously.
  • For brief interactive jobs, try the debug-28core queue or one of the short queues.  These queues are suitable for testing or debugging your code.
  • Use the long queues if you're not sure how much time your job requires.  Once you understand what your job's needs are, pick another one if it's more suitable.
  • Try using the hbm-1tb-long-96core queue for jobs that require very large amounts of memory.

Submit a ticket

Additional Information


There are no additional resources available for this article.

Provide Feedback


Your feedback is important to us, help us by logging in to rate this article and provide feedback.

Sign in with NetID

Getting Help


The Division of Information Technology provides support on all of our services. If you require assistance please submit a support ticket through the IT Service Management system.

Submit A Quick Ticket

For More Information Contact


IACS Support System