This Information is Intended for: Faculty, Guests, Researchers, Staff, Students
Last Updated: February 13, 2024
Average Rating: Not Rated
How are the queues different?
SeaWulf's queues (also known as partitions) mostly differ in the maximum runtime and number of nodes that can be allocated to jobs run on them. Some sets of queues offer different hardware. More specifically:
- The debug-28core, short-28core, long-28core, extended-28core, medium-28core, and large-28core queues share a set of identical nodes that have a max of 28 Haswell cores.
- The short-40core, long-40core, extended-40core, medium-40core, and large-40core queues share a set of identical nodes that have 40 Skylake cores.
- The short-96core, long-96core, extended-96core, medium-96core, and large-96core queues share a set of identical nodes that have 96 AMD EPYC Milan cores.
- The short-96core-shared, long-96core-shared, and extended-96core-shared queues also share the same set of identical nodes that have 96 AMD EPYC Milan cores, but multiple jobs are allowed to run on the same node simultaneously.
- The hbm-short-96core, hbm-long-96core, hbm-extended-96core, hbm-medium-96core, and hbm-large-96core queues share a set of identical nodes that have 96 Intel Sapphire Rapids cores.
- The hbm-1tb-long-96core queue allocates jobs to 4 identical nodes that have 96 Intel Sapphire Rapids cores. These nodes differ from the other hbm nodes in that they are configured in Cache mode and have 1 TB DDR5 memory.
- The gpu and gpu-long queues share a third set of identical nodes that are similar to those used by the short, long, etc. queues but with 4x K80 24GB GPUs each.
- The p100 and v100 queues each allocate jobs to a single node that has two Tesla P100 16GB or 2x V100 32GB GPUs, respectively.
- The a100, a100-long, and a100-large queues have 4x A100 80GB GPUs and 64 cores of Intel Xeon Ice Lake CPUs.
What does each queue provide?
The following table details hardware information and resource limits on jobs submitted to each queue via Slurm:
Queues accessed from login1 and login2:
Queue |
CPU Architecture | Latest Advanced Vector/Matrix Extension supported | CPU cores per node | GPUs per node | Node memory (Gb)1 |
Default run time |
Max run time |
Max # of nodes |
Min # of nodes |
Max # of simultaneous jobs per user |
---|---|---|---|---|---|---|---|---|---|---|
debug-28core | Intel Haswell | AVX2 | 28 | 0 |
128 |
1 hour | 1 hour | 8 | n/a | n/a |
extended-28core | Intel Haswell | AVX2 | 28 | 0 | 128 | 8 hours | 7 days | 2 | n/a | 6 |
gpu | Intel Haswell | AVX2 | 28 | 4 | 128 | 1 hour | 8 hours | 2 | n/a | 2 |
gpu-long | Intel Haswell | AVX2 | 28 | 4 | 128 | 8 hours | 48 hours | 1 | n/a | 2 |
gpu-large | Intel Haswell | AVX2 | 28 | 4 | 128 | 1 hour | 8 hours | 4 | n/a | 1 |
p100 | Intel Haswell | AVX2 | 12 | 2 | 64 | 1 hour | 24 hours | 1 | n/a | 1 |
v100 | Intel Haswell | AVX2 | 28 | 2 | 128 | 1 hour | 24 hours | 1 | n/a |
1 |
large-28core | Intel Haswell | AVX2 | 28 | 0 | 128 | 4 hours | 8 hours | 80 | 24 | 1 |
long-28core | Intel Haswell | AVX2 | 28 | 0 | 128 | 8 hours | 48 hours | 8 | n/a | 6 |
medium-28core | Intel Haswell | AVX2 | 28 | 0 | 128 | 4 hours | 12 hours | 24 | 8 | 2 |
short-28core | Intel Haswell | AVX2 | 28 | 0 | 128 | 1 hour | 4 hours | 12 | n/a | 8 |
1A small subset of node memory is reserved for the OS and file system and is not available for user applications.
Queues accessed from milan1 and milan2:
Queue | CPU Architecture | Latest Advanced Vector/Matrix Extensions supported | CPU cores per node | GPUs per node | Node memory (GB)1 | Default run time | Max run time | Max # of nodes | Min # of nodes | Max # of simultaneous jobs per user | Multiple users can share the same node? |
---|---|---|---|---|---|---|---|---|---|---|---|
extended-40core | Intel Skylake | AVX512 | 40 | 0 | 192 | 8 hours | 7 days | 2 | n/a | 3 | No |
hbm-extended-96core | Intel Sapphire Rapids | AMX, AVX512 & Intel DL Boost | 96 | 0 | 384 (256GB DDR5 + 128GB HBM) | 8 hours | 7 days | 2 | n/a | 3 | No |
extended-96core | AMD EPYC Milan | AVX2 | 96 | 0 | 256 | 8 hours | 7 days | 2 | n/a | 3 | No |
extended-96core-shared | AMD EPYC Milan | AVX2 | 96 | 0 | 256 | 8 hours | 7 days | 1 | n/a | n/a | Yes |
large-40core | Intel Skylake | AVX512 | 40 | 0 | 192 | 4 hours | 8 hours | 50 | 16 | 1 | No |
hbm-large-96core | Intel Sapphire Rapids | AMX, AVX512 & Intel DL Boost | 96 | 0 | 384 (256GB DDR5 + 128GB HBM) | 4 hours | 8 hours | 38 | 16 | 1 | No |
large-96core | AMD EPYC Milan | AVX2 | 96 | 0 | 256 | 4 hours | 8 hours | 38 | 16 | 1 | No |
long-40core | Intel Skylake | AVX512 | 40 | 0 | 192 | 8 hours | 48 hours | 6 | n/a | 3 | No |
hbm-long-96core | Intel Sapphire Rapids | AMX, AVX512 & Intel DL Boost | 96 | 0 | 384 (256GB DDR5 + 128GB HBM) | 8 hours | 48 hours | 6 | n/a | 3 | No |
hbm-1tb-long-96core | Intel Sapphire Rapids | AMX, AVX512 & Intel DL Boost | 96 | 0 | 1000 (1 TB DDR5 + 128 GB HBM configured as level 4 cache) | 8 hours | 48 hours | 1 | n/a | 1 | No |
long-96core | AMD EPYC Milan | AVX2 | 96 | 0 | 256 | 8 hours | 48 hours | 6 | n/a | 3 | No |
long-96core-shared | AMD EPYC Milan | AVX2 | 96 | 0 | 256 | 8 hours | 48 hours | 3 | n/a | n/a | Yes |
hbm-medium-96core | Intel Sapphire Rapids | AMX, AVX512 & Intel DL Boost | 96 | 0 | 384 (256GB DDR5 + 128GB HBM) | 4 hours | 12 hours | 16 | 6 | 1 | No |
medium-40core | Intel Skylake | AVX512 | 40 | 0 | 192 | 4 hours | 12 hours | 16 | 6 | 1 | No |
medium-96core | AMD EPYC Milan | AVX2 | 96 | 0 | 256 | 4 hours | 12 hours | 16 | 6 | 1 | No |
hbm-short-96core | Intel Sapphire Rapids | AMX, AVX512 & Intel DL Boost | 96 | 0 | 384 (256GB DDR5 + 128GB HBM) | 1 hour | 4 hours | 8 | n/a | 4 | No |
short-40core | Intel Skylake | AVX512 | 40 | 0 | 192 | 1 hour | 4 hours | 8 | n/a | 4 | No |
short-96core | AMD EPYC Milan | AVX2 | 96 | 0 | 256 | 1 hour | 4 hours | 8 | n/a | 4 | No |
short-96core-shared | AMD EPYC Milan | AVX2 | 96 | 0 | 256 | 1 hour | 4 hours | 4 | n/a | n/a | Yes |
a100 | Intel Ice Lake | AVX512 & Intel DL Boost | 64 | 4 | 256 | 1 hour | 8 hours | 2 | n/a | 2 | Yes |
a100-long | Intel Ice Lake | AVX512 & Intel DL Boost | 64 | 4 | 256 | 8 hours | 48 hours | 1 | n/a | 2 | Yes |
a100-large | Intel Ice Lake | AVX512 & Intel DL Boost | 64 | 4 | 256 | 1 hour | 8 hours | 4 | n/a | 1 | Yes |
1A small subset of node memory is reserved for the OS and file system and is not available for user applications.
In addition to the limits in the tables above, users cannot use more than 32 nodes at one time unless running jobs in one of the large queues, and the maximum number of jobs that a user can have queued at any given time is 100.
Which queue should I use?
In general, users should expect a trade-off between the amount of resources requested (number of nodes, job time), and how long your job will wait in the queue. In addition, some software, even those written written with MPI support, may not meaningfully benefit from using multiple nodes or even all the cores on a single node. Therefore, instead of wasting resources (and potentially spending more time waiting in the queue), we recommend that users write small test jobs to determine what computational resources are required before submitting larger, production runs of their code. Based on these test results, you should then select a queue that best matches your workload's requirements.
In addition:
- Do not run CPU-only applications in queues that provide access to GPUs. Please consult your software's documentation if you're unsure whether it can make use of GPUs.
- For jobs that require relatively few computational resources, we recommend using one of the "shared" 96-core nodes, which allow multiple jobs to be run on the same node simultaneously.
- For brief interactive jobs, try the debug-28core queue or one of the short queues. These queues are suitable for testing or debugging your code.
- Use the long queues if you're not sure how much time your job requires. Once you understand what your job's needs are, pick another one if it's more suitable.
- Try using the hbm-1tb-long-96core queue for jobs that require very large amounts of memory.