Rockfish Changes

We have made the following changes on Rockfish since October 2023:

  1. The Rockfish cluster, ARCH’s main resource, is at its maximum expansion capabilities and is being used to its full capacity. Rockfish is a very powerful core facility that has run out of room for additional faculty condos.
    • The final configuration is: 815 regular compute nodes (39,872 cores), 28 Bigmem nodes (1,344 cores), and 21 GPU nodes (84 gpu A100 cards) for a total power of 4 PFLOPs.

      The ARCH team is working on a new Request For Proposal (RFP) to acquire the successor to Rockfish, hopefully by Fall 2024.

  2. Secure shell connection to compute nodes: Users who are running jobs on any compute node are able to “ssh” to the compute node where their jobs are running. Users can check the status of the job, memory, and how many processes are running. From login node: ssh cNNN (name of the compute node where the job is running), then run “top” (or top -u $USER)
  3. New SLURM configuration (to be implemented after the downtime on November 7)
    •  “express” queue: We have created a new queue for fast jobs. The limits are 4 cores for a maximum of 8 hours. This queue can be used to request interactive jobs or to request access via OpenOnDemand.
    • “shared” queue. A new queue to run serial, High Throughput, or small parallel jobs will be created. It will run a combination of serial and parallel jobs up to 32 cores (single node jobs). Time limit 48 hours.
    • “parallel” queue. This queue will replace the old defq queue and will run only parallel jobs that need 48 cores or more. Nodes in this queue will be dedicated to run parallel jobs. Nodes will be rebooted  once per week.
    • All other queues, bigmem, a100 and a100ic will remain unchanged.

 

The SLURM re-configuration will improve job performance and more effective cluster utilization.