ARCH Newsletter October 2023

ARCH Newsletter October 2023

Hello ARCH & Rockfish Community,

In this newsletter for the month of October: 

  • Rockfish Outage Announcement
  • Research Profile: Dr. Corey Oses
  • Short Tutorials: SLURM Scripts
  • Staff Spotlight: Joseph Biondo
  • What is New on Rockfish
  • Job Opportunities: Assistant Research Scientist
  • Krieger IT VAST Filesystem

Read below for more details about our announcements.

Rockfish Outage Announcement

All Rockfish systems will be unavailable on Tuesday, November 7 at 11:00 AM for two hours to upgrade the InfiniBand network. Compute nodes and Filesystems will not be available for a couple of hours. We strongly recommend users to copy any data that may be needed during this downtime to an external device.

Research Profile: Corey Oses

We sat down with Dr. Corey Oses, Assistant Professor of Materials Science and Engineering, to learn more about his work on the discovery of new materials for clean and renewable energy. Learn more about his work at the Entropy for Energy Laboratory here.

Short Tutorials: SLURM Script

This month we present a tutorial on how to code SLURM Scripts. Learn the basics of SLURM scripts, as well as how to submit SLURM scripts for running job arrays and Message Passing Interface (MPI) programs. View our short tutorial here.

Staff Spotlight: Joseph Biondo

We continue our Research Spotlight series to highlight the work done by one of our colleagues from the ARCH team. Today we are excited to speak with Joseph Biondo, Assistant Director, IT Architect, and Operations Lead for ARCH. Learn more about his work at ARCH here.

What is New on Rockfish

We have made the following changes on Rockfish since October 2023:

  1. The Rockfish cluster, ARCH’s main resource, is at its maximum expansion capabilities and is being used to its full capacity. Rockfish is a very powerful core facility that has run out of room for additional faculty condos.
    • The final configuration is: 815 regular compute nodes (39,872 cores), 28 Bigmem nodes (1,344 cores), and 21 GPU nodes (84 gpu A100 cards) for a total power of 4 PFLOPs.

      The ARCH team is working on a new Request For Proposal (RFP) to acquire the successor to Rockfish, hopefully by Fall 2024.

  2. Secure shell connection to compute nodes: Users who are running jobs on any compute node are able to “ssh” to the compute node where their jobs are running. Users can check the status of the job, memory, and how many processes are running. From login node: ssh cNNN (name of the compute node where the job is running), then run “top” (or top -u $USER)
  3. New SLURM configuration (to be implemented after the downtime on November 7)
    •  “express” queue: We have created a new queue for fast jobs. The limits are 4 cores for a maximum of 8 hours. This queue can be used to request interactive jobs or to request access via OpenOnDemand.
    • “shared” queue. A new queue to run serial, High Throughput, or small parallel jobs will be created. It will run a combination of serial and parallel jobs up to 32 cores (single node jobs). Time limit 48 hours.
    • “parallel” queue. This queue will replace the old defq queue and will run only parallel jobs that need 48 cores or more. Nodes in this queue will be dedicated to run parallel jobs. Nodes will be rebooted  once per week.
    • All other queues, bigmem, a100 and a100ic will remain unchanged.

The SLURM re-configuration will improve job performance and more effective cluster utilization.

Job Opportunities: Assistant Research Scientist

Are you interested in joining the HPC workforce? As a Research Computing and Data Specialist (RCD) or Research Software Engineer (RSE) you will have plenty of opportunities to participate in innovative projects, establish collaborations with different research groups, support and enable cutting-edge research. ARCH has several openings at the assistant research scientist level. For more information click here.

Krieger IT VAST Filesystem

ARCH is happy to report that a direct connection to the Krieger IT VAST Filesystem is now available for researchers who have purchased a storage allocation through KSAS. The all-flash storage is available to campus researchers, including HORNET connectivity. You can view pricing information and request space using this form. A budget (IO or CC) is required.

Feel free to contact us for assistance on our services by submitting a help ticket on our website.

Thank you,