Condos and Colocation

ARCH provides colocation services at its facility at 5400 E. Lombard St, Baltimore Md 21224.  The facility is highly secure and has fast connectivity for housing mission critical servers. This document describes guidelines and requirements for physical hosting of computer equipment in a shared facility for the benefit of the Johns Hopkins and UMCP research communities. The colocation service will reduce costs of designing, building and maintaining HPC infrastructure.

Condominium Model

  • PIs are strongly encouraged to procure funding to integrate resources (condos) to the main cluster. Condos may consist of compute nodes, GPU nodes, sub-racks, management and Infiniband switches, and any necessary software licenses. 
  • PIs should discuss configuration with the ARCH director in order to maintain a homogeneous system, as allowed by advances in technology. There will not be a charge to add condos to the existing shared cluster (Rockfish)  but in exchange for the administration of the condo PIs will allow the use of their idling nodes. 
  • PIs will have an allocation in walltime hours equivalent to their purchase. Any condo user should be able to use as many cores as possible, up to their allocation. PIs should purchase hardware with a minimum 5-year warranty.
  • Hardware will be kept for a period of five (5) years and will be placed on surplus if it cannot be easily repaired.  
  • Schools will cover the additional cost of power and cooling based on utilization.
  • Please contact us directly for more information on condos and co-location requests.

Colocation

Rationale

MARCC is the “address” for the infrastructure that supports high-performance computing at Hopkins. The ARCH flagship High Performance Computing resources (the compute clusters) is  currently known as the “Rockfish”  cluster. Rockfish provides a shared facility where PIs from any School within Hopkins can contribute computational “condos” (compute clusters) in a relatively homogenous environment. This environment not only provides stability of system management and priority use of their cluster, but importantly, also serves to increase the available computational power for everyone, made possible by load-balancing across the fluctuating demand for cycles.  In addition to PI-provided “condos” the resources are augmented by resources deployed  using grants or contributions from various School Deans . ARCH realizes that this common, shared, infrastructure might not be ideal for all types of computing, therefore a co-location service is offered to research groups.

Purpose

The Advanced Research Computing @ Hopkins group provides co-location services at its facility at 5400 E. Lombard St, Baltimore MD 21224.  The facility is highly secure and has fast connectivity suitable for housing mission-critical servers. This document describes guidelines and requirements for the physical hosting of computer equipment in a shared facility for the benefit of the Johns Hopkins University research community. The co-location service will appeal to groups of researchers for the ability to reduce the cost of designing, building, operating, and maintaining a HPC infrastructure for their cluster in situations where being fully integrated into ARCH’s fully shared resource model, as in Rockfish.

Security

Physical security

There are closed circuit video cameras recording all activity inside and outside the data center. Only authorized personnel with card access are allowed in the facility and the data center. Guests are allowed in the facility per request in advance to the director of ARCH.

Cybersecurity

Customers must follow best industry practices to ensure the proper functioning of the data center. JHU IT/Network and the security office will coordinate security, firewalls and access rules. Customers should utilize the Hopkins firewall as much as possible. All exceptions must be clearly justified and documented.

Services

Racks

All servers must fit on standard 42U racks (78.74in x 23.62in x 43.30in). MARCC will provide racks, network and power trays through a cost-recovery model.

Rack Space

Customers are only allowed to deploy equipment that has been approved by ARCH. Customers are expected to utilize only the rack space that is allocated to them and respect space allocated to other users.

Power

The co-location space is powered by BGE resources. MARCC does not provide Uninterruptible Power Supply (UPS) power, but generator power-back is available. Customers may provide their own Uninterruptible Power Supply (UPS) equipment only after consultation with MARCC and JHU Facilities. Specific information regarding equipment power requirements is required by ARCH. Customers should provide information on the power consumption (idle, peak and standard loads) as well as the plug type. Each rack should consume no more than 12 kW/H  (average 300 Watts per server). Customers with high-density racks should consult with MARCC regarding additional service charges. The standards are: 110V/120V or 208/220V, 30 amp circuits.

Hardware

Higher priority will be given to servers which are under warranty and should not be older than 5 years. The equipment should be able to function properly at a cooling set point of 85°F (±3°). However, servers that are out of warranty or over five years old may be allocated space on an annual basis and owners are responsible for removing them if space is needed for higher priority use.

Cooling

The facility has adequate cooling (N+1 redundant) to ensure continuous operation. However, customers are strongly encouraged to follow these guidelines to ensure the proper cooling environment is always maintained.  The data center at MARCC is  air- cooled; no water-cooling is available.

Cables

These should rest on cable trays (provided). There is no space under the floor for cabling. Cable management should be done in accordance with local, state and national electrical, fire and safety standards and legal requirements.

Network

MARCC has a redundant 100Gb/s connection to Internet2, and 100Gb/s connectivity to the Homewood, East Baltimore and to UMCP campuses. Connections to different research buildings vary according to the Hopkins Research Network (HORNET).  Customers are required to purchase network hardware to connect to the MARCC network via 10Gb/s or 40Gb/s. Customers should also plan for a network technology refresh cycle of about 5 years. All network switches, routers, firewalls will be managed by IT@JH. Customers should discuss with MARCC and JHU IT their network needs, security measures and options to implement them.

Access

Remote access, as well as physical access, to the facility is provided for authorized personnel. Automated monitoring for environmental conditions and security is available on a 24×7 basis.

Inventory

Customers agree to provide all necessary information about their equipment to allow MARCC to keep the inventory up-to-date. This includes hardware type, power and cooling requirements, and weight.

Warranty

All equipment should have at least a three-year warranty or maintenance contract. Equipment can be operated for up to five years and customers should plan for refresh cycles. Customers must provide updated information when servers are replaced.

Safety

Customers must follow industry standards to safely operate equipment. Any safety issues or concerns should be reported to the director of ARCH. In addition, safety issues or concerns on MARCC’s side will be reported to the requisite PI or Center/Institute Director.

End of Life

Customers must replace equipment that is over 5 years old. Proper procedures should be followed to recycle surplus old equipment responsibly.

Server Administration

Professional administration is required. Servers that are maintained by students will not be allowed in the co-location area. If there is no professional administration available from the PI/Center/Institute, MARCC may provide system administration for a fee determined on a case-by-case basis. This fee will depend on the size and complexity of the system. An MOU will be provided.

Support

MARCC will provide initial support for deployment, installation, network access, and facility access.

Cost

There is no cost to the customer for housing servers or for the cost of utilities. The expectation is that the customer’s School or research center/institute will pay charges for space and utilities. ARCH support will be calculated on the percentage of an FTE that is required to manage the systems. There will be some additional cost for racks, cable trays, electrical work, and any other needs to be determined after discussions between customers, IT@JHU and ARCH.

Conflict

The Deans of the Schools involved will be responsible to resolve any potential conflicts (typically the Vice-Dean for Research and Sr. Associate Deans for finance). However, every customer must agree and sign an MOU that describes all requirements and responsibilities.

Requests

Please contact ARCH at help@rockfish.jhu.edu for more information or to start this process. 

MOU

Every customer must agree to the requirements by signing an MOU that describes requirements, responsibilities, and fees according to the hardware to be housed at MARCC. This MOU will be negotiated by both parties.

Audits

All hardware must pass security audits provided by IT@JHU.

As of June 29, 2020 these additional policies apply:

  1. Highest priority for space will be given to large shared clusters like Bluecrab and Rockfish. The bigger the user base the greater the priority.
  2. Priority for collocation space should be higher for larger clusters that cannot be placed in other locations or have a compelling need for communication to other clusters at ARCH.
  3. Machines placed at ARCH will be allowed to stay there for the lifetime of their service contract or 5 years from new purchase date – whichever is shortest.
  4. Machines that are out of warranty or over five years old will be  allocated space on an annual basis and owners will be responsible for removing them if the space is needed for higher priority use.

Responsibilities

MARCC / ARCH Responsibilities
  1. Provide adequate cyberinfrastructure (space, power and cooling)
  2. Provide basic support for initial setup
  3. Conduct an audit of the equipment to be co-located. Start inventory
  4. Provide restricted physical and virtual access to the facility
  5. Provide detailed information on power consumption on a monthly basis
  6. Approve equipment being housed in the space
PI Responsibilities
  1. Provide equipment (hardware) with warranty or maintenance support for at least three years. Any equipment 5+ years old may need to be replaced or recycled as surplus.
  2. Pay for work that needs to be done to maintain functioning equipment. For example, power whips connections, any network fees, etc.
  3. If UPS reserve power is needed, it will be the PI’s responsibility to provide this,  but it must be approved, in advance, by ARCH.
  4. Administration. Professional administration is required. If professional administration is not available, ARCH may be able to provide it at a cost depending on the size of the system. No students or other research personnel will be allowed to perform system administration.
  5. Implement security measures as needed (discuss with ARCH and IT networking/security).
  6. Provide and setup any firewalls (discuss with ARCH and IT networking/security)
  7. Racks are metered to keep track of power and cooling consumed.
  8. Obtain IP addresses and domain from the IT network group.
  9. Provide serial numbers to ARCH to be added to the inventory list.
School Responsibilities
  1. Pay for utilities (power and cooling) used by the equipment located at the data center. Schools will be billed monthly.

System Administration / Support

Per research groups request, ARCH may provide cluster/server administration based on the following  guidelines:

  1. Hardware: All hardware should be configured in consultation with ARCH. Power and thermal design settings need to be defined and aligned with facility capabilities for the proper operation of the system, prior to placing orders
    • The term “Hardware” includes  compute, management, and login nodes; storage servers and JBODs; administration and management switches; Infiniband switches, their respective cables, and any other peripherals.

2. Plan of use: A detailed description of how the system should be configured, including:

  • Operating system
  • Cluster management software
  • Queueing software and plan/diagram for  queues, user/group limits
  • Allocations process
  • User creation, approval, access, deletion, suspension policies
  • Storage allocations and policies
  • Resource accounting policies, including reporting frequency to customers
  • Plan for any additional resources (openOnDemand, XDMoD, coldfront)

3. Provide a detailed schedule of installation, implementation, and plans for testing equipment.

4. Full system administration may be provided by ARCH, but additional FTEs may be required according to the size of the cluster. FTEs must be members of the ARCH team to ensure effective and proper functioning of the systems.

5. Scientific support: Training, workshops, scientific application installation and support, troubleshooting, and debugging. Additional FTEs may be required according to the size of the system. These FTEs must be members of the ARCH team.

6. Backup policies: Provide a detailed data retention policy, including frequency of backups.

7. Provide an estimated schedule of downtimes including frequency and duration. Although downtimes are scheduled these will be executed only as needed.

Table of Contents