During the last several years, our teams at Oak Ridge National Laboratory, Louisiana Tech University, and Tennessee Technological University focused on efficient redundancy strategies for head and service nodes of high-performance computing (HPC) systems in order to pave the way for high availability (HA) in HPC. These nodes typically run critical HPC system services, like job and resource management, and represent sin-gle points of failure and control for an entire HPC sys-tem. The overarching goal of our research is to pro-vide high-level reliability, availability, and serviceabil-ity (RAS) for HPC systems by combining HA and HPC technology. This paper summarizes our accomplish-ments, such as developed concepts and implemented proof-of-co...
The demand for more computational power to solve complex scientific problems has been driving the ph...
This paper presents various aspects of reliability, availability and serviceability (RAS) systems as...
These use cases describe the most common ways in which researchers use high-performance computing (H...
In order to address anticipated high failure rates, reliability, availability and serviceability hav...
AbstractIn recent years, we have witnessed a growing interest in high performance computing (HPC) us...
In recent years, we have witnessed a growing interest in high performance computing (HPC) using a cl...
Our project is a multi-institutional research effort that adopts interplay of RELIABILITY, AVAILABIL...
Computer systems are often distributed across a network to provide services to the end-users. These ...
This paper presents various aspects of reliability, availability and serviceability (RAS) systems as...
Supercomputers have played an essential role in the progress of science and engineering research. As...
—Hardware support for high-performance computing (HPC) has so far been subject to significant advanc...
Abstract: In today’s complex enterprise environments, providing continuous service for applications ...
The delivery of key services in domains ranging from finance and manufacturing to healthcare and tra...
High Performance Computing facilities face increased pressures to survive and thrive in the next mil...
Ultra-scale architectures for scientific high-end computing with tens to hundreds of thousands of pr...
The demand for more computational power to solve complex scientific problems has been driving the ph...
This paper presents various aspects of reliability, availability and serviceability (RAS) systems as...
These use cases describe the most common ways in which researchers use high-performance computing (H...
In order to address anticipated high failure rates, reliability, availability and serviceability hav...
AbstractIn recent years, we have witnessed a growing interest in high performance computing (HPC) us...
In recent years, we have witnessed a growing interest in high performance computing (HPC) using a cl...
Our project is a multi-institutional research effort that adopts interplay of RELIABILITY, AVAILABIL...
Computer systems are often distributed across a network to provide services to the end-users. These ...
This paper presents various aspects of reliability, availability and serviceability (RAS) systems as...
Supercomputers have played an essential role in the progress of science and engineering research. As...
—Hardware support for high-performance computing (HPC) has so far been subject to significant advanc...
Abstract: In today’s complex enterprise environments, providing continuous service for applications ...
The delivery of key services in domains ranging from finance and manufacturing to healthcare and tra...
High Performance Computing facilities face increased pressures to survive and thrive in the next mil...
Ultra-scale architectures for scientific high-end computing with tens to hundreds of thousands of pr...
The demand for more computational power to solve complex scientific problems has been driving the ph...
This paper presents various aspects of reliability, availability and serviceability (RAS) systems as...
These use cases describe the most common ways in which researchers use high-performance computing (H...