The complexity and cost of managing high-performance computing infrastructures are on the rise. Automating management and repair through predictive models to minimize human interventions is an attempt to increase system availability and contain these costs. Building predictive models that are accurate enough to be useful in automatic management cannot be based on restricted log data from subsystems but requires a holistic approach to data analysis from disparate sources. Here we provide a detailed multi-scale characterization study based on four datasets reporting power consumption, temperature, workload, and hardware/software events for an IBM Blue Gene/Q installation. We show that the system runs a rich parallel workload, with low correla...
Abstract—The power consumption of state of the art supercom-puters, because of their complexity and ...
As the field of supercomputing continues its relentless push towards greater speeds and higher level...
The complexity of modern computer systems makes performance modeling an invaluable resource for guid...
The complexity and cost of managing high-performance computing infrastructures are on the rise. Auto...
The complexity and cost of managing high-performance computing infrastructures are on the rise. Auto...
The complexity and cost of managing high-performance computing infrastructures are on the rise. Auto...
The complexity and cost of managing high-performance computing infrastructures are on the rise. Auto...
Abstract — System- and application-level failures can be characterized by mining relevant log files ...
Abstract—In addition to pushing what is possible computa-tionally, state-of-the-art supercomputers a...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
Large-scale high performance computing (HPC) systems typically consist of many thousands of CPUs and...
Large-scale high performance computing (HPC) systems typically consist of many thousands of CPUs and...
Large-scale high performance computing (HPC) systems typically consist of many thousands of CPUs and...
In this work, we design techniques to use energy instrumentation to study the health and workloads o...
Modern scientific discoveries are driven by an unsatisfiable demand for computational resources. Hig...
Abstract—The power consumption of state of the art supercom-puters, because of their complexity and ...
As the field of supercomputing continues its relentless push towards greater speeds and higher level...
The complexity of modern computer systems makes performance modeling an invaluable resource for guid...
The complexity and cost of managing high-performance computing infrastructures are on the rise. Auto...
The complexity and cost of managing high-performance computing infrastructures are on the rise. Auto...
The complexity and cost of managing high-performance computing infrastructures are on the rise. Auto...
The complexity and cost of managing high-performance computing infrastructures are on the rise. Auto...
Abstract — System- and application-level failures can be characterized by mining relevant log files ...
Abstract—In addition to pushing what is possible computa-tionally, state-of-the-art supercomputers a...
With the increasing scale and complexity of high performance computing (HPC) systems, reliability ma...
Large-scale high performance computing (HPC) systems typically consist of many thousands of CPUs and...
Large-scale high performance computing (HPC) systems typically consist of many thousands of CPUs and...
Large-scale high performance computing (HPC) systems typically consist of many thousands of CPUs and...
In this work, we design techniques to use energy instrumentation to study the health and workloads o...
Modern scientific discoveries are driven by an unsatisfiable demand for computational resources. Hig...
Abstract—The power consumption of state of the art supercom-puters, because of their complexity and ...
As the field of supercomputing continues its relentless push towards greater speeds and higher level...
The complexity of modern computer systems makes performance modeling an invaluable resource for guid...