This dissertation describes the design, implementation, and performance of two mechanisms that address reliability and system management problems associated with parallel computing clusters: thread migration and checkpoint/recovery. A unique aspect of this work is the integration of these two mechanisms. Although there has been considerable prior work on each of these mechanisms in isolation, their integration offers synergistic benefit to both functionality and performance. Used in, conjunction, these mechanisms facilitate failure recovery, and node addition and removal with minimal disruption of executing applications. Our implementation differs from previous work in the following ways. First, by using thread migration instead of process ...
Commodity computer clusters are often composed of hundreds of computing nodes. These generally off-t...
Abstract—Nowadays, clusters are widely used to execute scientific applications. These applications a...
Thread migration is established as a mechanism for achieving dynamic load sharing and data locality....
Clusters of industry-standard multiprocessors are emerging as a competitive alternative for large-sc...
This paper describes issues in the design and implementation of checkpointing and recovery modules f...
It has been observed on engineering and scientific data centers that the absence of a clear separati...
A network multicomputer is a multiprocessor in which the processors are connected by general-purpose...
In this paper we describe the way thread migration can be carried out in Distributed Shared Memory (...
This thesis examines memory management and rollback recovery in parallel architectures. Three memory...
Recent research efforts of parallel processing on non-dedicated clusters have focused on high execut...
Thread migration is established as a mechanism for achieving dynamic load sharing and data locality....
Large-scale distributed systems are very attractive for the execution of parallel applications requi...
We present a system for functional parallel computing on distributed memory machines with dynamic lo...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
Nowadays, clusters are widely used to execute scientific applications. These applications are often ...
Commodity computer clusters are often composed of hundreds of computing nodes. These generally off-t...
Abstract—Nowadays, clusters are widely used to execute scientific applications. These applications a...
Thread migration is established as a mechanism for achieving dynamic load sharing and data locality....
Clusters of industry-standard multiprocessors are emerging as a competitive alternative for large-sc...
This paper describes issues in the design and implementation of checkpointing and recovery modules f...
It has been observed on engineering and scientific data centers that the absence of a clear separati...
A network multicomputer is a multiprocessor in which the processors are connected by general-purpose...
In this paper we describe the way thread migration can be carried out in Distributed Shared Memory (...
This thesis examines memory management and rollback recovery in parallel architectures. Three memory...
Recent research efforts of parallel processing on non-dedicated clusters have focused on high execut...
Thread migration is established as a mechanism for achieving dynamic load sharing and data locality....
Large-scale distributed systems are very attractive for the execution of parallel applications requi...
We present a system for functional parallel computing on distributed memory machines with dynamic lo...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
Nowadays, clusters are widely used to execute scientific applications. These applications are often ...
Commodity computer clusters are often composed of hundreds of computing nodes. These generally off-t...
Abstract—Nowadays, clusters are widely used to execute scientific applications. These applications a...
Thread migration is established as a mechanism for achieving dynamic load sharing and data locality....