An important consideration in the design of high performance multiprocessor systems is to ensure the correctness of the results computed in the presence of transient and intermittent failures. Concurrent error detection and correction have been applied to such systems in order to achieve reliability. Algorithm Based Fault Tolerance (ABFT) was suggested as a cost-effective concurrent error detection scheme. The research was motivated by the complexity involved in the analysis and design of ABFT systems. To that end, a matrix-based model was developed and, based on that, algorithms for both the design and analysis of ABFT systems are formulated. These algorithms are less complex than the existing ones. In order to reduce the complexity furthe...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
In Algorithm-based fault tolerance (ABFT), fault tolerance is tailored to the algorithm performed. M...
Multiprocessor systems which afford a high degree of parallelism are used in a variety of applicati...
An important consideration in the design of high performance multiprocessor systems is to ensure the...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
Efficient parallel algorithms proposed to solve many fundamental problems in scientific computation ...
Processor arrays can provide an attractive architecture for some applications. Featuring modularity,...
Checkpoint and recovery cost imposed by checkpoint/restart (CP/R) is a crucial performance issue for...
A new ABFT architecture is proposed to tolerate multiple soft-errors with low overheads. It memorize...
The design of survivable algorithms requires a solid foundation for executing them. While hardware t...
An A complex computer system consists of billions of transistors, miles of wires, and many interacti...
PhD ThesisThis thesis describes the design and development of algorithms for fault tolerant distr...
The mean time between failure (MTBF) of large supercomputers is decreasing, and future exascale comp...
Algorithm-based fault-tolerance (ABFT) is an inexpensive method of incorporating fault-tolerance int...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
In Algorithm-based fault tolerance (ABFT), fault tolerance is tailored to the algorithm performed. M...
Multiprocessor systems which afford a high degree of parallelism are used in a variety of applicati...
An important consideration in the design of high performance multiprocessor systems is to ensure the...
Abstract- The rapid progress in VLSI technology has reduced the cost of hardware, allowing multiple ...
We present a new approach to fault tolerance for High Performance Computing system. Our approach is ...
Efficient parallel algorithms proposed to solve many fundamental problems in scientific computation ...
Processor arrays can provide an attractive architecture for some applications. Featuring modularity,...
Checkpoint and recovery cost imposed by checkpoint/restart (CP/R) is a crucial performance issue for...
A new ABFT architecture is proposed to tolerate multiple soft-errors with low overheads. It memorize...
The design of survivable algorithms requires a solid foundation for executing them. While hardware t...
An A complex computer system consists of billions of transistors, miles of wires, and many interacti...
PhD ThesisThis thesis describes the design and development of algorithms for fault tolerant distr...
The mean time between failure (MTBF) of large supercomputers is decreasing, and future exascale comp...
Algorithm-based fault-tolerance (ABFT) is an inexpensive method of incorporating fault-tolerance int...
With the proliferation of parallel and distributed systems, it is an increasingly important problem ...
In Algorithm-based fault tolerance (ABFT), fault tolerance is tailored to the algorithm performed. M...
Multiprocessor systems which afford a high degree of parallelism are used in a variety of applicati...