The use of dynamic reconfiguration has been proposed to tolerate faults in large-scale partitionable parallel processing systems. If a processor develops a permanent fault during the execution of a task on a submachine A, three recovery options are migration of the task to another submachine, task migration to a subdivision of A, and redistribution of the task among the fault-free processors in A. Quantitative models of these reconfiguration schemes are developed to consider what information is needed to make a choice among these methods for a practical implementation. It is pointed out that in certain situations collecting precise values for all needed parameters is very difficult. Therefore, the model parameters are then analyzed, togethe...
The single-program multiple-data (SPMD) paradigm is becoming the most diffuse way to program commerc...
Fault tolerance can be defined as a concept of recovery that keeps a computer system operational by ...
This work deals with high performance computing on large scale platforms like computing grids. Compu...
Several parallel parallel processing systems exist that can be partitioned and/or can operate in mul...
Fault-tolerance and dynamic partitioning are two important issues in the design of large-scale paral...
Various aspects of reliable computing are formalized and quantified with emphasis on efficient fault...
Architecture reconfiguration, the ability of a system to alter the active interconnection among modu...
Traditional reliability-related models for fault-tolerant systems are used to predict system reliabi...
Massively parallel computers, using thousands of processors, will be the future trend for producing ...
The occurrence of faults in multicomputers with hundreds or thousands of nodes is a likely event tha...
As the sizes of distributed memory multiprocessors increase, the likelihood of a fault removing one ...
AbstractImperfect coverage and nonnegligible reconfiguration delay are known to have a deleterious e...
The management of redundancy in computer systems was studied and guidelines were provided for the de...
A general framework for the design and analysis of distributed fault-tolerant systems is proposed in...
The single-program multiple-data (SPMD) paradigm is becoming the most diffuse way to program commerc...
The single-program multiple-data (SPMD) paradigm is becoming the most diffuse way to program commerc...
Fault tolerance can be defined as a concept of recovery that keeps a computer system operational by ...
This work deals with high performance computing on large scale platforms like computing grids. Compu...
Several parallel parallel processing systems exist that can be partitioned and/or can operate in mul...
Fault-tolerance and dynamic partitioning are two important issues in the design of large-scale paral...
Various aspects of reliable computing are formalized and quantified with emphasis on efficient fault...
Architecture reconfiguration, the ability of a system to alter the active interconnection among modu...
Traditional reliability-related models for fault-tolerant systems are used to predict system reliabi...
Massively parallel computers, using thousands of processors, will be the future trend for producing ...
The occurrence of faults in multicomputers with hundreds or thousands of nodes is a likely event tha...
As the sizes of distributed memory multiprocessors increase, the likelihood of a fault removing one ...
AbstractImperfect coverage and nonnegligible reconfiguration delay are known to have a deleterious e...
The management of redundancy in computer systems was studied and guidelines were provided for the de...
A general framework for the design and analysis of distributed fault-tolerant systems is proposed in...
The single-program multiple-data (SPMD) paradigm is becoming the most diffuse way to program commerc...
The single-program multiple-data (SPMD) paradigm is becoming the most diffuse way to program commerc...
Fault tolerance can be defined as a concept of recovery that keeps a computer system operational by ...
This work deals with high performance computing on large scale platforms like computing grids. Compu...