Because of the ever-increasing execution scale, reliability and data management are becoming more and more important for scientific applications. On the one hand, exascale systems are anticipated to be more susceptible to soft errors ,e.g. silent data corruptions, due to the reduction in the size of transistors and the increase of the number of components. These errors will lead to corrupted results without warning, making the output of the computation untrustable. On the other hand, large volumes of highly variable data are produced by scientific computing with high velocity on exascale systems or advanced instruments, and the I/O time on storing these data is prohibitive due to the I/O bottleneck in parallel file systems. In this work, we...
This paper discusses the issue of fault-tolerance in dis-tributed computer systems with tens or hund...
A new ABFT architecture is proposed to tolerate multiple soft-errors with low overheads. It memorize...
Advancement in computational speed is nowadays gained by using more processing units rather than fas...
While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect soft errors o...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Todays exa-scale scientific applications or advanced instruments are producing vast volumes of data,...
While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect soft errors o...
Extremely large scale scientific simulation applications have been very important in many scientific...
With the ever-increasing volumes of data produced by today\u27s large-scale scientific simulations, ...
Today\u27s scientific simulations require a significant reduction of the data size because of extrem...
Advancement in computational speed is nowadays gained by using more processing units rather than fas...
Today\u27s scientific high-performance computing applications and advanced instruments are producing...
Because of the ever-increasing data being produced by today\u27s high performance computing (HPC) sc...
Error-bounded lossy compression is critical to the success of extreme-scale scientific research beca...
Today's scientific simulations are producing vast volumes of data that cannot be stored and transfer...
This paper discusses the issue of fault-tolerance in dis-tributed computer systems with tens or hund...
A new ABFT architecture is proposed to tolerate multiple soft-errors with low overheads. It memorize...
Advancement in computational speed is nowadays gained by using more processing units rather than fas...
While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect soft errors o...
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fund...
Todays exa-scale scientific applications or advanced instruments are producing vast volumes of data,...
While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect soft errors o...
Extremely large scale scientific simulation applications have been very important in many scientific...
With the ever-increasing volumes of data produced by today\u27s large-scale scientific simulations, ...
Today\u27s scientific simulations require a significant reduction of the data size because of extrem...
Advancement in computational speed is nowadays gained by using more processing units rather than fas...
Today\u27s scientific high-performance computing applications and advanced instruments are producing...
Because of the ever-increasing data being produced by today\u27s high performance computing (HPC) sc...
Error-bounded lossy compression is critical to the success of extreme-scale scientific research beca...
Today's scientific simulations are producing vast volumes of data that cannot be stored and transfer...
This paper discusses the issue of fault-tolerance in dis-tributed computer systems with tens or hund...
A new ABFT architecture is proposed to tolerate multiple soft-errors with low overheads. It memorize...
Advancement in computational speed is nowadays gained by using more processing units rather than fas...