AbstractAs parallel le systems span larger and larger numbers of nodes in order to provide the performance and scalability necessary for modern cluster applications, the need for fault-tolerance and high data availability le systems has arisen. Modern parallel le systems spanning tens, hundreds, or even thousands of servers will require fault tolerance to avoid job failure and catastrophic data loss due to a single disk failure or server loss. Effective fault tolerance in parallel le systems must provide a high degree of data resiliency, consistency, and scalable performance. In this paper, we describe a data replication technique that meets the resiliency and consistency requirements of parallel le systems and provides scalable perfor-manc...
We present a replication control protocol for distributed file systems that can guarantee strict con...
International audienceDumping large amounts of related data simulta-neously to local storage devices...
Replication is a key technique for improving fault tolerance. Replication can also improve applicati...
[[abstract]]Distributed environments such as networks of workstations are becoming more cost-effecti...
The vulnerability of computer nodes due to component failures is a critical issue for cluster-based ...
[[abstract]]Providing data availability in a high performance computing environment is very importan...
[[abstract]]© 2005 Springer Verlag-Providing data availability in a high performance computing envir...
A parallel single level store (PSLS) system integrates a shared virtual memory and a parallel file s...
Distributed systems provide the opportunity for fault tolerance through replication. This dissertati...
Distributed file systems need to provide for fault tolerance. This is typically achieved with the re...
Abstract—As computational clusters increase in size, their mean-time-to-failure reduces drastically....
The introduction of Exascale storage into production systems will lead to an increase on the number ...
[[abstract]]In this paper, we propose a new fault-tolerant model for replication in distributed-file...
Abstract. As computational clusters increase in size, their mean-time-to-failure reduces. Typically ...
Sharing data in scientific collaborations that involve many institu-tions around the world demands a...
We present a replication control protocol for distributed file systems that can guarantee strict con...
International audienceDumping large amounts of related data simulta-neously to local storage devices...
Replication is a key technique for improving fault tolerance. Replication can also improve applicati...
[[abstract]]Distributed environments such as networks of workstations are becoming more cost-effecti...
The vulnerability of computer nodes due to component failures is a critical issue for cluster-based ...
[[abstract]]Providing data availability in a high performance computing environment is very importan...
[[abstract]]© 2005 Springer Verlag-Providing data availability in a high performance computing envir...
A parallel single level store (PSLS) system integrates a shared virtual memory and a parallel file s...
Distributed systems provide the opportunity for fault tolerance through replication. This dissertati...
Distributed file systems need to provide for fault tolerance. This is typically achieved with the re...
Abstract—As computational clusters increase in size, their mean-time-to-failure reduces drastically....
The introduction of Exascale storage into production systems will lead to an increase on the number ...
[[abstract]]In this paper, we propose a new fault-tolerant model for replication in distributed-file...
Abstract. As computational clusters increase in size, their mean-time-to-failure reduces. Typically ...
Sharing data in scientific collaborations that involve many institu-tions around the world demands a...
We present a replication control protocol for distributed file systems that can guarantee strict con...
International audienceDumping large amounts of related data simulta-neously to local storage devices...
Replication is a key technique for improving fault tolerance. Replication can also improve applicati...