Scale-out programs run on multiple processes in a cluster. In scale-out systems, processes can fail. Computations using traditional libraries such as MPI fail when any component process fails. The advent of Map Reduce, Resilient Data Sets and MillWheel has shown dramatic improvements in productivity are possible when a high-level programming framework handles scale-out and resilience automatically. We are concerned with the development of general-purpose languages that support resilient programming. In this paper we show how the X10 language and implementa-tion can be extended to support resilience. In Resilient X10, places may fail asynchronously, causing loss of the data and tasks at the failed place. Failure is exposed through excep-tion...
Projections and reports about exascale failure modes conclude that we need to protect numerical simu...
10p.International audienceThe evolution of systems during their operational lifetime is becoming ine...
This paper describes a technique for implementing $k$-resilient objects - distributed objects that ...
We present a formal small-step structural operational semantics for a large fragment of X10, unifyin...
The consistent trends of increasing core counts and decreasing mean-time-to-failure in supercomputer...
Maintaining performance in a faulty distributed computing environment is a major challenge in the de...
High-performance systems pose a number of challenges to traditional fault tolerance approaches. The...
2015-08-04Future exascale high-performance computing (HPC) systems will be constructed using VLSI de...
The Global Matrix Library (GML) is a distributed matrix library in the X10 language. GML is designed...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
Big data processing frameworks (MapReduce, Hadoop, Dryad) are hugely popular today because they grea...
This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale...
Resilient objects are instances of distributed abstract data types that are tolerant to failures. D...
The path to exascale poses several challenges related to power, performance, resilience, productivit...
International audienceIntegrating recent advancements in resilient algorithms and techniques into ex...
Projections and reports about exascale failure modes conclude that we need to protect numerical simu...
10p.International audienceThe evolution of systems during their operational lifetime is becoming ine...
This paper describes a technique for implementing $k$-resilient objects - distributed objects that ...
We present a formal small-step structural operational semantics for a large fragment of X10, unifyin...
The consistent trends of increasing core counts and decreasing mean-time-to-failure in supercomputer...
Maintaining performance in a faulty distributed computing environment is a major challenge in the de...
High-performance systems pose a number of challenges to traditional fault tolerance approaches. The...
2015-08-04Future exascale high-performance computing (HPC) systems will be constructed using VLSI de...
The Global Matrix Library (GML) is a distributed matrix library in the X10 language. GML is designed...
High-Performance Computing (HPC) has passed the Petascale mark and is moving forward to Exascale. As...
Big data processing frameworks (MapReduce, Hadoop, Dryad) are hugely popular today because they grea...
This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale...
Resilient objects are instances of distributed abstract data types that are tolerant to failures. D...
The path to exascale poses several challenges related to power, performance, resilience, productivit...
International audienceIntegrating recent advancements in resilient algorithms and techniques into ex...
Projections and reports about exascale failure modes conclude that we need to protect numerical simu...
10p.International audienceThe evolution of systems during their operational lifetime is becoming ine...
This paper describes a technique for implementing $k$-resilient objects - distributed objects that ...