To support incremental replay of message-passing applications, processes must periodically checkpoint and the content of some messages must be logged, to break dependencies of the current state of the execution on past events. The paper presents a new adaptive logging algorithm that dynamically decides whether to log a message based on dependencies the incoming message introduces on past events of the execution. The paper discusses the implementation issues of the algorithm and evaluates its performances on several applications, showing how it improves previously known schemes. 1. Introduction Debugging 1 long-running parallel/distributed programs requires the capability of incremental replay, i.e., of replaying selected intervals of an ...
Checkpointing is widely used in robust fault-tolerant applications. We present an efficient incremen...
Debugging concurrent programs is known to be difficult due to scheduling non-determinism. The techni...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
To support incremental replay of message-passing applications, processes must periodically checkpoin...
To support incremental replay of message-passing applications. processes must periodically checkpoin...
To support incremental replay of message-passing applications, processes must periodically checkpoin...
Part 1: Full PapersInternational audienceDebugging of concurrent systems is a tedious and error-pron...
While a lot of work has been focused on design and programming of shared memory multi-core architect...
A common debugging strategy involves re-executing a program (on a given input) over and over, each t...
A message is {\it in-transit} with respect to a global state if its sending is recorded in this glob...
none3siCausal-consistent reversible debugging is an innovative technique for debugging concurrent sy...
Significant time is spent by companies trying to reproduce and fix bugs. BugNet is a recent architec...
Debugging requires execution replay. Locations of bugs are rarely known in advance, so an execution ...
Reproducing a failure is the first and most important step in debugging because it enables us to und...
Logging is a well-established technique to record dynamic information during system execution. It ha...
Checkpointing is widely used in robust fault-tolerant applications. We present an efficient incremen...
Debugging concurrent programs is known to be difficult due to scheduling non-determinism. The techni...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...
To support incremental replay of message-passing applications, processes must periodically checkpoin...
To support incremental replay of message-passing applications. processes must periodically checkpoin...
To support incremental replay of message-passing applications, processes must periodically checkpoin...
Part 1: Full PapersInternational audienceDebugging of concurrent systems is a tedious and error-pron...
While a lot of work has been focused on design and programming of shared memory multi-core architect...
A common debugging strategy involves re-executing a program (on a given input) over and over, each t...
A message is {\it in-transit} with respect to a global state if its sending is recorded in this glob...
none3siCausal-consistent reversible debugging is an innovative technique for debugging concurrent sy...
Significant time is spent by companies trying to reproduce and fix bugs. BugNet is a recent architec...
Debugging requires execution replay. Locations of bugs are rarely known in advance, so an execution ...
Reproducing a failure is the first and most important step in debugging because it enables us to und...
Logging is a well-established technique to record dynamic information during system execution. It ha...
Checkpointing is widely used in robust fault-tolerant applications. We present an efficient incremen...
Debugging concurrent programs is known to be difficult due to scheduling non-determinism. The techni...
Fault tolerance can allow processes executing in a computer system to survive failures within the sy...