Today's largest systems have over 100,000 cores, with million-core systems expected over the next few years. This growing scale makes debugging the applications that run on them a daunting challenge. Few debugging tools perform well at this scale and most provide an overload of information about the entire job. Developers need tools that quickly direct them to the root cause of the problem. This paper presents AutomaDeD, a tool that identifies which tasks of a large-scale application first manifest a bug at a specific code region at a specific point during program execution. AutomaDeD creates a statistical model of the application's control-flow and timing behavior that organizes tasks into groups and identifies deviations from normal execu...
Software defects, commonly known as bugs, present a serious challenge for system reliability and dep...
Significant time is spent by companies trying to reproduce and fix bugs. We recently proposed a har...
There are few runtime tools for modestly sized computing systems, with 10^3 processors, and above th...
Today's largest systems have over 100,000 cores, with million-core systems expected over the next fe...
Developing correct and efficient software for large scale systems is a challenging task. Developers ...
Abstract—Statistical debugging identifies program behaviors that are highly correlated with failures...
Petascale platforms with O(10{sup 5}) and O(10{sup 6}) processing cores are driving advancements in ...
This dissertation presents a comprehensive solution to the problem of debugging of parallel programs...
Debugging is a tedious and time-consuming process for software developers. Therefore, providing effe...
The ever-increasing parallelism in computer systems has made software more prone to concurrency fail...
When confronted with a buggy execution of a distributed system—which are commonplacefor distributed ...
With the growing use of computers in almost every aspect of our lives, software failures have greate...
As today\u27s distributed applications increase in complexity, it becomes increasingly difficult to ...
Concurrency bugs are problems due to incorrect interleaving of parallel tasks. They are often caused...
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed an...
Software defects, commonly known as bugs, present a serious challenge for system reliability and dep...
Significant time is spent by companies trying to reproduce and fix bugs. We recently proposed a har...
There are few runtime tools for modestly sized computing systems, with 10^3 processors, and above th...
Today's largest systems have over 100,000 cores, with million-core systems expected over the next fe...
Developing correct and efficient software for large scale systems is a challenging task. Developers ...
Abstract—Statistical debugging identifies program behaviors that are highly correlated with failures...
Petascale platforms with O(10{sup 5}) and O(10{sup 6}) processing cores are driving advancements in ...
This dissertation presents a comprehensive solution to the problem of debugging of parallel programs...
Debugging is a tedious and time-consuming process for software developers. Therefore, providing effe...
The ever-increasing parallelism in computer systems has made software more prone to concurrency fail...
When confronted with a buggy execution of a distributed system—which are commonplacefor distributed ...
With the growing use of computers in almost every aspect of our lives, software failures have greate...
As today\u27s distributed applications increase in complexity, it becomes increasingly difficult to ...
Concurrency bugs are problems due to incorrect interleaving of parallel tasks. They are often caused...
Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed an...
Software defects, commonly known as bugs, present a serious challenge for system reliability and dep...
Significant time is spent by companies trying to reproduce and fix bugs. We recently proposed a har...
There are few runtime tools for modestly sized computing systems, with 10^3 processors, and above th...