A fundamental challenge for big-data analytics is how to efficiently tune and debug multi-step dataflows. This paper presents Newt, a scalable architecture for captur-ing and using record-level data lineage to discover and resolve errors in analytics. Newt’s flexible instrumenta-tion allows system developers to collect this fine-grain lineage from a range of data intensive scalable com-puting (DISC) architectures, actively recording the flow of data through multi-step, user-defined transformations. Newt pairs this API with a scale-out, fault-tolerant lin-eage store and query engine. We find that while active collection can be expensive, real-world analytics often incur modest runtime over-heads (<36%) and it enables novel lineage-based d...
Multicore is here to stay. To keep up with the hardware innovation, software developers mustmove fro...
With the growing use of computers in almost every aspect of our lives, software failures have greate...
A data warehousing system collects data from multiple distributed sources and stores the integrated ...
A fundamental challenge for big-data analytics is how to efficiently tune and debug multi-step dataf...
Data-intensive scalable computing (DISC) systems facilitate large-scale analytics to mine "big data"...
The constantly increasing volume of data collected in every aspect of our daily lives has necessitat...
We present the Stack Trace Analysis Tool (STAT) to aid in debugging extreme-scale applications. STAT...
Petascale platforms with O(10{sup 5}) and O(10{sup 6}) processing cores are driving advancements in ...
Statistical debugging identifies program behaviors that are highly correlated with failures. Tra...
Developing correct and efficient software for large scale systems is a challenging task. Developers ...
Data lineage forms an essential aspect of today's enterprise environment. MANTA Flow is a data linea...
Modern software projects are incredible feats of engineering that manage dozens of concurrent execut...
We present STATBench, an emulator of a scalable, lightweight, and effective tool to help debug extre...
Debugging parallel programs is an order of magnitude more complex than sequential ones, and yet, mos...
International audienceMany frameworks exist for programmers to develop and deploy Big Data applicati...
Multicore is here to stay. To keep up with the hardware innovation, software developers mustmove fro...
With the growing use of computers in almost every aspect of our lives, software failures have greate...
A data warehousing system collects data from multiple distributed sources and stores the integrated ...
A fundamental challenge for big-data analytics is how to efficiently tune and debug multi-step dataf...
Data-intensive scalable computing (DISC) systems facilitate large-scale analytics to mine "big data"...
The constantly increasing volume of data collected in every aspect of our daily lives has necessitat...
We present the Stack Trace Analysis Tool (STAT) to aid in debugging extreme-scale applications. STAT...
Petascale platforms with O(10{sup 5}) and O(10{sup 6}) processing cores are driving advancements in ...
Statistical debugging identifies program behaviors that are highly correlated with failures. Tra...
Developing correct and efficient software for large scale systems is a challenging task. Developers ...
Data lineage forms an essential aspect of today's enterprise environment. MANTA Flow is a data linea...
Modern software projects are incredible feats of engineering that manage dozens of concurrent execut...
We present STATBench, an emulator of a scalable, lightweight, and effective tool to help debug extre...
Debugging parallel programs is an order of magnitude more complex than sequential ones, and yet, mos...
International audienceMany frameworks exist for programmers to develop and deploy Big Data applicati...
Multicore is here to stay. To keep up with the hardware innovation, software developers mustmove fro...
With the growing use of computers in almost every aspect of our lives, software failures have greate...
A data warehousing system collects data from multiple distributed sources and stores the integrated ...