Messing Up with BART: Error Generation for Evaluating Data-Cleaning Algorithms

Patricia, C. Arocena
Boris, Glavic
Giansalvatore, Mecca
Renée, J. Miller
Paolo, Papotti
Donatello, Santoro

Publication date

January 2015

Abstract

We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that sacrifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between con...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Messing Up with BART: Error Generation for Evaluating Data-Cleaning Algorithms

Abstract

Extracted data

Messing Up with BART: Error Generation for Evaluating Data-Cleaning Algorithms

Abstract

Extracted data

Related items

Related items