We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that sacrifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between con...
Data quality is one of the most important problems in data management, since dirty data often leads ...
As data analytics becomes mainstream, and the complexity of the underlying data and computation grow...
Data cleansing approaches have usually focused on detect-ing and fixing errors with little attention...
We study the problem of introducing errors into clean databases for the purpose of benchmarking data...
We study the problem of introducing errors into clean data-bases for the purpose of benchmarking dat...
Repairing erroneous or conicting data that violate a set of constraints is an important problem in d...
High quality data is a vital asset for several businesses and applications. With flawed data costing...
The amount of data being collected is growing exponentially, both in academics as well as in busines...
Data ambiguity is inherent in applications such as data integration, location-based services, and se...
International audienceIn large business applications, various data processing activities can be done...
The paper is concerned with the problem of automatic detection and correction of erroneous data into...
Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then f...
A lot of systems and applications are data-driven, and the correctness of their operation relies hea...
An important obstacle to accurate data analytics is dirty data in the form of missing, duplicate, in...
Despite the increasing importance of data quality and the rich theoretical and practical contributio...
Data quality is one of the most important problems in data management, since dirty data often leads ...
As data analytics becomes mainstream, and the complexity of the underlying data and computation grow...
Data cleansing approaches have usually focused on detect-ing and fixing errors with little attention...
We study the problem of introducing errors into clean databases for the purpose of benchmarking data...
We study the problem of introducing errors into clean data-bases for the purpose of benchmarking dat...
Repairing erroneous or conicting data that violate a set of constraints is an important problem in d...
High quality data is a vital asset for several businesses and applications. With flawed data costing...
The amount of data being collected is growing exponentially, both in academics as well as in busines...
Data ambiguity is inherent in applications such as data integration, location-based services, and se...
International audienceIn large business applications, various data processing activities can be done...
The paper is concerned with the problem of automatic detection and correction of erroneous data into...
Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then f...
A lot of systems and applications are data-driven, and the correctness of their operation relies hea...
An important obstacle to accurate data analytics is dirty data in the form of missing, duplicate, in...
Despite the increasing importance of data quality and the rich theoretical and practical contributio...
Data quality is one of the most important problems in data management, since dirty data often leads ...
As data analytics becomes mainstream, and the complexity of the underlying data and computation grow...
Data cleansing approaches have usually focused on detect-ing and fixing errors with little attention...