An important obstacle to accurate data analytics is dirty data in the form of missing, duplicate, incorrect, or inconsistent values. In the SampleClean project, we have developed a new suite of techniques to esti-mate the results of queries when only a sample of data can be cleaned. Some forms of data corruption, such as duplication, can affect sampling probabilities, and thus, new techniques have to be designed to ensure correctness of the approximate query results. We first describe our initial project on computing statistically bounded estimates of sum, count, and avg queries from samples of cleaned data. We sub-sequently explored how the same techniques could apply to other problems in database research, namely, materialized view mainte...
We study the problem of introducing errors into clean databases for the purpose of benchmarking data...
Data Cleaning is a long standing problem, which is grow-ing in importance with the mass of uncurated...
We study the problem of introducing errors into clean data-bases for the purpose of benchmarking dat...
Aggregate query processing over very large datasets can be slow and prone to error due to dirty (mis...
Materialized views (MVs), stored pre-computed results, are widely used to facilitate fast queries on...
Data Analytics (DA) is a technology used to make correct decisions through proper analysis and predi...
Organizations collect a substantial amount of user' data from multiple sources to explore such data ...
The detection of duplicate tuples, corresponding to the same real-world entity, is an important task...
As data analytics becomes mainstream, and the complexity of the underlying data and computation grow...
Data quality affects machine learning (ML) model performances, and data scientists spend considerabl...
Digitally collected data su\ud ↵\ud ers from many data quality issues, such as duplicate, incorrect,...
Recently Big Data has become one of the important new factors in the business field. This needs to h...
Incomplete data is ubiquitous. When a user issues a query over incomplete data, the results may cont...
Data cleaning is an action which includes a process of correcting and identifying the inconsistencie...
Data Cleaning, despite being a long standing problem, has occupied the center stage again thanks to ...
We study the problem of introducing errors into clean databases for the purpose of benchmarking data...
Data Cleaning is a long standing problem, which is grow-ing in importance with the mass of uncurated...
We study the problem of introducing errors into clean data-bases for the purpose of benchmarking dat...
Aggregate query processing over very large datasets can be slow and prone to error due to dirty (mis...
Materialized views (MVs), stored pre-computed results, are widely used to facilitate fast queries on...
Data Analytics (DA) is a technology used to make correct decisions through proper analysis and predi...
Organizations collect a substantial amount of user' data from multiple sources to explore such data ...
The detection of duplicate tuples, corresponding to the same real-world entity, is an important task...
As data analytics becomes mainstream, and the complexity of the underlying data and computation grow...
Data quality affects machine learning (ML) model performances, and data scientists spend considerabl...
Digitally collected data su\ud ↵\ud ers from many data quality issues, such as duplicate, incorrect,...
Recently Big Data has become one of the important new factors in the business field. This needs to h...
Incomplete data is ubiquitous. When a user issues a query over incomplete data, the results may cont...
Data cleaning is an action which includes a process of correcting and identifying the inconsistencie...
Data Cleaning, despite being a long standing problem, has occupied the center stage again thanks to ...
We study the problem of introducing errors into clean databases for the purpose of benchmarking data...
Data Cleaning is a long standing problem, which is grow-ing in importance with the mass of uncurated...
We study the problem of introducing errors into clean data-bases for the purpose of benchmarking dat...