We investigate the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. However, real-world data sets are often 'dirty', especially when integrating data from different sources. To deal with this issue, we propose new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between strings instances. To make the measures effective, we develop efficient algorithms for distributed sample creation and similarity computation. Test results show that for dirty data our measures are more accurate for me...
2 Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological ...
Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological ap...
Data analysts spend more than 80% of time on data cleaning and integration in the whole process of d...
String similarity join is an important operation in data in-tegration and cleansing that finds simil...
This paper proposes a new measure for similarity between basket datasets. The new measure is calcula...
Abstract—Sampling-based methods have previously been pro-posed for the problem of finding interestin...
Identifying similarities in large datasets is an essential operation in many applications such as bi...
In this paper we address the problem of data cleaning when multiple data sources are merged to creat...
We performed an investigation of how several data relationship discovery algorithms can be combined ...
Existing techniques for schema matching are classified as either schema-based, instance-based, or a ...
As an essential operation in data cleaning, the similarity join has attracted considerable attention...
Similarity Join plays an important role in data integration and cleansing, record linkage and data d...
The problem of identifying approximately duplicate records in databases is an essential step for dat...
abstract: Similarity Joins are some of the most useful and powerful data processing techniques. They...
The ability to handle noisy or imprecise data is becoming increasingly important in computing. In th...
2 Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological ...
Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological ap...
Data analysts spend more than 80% of time on data cleaning and integration in the whole process of d...
String similarity join is an important operation in data in-tegration and cleansing that finds simil...
This paper proposes a new measure for similarity between basket datasets. The new measure is calcula...
Abstract—Sampling-based methods have previously been pro-posed for the problem of finding interestin...
Identifying similarities in large datasets is an essential operation in many applications such as bi...
In this paper we address the problem of data cleaning when multiple data sources are merged to creat...
We performed an investigation of how several data relationship discovery algorithms can be combined ...
Existing techniques for schema matching are classified as either schema-based, instance-based, or a ...
As an essential operation in data cleaning, the similarity join has attracted considerable attention...
Similarity Join plays an important role in data integration and cleansing, record linkage and data d...
The problem of identifying approximately duplicate records in databases is an essential step for dat...
abstract: Similarity Joins are some of the most useful and powerful data processing techniques. They...
The ability to handle noisy or imprecise data is becoming increasingly important in computing. In th...
2 Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological ...
Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological ap...
Data analysts spend more than 80% of time on data cleaning and integration in the whole process of d...