We study the problem of discovering joinable datasets at scale. We approach the problem from a learning perspective relying on profiles. These are succinct representations that capture the underlying characteristics of the schemata and data values of datasets, which can be efficiently extracted in a distributed and parallel fashion. Profiles are then compared, to predict the quality of a join operation among a pair of attributes from different datasets. In contrast to the state-of-the-art, we define a novel notion of join quality that relies on a metric considering both the containment and cardinality proportion between join candidate attributes. We implement our approach in a system called NextiaJD, and present experiments to show the pred...
We present a simple conceptual framework to think about computing the relational join. Using this fr...
High-dimensional similarity join (HDSJ) is critical for many novel applications in the domain of mob...
Cloud enabled systems have become a crucial component to efficiently process and analyze massive amo...
Similarity join is the problem of finding pairs of records with simi-larity score greater than some ...
This work proposes V-SMART-Join, a scalable MapReduce-based framework for discovering all pairs of s...
We present three novel algorithms for performing multi-dimensional joins and an in-depth survey and ...
Similarity Joins are recognized to be among the most useful data processing and analysis operations....
This is an extended version of our paper accepted for SISAP 2021. It additionally includes descripti...
We describe a method of inferring join plans for a set of relation instances, in the absence of any ...
Machine Learning (ML) applications require high-quality datasets. Automated data augmentation techni...
Set similarity joins, which compute pairs of similar sets, constitute an important operator primitiv...
Conference Name:19th International Conference on Database Systems for Advanced Applications, DASFAA ...
International audienceWe investigate the problem of learning join queries from user examples. The us...
Join query is one of the most expressive and expensive data analytic tools in traditional database s...
Set similarity join is an essential operation in data integration and big data analytics, that finds...
We present a simple conceptual framework to think about computing the relational join. Using this fr...
High-dimensional similarity join (HDSJ) is critical for many novel applications in the domain of mob...
Cloud enabled systems have become a crucial component to efficiently process and analyze massive amo...
Similarity join is the problem of finding pairs of records with simi-larity score greater than some ...
This work proposes V-SMART-Join, a scalable MapReduce-based framework for discovering all pairs of s...
We present three novel algorithms for performing multi-dimensional joins and an in-depth survey and ...
Similarity Joins are recognized to be among the most useful data processing and analysis operations....
This is an extended version of our paper accepted for SISAP 2021. It additionally includes descripti...
We describe a method of inferring join plans for a set of relation instances, in the absence of any ...
Machine Learning (ML) applications require high-quality datasets. Automated data augmentation techni...
Set similarity joins, which compute pairs of similar sets, constitute an important operator primitiv...
Conference Name:19th International Conference on Database Systems for Advanced Applications, DASFAA ...
International audienceWe investigate the problem of learning join queries from user examples. The us...
Join query is one of the most expressive and expensive data analytic tools in traditional database s...
Set similarity join is an essential operation in data integration and big data analytics, that finds...
We present a simple conceptual framework to think about computing the relational join. Using this fr...
High-dimensional similarity join (HDSJ) is critical for many novel applications in the domain of mob...
Cloud enabled systems have become a crucial component to efficiently process and analyze massive amo...