Many systems for big data analytics employ a data flow abstrac-tion to define parallel data processing tasks. In this setting, custom operations expressed as user-defined functions are very common. We address the problem of performing data flow optimization at this level of abstraction, where the semantics of operators are not known. Traditionally, query optimization is applied to queries with known algebraic semantics. In this work, we find that a handful of properties, rather than a full algebraic specification, suffice to establish reordering conditions for data processing operators. We show that these properties can be accurately estimated for black box operators by statically analyzing the general-purpose code of their user-defined fun...
ABSTRACT Enterprises are adapting large-scale data processing platforms, such as Hadoop, to gain act...
Currently, we witness an increased interest in large-scale analytical data flows on non-relational d...
Organizations adopt different databases for big data which is huge in volume and have different data...
AbstractAdvanced database applications demand new data modeling constructs beyond those available in...
Recent decades have seen an explosion in the diversity and scale of data analytics tasks. While data...
Full support of parallelism in object-relational database systems (ORDBMSs) is desired. The parallel...
Data transformations are fundamental operations in legacy data migration, data integration, data cle...
In the last decade, the world wide web has grown from being a platform where users passively viewed ...
Classic query optimization in relational database systems relies on phases (algebraic, physical, cos...
Large-scale data analysis relies on custom code both for preparing the data for analysis as well as ...
Big data analytical systems, such as MapReduce, perform aggressive materialization of intermediate j...
Over the past decade, a number of data intensive scalable systems have been developed to process ext...
In recent years, complex data mining and machine learning algorithms have become more common in data...
Since the introduction of cost-based query optimization, the performance-critical role of interestin...
Traditionally, query optimizers assume a direct mapping from the logical entities modeling the data ...
ABSTRACT Enterprises are adapting large-scale data processing platforms, such as Hadoop, to gain act...
Currently, we witness an increased interest in large-scale analytical data flows on non-relational d...
Organizations adopt different databases for big data which is huge in volume and have different data...
AbstractAdvanced database applications demand new data modeling constructs beyond those available in...
Recent decades have seen an explosion in the diversity and scale of data analytics tasks. While data...
Full support of parallelism in object-relational database systems (ORDBMSs) is desired. The parallel...
Data transformations are fundamental operations in legacy data migration, data integration, data cle...
In the last decade, the world wide web has grown from being a platform where users passively viewed ...
Classic query optimization in relational database systems relies on phases (algebraic, physical, cos...
Large-scale data analysis relies on custom code both for preparing the data for analysis as well as ...
Big data analytical systems, such as MapReduce, perform aggressive materialization of intermediate j...
Over the past decade, a number of data intensive scalable systems have been developed to process ext...
In recent years, complex data mining and machine learning algorithms have become more common in data...
Since the introduction of cost-based query optimization, the performance-critical role of interestin...
Traditionally, query optimizers assume a direct mapping from the logical entities modeling the data ...
ABSTRACT Enterprises are adapting large-scale data processing platforms, such as Hadoop, to gain act...
Currently, we witness an increased interest in large-scale analytical data flows on non-relational d...
Organizations adopt different databases for big data which is huge in volume and have different data...