AbstractStatistical models are generally computed outside a DBMS due to their mathematical complexity. We introduce techniques to efciently compute fundamental statistical models inside a DBMS exploiting User-Dened-Functions (UDFs). We study the computation of linear regression, PCA, clustering and Naive Bayes. Two summary matrices on the data set are mathematically shown to be essential for all models: the linear sum of points and the quadratic sum of cross-products of points. We consider two layouts for the input data set: horizontal and vertical. We rst introduce efcient SQL queries to compute summary matrices and to score the data set. Based on the SQL framework, we introduce UDFs that work in a single table scan: aggregate UDFs to comp...
Database systems are islands of structure in a sea of unstructured data sources. Several real-world ...
Data summarization is an essential mechanism to accelerate analytic algorithms on large data sets. I...
Clustering is an important problem in Statistics and Machine Learning that is usually solved using L...
Multidimensional statistical models are generally computed outside a relational DBMS, exporting data...
In general, a relational DBMS provides limited capabilities to perform multidimensional statistical ...
This paper proposes a new approach based on the recent trend of self-tuning DBMS, by which the cost ...
We demonstrate F, a system for building regression models over database views. At its core lies the ...
Query optimizers in object-relational database management systems typically require users to provide...
For a wide variety of classification algorithms, scalability to large databases can be achieved by o...
In-database analytics is of great practical importance as it avoids the costly repeated loop data sc...
ABSTRACT Data analytics has recently grown to include increasingly sophisticated techniques, such as...
We leverage vectorized User-Defined Functions (UDFs) to efficiently integrate unchanged machine lear...
Enterprise applications need sophisticated in-database analytics in addition to traditional online a...
One fundamental limitation of classical statistical modeling is the assumption that data is represen...
dresden.de Large-scale data analysis relies on custom code both for preparing the data for analysis ...
Database systems are islands of structure in a sea of unstructured data sources. Several real-world ...
Data summarization is an essential mechanism to accelerate analytic algorithms on large data sets. I...
Clustering is an important problem in Statistics and Machine Learning that is usually solved using L...
Multidimensional statistical models are generally computed outside a relational DBMS, exporting data...
In general, a relational DBMS provides limited capabilities to perform multidimensional statistical ...
This paper proposes a new approach based on the recent trend of self-tuning DBMS, by which the cost ...
We demonstrate F, a system for building regression models over database views. At its core lies the ...
Query optimizers in object-relational database management systems typically require users to provide...
For a wide variety of classification algorithms, scalability to large databases can be achieved by o...
In-database analytics is of great practical importance as it avoids the costly repeated loop data sc...
ABSTRACT Data analytics has recently grown to include increasingly sophisticated techniques, such as...
We leverage vectorized User-Defined Functions (UDFs) to efficiently integrate unchanged machine lear...
Enterprise applications need sophisticated in-database analytics in addition to traditional online a...
One fundamental limitation of classical statistical modeling is the assumption that data is represen...
dresden.de Large-scale data analysis relies on custom code both for preparing the data for analysis ...
Database systems are islands of structure in a sea of unstructured data sources. Several real-world ...
Data summarization is an essential mechanism to accelerate analytic algorithms on large data sets. I...
Clustering is an important problem in Statistics and Machine Learning that is usually solved using L...