The massive growth of modern datasets from different sources such as videos, social networks, and sensor data, coupled with limited resources in terms of time and space, raises challenging questions for existing machine learning algorithms. From the statistical point of view, having access to more data may be viewed as a blessing, as it provides a better view of the underlying (possibly stochastic) processes generating the data. At the same time, it greatly increases the cost of storing, communicating, and processing the data. This interplay between the computational and statistical aspects is one of the key challenges in large-scale machine learning. In this dissertation we propose a general approach for addressing these challenges. We stu...
Huge data sets containing millions of training examples with a large number of attributes are relati...
International audienceThe development of cluster computing frameworks has allowed practitioners to s...
We study the problem of constructing coresets for clustering problems with time series data. This pr...
Faced with massive data, is it possible to trade off (statistical) risk, and (computational) space a...
The last several years have seen the emergence of datasets of an unprecedented scale, and solving va...
Abstract How can we train a statistical mixture model on a massive data set? In this paper, we show ...
Thesis: Ph. D. in Computer Science and Engineering, Massachusetts Institute of Technology, Departmen...
In practice, machine learners often care about two key issues: one is how to obtain a more accurate...
The scalability problem in data mining involves the development of methods for handling large databa...
In the era of datasets of unprecedented sizes, data compression techniques are an attractive approac...
The advent of large-scale datasets has offered unprecedented amounts of information for building sta...
How can we train a statistical mixture model on a massive data set? In this work we show how to cons...
Scalable training of Bayesian nonparametric models is a notoriously difficult challenge. We explore ...
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Comp...
Many existing procedures in machine learning and statistics are computationally intractable in the s...
Huge data sets containing millions of training examples with a large number of attributes are relati...
International audienceThe development of cluster computing frameworks has allowed practitioners to s...
We study the problem of constructing coresets for clustering problems with time series data. This pr...
Faced with massive data, is it possible to trade off (statistical) risk, and (computational) space a...
The last several years have seen the emergence of datasets of an unprecedented scale, and solving va...
Abstract How can we train a statistical mixture model on a massive data set? In this paper, we show ...
Thesis: Ph. D. in Computer Science and Engineering, Massachusetts Institute of Technology, Departmen...
In practice, machine learners often care about two key issues: one is how to obtain a more accurate...
The scalability problem in data mining involves the development of methods for handling large databa...
In the era of datasets of unprecedented sizes, data compression techniques are an attractive approac...
The advent of large-scale datasets has offered unprecedented amounts of information for building sta...
How can we train a statistical mixture model on a massive data set? In this work we show how to cons...
Scalable training of Bayesian nonparametric models is a notoriously difficult challenge. We explore ...
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Comp...
Many existing procedures in machine learning and statistics are computationally intractable in the s...
Huge data sets containing millions of training examples with a large number of attributes are relati...
International audienceThe development of cluster computing frameworks has allowed practitioners to s...
We study the problem of constructing coresets for clustering problems with time series data. This pr...