Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires significant insights into expected job runtimes and scaling behavior, resource characteristics, input data distributions, and other factors. Unable to estimate performance accurately, users frequently overprovision resources for their jobs, leading to low resource utilization and high costs. In this paper, we present major building blocks towards a collaborative approach for optimization of data processing cluster configurations based on runtime data and performance models. We believe that runtime data c...
Distributed dataflow systems enable users to process large datasets in parallel on clusters of commo...
Data analytics frameworks enable users to process large datasets while hiding the complexity of scal...
Extensive data analysis has become the enabler for diagnostics and decision making in many modern sy...
Analyzing large datasets with distributed dataflow systems requires the use of clusters. Public clou...
Distributed dataflow systems enable data-parallel processing of large datasets on clusters. Public c...
Distributed dataflow systems like Apache Flink and Apache Spark simplify processing large amounts of...
We address the problem of performance prediction for parallel programs executed on clusters of heter...
The success of modern applications depends on the insights they collect from their data repositories...
With the increasing adoption of distributed systems in both academia and industry, and with the incr...
Distributed data-parallel processing systems like MapReduce, Spark, and Flink are popular for analyz...
There is a huge and rapidly increasing amount of data being generated by social media, mobile applic...
Although parallel processing is a promising way of increasing the performance cost efficiently, it i...
We introduce a methodology for the study of the application-level performance of time-sharing parall...
Software service providers are increasingly adopting cloud-based solutions to maximize resource util...
Distributed dataflow systems like Spark or Flink enable users to analyze large datasets. Users creat...
Distributed dataflow systems enable users to process large datasets in parallel on clusters of commo...
Data analytics frameworks enable users to process large datasets while hiding the complexity of scal...
Extensive data analysis has become the enabler for diagnostics and decision making in many modern sy...
Analyzing large datasets with distributed dataflow systems requires the use of clusters. Public clou...
Distributed dataflow systems enable data-parallel processing of large datasets on clusters. Public c...
Distributed dataflow systems like Apache Flink and Apache Spark simplify processing large amounts of...
We address the problem of performance prediction for parallel programs executed on clusters of heter...
The success of modern applications depends on the insights they collect from their data repositories...
With the increasing adoption of distributed systems in both academia and industry, and with the incr...
Distributed data-parallel processing systems like MapReduce, Spark, and Flink are popular for analyz...
There is a huge and rapidly increasing amount of data being generated by social media, mobile applic...
Although parallel processing is a promising way of increasing the performance cost efficiently, it i...
We introduce a methodology for the study of the application-level performance of time-sharing parall...
Software service providers are increasingly adopting cloud-based solutions to maximize resource util...
Distributed dataflow systems like Spark or Flink enable users to analyze large datasets. Users creat...
Distributed dataflow systems enable users to process large datasets in parallel on clusters of commo...
Data analytics frameworks enable users to process large datasets while hiding the complexity of scal...
Extensive data analysis has become the enabler for diagnostics and decision making in many modern sy...