A grey-box approach to benchmarking and performance modelling of data-intensive applications

Ceesay, Sheriffo

View/Open

SheriffoCeesayPhDThesis.pdf (2.716Mb)

Date

30/06/2021

Abstract

The advent of big data about a decade ago, coupled with its processing and storage challenges gave rise to the development of a multitude of data-intensive frameworks. These distributed parallel processing frameworks can be used to process petabytes of data stored in a cluster of computing nodes. Companies and organisations can now process massive amounts of data to drive innovation and gain a competitive advantage. However, these new paradigms have resulted in several research challenges due to their inherent difference from the more mature traditional data processing and storage systems. Firstly, they are comparatively more modern, supporting the execution of a wide variety of new data-intensive workloads with varying performance requirements. Therefore, there is a clear need to study and standardise ways to benchmark and compare them to identify and improve performance bottlenecks. Secondly, they are highly configurable; enabling users the freedom to tune the execution environment based on the application's performance requirements. However, this freedom and the ubiquity of the configuration parameters present an additional challenge by shifting the tuning and optimisation responsibilities of these numerous configuration parameters to the users. To address the above broad challenges, in this research, we enabled a grey-box benchmarking and performance modelling framework focusing on two of the most common communication patterns for data-intensive applications. The use of communication patterns allowed us to classify and study varying but related data-intensive workloads using the same sets of requirements. Furthermore, we enabled a multi-objective performance prediction framework that can be used to answer various performance-related questions such as the time it takes to execute an application, the best configuration parameters to satisfy constraints such as deadline, and recommendation of optimal cloud instances to minimise monetary cost. To gauge the generality of this work, we have validated the results on two internal clusters, and the results are consistent across both setups. We have also provided a REST API and web implementation for validation. The primary take way result is that the research showcase a comprehensive approach that can be used to benchmarking and modelling the performance of data-intensive applications.

DOI

https://doi.org/10.17630/sta/60

Type

Thesis, PhD Doctor of Philosophy

Rights

Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International

http://creativecommons.org/licenses/by-nc-nd/4.0/

Collections

Computer Science Theses

URI

https://hdl.handle.net/10023/23026

Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International

Except where otherwise noted within the work, this item's licence for re-use is described as Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International