SparkFlow : towards high-performance data analytics for Spark-based genome analysis

Filgueira, Rosa; Awaysheh, Feras M.; Carter, Adam; White, Darren J.; Rana, Omar

Show simple item record

Files in this item

Name:: Filgueira_2022_CCGrid_SparkFlow_AAM.pdf
Size:: 753.2Kb
Format:: PDF

View/Open

Item metadata

dc.contributor.author	Filgueira, Rosa
dc.contributor.author	Awaysheh, Feras M.
dc.contributor.author	Carter, Adam
dc.contributor.author	White, Darren J.
dc.contributor.author	Rana, Omar
dc.date.accessioned	2022-03-24T09:26:10Z
dc.date.available	2022-03-24T09:26:10Z
dc.date.issued	2022-07-19
dc.identifier	278376911
dc.identifier	d80c456d-f3ae-4bb7-bc16-9085e042bda2
dc.identifier	85135742296
dc.identifier	000855065800111
dc.identifier.citation	Filgueira , R , Awaysheh , F M , Carter , A , White , D J & Rana , O 2022 , SparkFlow : towards high-performance data analytics for Spark-based genome analysis . in 20252 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid) . IEEE , pp. 1007-1016 , Workshop on Clusters, Clouds and Grids for Life Sciences (CCGrid Life 2022) , Taormina , Italy , 16/05/22 . https://doi.org/10.1109/CCGrid54584.2022.00123	en
dc.identifier.citation	workshop	en
dc.identifier.isbn	9781665499576
dc.identifier.isbn	9781665499569
dc.identifier.uri	https://hdl.handle.net/10023/25076
dc.description.abstract	The recent advances in DNA sequencing technology triggered next-generation sequencing (NGS) research in full scale. Big Data (BD) is becoming the main driver in analyzing these large-scale bioinformatic data. However, this complicated process has become the system bottleneck, requiring an amalgamation of scalable approaches to deliver the needed performance and hide the deployment complexity. Utilizing cutting-edge scientific workflows can robustly address these challenges. This paper presents a Spark-based alignment workflow called SparkFlow for massive NGS analysis over singularity containers. SparkFlow is highly scalable, reproducible, and capable of parallelizing computation by utilizing data-level parallelism and load balancing techniques in HPC and Cloud environments. The proposed workflow capitalizes on benchmarking two state-of-art NGS workflows, i.e., BaseRecalibrator and ApplyBQSR. SparkFlow realizes the ability to accelerate large-scale cancer genomic analysis by scaling vertically (HyperThreading) and horizontally (provisions on-demand). Our result demonstrates a trade-off inevitably between the targeted applications and processor architecture. SparkFlow achieves a decisive improvement in NGS computation performance, throughput, and scalability while maintaining deployment complexity. The paper’s findings aim to pave the way for a wide range of revolutionary enhancements and future trends within the High-performance Data Analytics (HPDA) genome analysis realm.
dc.format.extent	771360
dc.language.iso	eng
dc.publisher	IEEE
dc.relation.ispartof	20252 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)	en
dc.subject	Big data	en
dc.subject	Scientific workflow	en
dc.subject	HPC	en
dc.subject	Genome analysis	en
dc.subject	Apache Spark	en
dc.subject	High-performance data analytics	en
dc.subject	QA75 Electronic computers. Computer science	en
dc.subject	QH301 Biology	en
dc.subject	QH426 Genetics	en
dc.subject	NS	en
dc.subject	SDG 3 - Good Health and Well-being	en
dc.subject	NIS	en
dc.subject.lcc	QA75	en
dc.subject.lcc	QH301	en
dc.subject.lcc	QH426	en
dc.title	SparkFlow : towards high-performance data analytics for Spark-based genome analysis	en
dc.type	Conference item	en
dc.contributor.institution	University of St Andrews. School of Computer Science	en
dc.identifier.doi	https://doi.org/10.1109/CCGrid54584.2022.00123

This item appears in the following Collection(s)

University of St Andrews Research

Show simple item record