Files in this item
SparkFlow : towards high-performance data analytics for Spark-based genome analysis
Item metadata
dc.contributor.author | Filgueira, Rosa | |
dc.contributor.author | Awaysheh, Feras M. | |
dc.contributor.author | Carter, Adam | |
dc.contributor.author | White, Darren J. | |
dc.contributor.author | Rana, Omar | |
dc.date.accessioned | 2022-03-24T09:26:10Z | |
dc.date.available | 2022-03-24T09:26:10Z | |
dc.date.issued | 2022-07-19 | |
dc.identifier | 278376911 | |
dc.identifier | d80c456d-f3ae-4bb7-bc16-9085e042bda2 | |
dc.identifier | 85135742296 | |
dc.identifier | 000855065800111 | |
dc.identifier.citation | Filgueira , R , Awaysheh , F M , Carter , A , White , D J & Rana , O 2022 , SparkFlow : towards high-performance data analytics for Spark-based genome analysis . in 20252 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid) . IEEE , pp. 1007-1016 , Workshop on Clusters, Clouds and Grids for Life Sciences (CCGrid Life 2022) , Taormina , Italy , 16/05/22 . https://doi.org/10.1109/CCGrid54584.2022.00123 | en |
dc.identifier.citation | workshop | en |
dc.identifier.isbn | 9781665499576 | |
dc.identifier.isbn | 9781665499569 | |
dc.identifier.uri | https://hdl.handle.net/10023/25076 | |
dc.description.abstract | The recent advances in DNA sequencing technology triggered next-generation sequencing (NGS) research in full scale. Big Data (BD) is becoming the main driver in analyzing these large-scale bioinformatic data. However, this complicated process has become the system bottleneck, requiring an amalgamation of scalable approaches to deliver the needed performance and hide the deployment complexity. Utilizing cutting-edge scientific workflows can robustly address these challenges. This paper presents a Spark-based alignment workflow called SparkFlow for massive NGS analysis over singularity containers. SparkFlow is highly scalable, reproducible, and capable of parallelizing computation by utilizing data-level parallelism and load balancing techniques in HPC and Cloud environments. The proposed workflow capitalizes on benchmarking two state-of-art NGS workflows, i.e., BaseRecalibrator and ApplyBQSR. SparkFlow realizes the ability to accelerate large-scale cancer genomic analysis by scaling vertically (HyperThreading) and horizontally (provisions on-demand). Our result demonstrates a trade-off inevitably between the targeted applications and processor architecture. SparkFlow achieves a decisive improvement in NGS computation performance, throughput, and scalability while maintaining deployment complexity. The paper’s findings aim to pave the way for a wide range of revolutionary enhancements and future trends within the High-performance Data Analytics (HPDA) genome analysis realm. | |
dc.format.extent | 771360 | |
dc.language.iso | eng | |
dc.publisher | IEEE | |
dc.relation.ispartof | 20252 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid) | en |
dc.subject | Big data | en |
dc.subject | Scientific workflow | en |
dc.subject | HPC | en |
dc.subject | Genome analysis | en |
dc.subject | Apache Spark | en |
dc.subject | High-performance data analytics | en |
dc.subject | QA75 Electronic computers. Computer science | en |
dc.subject | QH301 Biology | en |
dc.subject | QH426 Genetics | en |
dc.subject | NS | en |
dc.subject | SDG 3 - Good Health and Well-being | en |
dc.subject | NIS | en |
dc.subject.lcc | QA75 | en |
dc.subject.lcc | QH301 | en |
dc.subject.lcc | QH426 | en |
dc.title | SparkFlow : towards high-performance data analytics for Spark-based genome analysis | en |
dc.type | Conference item | en |
dc.contributor.institution | University of St Andrews. School of Computer Science | en |
dc.identifier.doi | 10.1109/CCGrid54584.2022.00123 |
This item appears in the following Collection(s)
Items in the St Andrews Research Repository are protected by copyright, with all rights reserved, unless otherwise indicated.