Files in this item
SparkFlow : towards high-performance data analytics for Spark-based genome analysis
Item metadata
dc.contributor.author | Filgueira, Rosa | |
dc.contributor.author | Awaysheh, Feras M. | |
dc.contributor.author | Carter, Adam | |
dc.contributor.author | White, Darren J. | |
dc.contributor.author | Rana, Omar | |
dc.date.accessioned | 2022-03-24T09:26:10Z | |
dc.date.available | 2022-03-24T09:26:10Z | |
dc.date.issued | 2022-07-19 | |
dc.identifier.citation | Filgueira , R , Awaysheh , F M , Carter , A , White , D J & Rana , O 2022 , SparkFlow : towards high-performance data analytics for Spark-based genome analysis . in 20252 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid) . IEEE , pp. 1007-1016 , Workshop on Clusters, Clouds and Grids for Life Sciences (CCGrid Life 2022) , Taormina , Italy , 16/05/22 . https://doi.org/10.1109/CCGrid54584.2022.00123 | en |
dc.identifier.citation | workshop | en |
dc.identifier.isbn | 9781665499576 | |
dc.identifier.isbn | 9781665499569 | |
dc.identifier.other | PURE: 278376911 | |
dc.identifier.other | PURE UUID: d80c456d-f3ae-4bb7-bc16-9085e042bda2 | |
dc.identifier.other | Scopus: 85135742296 | |
dc.identifier.other | WOS: 000855065800111 | |
dc.identifier.uri | http://hdl.handle.net/10023/25076 | |
dc.description.abstract | The recent advances in DNA sequencing technology triggered next-generation sequencing (NGS) research in full scale. Big Data (BD) is becoming the main driver in analyzing these large-scale bioinformatic data. However, this complicated process has become the system bottleneck, requiring an amalgamation of scalable approaches to deliver the needed performance and hide the deployment complexity. Utilizing cutting-edge scientific workflows can robustly address these challenges. This paper presents a Spark-based alignment workflow called SparkFlow for massive NGS analysis over singularity containers. SparkFlow is highly scalable, reproducible, and capable of parallelizing computation by utilizing data-level parallelism and load balancing techniques in HPC and Cloud environments. The proposed workflow capitalizes on benchmarking two state-of-art NGS workflows, i.e., BaseRecalibrator and ApplyBQSR. SparkFlow realizes the ability to accelerate large-scale cancer genomic analysis by scaling vertically (HyperThreading) and horizontally (provisions on-demand). Our result demonstrates a trade-off inevitably between the targeted applications and processor architecture. SparkFlow achieves a decisive improvement in NGS computation performance, throughput, and scalability while maintaining deployment complexity. The paper’s findings aim to pave the way for a wide range of revolutionary enhancements and future trends within the High-performance Data Analytics (HPDA) genome analysis realm. | |
dc.language.iso | eng | |
dc.publisher | IEEE | |
dc.relation.ispartof | 20252 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid) | en |
dc.rights | Copyright © 2022 IEEE. This work has been made available online in accordance with publisher policies or with permission. Permission for further reuse of this content should be sought from the publisher or the rights holder. This is the author created accepted manuscript following peer review and may differ slightly from the final published version. The final published version of this work is available at https://doi.org/10.1109/CCGrid54584.2022.00123 | en |
dc.subject | Big data | en |
dc.subject | Scientific workflow | en |
dc.subject | HPC | en |
dc.subject | Genome analysis | en |
dc.subject | Apache Spark | en |
dc.subject | High-performance data analytics | en |
dc.subject | QA75 Electronic computers. Computer science | en |
dc.subject | QH301 Biology | en |
dc.subject | QH426 Genetics | en |
dc.subject | NS | en |
dc.subject | NIS | en |
dc.subject.lcc | QA75 | en |
dc.subject.lcc | QH301 | en |
dc.subject.lcc | QH426 | en |
dc.title | SparkFlow : towards high-performance data analytics for Spark-based genome analysis | en |
dc.type | Conference item | en |
dc.description.version | Postprint | en |
dc.contributor.institution | University of St Andrews. School of Computer Science | en |
dc.identifier.doi | https://doi.org/10.1109/CCGrid54584.2022.00123 |
This item appears in the following Collection(s)
Items in the St Andrews Research Repository are protected by copyright, with all rights reserved, unless otherwise indicated.