Show simple item record

Files in this item

Thumbnail

Item metadata

dc.contributor.authorFilgueira, Rosa
dc.contributor.authorAwaysheh, Feras M.
dc.contributor.authorCarter, Adam
dc.contributor.authorWhite, Darren J.
dc.contributor.authorRana, Omar
dc.date.accessioned2022-03-24T09:26:10Z
dc.date.available2022-03-24T09:26:10Z
dc.date.issued2022-07-19
dc.identifier278376911
dc.identifierd80c456d-f3ae-4bb7-bc16-9085e042bda2
dc.identifier85135742296
dc.identifier000855065800111
dc.identifier.citationFilgueira , R , Awaysheh , F M , Carter , A , White , D J & Rana , O 2022 , SparkFlow : towards high-performance data analytics for Spark-based genome analysis . in 20252 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid) . IEEE , pp. 1007-1016 , Workshop on Clusters, Clouds and Grids for Life Sciences (CCGrid Life 2022) , Taormina , Italy , 16/05/22 . https://doi.org/10.1109/CCGrid54584.2022.00123en
dc.identifier.citationworkshopen
dc.identifier.isbn9781665499576
dc.identifier.isbn9781665499569
dc.identifier.urihttps://hdl.handle.net/10023/25076
dc.description.abstractThe recent advances in DNA sequencing technology triggered next-generation sequencing (NGS) research in full scale. Big Data (BD) is becoming the main driver in analyzing these large-scale bioinformatic data. However, this complicated process has become the system bottleneck, requiring an amalgamation of scalable approaches to deliver the needed performance and hide the deployment complexity. Utilizing cutting-edge scientific workflows can robustly address these challenges. This paper presents a Spark-based alignment workflow called SparkFlow for massive NGS analysis over singularity containers. SparkFlow is highly scalable, reproducible, and capable of parallelizing computation by utilizing data-level parallelism and load balancing techniques in HPC and Cloud environments. The proposed workflow capitalizes on benchmarking two state-of-art NGS workflows, i.e., BaseRecalibrator and ApplyBQSR. SparkFlow realizes the ability to accelerate large-scale cancer genomic analysis by scaling vertically (HyperThreading) and horizontally (provisions on-demand). Our result demonstrates a trade-off inevitably between the targeted applications and processor architecture. SparkFlow achieves a decisive improvement in NGS computation performance, throughput, and scalability while maintaining deployment complexity. The paper’s findings aim to pave the way for a wide range of revolutionary enhancements and future trends within the High-performance Data Analytics (HPDA) genome analysis realm.
dc.format.extent771360
dc.language.isoeng
dc.publisherIEEE
dc.relation.ispartof20252 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)en
dc.subjectBig dataen
dc.subjectScientific workflowen
dc.subjectHPCen
dc.subjectGenome analysisen
dc.subjectApache Sparken
dc.subjectHigh-performance data analyticsen
dc.subjectQA75 Electronic computers. Computer scienceen
dc.subjectQH301 Biologyen
dc.subjectQH426 Geneticsen
dc.subjectNSen
dc.subjectSDG 3 - Good Health and Well-beingen
dc.subjectNISen
dc.subject.lccQA75en
dc.subject.lccQH301en
dc.subject.lccQH426en
dc.titleSparkFlow : towards high-performance data analytics for Spark-based genome analysisen
dc.typeConference itemen
dc.contributor.institutionUniversity of St Andrews. School of Computer Scienceen
dc.identifier.doi10.1109/CCGrid54584.2022.00123


This item appears in the following Collection(s)

Show simple item record