Print Email Facebook Twitter Improving the Performance of the Variant Calling Workflow for DNA Sequencing Title Improving the Performance of the Variant Calling Workflow for DNA Sequencing Author Heiss, Jonathan (TU Delft Electrical Engineering, Mathematics and Computer Science) Contributor Epema, D.H.J. (mentor) Datema, E (mentor) Al-Ars, Z. (graduation committee) Degree granting institution Delft University of Technology Project ICT Innovation (Cloud Computing and Services) Date 2017-08-30 Abstract The growing DNA data volumes that originate from novel efficient DNA sequencing methods expose new challenges to computer systems used to process this genomic data. BigData technologies in the Hadoop environment, in particular Apache Spark and the Hadoop Distributed File System (HDFS), are increasingly adapted in state-of-the-art Bioinformatic tools. One application domain of such tools is the Variant Calling Workflow (VCW) that is subject of this work’s research. The application of Spark-based open source tools to execute the different VCW results in a tool chain of separate programs. The programs are executed consecutively and consume the data that is produced by the preceding program as input. This data sharing represents an additional workload as the data needs to be transformed into a file format, written to disk and read from disk again by the next application.In our first research question we examine whether performance can be increased by improving data sharing. As improving measures we propose (1) the elimination of the single output file generation that is native to most open source Bioinformatics tools and (2) the application of the distributed in-memory file system, Alluxio, as in-memory layer for data sharing between any two consecutive VCW applications. While we achieved in our experiments for (1) an impressive performance boost of 17 % we could not improve performance in our experiments for (2).In our second research questions we investigate how the data throughput can be improved by changing the execution modes of the VCW. The growing DNA data is mainly represented by a larger quantity of DNA samples. Hence, it is important to optimize the VCW execution for multi-sample input. As part of the second research question we propose different execution modes and show in our experiments that concurrent workflow execution can improve the overall runtime performance in our VCW using 3 input samples of 10GB by 15 % compared to sequential execution. Subject Variant CallingWorkflowDNA SequencingSparkHadoopIn-Memory File SystemPerformance To reference this document use: http://resolver.tudelft.nl/uuid:7d02ec4a-0d99-453a-8950-f54287d91e2a Part of collection Student theses Document type master thesis Rights © 2017 Jonathan Heiss Files PDF MasterThesis_JonathanHeis ... 595513.pdf 1.65 MB Close viewer /islandora/object/uuid:7d02ec4a-0d99-453a-8950-f54287d91e2a/datastream/OBJ/view