Improving the Performance of the Variant Calling Workﬂow for DNA Sequencing

Heiss, Jonathan

Improving the Performance of the Variant Calling Workﬂow for DNA Sequencing

Title

Improving the Performance of the Variant Calling Workﬂow for DNA Sequencing

Author

Heiss, Jonathan (TU Delft Electrical Engineering, Mathematics and Computer Science)

Contributor

Epema, D.H.J. (mentor)
Datema, E (mentor)
Al-Ars, Z. (graduation committee)

Degree granting institution

Delft University of Technology

Project

ICT Innovation (Cloud Computing and Services)

Date

2017-08-30

Abstract

The growing DNA data volumes that originate from novel efﬁcient DNA sequencing methods expose new challenges to computer systems used to process this genomic data. BigData technologies in the Hadoop environment, in particular Apache Spark and the Hadoop Distributed File System (HDFS), are increasingly adapted in state-of-the-art Bioinformatic tools. One application domain of such tools is the Variant Calling Workﬂow (VCW) that is subject of this work’s research. The application of Spark-based open source tools to execute the different VCW results in a tool chain of separate programs. The programs are executed consecutively and consume the data that is produced by the preceding program as input. This data sharing represents an additional workload as the data needs to be transformed into a ﬁle format, written to disk and read from disk again by the next application.
In our first research question we examine whether performance can be increased by improving data sharing. As improving measures we propose (1) the elimination of the single output file generation that is native to most open source Bioinformatics tools and (2) the application of the distributed in-memory file system, Alluxio, as in-memory layer for data sharing between any two consecutive VCW applications. While we achieved in our experiments for (1) an impressive performance boost of 17 % we could not improve performance in our experiments for (2).
In our second research questions we investigate how the data throughput can be improved by changing the execution modes of the VCW. The growing DNA data is mainly represented by a larger quantity of DNA samples. Hence, it is important to optimize the VCW execution for multi-sample input. As part of the second research question we propose different execution modes and show in our experiments that concurrent workflow execution can improve the overall runtime performance in our VCW using 3 input samples of 10GB by 15 % compared to sequential execution.

Subject

Variant Calling
Workflow
DNA Sequencing
Spark
Hadoop
In-Memory File System
Performance

To reference this document use:

http://resolver.tudelft.nl/uuid:7d02ec4a-0d99-453a-8950-f54287d91e2a

Part of collection

Student theses

Document type

master thesis

Rights

Files

PDF

MasterThesis_JonathanHeis ... 595513.pdf

1.65 MB

Close viewer