Print Email Facebook Twitter Parallelization of Variable Rate Decompression for GPU Acceleration Title Parallelization of Variable Rate Decompression for GPU Acceleration Author Noordsij, Lennart (TU Delft Electrical Engineering, Mathematics and Computer Science) Contributor Al-Ars, Zaid (mentor) Degree granting institution Delft University of Technology Date 2019-06-14 Abstract Data movement has been long identified as the biggest challenge facing modern computer systems designers. To tackle this challenge, many novel data compression algorithms have been developed. These compression algorithms can be embedded into bandwidth-bound applications to reduce their memory traffic volume. As a result, data decompression, in many instances, is in the critical path of the application execution, while the compression itself can happen offine or outside of the critical path. Therefore, fast data decompression is of utmost importance. However, most existing parallel decompression schemes adopt a particular parallelization strategy suited for a particular HW platform. Such an approach fails to harness the parallelism found in diverse modern HW architectures. To this end, we propose multiple parallelization strategies for variable rate data decompression. The proposed strategies aim to utilize parallel architectures efficiently. Our strategies are based on generating extra information during the encoding phase, and then passing this information in a side-channel to the decoder. After that, the decoder can use that extra information to speed-up the decoding process tremendously. To demonstrate the effectiveness of our strategies, we implement them in a state-of-the-art compression algorithm called ZFP and apply it on a real-life industrial application from ASML. Our implementation is publicly available on GitHub. This application is a feed-forward control model for controlling wafer heat in EUV lithography machines. The application is dominated by matrix-vector multiplication (which is bandwidth-bound) and is executed on GPUs. We show that parallelization strategies suited for multicore CPUs are different from the ones suited for GPUs. On a CPU, we achieve a near-optimal speedup and an overhead size which is consistently less than 0.04% of the compressed data size. On a GPU, we achieve a decoding throughput of more than 130 GiB/s which allows us to execute the ASML application within the given time budget. To reference this document use: http://resolver.tudelft.nl/uuid:ebf17e05-4d9f-4a73-8c77-c1e7073e932f Embargo date 2020-06-14 Part of collection Student theses Document type master thesis Rights © 2019 Lennart Noordsij Files PDF Parallelization_of_Variab ... ration.pdf 1.21 MB Close viewer /islandora/object/uuid:ebf17e05-4d9f-4a73-8c77-c1e7073e932f/datastream/OBJ/view