Print Email Facebook Twitter The effects on speech detection of low sample frequency audio data Title The effects on speech detection of low sample frequency audio data Author Uno, Taichi (TU Delft Electrical Engineering, Mathematics and Computer Science) Contributor Hung, H.S. (mentor) Vargas Quiros, J.D. (mentor) Baaijens, J.A. (graduation committee) Degree granting institution Delft University of Technology Programme Computer Science and Engineering Project CSE3000 Research Project Date 2022-06-24 Abstract The interactions between human and machines are now common in our daily life. The audio data of human communication is a rich source of information, but it is con- sidered privacy-invasive for machines to listen to it. By reducing sampling frequency, it is possible to preserve privacy by making conversation unclear while still being possible to detect if someone is speaking or not. The topic of this paper is to investigate how low sampled frequency audio data hinders the detection of speech. To detect speaking, voice activity detection has been applied, which is a technology in the signal process- ing field that identifies which short segments of audio contain speakings. Two types of state-of-art voice activity detector(VAD) were used for this experiment including a supervised (pyannote) and two unsupervised (rVAD pitch and flatness mode) methods. As a result, the unsupervised methods outperformed the supervised model, where rVAD pitch mode has resulted in the best performance out of all three. More specifically, the unsupervised VAD’s performance became lower as the sample rates decreased while the supervised VAD did not work well at higher sample frequency. rVAD pitch mode at sample rates of 8000Hz or higher was possible to perform at the almost same level as a state-of-art supervised VAD that is trained in a similar data set. Furthermore, it was able to perform as well as a modern unsupervised VAD at 2000Hz or higher sample frequencies. At the sample rate of 1250Hz or lower, any VAD was not able to perform at the same level as a state-of-art VAD. Regarding the privacy aspect, it is observed that human ears detect speaking better than computers, where humans can understand parts or all of the contents of speaking at 2000Hz or higher, which infers that current technology is not enough to detect speech from downsampled privacy-preserving audio. However, there is still a need for further research to verify the effects of the training set and its sample frequencies for the supervised method and also proper scientific so- cial experiments to test the ability of humans of speech detection for reduced sampled audio. Subject Speech detectionLow sample frequency audioVoice Activity Detection To reference this document use: http://resolver.tudelft.nl/uuid:80ac9f4d-3bdb-4374-9227-343aee94356f Part of collection Student theses Document type bachelor thesis Rights © 2022 Taichi Uno Files PDF FINAL_Research_Paper.pdf 369.33 KB Close viewer /islandora/object/uuid:80ac9f4d-3bdb-4374-9227-343aee94356f/datastream/OBJ/view