Print Email Facebook Twitter Few shot emotion recognition using intelligent voice assistants and wearables Title Few shot emotion recognition using intelligent voice assistants and wearables: Learning from few samples of speech and physiological signals Author Kapadia, Mihir (TU Delft Electrical Engineering, Mathematics and Computer Science) Contributor Ali, Abdallah El (mentor) van der Veen, A.J. (graduation committee) Cesar, Pablo (graduation committee) Degree granting institution Delft University of Technology Programme Computer Science Date 2022-03-29 Abstract Emotion Recognition is one of the vastly studied areas of affective computing. Attempts have been made to design emotion recognition systems for everyday settings. The ubiquitous nature of Intelligent voice assistants (IVAs) in households, make them a great anchor for the introduction of emotion recognition technology to consumers. The existing systems lack such pipelines and rely on dictionary-based architectures in their design. Further, these systems lack conversational properties and are merely an extension of information retrieval engines.In this setting, we propose to introduce and develop emotion recognition pipelines that are suited to the interactions, common with these IVAs. To augment the existing emotion recognition pipelines which rely on audio information, we look at physiological information derived from wearables. Our proposed model uses multimodal embeddings with a Siamese Network to achieve the task of emotion recognition from a few samples. Physiological signals of blood volume pulse (BVP) and electrodermal activity (EDA) are used as additional input embeddings to two audio embeddings arising from the speech samples. We employ the state-of-the-art training schedules for Siamese Networks, which use a very limited amount of training on support datasets via sample pair comparisons. The performance of the model is evaluated using weighted binary accuracy and f1 scores.The proposed model is applied on two datasets that denote two unique experimental settings - the K-EmoCon dataset and RECOLA dataset. We demonstrate an improvement in the state-of-the-art accuracy with the K-EmoCon dataset with accuracies of 63.97% and 66.91% on arousal and valence dimensions respectively. Further, on the RECOLA dataset, the model performs moderately well with 53.81% and 53.87% respectively for arousal and valence dimensions. In addition to this, we present a study of the effects of variation of available support set for training from the dataset. We make some salient observations for these experiments across individual participants and also identify how the label distributions affect the performance of the model. Further, we investigate the impact of real-world noise samples from the DEMAND dataset on the two datasets. We observe that the proposed model is robust and performs sustainingly well even in the presence of imputed noise. Subject Emotion ClassificationAudio ClassificationDeep Learningfew shot learningMel SpectrogramChatbotWearable TechnologyVoice Assistants To reference this document use: http://resolver.tudelft.nl/uuid:f5a8b7e6-488f-43a0-a417-66aab2c5c736 Part of collection Student theses Document type master thesis Rights © 2022 Mihir Kapadia Files PDF Final_Thesis_Mihir_Kapadi ... 96278_.pdf 11.09 MB Close viewer /islandora/object/uuid:f5a8b7e6-488f-43a0-a417-66aab2c5c736/datastream/OBJ/view