Speech emotion recognition with early visual cross-modal enhancement using spiking neural networks

Mansouri-Benssassi, Esma; Ye, Juan

Show simple item record

Files in this item

Name:: 2019IJCNN_Esma.pdf
Size:: 490.5Kb
Format:: PDF

View/Open

Item metadata

dc.contributor.author	Mansouri-Benssassi, Esma
dc.contributor.author	Ye, Juan
dc.date.accessioned	2019-11-25T10:30:01Z
dc.date.available	2019-11-25T10:30:01Z
dc.date.issued	2019-09-30
dc.identifier	262322181
dc.identifier	b2ef2d01-4026-403d-9efe-376ec43cd491
dc.identifier	85073258399
dc.identifier	000530893806020
dc.identifier.citation	Mansouri-Benssassi , E & Ye , J 2019 , Speech emotion recognition with early visual cross-modal enhancement using spiking neural networks . in 2019 International Joint Conference on Neural Networks, IJCNN 2019 . , 8852473 , Proceedings of the International Joint Conference on Neural Networks , vol. 2019-July , Institute of Electrical and Electronics Engineers Inc. , pp. 1-8 , 2019 International Joint Conference on Neural Networks, IJCNN 2019 , Budapest , Hungary , 14/07/19 . https://doi.org/10.1109/IJCNN.2019.8852473	en
dc.identifier.citation	conference	en
dc.identifier.isbn	9781728119854
dc.identifier.issn	2161-4393
dc.identifier.other	ORCID: /0000-0002-2838-6836/work/68280979
dc.identifier.uri	https://hdl.handle.net/10023/18994
dc.description.abstract	Speech emotion recognition (SER) is an important part of affective computing and signal processing research areas. A number of approaches, especially deep learning techniques, have achieved promising results on SER. However, there are still challenges in translating temporal and dynamic changes in emotions through speech. Spiking Neural Networks (SNN) have demonstrated as a promising approach in machine learning and pattern recognition tasks such as handwriting and facial expression recognition. In this paper, we investigate the use of SNNs for SER tasks and more importantly we propose a new cross-modal enhancement approach. This method is inspired by the auditory information processing in the brain where auditory information is preceded, enhanced and predicted by a visual processing in multisensory audio-visual processing. We have conducted experiments on two datasets to compare our approach with the state-of-the-art SER techniques in both uni-modal and multi-modal aspects. The results have demonstrated that SNNs can be an ideal candidate for modeling temporal relationships in speech features and our cross-modal approach can significantly improve the accuracy of SER.
dc.format.extent	8
dc.format.extent	502359
dc.language.iso	eng
dc.publisher	Institute of Electrical and Electronics Engineers Inc.
dc.relation.ispartof	2019 International Joint Conference on Neural Networks, IJCNN 2019	en
dc.relation.ispartofseries	Proceedings of the International Joint Conference on Neural Networks	en
dc.subject	Multisensory integration	en
dc.subject	Speech Emotion Recognition	en
dc.subject	Spiking Neural Networks	en
dc.subject	Unsupervised learning	en
dc.subject	QA75 Electronic computers. Computer science	en
dc.subject	T Technology	en
dc.subject	Artificial Intelligence	en
dc.subject	Software	en
dc.subject	3rd-DAS	en
dc.subject.lcc	QA75	en
dc.subject.lcc	T	en
dc.title	Speech emotion recognition with early visual cross-modal enhancement using spiking neural networks	en
dc.type	Conference item	en
dc.contributor.institution	University of St Andrews. School of Computer Science	en
dc.identifier.doi	10.1109/IJCNN.2019.8852473

This item appears in the following Collection(s)

University of St Andrews Research

Show simple item record