مشخصات مقاله | |
ترجمه عنوان مقاله | ارزیابی معماری یادگیری عمیق برای تشخیص گفتار احساسی |
عنوان انگلیسی مقاله | Evaluating deep learning architectures for Speech Emotion Recognition |
انتشار | مقاله سال 2017 |
تعداد صفحات مقاله انگلیسی | 9 صفحه |
هزینه | دانلود مقاله انگلیسی رایگان میباشد. |
پایگاه داده | نشریه الزویر |
نوع نگارش مقاله |
مقاله پژوهشی (Research Article) |
مقاله بیس | این مقاله بیس میباشد |
نمایه (index) | MedLine – Scopus – Master Journal List – JCR |
نوع مقاله | ISI |
فرمت مقاله انگلیسی | |
ایمپکت فاکتور(IF) |
8.446 در سال 2017 |
شاخص H_index | 121 در سال 2019 |
شاخص SJR | 2.359 در سال 2017 |
شناسه ISSN | 0893-6080 |
شاخص Quartile (چارک) | Q1 در سال 2017 |
رشته های مرتبط | مهندسی کامپیوتر، فناوری اطلاعات |
گرایش های مرتبط | هوش مصنوعی، شبکه های کامپیوتری |
نوع ارائه مقاله |
ژورنال |
مجله | شبکه های عصبی – Neural Networks |
دانشگاه | School of Engineering – RMIT University – Melbourne VIC – Australia |
کلمات کلیدی | محاسبات عاطفی، یادگیری عمیق، شناخت احساسی، شبکه های عصبی، تشخیص گفتار |
کلمات کلیدی انگلیسی | Affective computing، Deep learning، Emotion recognition، Neural networks، Speech recognition |
شناسه دیجیتال – doi |
http://dx.doi.org/10.1016/j.neunet.2017.02.013 |
کد محصول | E10738 |
وضعیت ترجمه مقاله | ترجمه آماده این مقاله موجود نمیباشد. میتوانید از طریق دکمه پایین سفارش دهید. |
دانلود رایگان مقاله | دانلود رایگان مقاله انگلیسی |
سفارش ترجمه این مقاله | سفارش ترجمه این مقاله |
فهرست مطالب مقاله: |
Abstract
1- Introduction 2- Related work 3- Deep learning: An overview 4- Proposed speech emotion recognition system 5- Experimental setup 6- Experiments and results 7- Discussion 8- Conclusion References |
بخشی از متن مقاله: |
Abstract Speech Emotion Recognition (SER) can be regarded as a static or dynamic classification problem, which makes SER an excellent test bed for investigating and comparing various deep learning architectures. We describe a frame-based formulation to SER that relies on minimal speech processing and end-to-end deep learning to model intra-utterance dynamics. We use the proposed SER system to empirically explore feed-forward and recurrent neural network architectures and their variants. Experiments conducted illuminate the advantages and limitations of these architectures in paralinguistic speech recognition and emotion recognition in particular. As a result of our exploration, we report state-of-the-art results on the IEMOCAP database for speaker-independent SER and present quantitative and qualitative assessments of the models’ performances. Introduction In recent years, deep learning in neural networks has achieved tremendous success in various domains that led to multiple deep learning architectures emerging as effective models across numerous tasks. Feed-forward architectures such as Deep Neural Networks (DNNs) and Convolutional Neural Networks (ConvNets) have been particularly successful in image and video processing as well as speech recognition, while recurrent architectures such as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) RNNs have been effective in speech recognition and natural language processing (LeCun, Bengio, & Hinton, 2015; Schmidhuber, 2015). These architectures process and model information in different ways and have their own advantages and limitations. For instance, ConvNets are able to deal with high-dimensional inputs and learn features that are invariant to small variations and distortions (Krizhevsky, Sutskever, & Hinton, 2012), whereas LSTM-RNNs are able to deal with variable length inputs and model sequential data with long range context (Graves, 2008). In this paper, we investigate the application of end-to-end deep learning to Speech Emotion Recognition (SER) and critically explore how each of these architectures can be employed in this task. ∗ Corresponding author. E-mail addresses: haytham.fayek@ieee.org (H.M. Fayek), margaret.lech@rmit.edu.au (M. Lech), lawrence.cavedon@rmit.edu.au (L. Cavedon). SER can be regarded as a static or dynamic classification problem, which has motivated two popular formulations in the literature to the task (Ververidis & Kotropoulos, 2006): turn-based processing (also known as static modeling), which aims to recognize emotions from a complete utterance; or frame-based processing (also known as dynamic modeling), which aims to recognize emotions at the frame level. In either formulation, SER can be employed in stand-alone applications; e.g. emotion monitoring, or integrated into other systems for emotional awareness; e.g. integrating SER into Automatic Speech Recognition (ASR) to improve its capability in dealing with emotional speech (Cowie et al., 2001; Fayek, Lech, & Cavedon, 2016b; Fernandez, 2004). Frame-based processing is more robust since it does not rely on segmenting the input speech into utterances and can model intra-utterance emotion dynamics (Arias, Busso, & Yoma, 2013; Fayek, Lech, & Cavedon, 2015). However, empirical comparisons between frame-based processing and turn-based processing in prior work have demonstrated the superiority of the latter (Schuller, Vlasenko, Eyben, Rigoll, & Wendemuth, 2009; Vlasenko, Schuller, Wendemuth, & Rigoll, 2007). |