مقاله انگلیسی رایگان در مورد ارزیابی معماری یادگیری عمیق برای تشخیص گفتار احساسی – الزویر 2017

 

مشخصات مقاله
ترجمه عنوان مقاله ارزیابی معماری یادگیری عمیق برای تشخیص گفتار احساسی
عنوان انگلیسی مقاله Evaluating deep learning architectures for Speech Emotion Recognition
انتشار مقاله سال 2017
تعداد صفحات مقاله انگلیسی  9 صفحه
هزینه دانلود مقاله انگلیسی رایگان میباشد.
پایگاه داده نشریه الزویر
نوع نگارش مقاله
مقاله پژوهشی (Research Article)
مقاله بیس این مقاله بیس میباشد
نمایه (index) MedLine – Scopus – Master Journal List – JCR
نوع مقاله ISI
فرمت مقاله انگلیسی  PDF
ایمپکت فاکتور(IF)
8.446 در سال 2017
شاخص H_index 121 در سال 2019
شاخص SJR 2.359 در سال 2017
شناسه ISSN 0893-6080
شاخص Quartile (چارک) Q1 در سال 2017
رشته های مرتبط مهندسی کامپیوتر، فناوری اطلاعات
گرایش های مرتبط هوش مصنوعی، شبکه های کامپیوتری
نوع ارائه مقاله
ژورنال
مجله  شبکه های عصبی – Neural Networks
دانشگاه  School of Engineering – RMIT University – Melbourne VIC – Australia
کلمات کلیدی محاسبات عاطفی، یادگیری عمیق، شناخت احساسی، شبکه های عصبی، تشخیص گفتار
کلمات کلیدی انگلیسی Affective computing، Deep learning، Emotion recognition، Neural networks، Speech recognition
شناسه دیجیتال – doi
http://dx.doi.org/10.1016/j.neunet.2017.02.013
کد محصول  E10738
وضعیت ترجمه مقاله  ترجمه آماده این مقاله موجود نمیباشد. میتوانید از طریق دکمه پایین سفارش دهید.
دانلود رایگان مقاله دانلود رایگان مقاله انگلیسی
سفارش ترجمه این مقاله سفارش ترجمه این مقاله

 

فهرست مطالب مقاله:
Abstract

1- Introduction

2- Related work

3- Deep learning: An overview

4- Proposed speech emotion recognition system

5- Experimental setup

6- Experiments and results

7- Discussion

8- Conclusion

References

بخشی از متن مقاله:

Abstract

Speech Emotion Recognition (SER) can be regarded as a static or dynamic classification problem, which makes SER an excellent test bed for investigating and comparing various deep learning architectures. We describe a frame-based formulation to SER that relies on minimal speech processing and end-to-end deep learning to model intra-utterance dynamics. We use the proposed SER system to empirically explore feed-forward and recurrent neural network architectures and their variants. Experiments conducted illuminate the advantages and limitations of these architectures in paralinguistic speech recognition and emotion recognition in particular. As a result of our exploration, we report state-of-the-art results on the IEMOCAP database for speaker-independent SER and present quantitative and qualitative assessments of the models’ performances.

Introduction

In recent years, deep learning in neural networks has achieved tremendous success in various domains that led to multiple deep learning architectures emerging as effective models across numerous tasks. Feed-forward architectures such as Deep Neural Networks (DNNs) and Convolutional Neural Networks (ConvNets) have been particularly successful in image and video processing as well as speech recognition, while recurrent architectures such as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) RNNs have been effective in speech recognition and natural language processing (LeCun, Bengio, & Hinton, 2015; Schmidhuber, 2015). These architectures process and model information in different ways and have their own advantages and limitations. For instance, ConvNets are able to deal with high-dimensional inputs and learn features that are invariant to small variations and distortions (Krizhevsky, Sutskever, & Hinton, 2012), whereas LSTM-RNNs are able to deal with variable length inputs and model sequential data with long range context (Graves, 2008). In this paper, we investigate the application of end-to-end deep learning to Speech Emotion Recognition (SER) and critically explore how each of these architectures can be employed in this task. ∗ Corresponding author. E-mail addresses: haytham.fayek@ieee.org (H.M. Fayek), margaret.lech@rmit.edu.au (M. Lech), lawrence.cavedon@rmit.edu.au (L. Cavedon). SER can be regarded as a static or dynamic classification problem, which has motivated two popular formulations in the literature to the task (Ververidis & Kotropoulos, 2006): turn-based processing (also known as static modeling), which aims to recognize emotions from a complete utterance; or frame-based processing (also known as dynamic modeling), which aims to recognize emotions at the frame level. In either formulation, SER can be employed in stand-alone applications; e.g. emotion monitoring, or integrated into other systems for emotional awareness; e.g. integrating SER into Automatic Speech Recognition (ASR) to improve its capability in dealing with emotional speech (Cowie et al., 2001; Fayek, Lech, & Cavedon, 2016b; Fernandez, 2004). Frame-based processing is more robust since it does not rely on segmenting the input speech into utterances and can model intra-utterance emotion dynamics (Arias, Busso, & Yoma, 2013; Fayek, Lech, & Cavedon, 2015). However, empirical comparisons between frame-based processing and turn-based processing in prior work have demonstrated the superiority of the latter (Schuller, Vlasenko, Eyben, Rigoll, & Wendemuth, 2009; Vlasenko, Schuller, Wendemuth, & Rigoll, 2007).

دیدگاهتان را بنویسید

نشانی ایمیل شما منتشر نخواهد شد. بخش‌های موردنیاز علامت‌گذاری شده‌اند *

دکمه بازگشت به بالا