مقاله انگلیسی رایگان در مورد بهبود تشخیص گفتار چند شخصی – IEEE 2020

مقاله انگلیسی رایگان در مورد بهبود تشخیص گفتار چند شخصی – IEEE 2020

 

مشخصات مقاله
ترجمه عنوان مقاله بهبود تشخیص گفتار چند شخصی تک کاناله انتها به انتها
عنوان انگلیسی مقاله Improving End-to-End Single-Channel Multi-Talker Speech Recognition
انتشار مقاله سال ۲۰۲۰
تعداد صفحات مقاله انگلیسی ۱۰ صفحه
هزینه دانلود مقاله انگلیسی رایگان میباشد.
پایگاه داده نشریه IEEE
نوع نگارش مقاله
مقاله پژوهشی (Research Article)
مقاله بیس این مقاله بیس نمیباشد
نوع مقاله ISI
فرمت مقاله انگلیسی  PDF
ایمپکت فاکتور(IF)
۳٫۳۹۸ در سال ۲۰۱۹
مدل مفهومی ندارد
پرسشنامه ندارد
متغیر دارد
رفرنس دارد
رشته های مرتبط مهندسی کامپیوتر
گرایش های مرتبط هوش مصنوعی
نوع ارائه مقاله
ژورنال
مجله / کنفرانس تعاملات IEEE/ACM درباره صدا، گفتار و پردازش زبان – IEEE/ACM Transactions on Audio, Speech, and Language Processing
دانشگاه  Shanghai Jiao Tong University, Shanghai, China
کلمات کلیدی تشخیص گفتار ترکیبی چند شخصی، تبدیل تغییر ناپذیر آموزش، مدل انتها به انتها، چکیده دانش، یادگیری برنامه ای
کلمات کلیدی انگلیسی  Multi-talker mixed speech recognition, permutation invariant training, end-to-end model, knowledge distillation, curriculum learning
شناسه دیجیتال – doi
https://doi.org/10.1109/TASLP.2020.2988423
کد محصول  E15095
وضعیت ترجمه مقاله  ترجمه آماده این مقاله موجود نمیباشد. میتوانید از طریق دکمه پایین سفارش دهید.
دانلود رایگان مقاله دانلود رایگان مقاله انگلیسی
سفارش ترجمه این مقاله سفارش ترجمه این مقاله

 

فهرست مطالب مقاله:
Abstract

۱- INTRODUCTION

۲- SINGLE-CHANNEL MULTI-TALKER SPEECH RECOGNITION

۳- END-TO-END MULTI-TALKER SPEECH RECOGNITION

۴- EXPERIMENT

۵- CONCLUSION

REFERENCES

 

بخشی از متن مقاله:

Abstract

Although significant progress has been made in singletalker automatic speech recognition (ASR), there is still a large performance gap between multi-talker and single-talker speech recognition systems. In this article, we propose an enhanced end-toend monaural multi-talker ASR architecture and training strategy to recognize the overlapped speech. The single-talker end-to-end model is extended to a multi-talker architecture with permutation invariant training (PIT). Several methods are designed to enhance the system performance, including speaker parallel attention, scheduled sampling, curriculum learning and knowledge distillation.More specifically, the speaker parallel attention extends the basic single shared attention module into multiple attention modules for each speaker, which can enhance the tracing and separation ability. Then the scheduled sampling and curriculum learning are proposed to make the model better optimized. Finally the knowledge distillation transfers the knowledge from an original single-speaker model to the current multi-speaker model in the proposed end-to-end multi-talker ASR structure. Our proposed architectures are evaluated and compared on the artificially mixed speech datasets generated from the WSJ0 reading corpus. The experiments demonstrate that our proposed architectures can significantly improve the multi-talker mixed speech recognition. The final system obtains more than 15% relative performance gains in both character error rate (CER) and word error rate (WER) compared to the basic end-to-end multi-talker ASR system.

INTRODUCTION

T HANKS to the advances in deep learning, automatic speech recognition (ASR) has achieved a huge progress. Deep neural networks (DNN) and hiddenMarkov model (HMM) based hybrid systems have achieved a very good performance, which are comparable with, or even surpassing, human performance [1]–[۳]. Recently, there have been growing interests in developing end-to-end systems for speech recognition, in which multiple modules in the hybrid systems, such as the acoustic model (AM), lexicon model, and language model (LM), are folded into a single neural network model, so that they can be optimized simultaneously. Over the past few years, a variety of end-to-end (E2E) models have been proposed and they can be mainly categorized into connectionist temporal classification (CTC) based models [4], [5], and sequence to sequence (S2S) based models [6], [7]. The combined mode with both CTC and S2S [8] is also designed to further improve the end-to-end ASR system. The end-to-end systems have shown promising results according to existing works [8]–[۱۰]. On the other hand, although a huge progress has been achieved on ASR, the current systems mainly focus on single-talker speech, and there is still a large performance gap between single-talker and multi-talker speech recognition. Processing the multi-talker mixed speech is a key problem when multi-talker mixed speech commonly exists in the complex real-world conditions, especially under the cocktail party scenarios [11]–[۱۳].

ثبت دیدگاه