مقاله انگلیسی رایگان در مورد سیستم N-Gram مبتنی بر سرویس غلط یابی – IEEE 2019

 

مشخصات مقاله
ترجمه عنوان مقاله سیستم N-Gram پویا مبتنی بر یک سرویس غلط یابی آنلاین زبان کرواسی
عنوان انگلیسی مقاله Dynamic N-Gram System Based on an Online Croatian Spellchecking Service
انتشار مقاله سال 2019
تعداد صفحات مقاله انگلیسی 8 صفحه
هزینه دانلود مقاله انگلیسی رایگان میباشد.
پایگاه داده نشریه IEEE
نوع نگارش مقاله
مقاله پژوهشی (Research Article)
مقاله بیس این مقاله بیس نمیباشد
نمایه (index) Scopus – Master Journals List – JCR
نوع مقاله ISI
فرمت مقاله انگلیسی  PDF
ایمپکت فاکتور(IF)
4.641 در سال 2018
شاخص H_index 56 در سال 2019
شاخص SJR 0.609 در سال 2018
شناسه ISSN 2169-3536
شاخص Quartile (چارک) Q2 در سال 2018
مدل مفهومی ندارد
پرسشنامه ندارد
متغیر ندارد
رفرنس دارد
رشته های مرتبط مهندسی کامپیوتر
گرایش های مرتبط معماری سیستم های کامپیوتری
نوع ارائه مقاله
ژورنال
مجله / کنفرانس دسترسی – IEEE Access
دانشگاه  Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb 10000, Croatia
کلمات کلیدی زبان کرواسی، قانون هپس، مدلسازی زبان، n-gram واژگانی، مقایسه سیستم n-gram
کلمات کلیدی انگلیسی  Croatian language, Heaps’ law, language modeling, lexical n-gram, n-gram system comparison
شناسه دیجیتال – doi
https://doi.org/10.1109/ACCESS.2019.2947898
کد محصول  E13871
وضعیت ترجمه مقاله  ترجمه آماده این مقاله موجود نمیباشد. میتوانید از طریق دکمه پایین سفارش دهید.
دانلود رایگان مقاله دانلود رایگان مقاله انگلیسی
سفارش ترجمه این مقاله سفارش ترجمه این مقاله

 

فهرست مطالب مقاله:
Abstract
I. Introduction
II. Conventionally Created Croatian Corpora
III. About the Spellchecker
IV. Croatian N-Gram System Characteristics
V. Heaps’ Law Applied to Croatian N-Grams
Authors
Figures
References

 

بخشی از متن مقاله:
Abstract

As an infrastructure able to accelerate the development of natural language processing applications, large-scale lexical n-gram databases are at present important data systems. However, deriving such systems for world minority languages as it was done in the Google n-gram project leads to many obstacles. This paper presents an innovative approach to large-scale n-gram system creation applied to the Croatian language. Instead of using the Web as the world’s largest text repository, our process of n-gram collection relies on the Croatian online academic spellchecker Hascheck, a language service publicly available since 1993 and popular worldwide. Our n-gram filtering is based on dictionary criteria, contrary to the publicly available Google n-gram systems in which cutoff criteria were applied. After 12 years of collecting, the size of the Croatian n-gram system reached the size of the largest Google Version 1 n-gram systems. Due to reliance on a service in constant use, the Croatian n-gram system is a dynamic one. System dynamics allowed modeling of n-gram count behavior through Heaps’ law, which led to interesting results. Like many minority languages, the Croatian language suffers from a lack of sophisticated language processing systems in many application areas. The importance of a rich lexical n-gram infrastructure for rapid breakthroughs in new application areas is also exemplified in the paper.

Introduction

Lexical n-grams are nowadays an important data infrastructure in many areas of natural language processing (NLP), machine learning, text analytics, and data mining [1]. Many technologies take advantage of large-scale language models based on huge n-gram systems derived from gigantic corpora. ‘‘More words and less linguistic annotation’’ is a trend well expressed in [2]. The trend is strictly followed in the research presented here. Besides English [3], structured big data are the privilege of a dozen languages most advanced in NLP, those treated in the Google n-gram project [4]–[6]. Abundant linguistic data collection is a prerequisite for large-scale language modeling, but in many cases, it is hardly a feasible step in the machine processing of minority languages such as Croatian, which belongs to the subfamily of South Slavic languages and has approximately 4.5 million users, or less than 0.1% of the world’s population. It is clear that an enormous English or Chinese text corpus cannot be comparable in size with a Croatian one due to differences in the numbers of language users. However, statistical machine translation or speech recognition asks for language models of comparable size in order to produce the desired effectiveness. This means the n-gram system, from which language models are derived, in a minority language must be enriched to approximately the size of n-gram systems for world major languages.

دیدگاهتان را بنویسید

نشانی ایمیل شما منتشر نخواهد شد. بخش‌های موردنیاز علامت‌گذاری شده‌اند *

دکمه بازگشت به بالا