مشخصات مقاله | |
ترجمه عنوان مقاله | سیستم N-Gram پویا مبتنی بر یک سرویس غلط یابی آنلاین زبان کرواسی |
عنوان انگلیسی مقاله | Dynamic N-Gram System Based on an Online Croatian Spellchecking Service |
انتشار | مقاله سال 2019 |
تعداد صفحات مقاله انگلیسی | 8 صفحه |
هزینه | دانلود مقاله انگلیسی رایگان میباشد. |
پایگاه داده | نشریه IEEE |
نوع نگارش مقاله |
مقاله پژوهشی (Research Article) |
مقاله بیس | این مقاله بیس نمیباشد |
نمایه (index) | Scopus – Master Journals List – JCR |
نوع مقاله | ISI |
فرمت مقاله انگلیسی | |
ایمپکت فاکتور(IF) |
4.641 در سال 2018 |
شاخص H_index | 56 در سال 2019 |
شاخص SJR | 0.609 در سال 2018 |
شناسه ISSN | 2169-3536 |
شاخص Quartile (چارک) | Q2 در سال 2018 |
مدل مفهومی | ندارد |
پرسشنامه | ندارد |
متغیر | ندارد |
رفرنس | دارد |
رشته های مرتبط | مهندسی کامپیوتر |
گرایش های مرتبط | معماری سیستم های کامپیوتری |
نوع ارائه مقاله |
ژورنال |
مجله / کنفرانس | دسترسی – IEEE Access |
دانشگاه | Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb 10000, Croatia |
کلمات کلیدی | زبان کرواسی، قانون هپس، مدلسازی زبان، n-gram واژگانی، مقایسه سیستم n-gram |
کلمات کلیدی انگلیسی | Croatian language, Heaps’ law, language modeling, lexical n-gram, n-gram system comparison |
شناسه دیجیتال – doi |
https://doi.org/10.1109/ACCESS.2019.2947898 |
کد محصول | E13871 |
وضعیت ترجمه مقاله | ترجمه آماده این مقاله موجود نمیباشد. میتوانید از طریق دکمه پایین سفارش دهید. |
دانلود رایگان مقاله | دانلود رایگان مقاله انگلیسی |
سفارش ترجمه این مقاله | سفارش ترجمه این مقاله |
فهرست مطالب مقاله: |
Abstract I. Introduction II. Conventionally Created Croatian Corpora III. About the Spellchecker IV. Croatian N-Gram System Characteristics V. Heaps’ Law Applied to Croatian N-Grams Authors Figures References |
بخشی از متن مقاله: |
Abstract
As an infrastructure able to accelerate the development of natural language processing applications, large-scale lexical n-gram databases are at present important data systems. However, deriving such systems for world minority languages as it was done in the Google n-gram project leads to many obstacles. This paper presents an innovative approach to large-scale n-gram system creation applied to the Croatian language. Instead of using the Web as the world’s largest text repository, our process of n-gram collection relies on the Croatian online academic spellchecker Hascheck, a language service publicly available since 1993 and popular worldwide. Our n-gram filtering is based on dictionary criteria, contrary to the publicly available Google n-gram systems in which cutoff criteria were applied. After 12 years of collecting, the size of the Croatian n-gram system reached the size of the largest Google Version 1 n-gram systems. Due to reliance on a service in constant use, the Croatian n-gram system is a dynamic one. System dynamics allowed modeling of n-gram count behavior through Heaps’ law, which led to interesting results. Like many minority languages, the Croatian language suffers from a lack of sophisticated language processing systems in many application areas. The importance of a rich lexical n-gram infrastructure for rapid breakthroughs in new application areas is also exemplified in the paper. Introduction Lexical n-grams are nowadays an important data infrastructure in many areas of natural language processing (NLP), machine learning, text analytics, and data mining [1]. Many technologies take advantage of large-scale language models based on huge n-gram systems derived from gigantic corpora. ‘‘More words and less linguistic annotation’’ is a trend well expressed in [2]. The trend is strictly followed in the research presented here. Besides English [3], structured big data are the privilege of a dozen languages most advanced in NLP, those treated in the Google n-gram project [4]–[6]. Abundant linguistic data collection is a prerequisite for large-scale language modeling, but in many cases, it is hardly a feasible step in the machine processing of minority languages such as Croatian, which belongs to the subfamily of South Slavic languages and has approximately 4.5 million users, or less than 0.1% of the world’s population. It is clear that an enormous English or Chinese text corpus cannot be comparable in size with a Croatian one due to differences in the numbers of language users. However, statistical machine translation or speech recognition asks for language models of comparable size in order to produce the desired effectiveness. This means the n-gram system, from which language models are derived, in a minority language must be enriched to approximately the size of n-gram systems for world major languages. |