مقاله انگلیسی رایگان در مورد lexiDB: مجذور مقیاس پذیر سیستم مدیریت پایگاه داده – ۲۰۱۷ IEEE
مشخصات مقاله | |
انتشار | مقاله سال ۲۰۱۷ |
تعداد صفحات مقاله انگلیسی | ۵ صفحه |
هزینه | دانلود مقاله انگلیسی رایگان میباشد. |
منتشر شده در | نشریه IEEE |
نوع مقاله | ISI |
عنوان انگلیسی مقاله | lexiDB: A Scalable Corpus Database Management System |
ترجمه عنوان مقاله | lexiDB: مجذور مقیاس پذیر سیستم مدیریت پایگاه داده |
فرمت مقاله انگلیسی | |
رشته های مرتبط | مهندسی کامپیوتر |
گرایش های مرتبط | نرم افزار |
مجله | کنفرانس بین المللی کلان داده – International Conference on Big Data |
کد محصول | E7296 |
وضعیت ترجمه مقاله | ترجمه آماده این مقاله موجود نمیباشد. میتوانید از طریق دکمه پایین سفارش دهید. |
دانلود رایگان مقاله | دانلود رایگان مقاله انگلیسی |
سفارش ترجمه این مقاله | سفارش ترجمه این مقاله |
بخشی از متن مقاله: |
I. INTRODUCTION
Corpora utilised by corpus linguists have steadily grown in scale and complexity over the last fifty years. Beginning with relatively small corpora (although they were considered large at the time) of one million words such as Brown [5] in the 1960s, the size of corpora has been increasing by an order of magnitude roughly every 10 years. In the 1990s the British National Corpus (BNC)1 was created with one hundred million words and now corpora of interest to linguists order in the billions of words with Historical Hansard2 and Early English Books Online (EEBO)3 being prime examples. In parallel to this growth in size of the raw text used in corpora so too has there been an increase in the number of levels of annotation attached to such corpora. Beginning with simple part-of-speech (POS) tagging and lemmatisation, linguists now utilise more advanced annotation such as dependency parsing, semantic tags and historical spelling variants when conducting corpus analysis. This motivates the need for retrieval software and tools that are capable of supporting annotated corpus data at this scale and complexity. In big data terms, increasing the ‘volume’ of corpora provides greater numbers of examples for mid to low frequency words and linguistic features which is important for analysis purposes. The ‘variety’ of data included within a corpus is also important to achieve for improved representativeness and coverage of the types of language being studied. In this paper, we address issues of ‘velocity’ (the application of parallel or distributed methods), consideration of which is vitally important since the current crop of corpus linguistics retrieval tools are struggling to cope with the ever increasing scale of corpora. Typically corpus linguists rely on five main retrieval methods in order to perform their analysis: concordances, collocations, clusters (n-grams), keyword lists and frequency lists. Whilst other more complex forms of analysis exist they often are built on top of one or more of these basic methods or are sometimes subtle variations of such queries. These query types are generally not fully or efficiently supported by traditional DBMSs (Database Management Systems) or IR (Information Retrieval) systems as shown in previous work [1]. Some systems have limited support for keyword in context search (concordances) but in order to support these query types fully, corpus linguists must usually rely on a tool built on top of an existing retrieval or database system. Software that can be used locally on desktop PCs are sometimes favored by linguists as they allow them the flexibility to use their own corpora and to perform analysis without reliance on anything more than a laptop. Tools such as WordSmith4 and AntConc5 allow users to perform corpus queries such as concordances and to generate frequency lists. However these tools lack support for larger billion word scale corpora. Other server based tools exist such as Wmatrix [8], CQPweb [3], SketchEngine6, KorAP [2] and corpus.byu7. Often these tools are based on Open Corpus Workbench (CWB)8, existing relational DBMSs such as MySQL9 or text indexers such as Lucene10. These systems handle corpora of larger scale better but are limited relative to the flexibility of local tools as linguists often cannot add their own corpora or annotation or are restricted in the size of corpora that can be added. |