مشخصات مقاله | |
ترجمه عنوان مقاله | یادگیری گروهی توزیع شده برچسب – آگاهی: یک مدل آموزش طبقه بندی شده توزیع ساده شده برای داده های بزرگ |
عنوان انگلیسی مقاله | Label-Aware Distributed Ensemble Learning: A Simplified Distributed Classifier Training Model for Big Data |
انتشار | مقاله سال 2019 |
تعداد صفحات مقاله انگلیسی | 12 صفحه |
هزینه | دانلود مقاله انگلیسی رایگان میباشد. |
پایگاه داده | نشریه الزویر |
نوع نگارش مقاله |
مقاله پژوهشی (Research Article) |
مقاله بیس | این مقاله بیس نمیباشد |
نمایه (index) | Scopus – Master Journals List – JCR |
نوع مقاله | ISI |
فرمت مقاله انگلیسی | |
ایمپکت فاکتور(IF) |
3.643 در سال 2018 |
شاخص H_index | 16 در سال 2019 |
شاخص SJR | 0.984 در سال 2018 |
شناسه ISSN | 2214-5796 |
شاخص Quartile (چارک) | Q1 در سال 2018 |
مدل مفهومی | ندارد |
پرسشنامه | ندارد |
متغیر | ندارد |
رفرنس | دارد |
رشته های مرتبط | مهندسی کامپیوتر، مهندسی فناوری اطلاعات |
گرایش های مرتبط | هوش مصنوعی، مهندسی الگوریتم ها و محاسبات، رایانش ابری |
نوع ارائه مقاله |
ژورنال |
مجله | بررسی کلان داده ها – Big Data Research |
دانشگاه | School of Computing, Queen’s University, Kingston, ON, Canada |
کلمات کلیدی | داده هاي بزرگ ، تحليل ، توزيع شده، يادگيري ماشين ، طبقه بندي |
کلمات کلیدی انگلیسی | Big Data، Analytics، Distributed، Machine learning، Classification |
شناسه دیجیتال – doi |
https://doi.org/10.1016/j.bdr.2018.11.001 |
کد محصول | E11523 |
وضعیت ترجمه مقاله | ترجمه آماده این مقاله موجود نمیباشد. میتوانید از طریق دکمه پایین سفارش دهید. |
دانلود رایگان مقاله | دانلود رایگان مقاله انگلیسی |
سفارش ترجمه این مقاله | سفارش ترجمه این مقاله |
فهرست مطالب مقاله: |
Abstract
1- Introduction 2- Distributed classifier training: benefits and pitfalls 3- The Label-Aware Distributed Ensemble Learning (LADEL) model 4- Evaluation 5- Conclusions and future work References |
بخشی از متن مقاله: |
Abstract Label-Aware Distributed Ensemble Learning (LADEL) is a programming model and an associated implementation for distributing any classifier training to handle Big Data. It only requires users to specify the training data source, the classification algorithm and the desired parallelization level. First, a distributed stratified sampling algorithm is proposed to generate stratified samples from large, pre-partitioned datasets in a shared-nothing architecture. It executes in a single pass over the data and minimizes inter-machine communication. Second, the specified classification algorithm training is parallelized and executed on any number of heterogeneous machines. Finally, the trained classifiers are aggregated to produce the final classifier. Data miners can use LADEL to run any classification algorithm on any distributed framework, without any experience in parallel and distributed systems. The proposed LADEL model can be implemented on any distributed framework (Drill, Spark, Hadoop, etc.) to speed up the development of its data mining capabilities. It is also generic and can be used to distribute the training of any classification algorithm of any sequential single-node data mining library (Weka, R, scikit-learn, etc.). Distributed frameworks can implement LADEL to distribute the execution of existing data mining libraries without rewriting the algorithms to run in parallel. As a proof-of-concept, the LADEL model is implemented on Apache Drill to distribute the training execution of Weka’s classification algorithms. Our empirical studies show that LADEL classifiers have similar and sometimes even better accuracy to the single-node classifiers and they have a significantly faster training and scoring times. Introduction Data mining is the process of discovering hidden patterns in data and using these patterns to predict the likelihood of future events. Several problems can be addressed using data mining like: • Classification: Predict the category (discrete) of a new data point. • Regression: Predict the value (continuous) of a new data point. • Clustering: Split data points into categories. • Association Rules: Find relationships between attributes. In this work, we focus on the Classification problem and ways of making it Big Data ready. Classification is a supervised learning approach consisting of two phases: (1) Training: a classifier is built using historical labeled data (i.e data with known category) and (2) Scoring: the trained classifier is used to predict the category of new data points (i.e with unknown category). With the large volume of Big Data, classifier training time and memory requirements are a real challenge. Scalable distributed data mining libraries like Apache Mahout [1], Cloudera Oryx [2], Oxdata H2O [3], MLlib [4] [5] and Deeplearning4j [6] implement distributed versions of the classification algorithms to run on Hadoop [7] and Spark [8]. Distributing classifier training significantly reduces the training time and enables digesting of Big Data. However, the approach used by scalable libraries requires rewriting the classification algorithms to execute in parallel. The rewriting process is complex, timeconsuming and the quality of the modified algorithm depends entirely on the contributors’ expertise. Thus, scalable libraries fail to support as many algorithms as sequential single-node libraries like R [9], Weka [10], scikitlearn [11] and RapidMiner [12]. |