![]() |
个人信息Personal Information
教授
博士生导师
硕士生导师
性别:男
毕业院校:东北大学
学位:博士
所在单位:控制科学与工程学院
学科:应用数学. 应用数学. 控制理论与控制工程
办公地点:创新园大厦A0620
联系方式:电话: (+86-411) 84726020 (home) (+86-411) 84709380 (Office) 传真: (+86-411) 84707579 手机: (+86-411) 13130042458
电子邮箱:xdliuros@dlut.edu.cn
A parallel C4.5 decision tree algorithm based on MapReduce
点击次数:
论文类型:期刊论文
发表时间:2017-04-25
发表刊物:CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE
收录刊物:SCIE、EI
卷号:29
期号:8
ISSN号:1532-0626
关键字:C4.5; decision trees; MapReduce; parallel computing
摘要:In the supervised classification, large training data are very common, and decision trees are widely used. However, as some bottlenecks such as memory restrictions, time complexity, or data complexity, many supervised classifiers including classical C4.5 tree cannot directly handle big data. One solution for this problem is to design a highly parallelized learning algorithm. Motivated by this, we propose a parallelized C4.5 decision tree algorithm based on MapReduce (MR-C4.5-Tree) with 2 parallelized methods to build the tree nodes. First, an information entropy-based parallelized attribute selection method (MR-A-S) on several subsets for MR-C4.5-Tree is proposed to confirm the best splitting attribute and the cut points. Then, a data splitting method (MR-D-S) in parallel is presented to partition the training data into subsets. At last, we introduce the MR-C4.5-Tree learning algorithm that grows in a top-down recursive way. Besides, the depth of the constructed decision tree, the number of samples and the maximal class probability in each tree node are used as the termination conditions to avoid the over-partitioning problem. Experimental studies show the feasibility and the good performance of the proposed parallelized MR-C4.5-Tree algorithm.