黄德根Huang Degen

(教授)

 博士生导师  硕士生导师
学位:博士
性别:男
毕业院校:大连理工大学
所在单位:计算机科学与技术学院
电子邮箱:huangdg@dlut.edu.cn

论文成果

Corpus Expansion for Neural CWS on Microblog-Oriented Data with lambda-Active Learning Approach

发表时间:2019-03-11 点击次数:

论文名称:Corpus Expansion for Neural CWS on Microblog-Oriented Data with lambda-Active Learning Approach
论文类型:期刊论文
发表刊物:IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS
收录刊物:SCIE、EI
卷号:E101D
期号:3
页面范围:778-785
ISSN号:1745-1361
关键字:Chinese word segmentation; active learning; deep neural networks; corpus expansion
摘要:Microblog data contains rich information of real-world events with great commercial values, so microblog-oriented natural language processing (NLP) tasks have grabbed considerable attention of researchers. However, the performance of microblog-oriented Chinese Word Segmentation (CWS) based on deep neural networks (DNNs) is still not satisfying. One critical reason is that the existing microblog-oriented training corpus is inadequate to train effective weight matrices for DNNs. In this paper, we propose a novel active learning method to extend the scale of the training corpus for DNNs. However, due to a large amount of partially overlapped sentences in the microblogs, it is difficult to select samples with high annotation values from raw microblogs during the active learning procedure. To select samples with higher annotation values, parameter. is introduced to control the number of repeatedly selected samples. Meanwhile, various strategies are adopted to measure the overall annotation values of a sample during the active learning procedure. Experiments on the benchmark datasets of NLPCC 2015 show that our.-active learning method outperforms the baseline system and the state-of-the-art method. Besides, the results also demonstrate that the performances of the DNNs trained on the extended corpus are significantly improved.
发表时间:2018-03-01