黄德根Huang Degen

(教授)

 博士生导师  硕士生导师
学位:博士
性别:男
毕业院校:大连理工大学
所在单位:计算机科学与技术学院
电子邮箱:huangdg@dlut.edu.cn

论文成果

Detecting New Words from Chinese Text Using Latent Semi-CRF Models

发表时间:2019-03-09 点击次数:

论文名称:Detecting New Words from Chinese Text Using Latent Semi-CRF Models
论文类型:期刊论文
发表刊物:IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS
收录刊物:SCIE、EI、Scopus
卷号:E93D
期号:6
页面范围:1386-1393
ISSN号:0916-8532
关键字:natural language processing; new word detection; new words POS tagging; conditional random fields; latent-dynamic CRF; semi-CRF; latent semi-CRF
摘要:Chinese new words and their part-of-speech (PUS) are particularly problematic in Chinese natural language processing. With the fast development of internet and information technology, it is impossible to get a complete system dictionary for Chinese natural language processing, as new words out of the basic system dictionary are always being created. A latent semi-CRF model, which combines the strengths of LDCRF (Latent-Dynamic Conditional Random Field) and semi-CRF, is proposed to detect the new words together with their PUS synchronously regardless of the types of the new words from the Chinese text without being pre-segmented. Unlike the original semi-CRF, the LDCRF is applied to generate the candidate entities for training and testing the latent semi-CRF, which accelerates the training speed and decreases the computation cost. The complexity of the latent semi-CRF could be further adjusted by tuning the number of hidden variables in LDCRF and the number of the candidate entities from the Nbest outputs of the LDCRF. A new-words-generating framework is proposed for model training and testing, under which the definitions and distributions of the new words conform to the ones existing in real text. Specific features called "Global Fragment Information" for new word detection and PUS tagging are adopted in the model training and testing. The experimental results show that the proposed method is capable of detecting even low frequency new words together with their PUS tags. The proposed model is found to be performing competitively with the state-of-the-art models presented.
发表时间:2010-06-01