大连理工大学主页平台管理系统黄德根--自然语言处理-- Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity Measurement

论文成果

当前位置: 自然语言处理 >> 科学研究 >> 论文成果

Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity Measurement

发表时间：2019-03-12 点击次数：

论文名称：Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity Measurement
论文类型：期刊论文
发表刊物：ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING
收录刊物：SCIE
卷号：17
期号：3
ISSN号：2375-4699
关键字：Chinese word similarity; word embedding; prior knowledge
摘要：Word embedding-based methods have received increasing attention for their flexibility and effectiveness in many natural language-processing (NLP) tasks, including Word Similarity (WS). However, these approaches rely on high-quality corpus and neglect prior knowledge. Lexicon-based methods concentrate on human's intelligence contained in semantic resources, e.g., Tongyici Cilin, HowNet, and Chinese WordNet, but they have the drawback of being unable to deal with unknown words. This article proposes a three-stage framework for measuring the Chinese word similarity by incorporating prior knowledge obtained from lexicons and statistics into word embedding: in the first stage, we utilize retrieval techniques to crawl the contexts of word pairs from web resources to extend context corpus. In the next stage, we investigate three types of single similarity measurements, including lexicon similarities, statistical similarities, and embedding-based similarities. Finally, we exploit simple combination strategies with math operations and the counter-fitting combination strategy using optimization method. To demonstrate our system's efficiency, comparable experiments are conducted on the PKU-500 dataset. Our final results are 0.561/0.516 of Spearman/Pearson rank correlation coefficient, which outperform the state-of-the-art performance to the best of our knowledge. Experiment results on Chinese MC-30 and SemEval-2012 datasets show that our system also performs well on other Chinese datasets, which proves its transferability. Besides, our system is not language-specific and can be applied to other languages, e.g., English.
发表时间：2018-05-01

上一条：Multi-Level Attention Based BLSTM Neural Network for Biomedical Event Extraction

下一条：基于λ-主动学习方法的中文微博分词

首页

科学研究

教学研究

获奖信息

招生信息

学生信息

我的相册

教师博客

个人信息

黄德根Huang Degen

同专业博导

同专业硕导

个人学术主页

论文成果

Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity Measurement