刘馨月

个人信息Personal Information

副教授

博士生导师

硕士生导师

性别:女

毕业院校:大连理工大学

学位:博士

所在单位:软件学院、国际信息与软件学院

学科:计算机软件与理论. 软件工程

电子邮箱:xyliu@dlut.edu.cn

扫描关注

论文成果

当前位置: 中文主页 >> 科学研究 >> 论文成果

GLTM: A Global and Local Word Embedding-Based Topic Model for Short Texts

点击次数:

论文类型:期刊论文

发表时间:2018-01-01

发表刊物:IEEE ACCESS

收录刊物:SCIE

卷号:6

页面范围:43612-43621

ISSN号:2169-3536

关键字:Text mining; context modeling; natural language processing; topic model; short text

摘要:Short texts have become a kind of prevalent source of information, and discovering topical information from short text collections is valuable for many applications. Due to the length limitation, conventional topic models based on document-level word co-occurrence information often fail to distill semantically coherent topics from short text collections. On the other hand, word embeddings as a powerful tool have been successfully applied in natural language processing. Word embeddings trained on large corpus are encoded with general semantic and syntactic information of words, and hence they can be leveraged to guide topic modeling for short text collections as supplementary information for sparse co-occurrence patterns. However, word embeddings are trained on large external corpus and the encoded information is not necessarily suitable for training data set of topic models, which is ignored by most existing models. In this article, we propose a novel global and local word embedding-based topic model (GLTM) for short texts. In the GLTM, we train global word embeddings from large external corpus and employ the continuous skip-gram model with negative sampling (SGNS) to obtain local word embeddings. Utilizing both the global and local word embeddings, the GLTM can distill semantic relatedness information between words which can be further leveraged by Gibbs sampler in the inference process to strengthen semantic coherence of topics. Compared with five state-of-the-art short text topic models on four real-world short text collections, the proposed GLTM exhibits the superiority in most cases.