大连理工大学主页平台管理系统许侃基于词向量和EMD距离的短文本聚类许侃

许侃

点赞：

高级工程师

性别：男

毕业院校：大连理工大学

学位：博士

所在单位：计算机科学与技术学院

学科：计算机应用技术

办公地点：创新园大厦D0103房间

联系方式：

电子邮箱：

qq :

移动版主页

手机版

访问量：

开通时间： ..

最后更新时间：..

个人学术主页

当前位置：许侃 >> 科学研究 >> 论文成果

基于词向量和EMD距离的短文本聚类

点击次数：

发布时间：2024-09-18

论文类型：期刊论文

发表时间：2022-06-29

发表刊物：山东大学学报理学版

卷号：52

期号：7

页面范围：66-72

ISSN号：1671-9352

摘要：Short text clustering plays an important role in data mining. The traditional short text clustering model has some problems, such as high dimensionality、sparse data and lack of semantic information. To overcome the shortcomings of short text clustering caused by sparse features、semantic ambiguity、dynamics and other reasons, this paper presents a feature based on the word embeddings representation of text and short text clustering algorithm based on the moving distance of the characteristic words. Initially, the word embeddings that represents semantics of the feature word was gained through training in large-scale corpus with the Continous Skip-gram Model. Furthermore, use the Euclidean distance calculation feature word similarity. Additionally, EMD (Earth Mover's Distance) was used to calculate the similarity between the short text. Finally, apply the similarity between the short text to Kmeans clustering algorithm implemented in the short text clustering. The evaluation results on three data sets show that the effect of this method is superior to traditional clustering algorithms.

备注：新增回溯数据

上一条：基于表示学习的学者间潜在合作机会挖掘

下一条：基于词向量和ＥＭＤ距离的短文本聚类