潘东华

个人信息Personal Information

副教授

硕士生导师

性别:男

毕业院校:长春光学精密机械研究所

学位:硕士

所在单位:系统工程研究所

电子邮箱:gyise@dlut.edu.cn

扫描关注

论文成果

当前位置: 中文主页 >> 科学研究 >> 论文成果

RESEARCH ON THEMATIC WORD EXTRACTION BASED ON HIGH QUALITY DATA SOURCES ON THE WEB

点击次数:

论文类型:会议论文

发表时间:2012-01-01

收录刊物:CPCI-S

页面范围:549-553

关键字:High quality data source identification; Subject terms extraction; An improved TF-IDF algorithm

摘要:The data source selection is one of the most important processes for domain thematic word extraction. Most of the previous work mainly researched on how to the extract keywords from existing corpus with good algorithms. Meanwhile, there is very limited research on how to explore good data sources for text corpus collection. This paper researches on how to use the online web tools to identify high quality data sources. Then, considering the characteristics of subject keywords, we propose an improved TF-IDF weight calculation formula for keywords sorting, and extract the field keywords from the documents by recalculating the weights of candidate words with the improved method. Finally, taking the Chinese herbal medicine field as an example, our result shows that we can have large higher accuracy and higher recall rate at much lower cost with our method given in this paper.