location: Current position: Home >> Scientific Research >> Paper Publications

RESEARCH ON THEMATIC WORD EXTRACTION BASED ON HIGH QUALITY DATA SOURCES ON THE WEB

Hits:

Indexed by:会议论文

Date of Publication:2012-01-01

Included Journals:CPCI-S

Page Number:549-553

Key Words:High quality data source identification; Subject terms extraction; An improved TF-IDF algorithm

Abstract:The data source selection is one of the most important processes for domain thematic word extraction. Most of the previous work mainly researched on how to the extract keywords from existing corpus with good algorithms. Meanwhile, there is very limited research on how to explore good data sources for text corpus collection. This paper researches on how to use the online web tools to identify high quality data sources. Then, considering the characteristics of subject keywords, we propose an improved TF-IDF weight calculation formula for keywords sorting, and extract the field keywords from the documents by recalculating the weights of candidate words with the improved method. Finally, taking the Chinese herbal medicine field as an example, our result shows that we can have large higher accuracy and higher recall rate at much lower cost with our method given in this paper.

Pre One:产业技术创新战略联盟创新绩效评价

Next One:SPECTRAL CLUSTERING WITH A NEW SIMILARITY MEASURE