![]() |
个人信息Personal Information
教授
博士生导师
硕士生导师
性别:女
毕业院校:大连理工大学
学位:博士
所在单位:计算机科学与技术学院
学科:计算机应用技术. 计算机软件与理论
Primary content block detection from Web page clusters through entropy and semantic distance
点击次数:
论文类型:会议论文
发表时间:2008-06-18
收录刊物:EI、Scopus
摘要:A new method named ENP-DOM Tree is proposed in this paper, which extends the Document Object Module Tree by adding two properties, i.e., entropy and relativity, to some nodes. Semantic distance is used to extract the primary content accurately from the same source based on three facts: noise blocks always have high entropy property within a given website; primary content blocks are often made up of few link words and many text words; useful links are contained in a useful content blocks and have a close semantic distance with page titles. The proposed method can identify the primary content blocks with higher precision and recall rate and reduce the storage requirement for search engines; thus, result in smaller indexes, faster search time, and better user satisfaction. Extensive experiments are also conducted to evaluate the proposed method by comparison with existing methods. The experimental results show that the method outperforms existing methods with better satisfying recall rate and higher precision. © 2008 IEEE.