Hits:
Indexed by:会议论文
Date of Publication:2008-06-18
Included Journals:EI、Scopus
Abstract:A new method named ENP-DOM Tree is proposed in this paper, which extends the Document Object Module Tree by adding two properties, i.e., entropy and relativity, to some nodes. Semantic distance is used to extract the primary content accurately from the same source based on three facts: noise blocks always have high entropy property within a given website; primary content blocks are often made up of few link words and many text words; useful links are contained in a useful content blocks and have a close semantic distance with page titles. The proposed method can identify the primary content blocks with higher precision and recall rate and reduce the storage requirement for search engines; thus, result in smaller indexes, faster search time, and better user satisfaction. Extensive experiments are also conducted to evaluate the proposed method by comparison with existing methods. The experimental results show that the method outperforms existing methods with better satisfying recall rate and higher precision. © 2008 IEEE.