location: Current position: Home >> Scientific Research >> Paper Publications

Primary content block detection from Web page clusters through entropy and semantic distance

Hits:

Indexed by:会议论文

Date of Publication:2008-06-18

Included Journals:EI、Scopus

Abstract:A new method named ENP-DOM Tree is proposed in this paper, which extends the Document Object Module Tree by adding two properties, i.e., entropy and relativity, to some nodes. Semantic distance is used to extract the primary content accurately from the same source based on three facts: noise blocks always have high entropy property within a given website; primary content blocks are often made up of few link words and many text words; useful links are contained in a useful content blocks and have a close semantic distance with page titles. The proposed method can identify the primary content blocks with higher precision and recall rate and reduce the storage requirement for search engines; thus, result in smaller indexes, faster search time, and better user satisfaction. Extensive experiments are also conducted to evaluate the proposed method by comparison with existing methods. The experimental results show that the method outperforms existing methods with better satisfying recall rate and higher precision. © 2008 IEEE.

Pre One:基于弹簧质点与有限元混合模型的建模研究

Next One:Dimension reduction of latent semantic indexing extracting from local feature space