Current position: Home >> Scientific Research >> Paper Publications

Primary content block detection from Web page clusters through entropy and semantic distance

Release Time:2019-03-11  Hits:

Indexed by: Conference Paper

Date of Publication: 2008-06-18

Included Journals: Scopus、EI

Abstract: A new method named ENP-DOM Tree is proposed in this paper, which extends the Document Object Module Tree by adding two properties, i.e., entropy and relativity, to some nodes. Semantic distance is used to extract the primary content accurately from the same source based on three facts: noise blocks always have high entropy property within a given website; primary content blocks are often made up of few link words and many text words; useful links are contained in a useful content blocks and have a close semantic distance with page titles. The proposed method can identify the primary content blocks with higher precision and recall rate and reduce the storage requirement for search engines; thus, result in smaller indexes, faster search time, and better user satisfaction. Extensive experiments are also conducted to evaluate the proposed method by comparison with existing methods. The experimental results show that the method outperforms existing methods with better satisfying recall rate and higher precision. © 2008 IEEE.

Prev One:基于弹簧质点与有限元混合模型的建模研究

Next One:Dimension reduction of latent semantic indexing extracting from local feature space