Academic
Publications
Discovering informative content blocks from Web documents

Discovering informative content blocks from Web documents,10.1145/775047.775134,Shian-Hua Lin,Jan-Ming Ho

Discovering informative content blocks from Web documents   (Citations: 147)
BibTex | RIS | RefWorks Download
In this paper, we propose a new approach to discover informative contents from a set of tabular documents (or Web pages) of a Web site. Our system, InfoDiscoverer, first partitions a page into several content blocks according to HTML tag in a Web page. Based on the occurrence of the features (terms) in the set of pages, it calculates entropy value of each feature. According to the entropy value of each feature in a content block, the entropy value of the block is defined. By analyzing the information measure, we propose a method to dynamically select the entropy-threshold that partitions blocks into either informative or redundant. Informative content blocks are distinguished parts of the page, whereas redundant content blocks are common parts. Based on the answer set generated from 13 manually tagged news Web sites with a total of 26,518 Web pages, experiments show that both recall and precision rates are greater than 0.956. That is, using the approach, informative blocks (news articles) of these sites can be automatically separated from semantically redundant contents such as advertisements, banners, navigation panels, news categories, etc. By adopting InfoDiscoverer as the preprocessor of information retrieval and extraction applications, the retrieval and extracting precision will be increased, and the indexing size and extracting complexity will also be reduced.
Conference: Knowledge Discovery and Data Mining - KDD , pp. 588-593, 2002
Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.
    • ...Several groups have developed generic content discovery algorithms based on heuristic rules and statistical models, eg, (Lin and Ho 2002), but ready to use software may be difficult to find in the public domain...

    Nigel Collier. Uncovering text mining: A survey of current work on web-based epidemic...

    • ...Features we dene are inspired by web page segmentation and template extraction studies [6, 22, 19, 24, 17, 4]. Structural features have also been proposed in a number of other studies...
    • ...Lin and Ho [22] and Gupta et al. [19] proposed algorithms to extract content blocks from HTML pages using a DOM (Document Object Model)-based approach and an information theoretic approach, respectively...

    Jangwon Seoet al. Generalized link suggestions via web site clustering

    • ...In [5] S. H. Lin et al. propose a system, InfoDiscoverer to discover informative content blocks from web documents...

    Swe Swe Nyein. Mining contents in Web page using cosine similarity

    • ...Template detection algorithms [23, 33, 21, 10, 7, 9] are a different approach to content extraction in which collections of training documents based on the same template are used to learn a common structure...

    Tim Weningeret al. CETR: content extraction via tag ratios

    • ...So content extraction from Web news page has attracted many researchers [1][2][3][4][5][6][7]recently...
    • ...Reference [7] proposed an approach to partition a Web page into several content blocks according to HTML tables, and to discover informative content blocks based on statistics on the occurrence of the features (terms) in the set of pages...

    Yan Guoet al. ECON: An Approach to Extract Content from Web News Page

Sort by: