Academic
Publications
Webpage Understanding: an Integrated Approach

Webpage Understanding: an Integrated Approach,10.1145/1281192.1281288,Jun Zhu,Zaiqing Nie,Ji-Rong Wen,Bo Zhang,Hsiao-Wuen Hon

Webpage Understanding: an Integrated Approach   (Citations: 17)
BibTex | RIS | RefWorks Download
Recent work has shown the eectiveness of leveraging layout and tag-tree structure for segmenting webpages and label- ing HTML elements. However, how to eectively segment and label the text contents inside HTML elements is still an open problem. Since many text contents on a webpage are often text fragments and not strictly grammatical, tra- ditional natural language processing techniques, that typi- cally expect grammatical sentences, are no longer directly applicable. In this paper, we examine how to use layout and tag-tree structure in a principled way to help under- stand text contents on webpages. We propose to segment and label the page structure and the text content of a web- page in a joint discriminative probabilistic model. In this model, semantic labels of page structure can be leveraged to help text content understanding, and semantic labels of the text phrases can be used in page structure understand- ing tasks such as data record detection. Thus, integration of both page structure and text content understanding leads to an integrated solution of webpage understanding. Exper- imental results on research homepage extraction show the feasibility and promise of our approach.
Conference: Knowledge Discovery and Data Mining - KDD , pp. 903-912, 2007
Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.
    • ...Several supervised information extraction methods such as wrappers have been proposed to handle semi-structured documents like Web pages [5, 11, 9, 14, 21]...

    Tak-Lam Wonget al. Normalizing web product attributes and discovering domain ontology wit...

    • ...Recently, several wrapper learning approaches have been proposed for automatically learning wrappers from training examples [22], [26], [28], [34]...

    Tak-Lam Wonget al. Learning to Adapt Web Information Extraction Knowledge and Discovering...

    • ...Our latest work on webpage understanding [11] introduces a joint model of the Hierarchical Conditional Random Fields (HCRFs) model and the extended Semi-Markov Conditional Random Fields (Semi-CRFs) model to leverage the page structure understanding results in free text segmentation and labeling...
    • ...We can use the HCRF algorithm [9] together with the Semi-CRF algorithm [12] to detect the object block first and then label the attributes within the block [11]...
    • ...Our latest work on webpage understanding [11] makes the first attempt toward such an integrated solution...
    • ...In [11], the Semi-CRF model is designed to handle simple text fragment segmentation, such as the segmentation between city, state, and zip code...
    • ...The combination of structure understanding and text understanding is natural [26], [27], [11]...
    • ...For example, Zhu et al. [11] described a joint model that was able to segment and label the text within the vision node...
    • ...Observing the drawback of existing models, we propose ourWebNLPframework.Thedifferencesbetweenthemodel in [11] and the WebNLP framework are obvious...
    • ...It makes our model perform much better than the extended Semi-CRF model in [11]with onlyregularexpressionmatching features and sequential structure features...
    • ...Then, the parameter learning method for the original HCRF model [11] is taken on the extended observation...
    • ...Semi-CRF framework. We name it the Basic HCRF and extended Semi-CRF (BHS) algorithm. It is the algorithm described in [11]...

    Chunyu Yanget al. Closing the Loop in Webpage Understanding

    • ...In recent years, there has been a surge of interest in extracting structured information from Web documents and converting the extracted data into database objects [2, 16, 20, 17]...

    Xiao Liet al. Extracting structured information from user queries with semi-supervis...

    • ...Logically coherent data blocks can still lack of grammars (see [31])...

    Milos Kudelkaet al. Social Aspects of Web Page Contents

Sort by: