CETR: content extraction via tag ratios

CETR: content extraction via tag ratios,10.1145/1772690.1772789,Tim Weninger,William H. Hsu,Jiawei Han

CETR: content extraction via tag ratios  
BibTex | RIS | RefWorks Download
We present Content Extraction via Tag Ratios (CETR) - a method to extract content text from diverse webpages by using the HTML document's tag ratios. We describe how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram into content and non-content areas. Initially, we find that the tag ratio histogram is not easily clustered because of its one-dimensionality; therefore we extend the original approach in order to model the data in two dimensions. Next, we present a tailored clustering technique which operates on the two-dimensional model, and then evaluate our approach against a large set of alternative methods using standard accuracy, precision and recall metrics on a large and varied Web corpus. Finally, we show that, in most cases, CETR achieves better content extraction performance than existing methods, especially across varying web domains, languages and styles.
Conference: World Wide Web Conference Series - WWW , pp. 971-980, 2010
Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.