Academic
Publications
Finding near-duplicate web pages: a large-scale evaluation of algorithms

Finding near-duplicate web pages: a large-scale evaluation of algorithms,10.1145/1148170.1148222,Monika Rauch Henzinger

Finding near-duplicate web pages: a large-scale evaluation of algorithms   (Citations: 75)
BibTex | RIS | RefWorks Download
Broder et al.'s (3) shingling algorithm and Charikar's (4) ran- dom projection based approach are considered \state-of-the- art" algorithms for nding near-duplicate web pages. Both algorithms were either developed at or used by popular web search engines. We compare the two algorithms on a very large scale, namely on a set of 1.6B distinct web pages. The results show that neither of the algorithms works well for nding near-duplicate pairs on the same site, while both achieve high precision for near-duplicate pairs on dier ent sites. Since Charikar's algorithm nds more near-duplicate pairs on dier ent sites, it achieves a better precision overall, namely 0.50 versus 0.38 for Broder et al. 's algorithm. We present a combined algorithm which achieves precision 0.79 with 79% of the recall of the other algorithms.
Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.
Sort by: