Academic
Publications
Reconstructing Websites for the Lazy Webmaster

Reconstructing Websites for the Lazy Webmaster,Computing Research Repository,Frank Mccown,Joan A. Smith,Michael L. Nelson,Johan Bollen

Reconstructing Websites for the Lazy Webmaster   (Citations: 5)
BibTex | RIS | RefWorks Download
Backup or preservation of websites is often not considered until after a catastrophic event has occurred. In the face of complete website loss, "lazy" webmasters or concerned third parties may be able to recover some of their website from the Internet Archive. Other pages may also be salvaged from commercial search engine caches. We introduce the concept of "lazy preservation"- digital preservation performed as a result of the normal operations of the Web infrastructure (search engines and caches). We present Warrick, a tool to automate the process of website reconstruction from the In- ternet Archive, Google, MSN and Yahoo. Using Warrick, we have reconstructed 24 websites of varying sizes and compo- sition to demonstrate the feasibility and limitations of web- site reconstruction from the public Web infrastructure. To measure Warrick's window of opportunity, we have profiled the time required for new Web resources to enter and leave search engine caches.
Journal: Computing Research Repository - CORR , vol. abs/cs/051, 2005
Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.
    • ...More detail about the organization of the web collections and what the pages and images looked like can be found in [20, 26]...
    • ...We present 4 of the 24 results of the aggregate reconstructions in Table 3, ordered by percent of recovered URIs. The complete results can be seen in [20]...

    Frank Mccownet al. Lazy preservation: reconstructing websites by crawling the crawlers

    • ...We have built a new type of crawler, a web-repository crawler, that is used for reconstructing lost websites when backups are unavailable [34, 49]...
    • ...As far as we know, our work in [34, 49] was the first to address the possibility of reconstructing websites by crawling web archives and search engine caches...
    • ...To quantify the dierence between a reconstructed website and a lost website, we use the graphs produced by the websites to classify the recovered resources as first introduced in [34]...
    • ...In [34] we used shingling [8] to measure the dierence between text-based resources...
    • ...A more detailed discussion of the algorithm can be found in [34]...
    • ...To better understand how the na¨ıve, knowledgeable and exhaustive crawling policies aect our website reconstructions, we downloaded the 24 websites from our previous paper on website reconstruction [34]...
    • ...It may be surprising to some that resources over one year in age remain in the Google cache, but as our previous experiments have demonstrated [34], Google may make resources available in their cache long after the resources have been removed from a website...
    • ...In our previous study [34] we did not find a statistical correlation between website size, Google’s PageRank and reconstruction success...

    Frank Mccownet al. Evaluation of crawling policies for a web-repository crawler

    • ...As far as we aware, our work [25] was the first to focus on the use of SE caches for digital preservation...
    • ...We have performed a preliminary investigation into using the WI for website reconstruction in [25] and give further details in [37]...
    • ...This representation was used in our website reconstruction experiments reported in [25]...
    • ...Although we know a lot about the archived material in the IA, we know very little about the type of content that is available in the SE caches; there has been no in depth analysis that we are aware of except the exploratory research we performed in [25]...

    Frank McCown. Website Reconstruction using the Web Infrastructure (Extended Abstract...

    • ...ous study [7] we demonstrated that a large percentage of a web site’s known content could be reconstructed using Google, Yahoo and MSN search engine caches...

    Joan A. Smith. Integrating Preservation Functions into the Apache Web Server

Sort by: