The world wide web provides access to a wealth of data. Collecting and maintaining such large amounts of data necessitates automated processing for extraction, since appropriate automation can perform extraction tasks that would be otherwise infeasible. Modern web interfaces, however, are generally designed primarily for human users, delivering sophisticated interactions through the use of client-side scripting and asynchronous server communication. To this end, we introduce OXPath, a careful extension of XPath that facilitates data extraction from the deep web. OXPath exploits XPath's familiarity and theoretical foundations. OXPath, then, achieves favourable evaluation complexity and optimal page buffering, storing only a constant number of pages for non-recursive queries. Further, OXPath provides a lightweight interface, which is easy to use and embed. This paper outlines the motivation, theoretical framework, current implementation, and preliminary results obtained so far. We conclude with proposed future work on OXPath, including an investigation of how to deploy OXPath efficiently in a highly elastic computing framework (cloud).
Conference: World Wide Web Conference Series - WWW , pp. 409-414, 2011
