Academic
Publications
Less is more: probabilistic models for retrieving fewer relevant documents

Less is more: probabilistic models for retrieving fewer relevant documents,10.1145/1148170.1148245,Harr Chen,David R. Karger

Less is more: probabilistic models for retrieving fewer relevant documents   (Citations: 81)
BibTex | RIS | RefWorks Download
Traditionally, information retrieval systems aim to maximize thenumber of relevant documents returned to a user within some windowof the top. For that goal, the probability ranking principle, whichranks documents in decreasing order of probability of relevance, isprovably optimal. However, there are many scenarios in which thatranking does not optimize for the users information need. Oneexample is when the user would be satisfied with some limitednumber of relevant documents, rather than needing all relevantdocuments. We show that in such a scenario, an attempt to returnmany relevant documents can actually reduce the chances of findingany relevant documents.We consider a number of information retrieval metrics from theliterature, including the rank of the first relevant result, the%no metric that penalizes a system only for retrieving no relevantresults near the top, and the diversity of retrieved results whenqueries have multiple interpretations. We observe that given aprobabilistic model of relevance, it is appropriate to rank so asto directly optimize these metrics in expectation. While doing somay be computationally intractable, we show that a simple greedyoptimization algorithm that approximately optimizes the givenobjectives produces rankings for TREC queries that outperform thestandard approach based on the probability ranking principle.
Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.
    • ...As shown by a number of authors, the computation of collectiondependent normalization is NP-hard in the general case [4, 6, 7, 21], although a simple greedy approximation is usually acceptable in practice...
    • ...In addition, we plan to compare the cascade measures to other simple methods for evaluating novelty and diversity, including ideas by Zhai et al. [21], Chen and Karger [6], and Sakai et al. [17]...

    Charles L. A. Clarkeet al. A comparative analysis of cascade measures for novelty and diversity

    • ...Examples of such applications range from exploratory and ambiguous keywords searches (e.g., jaguar, java, windows, eclipse) [1], [2], [3], diversification of structured databases [4], [5] and user personalized results [6], to topic summarization [7], [8], [9], or even to exclude near-duplicate results from multiple resources [10]...
    • ...In [1], a Bayesian network model is used as a blind negative relevance feedback to re-rank documents, assuming that in each position in the rank order all previous documents are irrelevant to the query...

    Marcos R. Vieiraet al. On query result diversification

    • ...Chen and Karger [5] use Bayesian retrieval models and condition selection of subsequent documents by making assumptions about the relevance of the previously retrieved documents...
    • ...Metrics such as search length (SL) [7] and k-call [5], and their aggregated forms, are well suited to evaluate diversification of search systems under single document assumptions...
    • ...Under these conditions, our strategy for selecting documents is similar to the greedy approach for optimizing k-call presented by Chen and Karger [5], using Pr(Ti|D) to select the documents most likely related to Ti...

    Michael J. Welchet al. Search result diversity for informational queries

    • ...Bookstein [4] and Chen and Karger [7] both consider information retrieval in the context of ambiguous queries...

    Sayan Bhattacharyaet al. Consideration set generation in commerce search

    • ...This has resulted in a large amount of research on producing diverse rankings of documents that avoid showing multiple documents with redundant content to users (e.g., [1, 8, 13, 28, 34, 36])...

    Filip Radlinskiet al. Detecting duplicate web documents using clickthrough data

Sort by: