Modern information retrieval research has evolved a standard workflow
that involves first indexing a document collection and then running
ad hoc queries sequentially to evaluate retrieval effectiveness
using standard test collections. This paper explores how aspects of
this workflow might change in a MapReduce cluster-based environment.
First, we present and evaluate two algorithms for inverted indexing
that take advantage of the programming model's sorting mechanism to
different extents. The running times of both algorithms scale
linearly in terms of collection size up to 102 million web pages.
Second, we show that it is possible to efficiently perform batch query
evaluation with MapReduce by scanning all postings lists in parallel,
as opposed to sequentially accessing each postings list. Third, we
explore an approach that forgoes inverted indexing altogether and
simply computes all query-document scores from document vectors
themselves. Experimental results challenge us to think differently
about previous assumptions in information retrieval, and show that
brute force approaches are surprisingly compelling under certain
circumstances: parallel scan of postings can effectively take
advantage of large clusters and parallel scan of documents fits
naturally with ranking functions that use document-level features.
The HCIL has a long, rich history of transforming the experience people have with new technologies. From understanding user needs, to developing and evaluating these technologies, the lab's faculty, staff, and students have been leading the way in HCI research and teaching.