What is information retrieval?

Information Retrieval (IR) research encompasses the development of algorithms, techniques and models for the retrieval of information from document repositories.

A classical problem in IR is the ad-hoc retrieval problem, in which the user enters a query describing the desired information and the system returns a list of documents, preferably in an order ranked from most to least relevant. Broadly speaking the solution consists of at least two steps, the first being search and the second that of assigning relevance to documents with respect to a query.

When the document repository is large and heterogeneous, as is the case for the web, result sets of the search step are often huge and unwieldy. So while the web provides a rich and diverse data set from which to obtain information, the current tools available to users are less than perfect. Anyone that has spent considerable time doing research on the internet using popular tools such as the Google and Yahoo! search engines have experienced the problem of search operations returning too many irrelevant documents. In IR terminology this is known as the problem of precision, precision is the ratio of the number of relevant documents retrieved to the total number of documents retrieved [1].

Precision however does not capture another quality of a result set which is also desireable to the user. Those documents within the result set that are likely relevant to the users needs are only helpful if they are easily discovered by the user. In IR this is encapsulated in the probability ranking principle which can be stated as follows, "If a reference retrieval system's response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data" [1]. Thus the second step in ad-hoc retrieval aims to provide the most relevant results to the user as soon as possible. A poor ranking is just as frustrating as an imprecise search. It is this second problem, that of ranking documents in the result set, that Compass addresses most directly.