GLARE 2018

1st International Workshop on Generalization in Information Retrieval: Can We Predict Performance in New Domains?

co-located with 27th ACM International Conference on Information and Knowledge Management (CIKM 2018)
22 October 2018, Turin, Italy



Research in IR puts a strong focus on evaluation, with many past and ongoing evaluation campaigns. However, most evaluations utilize offline experiments with single queries only, while most IR applications are interactive, with multiple queries in a session. Moreover, context (e.g., time, location, access device, task) is rarely considered. Finally, the large variance of search topic difficulty make performance prediction especially hard.

Several types of prediction may be relevant in IR. One case is that we have a system and a collection and we would like to know what happens when we move to a new collection, keeping the same kind of task. In another case, we have a system, a collection, and a kind of task, and we move to a new kind of task. A further case is when collections are fluid, and the task must be supported over changing data.

Current approaches to evaluation mean that predictability can be poor, in particular:

  • Assumptions or simplifications made for experimental purposes may be of unknown or unquantified validity; they may be implicit. Collection scale (in particular, numbers of queries) may be unrealistically small or fail to capture ordinary variability.
  • Test collections tend to be specific, and to have assumed use-cases; they are rarely as heterogeneous as ordinary search. The processes by which they are constructed may rely on hidden assumptions or properties.
  • Test environments rarely explore cases such as poorly specified queries, or the different uses of repeated queries (re-finding versus showing new material versus query exploration, for example). Characteristics such as "the space of queries from which the test cases have been sampled" may be undefined.
  • Researchers typically rely on point estimates for the performance measures, instead of giving confidence intervals. Thus, we are not even able to make a prediction about the results for another sample from the same population. A related confound is that highly correlated measures (for example, Mean Average Precision (MAP) vs normalized Discounted Cumulative Gain (nDCG)) are reported as if they were independent; while, on the other hand, measures which reflect different quality aspects (such as precision and recall) are averaged (usually with a harmonic mean), thus obscuring their explanatory power.
  • Current analysis tools are focused on sensitivity (differences between systems) rather than reliability (consistency over queries).
  • Summary statistics are used to demonstrate differences, but the differences remain unexplained. Averages are reported without analysis of changes in individual queries.

Perhaps the most significant issue is the gap between offline and online evaluation. Correlations between system performance, user behavior, and user satisfaction are not well understood, and offline predictions of changes in user satisfaction continue to be poor because the mapping from metrics to user perceptions and experiences is not well understood.

Important Dates

Submission deadline: July 9, 2018, extended to July 16, 2018

Notification of acceptance: July 30, 2018

Camera ready: August 27, 2018

Workshop day: October 22, 2018

Conference days: October 23-26, 2018

Call for Position Papers

General areas of interests include, but are not limited to, the following topics:

  • Measures: We need a better understanding of the assumptions and user perceptions underlying different metrics, as a basis for judging about the differences between methods. Especially, the current practice of concentrating on global measures should be replaced by using sets of more specialized metrics, each emphasizing certain perspectives or properties. Furthermore, the relationships between system-oriented and user-/task-oriented evaluation measures should be determined, in order to obtain a better improved prediction of user satisfaction and attainment of end-user goals.
  • Performance analysis: Instead of regarding only overall performance figures, we should develop rigorous and systematic evaluation protocols focused on explaining performance differences. Failure and error analysis should aim at identifying general problems, avoiding idiosyncratic behavior associated with characteristics of systems or data under evaluation.
  • Assumptions: The assumptions underlying our algorithms, evaluation methods, datasets, tasks, and measures should be identified and explicitly formulated. Furthermore, we need strategies for determining how much we are departing from them in new cases.
  • Application features: The gap between test collections and real-world applications should be reduced. Most importantly, we need to determine the features of datasets, systems, contexts, tasks that affect the performance of a system.
  • Performance Models: We need to develop models of performance which describe how application features and assumptions affect the system performance in terms of the chosen measure, in order to leverage them for prediction of performance.

Papers should be formatted according to the ACM SIG Proceedings Template.

Beyond research papers (4-6 pages), we will solicit short (1 page) position papers from interested participants.

Papers will be peer-reviewed by members of the program committee through double-blind peer review, i.e. authors must be anonymized. Selection will be based on originality, clarity, and technical quality. Papers should be submitted in PDF format to the following address:

Accepted papers will be published online as a volume of the CEUR-WS proceeding series.


Ian Soboroff, National Institute of Standards and Technology (NIST), USA,

Nicola Ferro, University of Padua, Italy

Norbert Fuhr, University of Duisburg-Essen, Germany

Program Committee

  • Javed A. Aslam, Northeastern University, USA
  • Ben Carterette, University of Delaware, USA
  • Eric Gaussier, University Grenoble Alps, France
  • Julio Gonzalo, UNED, Spain
  • Gregory Grefenstette, INRIA Saclay -- Ile-de-France, France
  • Diane Kelly, University of Tennessee, USA
  • Joseph A. Konstan, University of Minnesota, USA
  • Claudio Lucchese, Ca' Foscari University of Venice, Italy
  • Maria Maistro, University of Padua, Italy
  • Josiane Mothe, University of Toulouse, France
  • Jian-Yun Nie, Université de Montréal, Canada
  • Raffaele Perego, ISTI CNR Pisa, Italy
  • Gianmaria Silvello, University of Padua, Italy
  • Ellen Voorhees, National Institute of Standards and Technology (NIST), USA
  • Arjen P. de Vries, Radboud University, The Netherlands
  • Justin Zobel, University of Melbourne, Australia


Keynote Talk