GLARE 2018

1st International Workshop on Generalization in Information Retrieval: Can We Predict Performance in New Domains?

co-located with 27th ACM International Conference on Information and Knowledge Management (CIKM 2018)
22 October 2018, Turin, Italy



Research in IR puts a strong focus on evaluation, with many past and ongoing evaluation campaigns. However, most evaluations utilize offline experiments with single queries only, while most IR applications are interactive, with multiple queries in a session. Moreover, context (e.g., time, location, access device, task) is rarely considered. Finally, the large variance of search topic difficulty make performance prediction especially hard.

Several types of prediction may be relevant in IR. One case is that we have a system and a collection and we would like to know what happens when we move to a new collection, keeping the same kind of task. In another case, we have a system, a collection, and a kind of task, and we move to a new kind of task. A further case is when collections are fluid, and the task must be supported over changing data.

Current approaches to evaluation mean that predictability can be poor, in particular:

  • Assumptions or simplifications made for experimental purposes may be of unknown or unquantified validity; they may be implicit. Collection scale (in particular, numbers of queries) may be unrealistically small or fail to capture ordinary variability.
  • Test collections tend to be specific, and to have assumed use-cases; they are rarely as heterogeneous as ordinary search. The processes by which they are constructed may rely on hidden assumptions or properties.
  • Test environments rarely explore cases such as poorly specified queries, or the different uses of repeated queries (re-finding versus showing new material versus query exploration, for example). Characteristics such as "the space of queries from which the test cases have been sampled" may be undefined.
  • Researchers typically rely on point estimates for the performance measures, instead of giving confidence intervals. Thus, we are not even able to make a prediction about the results for another sample from the same population. A related confound is that highly correlated measures (for example, Mean Average Precision (MAP) vs normalized Discounted Cumulative Gain (nDCG)) are reported as if they were independent; while, on the other hand, measures which reflect different quality aspects (such as precision and recall) are averaged (usually with a harmonic mean), thus obscuring their explanatory power.
  • Current analysis tools are focused on sensitivity (differences between systems) rather than reliability (consistency over queries).
  • Summary statistics are used to demonstrate differences, but the differences remain unexplained. Averages are reported without analysis of changes in individual queries.

Perhaps the most significant issue is the gap between offline and online evaluation. Correlations between system performance, user behavior, and user satisfaction are not well understood, and offline predictions of changes in user satisfaction continue to be poor because the mapping from metrics to user perceptions and experiences is not well understood.

Important Dates

Submission deadline: July 9, 2018, extended to July 16, 2018

Notification of acceptance: July 30, 2018, moved to August 10, 2018

Camera ready: August 27, 2018

Workshop day: October 22, 2018

Conference days: October 23-26, 2018

Call for Position Papers

General areas of interests include, but are not limited to, the following topics:

  • Measures: We need a better understanding of the assumptions and user perceptions underlying different metrics, as a basis for judging about the differences between methods. Especially, the current practice of concentrating on global measures should be replaced by using sets of more specialized metrics, each emphasizing certain perspectives or properties. Furthermore, the relationships between system-oriented and user-/task-oriented evaluation measures should be determined, in order to obtain a better improved prediction of user satisfaction and attainment of end-user goals.
  • Performance analysis: Instead of regarding only overall performance figures, we should develop rigorous and systematic evaluation protocols focused on explaining performance differences. Failure and error analysis should aim at identifying general problems, avoiding idiosyncratic behavior associated with characteristics of systems or data under evaluation.
  • Assumptions: The assumptions underlying our algorithms, evaluation methods, datasets, tasks, and measures should be identified and explicitly formulated. Furthermore, we need strategies for determining how much we are departing from them in new cases.
  • Application features: The gap between test collections and real-world applications should be reduced. Most importantly, we need to determine the features of datasets, systems, contexts, tasks that affect the performance of a system.
  • Performance Models: We need to develop models of performance which describe how application features and assumptions affect the system performance in terms of the chosen measure, in order to leverage them for prediction of performance.

Papers should be formatted according to the ACM SIG Proceedings Template.

Beyond research papers (4-6 pages), we will solicit short (1 page) position papers from interested participants.

Papers will be peer-reviewed by members of the program committee through double-blind peer review, i.e. authors must be anonymized. Selection will be based on originality, clarity, and technical quality. Papers should be submitted in PDF format to the following address:

Accepted papers will be published online as a volume of the CEUR-WS proceeding series.


Ian Soboroff, National Institute of Standards and Technology (NIST), USA,

Nicola Ferro, University of Padua, Italy

Norbert Fuhr, University of Duisburg-Essen, Germany

Program Committee

  • Javed A. Aslam, Northeastern University, USA
  • Ben Carterette, University of Delaware, USA
  • Eric Gaussier, University Grenoble Alps, France
  • Julio Gonzalo, UNED, Spain
  • Gregory Grefenstette, INRIA Saclay -- Ile-de-France, France
  • Diane Kelly, University of Tennessee, USA
  • Joseph A. Konstan, University of Minnesota, USA
  • Claudio Lucchese, Ca' Foscari University of Venice, Italy
  • Maria Maistro, University of Padua, Italy
  • Josiane Mothe, University of Toulouse, France
  • Jian-Yun Nie, Université de Montréal, Canada
  • Raffaele Perego, ISTI CNR Pisa, Italy
  • Gianmaria Silvello, University of Padua, Italy
  • Ellen Voorhees, National Institute of Standards and Technology (NIST), USA
  • Arjen P. de Vries, Radboud University, The Netherlands
  • Justin Zobel, University of Melbourne, Australia


Keynote Talk

Justin Zobel
School of Computing & Information Systems, University of Melbourne, Australia

Justin Zobel


Professor Justin Zobel is a Redmond Barry Distinguished Professor at the University of Melbourne in the School of Computing & Information Systems, and is currently the university’s Pro-Vice Chancellor (Graduate & International Research). He received his PhD from Melbourne in 1991, and worked at RMIT until he returned to Melbourne in the late 2000s, where until recently he was Head of his School. In the research community, Professor Zobel is best known for his role in the development of algorithms for efficient web search, and also is known for research on measurement, bioinformatics, and fundamental algorithms. He is the author of three texts on graduate study and research methods, and has held a range of roles in the national and international computer science community.

Proxies and Decoys: Assumptions, Approximations, and Artefacts in Measurement of Search Systems

Research in information retrieval depends on the ability to undertake repeatable, robust measurements of search systems. Over several decades, the academic community has created measurement tools and measurement practices that are now widely accepted and used. However, these tools and practices not only have known flaws and shortcomings but remain imperfectly understood. This talk examines measurement in IR from the perspective of inconsistencies between quantitative measures of performance and the qualitative goals of IR research, and considers whether some of the shortcomings in measurement and predictivity arise from assumptions made for the purpose of producing standardised metrics. These issues suggest challenges to be addressed if measurement is to continue to support research that is enduring and defensible.


09:00-09:10 - Opening and Welcome   
09:10-09:35 - Keynote Justin Zobel, University of Melbourne, Australia
09:35-09:55 - Discussion
09:55-10:40 - Position Papers

The Challenges of Moving from Web to Voice in Product Search      

Amir Ingber, Alexa Shopping, Amazon Research, Israel
Arnon Lazerson, Alexa Shopping, Amazon Research, Israel
Liane Lewin-Eytan, Alexa Shopping, Amazon Research, Israel
Alexander Libov, Alexa Shopping, Amazon Research, Israel
Eliyahu Osherovich, Alexa Shopping, Amazon Research, Israel

Offline vs. Online Evaluation in Voice Product Search      

Amir Ingber, Alexa Shopping, Amazon Research, Israel
Liane Lewin-Eytan, Alexa Shopping, Amazon Research, Israel
Alexander Libov, Alexa Shopping, Amazon Research, Israel
Yoelle Maarek, Alexa Shopping, Amazon Research, Israel
Eliyahu Osherovich, Alexa Shopping, Amazon Research, Israel

Causality, prediction and improvements that (don’t) add up      

Norbert Fuhr, University of Duisburg-Essen, Germany
10:40-11:10 - Coffee Break
11:10-11:50 - Long Papers

Towards a Basic Principle for Ranking Effectiveness Prediction without Human Assessments: A Preliminary Study      

Enrique Amigó, UNED, Spain
Stefano Mizzaro, University of Udine, Italy
Damiano Spina, RMIT University, Australia

Novel Query Performance Predictors and their Correlations for Medical Applications      

Mohammad Bahrani, Queen Mary University of London, UK
Thomas Roelleke, Queen Mary University of London, UK
11:50-12:20 - Looking ahead

Report on the Dagstuhl Perspectives Workshop 17442 - Towards Cross-Domain Performance Modeling and Prediction: IR/RecSys/NLP   

Nicola Ferro, University of Padua, Italy


12:20-12:30 - Wrap Up and Closing
12:30-13:30 - Lunch