Task 2: Consumer Health Search

The 2021 CLEF eHealth IR Task on consumer health search builds on the information retrieval tasks that have run at CLEF eHealth since its inception. 

The consumer health search task follows a standard information retrieval shared challenge paradigm from the perspective that it provides a test collection consisting of a set of documents and a set of topics. Participants must retrieve web pages that fulfil a given patient’s personalised information need. This needs to fulfil the following criteria: information credibility, quality, and suitability.

This year’s challenge focuses on ad hoc retrieval and features a newly created test collection consisting of:

  • NEW document collection which includes social media.
  • NEW topic set representing layperson medical queries in set domains (e.g. diabetes), and featuring both queries representing simple information needs and complex information needs.

Participants of the challenge will be provided with access to this new test collection. Runs submitted by participants are pooled, and manual relevance assessments conducted.

Participants can take part in three subtasks:


  • March, 15th: training queries and document collection release
  • April, 1st: test queries release
  • April, 9th: FULL document collection release + FULL topic set release
  • May, 1st: participants runs submission => Extended to the May, 8th
  • End of May: result release and working notes submission


Participants submission website: https://easychair.org/conferences/?conf=clefehealth2021runs

Instructions for submissions to Substasks 1 and 2

The format for the submission of runs should follow the standard TREC run format. Fields in the run result file should be separated using a space as the delimiter between columns. The width of the columns in the format is not important, but it is important to include all columns and have some amount of white space between the columns. Each run should contain the following fields:qid Q0 docno rank score tag where:

  • qid is the query number (or query id)
  • Q0 is the literal Q0
  • docno is the id of a document returned by your system for qid
  • rank (1-999) is the rank of this response for this qid
  • score is a system-generated indication of the quality of the response: please ensure documents are listed in decreasing score value. Ties in score will be treated as per trec_eval convention.
  • tag is the identifier for the system, also called the run id.

Example run:

151001 Q0 3a6ac7fc-b2ea-4631-9438-f58ba0dfef41 1 1.73315273652 mySystem
151001 Q0 bc3b9dda-18d2-4ad5-9a37-26cbc10a3f7f 2 1.72581054377 mySystem 151001 Q0 fc3aa605-1103-494e-be6d-bd5331e7612a 3 1.72522727817 mySystem 151001 Q0 fefda1a5-39b6-486b-b88f-0e534da574d3 4 1.72522727817 mySystem 151001 Q0 341f81da-2f47-42c8-a37c-2df312fe165c 5 1.71374426875 mySystem

Numbers of submissions per team: Up to 4 submissions per subtask for each team are allowed. Note we may not be able to pool all the submissions for a team for relevance assessments.

For all subtasks:

Participants runs must be submitted as a ZIP file containing:

  • a team description (plain text file with a brief team description),
  • a solution/run description (plain text file with a brief description of each system included in the submission),
  • system output files (e.g., predictions or IR system runs). One file per run, named after the team, the subtask and the run id. For example, run 1 for subtask 2 of the team Beluga should be named run1_subtask2_Beluga.

Download the instructions as a PDF.

Subtask 1: Adhoc Information Retrieval


The purpose of the task is to evaluate IR systems abilities to provide users with relevant, understandable and credible documents. Similarly to previous years, this subtask is centered on realistic use cases.

Document collection

The document collection used is the collection newly introduced in 2018, extended with additional webpages and social media content. This collection consists of over 5 million medical webpages from selected domains acquired from the CommonCrawl and other resources.

Participants have access to 2 separate crawls of documents: web documents and social media. The crawls have to be downloaded by participants with the scripts detailed below. Alternatively, registered participants can download an indexed version of the document collection.

1. Web documents

2. Social media documents

  • Crawler code for Twitter and Reddit: GitHub
  • List of document IDs: GitHub


The topics this year are all based on realistic search scenarios. Two sets of topics will be created for the task:

  • one topic set is based on discussions with multiple sclerosis and diabetes patients, the queries are manually generated by experts from established search scenarios
  • one topic set is based on use cases from discussion forums. The queries are extracted and manually selected from Google trends to best fit each use case

The 5 training topics and 50 test topics contain a balanced sample of the two sets described above.

The topic set can be downloaded on: GitHub

Before the queries release, participants can use queries from 2020.


The challenge this year is: given the queries, participants are challenged with retrieving the relevant documents from the provided document collection. The task will consider relevance under 3 dimensions: topical relevance, understandability and credibility.

We will evaluate:

  • the ability of systems to retrieve relevant, readable and credible documents for the topics
  • the ability of systems to retrieve all kinds of documents (web or social media)

Participants can submit multiple runs for each subtask. Evaluation measures used are NDCG@10, BPref and RBP , as well as other metrics adapted to other relevance dimensions such as uRBP.

Subtask 2: Weakly-supervised Information Retrieval


This task aims to evaluate the ability of Machine Learning-based ad-hoc IR models, trained with weak supervision, to retrieve relevant documents in the health domain. For more information on weak supervision in IR, please see: Dehghani et al. “Neural ranking models with weak supervision.” Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2017.


This task provides a large set of training queries in the health domain. Specifically, we share 167k+ real-world health-related queries extracted from commercial search engine query logs and synthetic (weak) relevance scores computed with a competitive IR system. As a development/test set, the participants have to make use of the dataset provided in Subtask 1 (topical relevance assessments, document collection, and topics).


Evaluation metrics that can be used are MAP@10, MRR@10, NDCG@10 as well as other IR metrics.

Subtask 3: Document Credibility Assessment


The purpose of this task is the automatic assessment of the credibility of information that is disseminated online, through the Web and social media. Using the dataset related to Subtask 1, and the credibility labels associated with the documents, it is possible to test approaches for classifying or ranking information with respect to its credibility, also with respect to two distinct types of content, i.e. Web pages and social media posts.


To assess the credibility of information, two distinct scenarios can be considered. In the first scenario, the objective is to assess the credibility of documents independently on specific information needs (topics); in this case, only documents provided for Subtask 1 with their associated credibility labels should be used. In the second scenario, credibility needs to be assessed in relation to an information need; in this case, the same document could have distinct credibility assessments in relation to the distinct topics it conveys, and both topics and documents related to Subtask 1 should be used.


Depending on whether the identification of credible information is approached as a classification or a ranking problem, different measures can be used to assess the effectiveness of the proposed approaches. Classic measures used in Machine Learning such as F1-measure, AUC, Accuracy, in the case of classification can be used. On the other hand, in case a ranking of credible information is provided, the effectiveness of the results can be assessed by using measures such as P@k, MAP@k, NGCG@k, as well as other IR metrics.

Useful links