CLEF eHealth 2019 – Task 2: Technology Assisted Reviews in Empirical Medicine


Evidence-based medicine has become an important strategy in health care and policy making. In order to practice evidence-based medicine, it is important to have a clear overview over the current scientific consensus. These overviews are provided in systematic review articles, that summarise all evidence that is published regarding a certain topic (e.g., a treatment or diagnostic test). In order to write a systematic review, researchers have to conduct a search that will retrieve all the documents that are relevant. This is a difficult task, known in the Information Retrieval (IR) domain as the total recall problem. With medical libraries expanding rapidly, the need for automation in this process becomes of utmost importance.

The goal of this lab is to bring together academic, commercial, and government researchers that will conduct experiments and share results for a high recall task that specialises in the medical domain, and release a reusable test collection that can be used as a reference for comparing different retrieval approaches in the field of clinical systematic reviews.

To this date Systematic Reviews are being conducted in multiple stages:

  1. Boolean Search: At the first stage experts build a Boolean query expressing what constitutes relevant information. The query is then submitted to a medical database containing titles and abstracts of medical studies. The result is a set, A, of potentially interesting studies.
  2. Title and Abstract Screening: At a second stage experts are screening the titles and abstracts of the returned set and decide which one of those hold potential value for their systematic review, a set D. Screening an abstract has a cost Ca. Therefore, screening all |A| abstracts has a cost of Ca*|A|.
  3. Document Screening: At a third stage experts are downloading the full text of the potentially relevant abstracts, D, identified in the previous phase and screen the content to decide whether indeed these documents are relevant or not. Screening a document has a cost Cd > Ca. The result of the second screening is a set of references to be included in the systematic review.

Lab Overview

In previous years this lab focused exclusively on Diagnostic Test Accuracy (DTA) reviews. This year we will be increasing the complexity of the challenge and creating a new test collection that consists of a mixture of topics: DTA reviews and Intervention Reviews.

The task will have a focus on the first and second stage of the process, i.e. 1. Boolean Search and 2. Title and Abstract Screening. The lab will focus on the results obtained through one publicly available bibliographic database, PubMed, as the content of this database is free to use for everyone, unlike some of the subscription-based databases. The PubMed database has at its core the annotated medline database from the National library of medicine. The database contains bibliographic data (Title, abstract, indexing terms, author assigned keywords, affiliation, etc) although not all records have the same amount of detailed information: some records have title, abstract, and indexing terms, whilst other might only contain the title and no abstract or indexing terms, etc. (more info at the pubmed tutorial).

This year the evaluation of the task will be more nuanced. Each review includes studies of varying levels of quality in terms of the sample size, and risk of bias. Missing a high risk of bias, small sample study is less important than missing a low risk of bias small sample study. Thus the “gain” attributed to included relevant documents will be weighted based on its potential for impact on the findings – and so runs that miss important studies will be penalized more than those that miss unimportant studies.


  • Training set release: March 29th, 2019
  • Test set release: May 14th, 2019
  • Participant result submission: May 21st, 2019
  • Results returned back to participants: May 24th, 2019
  • Participants’ working notes papers submitted [CEUR-WS]: May, 31st, 2019
  • Notification of Acceptance Participant Papers [CEUR-WS]: June 14th, 2019
  • Camera Ready Copy of Participant Papers [CEUR-WS] due: mid July
  • CLEFeHealth2019 one-day lab session: Sept 2019

Useful Links

Task Description

Sub-Task 1: No Boolean Search

Prior to constructing a Boolean Query researchers have to design and write a search protocol that in written and in detail defines what constitutes a relevant study for their review. In this experimental task of the TAR lab, participants will be provided with the relevant pieces of a protocol, in an attempt to complete search effectively and efficiently bypassing the construction of the Boolean query.


For each topic participants will be provided with

  1. Topic-ID
  2. The title of the review, written by Cochrane experts;
  3. Parts of the protocol, including the Objective of the review, but also other information such as the Target Condition, etc.
  4. The entire PubMED database (it can be downloaded directly from PubMED).

Data Set:

Participants will be provided with a test set of (a) 8 DTA reviews, (b) 20 Intervention reviews, (c) 1 Prognosis review, and (d) 2 Qualitative reviews.

Important: The data (training and test data) is placed in the github repository

Sub-Task 2: Abstract and Title Screening

Given the results of the Boolean Search from stage 1 as the starting point, participants will be asked to rank the set of abstracts (A). The task has two goals (i) to produce an the efficient ordering of the documents, such that all of the relevant abstracts are retrieved as early as possible, and (ii) to identify a subset of A which contains all or as many of the relevant abstracts for the least effort (i.e. total number of abstracts to be assessed).


For each topic participants will be provided with

  1. Topic-ID
  2. The title of the review, written by Cochrane experts;
  3. The Boolean query manually constructed by Cochrane experts;
  4. The set of PubMed Document Identifiers (PID’s) returned by running the query in MEDLINE.

Data Set:

Participants will be provided with a test set of (a) 8 DTA reviews, (b) 20 Intervention reviews, (c) 1 Prognosis review, and (d) 2 Qualitative reviews. These are the same as in Sub-Task 1.

Important: The data (training and test data) is placed in the github repository

Submission Guidelines

For both tasks the participants are asked to submit:

A ranking and a threshold: Automatic or manual methods (including iterative methods) will need to rank all abstracts in set A. The goal will be to retrieve relevant items as early in the ranking as possible. Automatic or manual methods (including iterative methods) will yield a rank threshold.

Important Note:

As in previous years the test data also includes the QRELs (i.e. the relevance of articles) so that participants can build interactive systems. QRELs come in two levels, relevance at the level of the abstract (what goes through the abstract screening) and relevance at the level of full article content (what is at the end included in the review).

Those that make use of the QRELs should make sure that the relevance of a document should be requested only AFTER the document is placed in the ranked list of the results. That is the ranked list should be built interactively, by first placing a document in the ranked list, then requesting the relevance, then placing a next document in the ranked list and so on.


The lab will use TREC-style submissions. In TREC, a “run” is the output of a search system over ALL topics.


  • TOPIC-ID can be found in the released topics
  • THRESHOLD: 0/1, with 1 if you want to threshold on this particular rank (i.e. this is the last document to be shown). A single 1 should appear.
  • PID = PubMed Document Identifier
  • RANK = the rank of the document (in increasing order)
  • SCORE = the score of the ranking/classification algorithm
  • RUN-ID = an identifier for the submission

Number of Runs

  • Participants are allowed to submit as many runs as they wish
  • Participants are also encouraged to submit runs that result from their 2017 and 2018 frozen systems. Use 2017_XXX and 2018_XXX as the run name for these runs, where XXX is the name you wish to give to these runs.

Length of Runs

  • For Task 1: A maximum of 5,000 PIDs per topic.
  • For Task 2: All PIDs provided in the topic files


The evaluation script is available at (

This year we will try to evaluate runs in two different ways: (a) similar to the previous year, runs will be evaluated on the basis of identifying the studies to be included (relevant documents), (b) different from previous years, runs will be evaluated on the basis of not only finding the studies to be included, but also finding high quality included studies before low quality included studies.

As mentioned earlier we will focus on two characteristics of studies: sample size and risk-of-bias. For sample size we will list for each included study the total number of participants included in the clinical study. The authors of the included reviews also determined for all included studies the Risk of bias. Risk of bias is a judgement by researchers on how likely the used methodology and conduct of a study could cause bias in the end result. For the DTA reviews this was done using the quadas-2 tool while for intervention reviews the Cochrane risk of bias tool was used. As both tools consist of multiple domains for risk of bias, we will score a study as high risk of bias if one or more domains scores high risk of bias.

A 5-point scale will be used where the risk of bias and the relative sample size (the size of a sample compared to median sample size – as numbers are often low I think median is better option than average) will be combined with the risk of bias score. Then gain will be defined as 0 for irrelevant documents, and 1 + n for relevant documents, where n is, for example, equal to:

0 = small study, high risk of bias

1= large study, high risk of bias

2= small/ large study, unknown risk of bias

3 = small study, low risk of bias

4 = large study, low risk of bias


  • Each run, for either tasks, will be evaluated with respect to its ability to rank relevant documents above non-relevant.
  • The evaluation will be performed over the entire ranking (independent of thresholding)
  • The measures that will be primarily reported will be:
    • Mean Average Precision
    • Recall vs. Number of Documents Shown


  • Each run, for either tasks, will be evaluated with respect to its ability to rank relevant documents with minimal effort.
  • The measures that will be primarily reported will be recall@threshold vs. number of shown documents, attempting to find a Pareto frontier.

Article Quality

An effort will be made to consider not only the relevance but the quality of the articles as well, taking into account indicators such as the risk-of-bias, and the sample size of the trials reported of the studies.


Evangelos Kanoulas (University of Amsterdam)

Dan Li (University of Amsterdam)

Rene Spijker (Cochrane Netherlands; UMC, Utrecht University; AMC, University of Amsterdam)

Leif Azzopardi (University of Strathclyde)