CLEF eHealth 2018 – Task 2: Technology Assisted Reviews in Empirical Medicine

CLEF 2018 Conference at Avignon

CLEF 2018 eHealth – Task 2, Technology Assisted Reviews in Empirical Medicine will have a dedicated time slot of 1 hour and a half, on Tuesday, September 11, from 14:30 – 16:00. The tentative schedule of the TAR session is:

  • 14:30 – 14:45 – Welcome and brief overview of CLEF eHealth TAR
  • 14:45 – 15:00 – Christopher Norman, LIMSI-CNRS
  • 15:00 – 15:15 – Giorgio Maria Di Nunzio, UNIPD
  • 15:15 – 15:30 – Athanasios Lagopoulos, AUTH
  • 15:30 – 15:45 – Amal H Alharbi, U Sheffield
  • 15:45 – 16:00 – The future of CLEF eHealth TAR


Evidence-based medicine has become an important strategy in health care and policy making. In order to practice evidence-based medicine, it is important to have a clear overview over the current scientific consensus. These overviews are provided in systematic review articles, that summarise all evidence that is published regarding a certain topic (e.g., a treatment or diagnostic test). In order to write a systematic review, researchers have to conduct a search that will retrieve all the documents that are relevant. This is a difficult task, known in the Information Retrieval (IR) domain as the total recall problem. With medical libraries expanding rapidly, the need for automation in this process becomes of utmost importance.

The goal of this lab is to bring together academic, commercial, and government researchers that will conduct experiments and share results for a high recall task that specialises in the medical domain, and release a reusable test collection that can be used as a reference for comparing different retrieval approaches in the field of medical systematic reviews.

To this date Systematic Reviews are being conducted in multiple stages:

  1. Boolean Search: At the first stage experts build a Boolean query expressing what constitutes relevant information. The query is then submitted to a medical database containing titles and abstracts of medical studies. The result is a set, A, of potentially interesting studies.
  2. Title and Abstract Screening: At a second stage experts are screening the titles and abstracts of the returned set and decide which one of those hold potential value for their systematic review, a set D. Screening an abstract has a cost Ca. Therefore, screening all |A| abstracts has a cost of Ca*|A|.
  3. Document Screening: At a third stage experts are downloading the full text of the potentially relevant abstracts, D, identified in the previous phase and screen the content to decide whether indeed these documents are relevant or not. Screening a document has a cost Cd > Ca. The result of the second screening is a set of references to be included in the systematic review.

Lab Overview

The lab will focus on Diagnostic Test Accuracy (DTA) reviews. Search in this area is generally considered the hardest, and a breakthrough in this field would likely be applicable to other areas as well.

The task will have a focus on the first and second stage of the process, i.e. Boolean Search and Abstract and Title Screening.


  • Training set release: Mid-February 2018 [released]
  • Test set release: Mid-March 2018 [released]
  • Qrels for semi-automatic runs release: 1 May 2018
  • Result submission: 5 May 2018
  • Participants’ working notes papers submitted [CEUR-WS]: 31 May 2018
  • Notification of Acceptance Participant Papers [CEUR-WS]: 15 June 2018
  • Camera Ready Copy of Participant Papers [CEUR-WS] due: 29 June 2018
  • CLEFeHealth2018 one-day lab session: Sept 2018 in Avignon

Useful Links

Task Description

Sub-Task 1: No Boolean Search

Prior to constructing a Boolean Query researchers have to design and write a search protocol that in written and in detail defines what constitutes a relevant study for their review. In this experimental task of the TAR lab, participants will be provided with the relevant pieces of a protocol, in an attempt to complete search effectively and efficiently bypassing the construction of the Boolean query.


For each topic participants will be provided with

  1. Topic-ID
  2. The title of the review, written by Cochrane experts;
  3. A part of the protocol: The Objective
  4. The entire PubMED database (it can be downloaded directly from PubMED).

Data Set:

Participants will be provided with a test set consisting of 30 topics of Diagnostic Test Accuracy (DTA) reviews. If time allows it the organizers will also produce the relevant parts of protocols for the CLEF 2017 TAR data to be used as training material. Important: The data will be placed in the github repository

Sub-Task 2: Abstract and Title Screening

Given the results of the Boolean Search from stage 1 as the starting point, participants will be asked to rank the set of abstracts (A). The task has two goals (i) to produce an the efficient ordering of the documents, such that all of the relevant abstracts are retrieved as early as possible, and (ii) to identify a subset of A which contains all or as many of the relevant abstracts for the least effort (i.e. total number of abstracts to be assessed).


For each topic participants will be provided with

  1. Topic-ID
  2. The title of the review, written by Cochrane experts;
  3. The Boolean query manually constructed by Cochrane experts;
  4. The set of PubMed Document Identifiers (PID’s) returned by running the query in MEDLINE.

Data Set:

Participants will be provided with a test set consisting of 30 topics of Diagnostic Test Accuracy (DTA) reviews. Participants can use the CLEF 2017 TAR 42 (excluding topics that were reviewed and found unreliable) topics as training set. Important: The data will be placed in the github repository

Submission Guidelines

For both tasks the participants are asked to submit:

A ranking and a threshold: Automatic or manual methods (including iterative methods) will need to rank all abstracts in set A. The goal will be to retrieve relevant items as early in the ranking as possible. Automatic or manual methods (including iterative methods) will yield a rank threshold.


The lab will use TREC-style submissions. In TREC, a “run” is the output of a search system over ALL topics.


  • TOPIC-ID can be found in the released topics
  • THRESHOLD: 0/1, with 1 if you want to threshold on this particular rank (i.e. this is the last document to be shown). A single 1 should appear.
  • PID = PubMed Document Identifier
  • RANK = the rank of the document (in increasing order)
  • SCORE = the score of the ranking/classification algorithm
  • RUN-ID = an identifier for the submission

Number of Runs

  • Participants are allowed to submit as many as 3 runs per task (hence a maximum of 6 runs for both topics)
  • For Task 2, participants are also allowed (in fact encouraged) to submit ANY number of runs that result from their 2017 frozen systems. Use 2017_XXX as the run name for these runs.

Length of Runs

  • For Task 1: A maximum of 5,000 PIDs per topic, i.e. a total maximum of 150,000 lines in a run
  • For Task 2: All PIDs provided in the topic files


The evaluation script is available at

The evaluation will follow the evaluation implemented by TREC Total Recall. The assumption behind this evaluation is the following: The user of your system is the researcher that performs the abstract and title screening of the retrieved articles. Every time an abstract is returned (i.e. ranked) there is an incurred cost/effort, whether the abstract is irrelevant (in which case no further action will be taken) or relevant (and hence passed to the next stage of document screening) to the topic under review.


  • Each run, for either tasks, will be evaluated wrt. its ability to rank relevant documents above non-relevant.
  • The evaluation will be performed over the entire ranking (independent of thresholding)
  • The measures that will be primarily reported will be:
    • Mean Average Precision
    • Recall vs. Number of Documents Shown


  • Each run, for either tasks, will be evaluated wrt. its ability to rank relevant documents with minimal effort.
  • The measures that will be primarily reported will be recall@threshold vs. number of shown documents, attempting to find a Pareto frontier.

All evaluation measurements available to participants:

  1. Area under the recall-precision curve, i.e. Average Precision (tar_eval: ap)
  2. Minimum number of documents returned to retrieve all R relevant documents (tar_eval: last_rel)
    • a measure for optimistic thresholding
  3. Work Saved over Sampling @ Recall (tar_eval: wss_100, and wss_95)
    • WSS@Recall = (TN + FN) / N – (1-Recall)
  4. Area under the cumulative recall curve normalized by the optimal area (tar_eval: norm_area)
    • optimal area = R * N – (R^2/2)
  5. Normalized cumulative gain @ 0% to 100% of documents shown (tar_eval: NCG@0 to NCG@100)
    • For the simple case that judgments are binary, normalized cumulative gain @ % is simply Recall @ % of shown document
  6. Cost-based measure
    1. Total cost (tar_eval: total_cost)
        • Cost: 0
        • Cost: C_A
        • Cost: C_A + C_A
    2. Total cost w/ penalt
      • Cost = Total Cost + Penalty
      • Penalty Cost
        • Uniform: (m / R) * (N – n) * CP (tar_eval: total_cost_uniform)
          • the assumption behind this is that one needs examine half of the remaining documents to find the remaining missing documents
          • N is the total number of documents in the collection
          • n is the number of documents shown to the user
          • (N – n) is the number of documents not shown to the user
          • m is the number of missing relevant documents
          • C_P = 2*C_A
        • Weighted: sum_{i=1}^{m} (1/2^i) (N – n)*CP (tar_eval: total_cost_weighted)
          • the assumption behind this is that one needs examine half of what is left to find the 1st relevant, then 1/4, then 1/8 etc.
    3. Reliability
      • Reliability = loss_r + loss_e (tar_eval: loss_er)
      • lossr = (1-recall)^2 (tar_eval: loss_r)
      • losse = (n / (R+100) * 100/N)^2 (tar_eval: loss_e)
        • recall = nr / R (tar_eval: r)
        • nr is the number of relevant document found and R the total number of relevant document
        • n is the number of returned documents by the system, and
        • N is the size of the collectio
        • for more information about this measure and its rationale, please refer to


Evangelos Kanoulas (University of Amsterdam)

Dan Li (University of Amsterdam)

Rene Spijker (Cochrane Netherlands; UMC, Utrecht University; AMC, University of Amsterdam)

Leif Azzopardi (University of Strathclyde)