Task 2: User-Centred Health Information Retrieval
The 2015 CLEF eHealth Task 2 aims to evaluate the effectiveness of information retrieval systems when searching for health content on the web, with the objective to foster research and development of search engines tailored to health information seeking.
This task is a continuation of the previous CLEF eHealth Task 3 that ran in 2013 and 2014, and embraces the TREC-style evaluation process, with a shared collection of documents and queries, the contribution of runs from participants and the subsequent formation of relevance assessments and evaluation of the participants submissions.
In this year’s task, we explore queries that differ from the previous years tasks. This year’s queries in fact aim to mimic queries of lay people (ie. not medical expert) that are confronted with a sign, symptom or condition and attempt to find out more about the condition they may have. For example, when confronted with signs of jaundice, non experts may use queries like “white part of eye turned green” to search for information that allow them to diagnose themselves or better understand their health conditions. These queries are often circumlocutory in nature, where a long, ambiguous wording is used in place of the actual name to refer to a condition or disease. Recent research has shown that these queries are used by health consumers and that current web search engines fail to effectively support these queries (Zuccon et al., Staton et al.)
In addition to changes in query types, this year’s lab will introduce changes in the evaluation settings, where, to judge the effectiveness of a retrieval system, we will consider also the readability of the retrieved medical content, along with the common topical assessments of relevance. Also, the multilingual element added to the task last year will be further developed. This year, parallel queries in Arabic, Czech, French, German, Farsi and Portuguese will be offered, as well as baseline machine translations.
The collection is composed of a crawl of about one million documents, which have been made available to CLEF eHealth through the Khresmoi project. This collection consists of web pages covering a broad range of health topics, targeted at both the general public and healthcare professionals. Web pages in the corpus are predominantly medical and health-related websites that have been certified by the Health on the Net (HON) Foundation as adhering to the HONcode principles (approximately 60–70% of the collection), as well as other commonly used health and medicine websites such as Drugbank, Diagnosia and Trip Answers. The crawled documents are provided in the dataset in their raw HTML (Hyper Text Markup Language) format along with their uniform resource locators (URL). The dataset is made available for download on the web to registered participants on a secure password-protected server (see details below). Please follow this link for details on this dataset.
As described above, this year we will explore circumlocutory queries that users may pose when faced with signs and symptoms of a medical condition. To form the query set, we adopted the method recently investigated by Zuccon et al. and Staton et al..
We provide a small set of sample or training queries, in which both query and narrative field are provided, e.g.:
<top> <num>clef2015.training.1</num> <query>loss of hair on scalp in an inch width round</query> <narr>Documents should contain information allowing the user to understand they have alopecia</narr> </top>
The narrative field is used to provide information to the assessors when performing relevance assessments. A test query will be very similar to the sample queries.
Multilingual queries will be provided for every query, including one automatic translation generated by Google Translator, e.g.:
<top> <num>clef2015.training.1</num> <orig_query>loss of hair on scalp in an inch width round</orig_query> <de_query>Haarverlust an der Kopfhaut Zoll Breite rund</de_query> <auto_de_to_en_query>Hair loss on the scalp inches wide around</auto_de_to_en_query> </top>
System evaluation will consider P@5, P@10, NDCG@5, NDCG@10 (main measure), which can be computed with trec_eval evaluation tool, available at http://trec.nist.gov/trec_eval/.
We are also working on evaluation measures that combine assessments of topical relevance with readability of the medical content, as suggested in Zuccon&Koopman. Information about this measure and associated toolkit has been included in the evaluation package.
Registration and Data Access
- Please register on the main CLEF 2015 registration page
- Please fill in, print, and sign the end-user agreement, scan it and send it by email to firstname.lastname@example.org.
- Sign up for Physionet: this is the website used to distribute the dataset.
- Login to your account and request access to this project.
Apart from the 2015 data, registered users will have access to the queries used in 2013 and 2014.
- Collection release: 27th January 2015
- Training queries release: 14th February 2015 6th March 2015
- Test queries release: 20th February 2015 20th March 2015
- Result submission: 15th April 2015 22nd April 2015
The best (and maybe the fastest) way to get your questions answered is joining one of the clef-ehealth mailing list:
- https://groups.google.com/forum/#!forum/clef-ehealth-task-3 (exclusively for IR related questions)
- https://groups.google.com/forum/#!forum/clef-ehealth-evaluation-lab-information (for all sub-tasks)
Guidelines and Submission Details
This document details the submission procedure and how your results will be judged.
Runs should be submitted using the EasyChair system at: https://easychair.org/conferences/?conf=clefehealth2015resul