Task 3 – Information Retrieval
Task 3. User-Centred Health Information Retrieval (IR)
Laypeople have different search behaviour, which can be seen both in the formulation of queries and expectations for retrieved documents. To support these needs, our goal is to develop methods and resources for the evaluation of Information Retrieval (IR) from patients’ perspective. Towards this, this year’s Task 3 is split into two parts:
Task 3a – monolingual information retrieval (IR) task
Task 3a is a standard TREC-style IR task using (a) the 2012 crawl of approximately one million medical documents made available by the EU-FP7 Khresmoi project (http://www.khresmoi.eu/) in plain text form which was used in CLEF eHealth 2013’s Task 3 (this year we provide an improved cleaned version of the document collection which was distributed last year) and (b) a new 2014 set of English general public queries that individuals may realistically pose based on the content of their discharge summaries (this year, as distinct from last year, we use the main disorder diagnosed in the discharge summary [last year a random disorder was selected from the discharge summary]). Queries are generated from the discharge summaries used in Tasks 2. The goal of Task 3a is to retrieve the relevant documents for the user queries.
Task 3b – multilingual IR task
Task 3b extends Task 3a by providing a translation of the queries from Task 3a into German, French and Czech. The goal in Task 3b is to develop techniques to translate these queries into English and then apply them to the retrieval task 3a.
Task 3a and 3b will operate by distributing the test collection (document set, sample development queries (in English, German, French, Czech), and result set) to registered task participants. Participants will have one month to explore the collection and develop retrieval techniques, after which test queries for the task will be released. Post-submission relevance assessment will be conducted using a pool of the submitted runs. Result sets for the task and performance measures will be distributed to participants.
Participants are free to take part in Task 3a or Task 3b or both.
Task 3 Dataset
Document Collection
The data set for task 3 consists of a set of medical-related documents, provided by the Khresmoi project. This collection contains documents covering a broad set of medical topics, and does not contain any patient information. The documents in the collection come from several online sources, including the Health On the Net organization certified websites, as well as well-known medical sites and databases (e.g. Genetics Home Reference, ClinicalTrial.gov, Diagnosia).
Topics
The topics are built from discharge summaries provided by task 2. From the main disorder diagnosed in the discharge summary, medical professional generated topics containing the following fields:
- Title: text of the query,
- Description: longer description of what the query means,
- Narrative: expected content of the relevant documents,
- Profile: main information on the patient (age, gender, condition)
- Discharge_summary: ID of the matching discharge summary
The training set contains 5 queries and the matching relevance assessment. The test set contains 50 queries. All the queries are translated by professionals in German, French and Czech for task 3b.
Discharge Summaries (optional)
Registered lab participants are free to obtain access to the discharge summaries from task 2. It is not mandatory to obtain the Task 2 dataset to participate in Task 3 but it can be used as an external resource if desired.
Task 3 Guidelines
Attention lab participants! – you should now start writing your working notes papers! Submission deadline June 3rd.
Details on preparing working notes & link to the working notes submission system are available at: http://clef2014.clef-initiative.eu/index.php?page=Pages/instructions_for_authors.html
Participants will be provided training and test data sets. The evaluation will be conducted using the withheld test queries. Participating teams are asked to stop development as soon as they download the test queries. Teams are allowed to use any outside resources in their algorithms.
Timeline:
- Run submission deadline: 1st of May (Hawai time UTC-10)
- Task result release: 1st of June
- Working notes submission deadline: 7th of June
Run Submission Guidelines:
- Runs description
Task 3a (monolingual information retrieval): Participants can submit up to seven ranked runs for the English (EN) queries in Task 3a. Top 1,000 documents returned for each query should be included.Task 3b (multilingual information retrieval): Participants can also submit up to seven ranked runs for the German (DE) queries, up to seven ranked runs for the French (FR) queries, and up to seven ranked runs for the Czech (CZ) queries. Top 1,000 documents returned for each query should be included.
Description of the runs for each sub-task:
- Run 1 (mandatory) is a baseline: only title and description in the query can be used, and no external resource (including discharge summary, corpora, ontology, etc) can be used.
- Runs 2-4 (optional) any experiment WITH the discharge summaries.
- Runs 5-7 (optional) any experiment WITHOUT the discharge summaries.
One of the runs from 2-4 and one from 5-7 must use the IR technique in Run 1 as a baseline. The idea being to allow analysis of the impact of discharge summaries/other techniques on the performance of the baseline Run 1. The optional runs must be ranked in order of priority (for Runs 2-4, 2 is the highest priority; for Runs 5-7, 5 is the highest priority).
Submitted runs should use the following naming convention: <TeamName>_<QueryLanguage>_Run<RunNumber>.<FileFormat>
For example: DCU_EN_Run1.dat , DCU_CZ_Run5.dat
Submitted runs have to follow TREC format; validity checking tool ‘format-script-clefeHealth-task3.zip’ is available in Physionet.
- Runs Submission
Runs should be submitted through EasyChair: https://www.easychair.org/conferences/?conf=clefehealth2014
- 1. Submit separately to each task by selecting “New Submission”. You will submit all runs for one task at the same time. After you have created a new submission, you can update it, but no updates of runs are accepted after the deadline has passed.
- 2. List all your team members as “Authors”. “Address for Correspondence” and “Corresponding author” refer to your team leader. Note: you can acknowledge people not listed as authors separately in the working notes (to be submitted by June 7 (see below)) – we wish this process to be very similar to defining the list of authors in scientific papers.
- 3. Please provide the task and your team name as “Title” (e.g., “Task 1a: Team NICTA” or “Task 1a using extra annotations: Team NICTA”) and a short description (max 100 words) of your team as “Abstract”. See the category list below the abstract field for the task names. If you submit to multiple tasks, please copy and paste the same description to all your submissions and use the same team name in all submissions.
- 4. Choose a “category” and one or more “Groups” to describe your submission. We allow up to 7 runs for Task 3.
- 5. Please provide 3-10 “Keywords” that describe your the different runs in the submission, including methods (e.g., MetaMap, Support Vector Machines, Weka) and resources (e.g., Unified Medical Language System, expert annotation). You will provide a narrative description later in the process.
- 6. As “Paper” please submit a zip file including the runs for this task. Submitted runs should use the following naming convention: TeamName>_<QueryLanguage>_Run<RunNumber>.<FileFormat> (e.g. DCU_EN_Run1.dat , DCU_CZ_Run5.dat). The run ID 1 should refer to the mandatory baseline run (mandatory run); 2-4 to the runs generated using the discharge summaries (optional runs); and 5-7 to the runs generated without using the discharge summaries (optional runs).
- 7. As the mandatory attachment file, please provide a txt file with a description of the submission. Please structure this file by using your run-file names above. For each run, provide a max 200 word summary of the processing pipeline (i.e., methods and resources). Be sure to describe differences between the runs in the submission.
- 8. Before June 7, 2014, please submit your working notes. Formatting and submission guidelines will be soon available.
- 9. Post submission relevance assessment will be conducted on the test queries to generate the complete result set. The organizers will provide the evaluation results via the Easy Chair for CLEFeHealth2014 runs. This includes your ranking with respect to other teams as well as the value(s) of the official evaluation measure(s).
Evaluation Metrics:
Evaluation will focus on P@5, P@10, NDCG@5, NDCG@10, but other suitable IR evaluation measures will also be computed for the submitted runs. Evaluation metrics can be computed with the trec_eval evaluation tool, which is available from http://trec.nist.gov/trec_eval/.
Working notes:
Participating groups in Task 3 must submit a report (working notes) describing their Task 3 experiments.
Details on preparing working notes & link to the working notes submission system are available at: http://clef2014.clef-initiative.eu/index.php?page=Pages/instructions_for_authors.html
Task 3 – Getting Started
Registration
To participate, you must first register to CLEF2014.
Registration link: http://147.162.2.122:8888/clef2014labs/
How to get the dataset
- Fill in the agreement form, and send it signed to clefehealthtask3_at_khresmoi.eu
- Create an account on PhysioNet website: click on the link for “creating a PhysioNetWorks account” (near middle of page) (https://physionet.org/pnw/login) and follow the instructions.
- Once logged in, go to our project physionet webpage (https://physionet.org/works/ShareCLEFeHealth2013TASK3/) and ask for access to the data
- Once approved, the organizers will add you to the physionetworks ShARE/CLEF eHealth 2013 Task 3 project as a reviewer. We will send you an email informing you that you can go to the PhysioNetWorks website and click on the authorized users link to access the data (it will ask you to log in using your physionetworks account login)
- (optional) Participants willing to get access to the discharge summaries (task 2 dataset) have to follow task 2 guidelines
Timeline
Document collection release | 15th December 2013 |
Training queries and relevance assessment release | 31st January 2014 |
Test queries release | 1st of April 2014 |
Submission of the results | 1st of May 2014 |
Run Submission System:
https://www.easychair.org/conferences/?conf=clefehealth2014
Contact information
To post questions about Task 3 to fellow participants and organisers, please join the google group web forum: https://groups.google.com/forum/?hl=en&fromgroups#!forum/clef-ehealth-task-3To contact the organisers directly: clefehealthtask3_at_khresmoi.eu