CLEF eHealth 2020 – Task 2: Consumer Health Search

The 2020 CLEF eHealth Task 2 on consumer health search builds on the information retrieval tasks that have run at CLEF eHealth since its inception. The consumer health search task follows a standard information retrieval shared challenge paradigm from the perspective that it provides challenge participants with a test collection consisting of a set of documents and a set of topics to develop retrieval techniques for. Runs submitted by participants are pooled, and manual relevance assessment conducted.

This year the lab proposes 2 subtasks:

  1. Adhoc subtask
  2. Spoken queries subtask

The document collection is common to all subtasks, only the topics change (they are provided in several versions).

Timeline

  • CLEF 2018 Collection Released (corpus + topics): January 2020 [released]
  • Result submission: **8 May 2020 at 23:55 GMT**
  • Results released: May-June 2020 (see task specific pages for details)
  • Participants’ working notes papers submitted [CEUR-WS]: 17 July 2020
  • Notification of Acceptance Participant Papers [CEUR-WS]: 14 August 2020
  • Camera Ready Copy of Participant Papers [CEUR-WS] due: 28 August 2020
  • CLEFeHealth2020 one-day lab session: **ONLINE 22-25 September **

Document Collection

The document collection used is the collection newly introduced in 2018, named clefehealth2018. This collection consists of over 5 million medical webpages from selected domains acquired from the CommonCrawl. Given the positive feedback received for this document collection, it will be used again in the 2020 CHS task.

Document collection structure:

The corpus is divided into a folder by domain name. Each folder contains a set of files, one per webpage from the given domain as captured by the CommonCrawl dump. In total, 2,021 domains were requested from the CommonCrawl dump of 2018-09. We successfully acquired data for 1,903. For the remaining domains, the CommonCrawl API returned an error or corrupted data (10 retries attempted), or incompatible data. Of the 1,903 crawled domains, 84 were not available in the CommonCrawl dump and for these, a folder in the corpus exists and represents the domain that was requested; however, the folder is empty (to mean that it was not available in the dump). Note that PDF documents were excluded from the data acquired from CommonCrawl. A complete list of domains and size of the crawl data for each domain is available here.

The filename of a document, excluding path, is used as document ids in the collection for relevance judgement (qrels). The web page is in the original format as crawled from CommonCrawl, thus it may be html, xhtml, xml, etc. Be mindful of this when setting up your parser for indexing this data (see as an example the information about the basic ElasticSearch index we distribute).

The full collection, named clefehealth2018, occupies about 480GB of space, uncompressed. We have also created a subset of the corpus that contains a restricted number of websites. This is called clefehealth2018_B; this subset contains 1,653 website domains. This subset was created by removing a number of websites that were not strictly health-related (e.g. news websites). Note, this corpus can be used in place of the full corpus for all tasks: however, in doing so, you may miss retrieving some of the relevant documents.

How to get the document collection?

The clefehealth2018 corpus is directly available for download at https://goo.gl/uBJaNi.

The collection is about 480GB uncompressed and is distributed as a tar.gz compressed file; the size compressed is about 96 GB. Due to its size, we highly recommend the use of download manager software that enables resuming the download in case of failure. On Linux machines, this can be accomplished withwget -c.

The clefehealth2018_B corpus is directly available for download at https://goo.gl/fGbwD5. This corpus can be used in place of the full corpus for all tasks: however, in doing so, you may miss retrieving some of the relevant documents. The corpus is about 294GB uncompressed, and 57GB compressed.

Alternatively, the corpus can be obtained by downloading the data directly from the CommonCrawl. To do this, use the script querycc.py available on GitHub. Note that this may take up to 1 week to download the whole collection.

Topics

Historically the CLEF eHealth IR task has released text queries representative of layperson medical information needs in various scenarios. In recent years query variations issued by multiple laypeople for the same information need have been offered. In this year’s task, we extend this to spoken queries. These spoken queries are generated by 6 individuals using the information needs derived for the 2018 challenge. We also provide textual transcripts of these spoken queries and automatic speech-to-text translations.

Topics for subtask 1: Adhoc IR

The topics are similar to 2018 CHS task topics: 50 queries, which were issued by the general public to the HON (Health on the Net) search service. These queries were manually selected by a domain expert from a sample of raw queries collected over a period of 6 months to be representative of the type of queries posed to the search engine. Queries were not preprocessed, for example any spelling mistakes that may be present have not been removed. Queries are numbered using a 6 digits number with the following convention: the first 3 digits of a query ID identified a topic number (information need) ranged from 151 to 200. The last 3 digits of a query ID identified each individual query creator.

Link to download the topics.

Topics for subtask 2: Spoken queries retrieval

All the queries from the adhoc task have been recorded with several users. Transcription of these audio files is also provided, using ESPNET, Librispeech, CommonVoice and Google API (with three models).

To get the audio queries and their transcription: fill in and sign the agreement and send it to hanna.suominen _at_ anu.edu.au.  Please use the subject line of “CLEF eHealth 2020 Task 2”. You will receive a link to download the data, valid for 24 hours.

Additional Resources

More resources can be found on the task 2020 github repository.

Along with the collection, we make available the following additional resources:

  • An ElasticSearch index: https://goo.gl/exkdeA. This index is about 36 GB compressed (tar.gz)
  • An Indri (v5.9) index: https://goo.gl/uNKXcJ. The indri index has around 122 GB compressed (tar.gz).
  • A Terrier (v4.2) index: https://goo.gl/nUwLVo. This index has around 42 GB compressed (tar.gz).
  • Medical CBOW and Skipgram word embeddings (created using the TREC Medical Records collection): available at https://goo.gl/M2tWCf (scroll through the table to find MedTrack embeddings)

Evaluation Methodology

The challenge this year is: given the query variants for a given information need, participants are challenged with retrieving the relevant documents from the provided document collection. This is divided into a number of sub-tasks which can be completed using the spoken queries, textual transcripts of the queries, or the provided automatic speech-to-text transcripts of the queries.

Participants can submit multiple runs for each subtask. Evaluation measures used are NDCG@10, BPref and RBP for the ad-hoc search.

Submission

The format for the submission of runs should follow the standard TREC run format. Fields in the run result file should be separated using a space as the delimiter between columns. The width of the columns in the format is not important, but it is important to include all columns and have some amount of white space between the columns. Each run should contain the following fields:qid Q0 docno rank score tag where:

  • qid is the query number (or query id)
  • Q0 is the literal Q0
  • docno is the id of a document returned by your system for qid
  • rank (1-999) is the rank of this response for this qid
  • score is a system-generated indication of the quality of the response: please ensure documents are listed in decreasing score value. Ties in score will be treated as per trec_eval convention.
  • tag is the identifier for the system, also called the run id.

Example run:

151001 Q0 3a6ac7fc-b2ea-4631-9438-f58ba0dfef41 1 1.73315273652 mySystem
151001 Q0 bc3b9dda-18d2-4ad5-9a37-26cbc10a3f7f 2 1.72581054377 mySystem 151001 Q0 fc3aa605-1103-494e-be6d-bd5331e7612a 3 1.72522727817 mySystem 151001 Q0 fefda1a5-39b6-486b-b88f-0e534da574d3 4 1.72522727817 mySystem 151001 Q0 341f81da-2f47-42c8-a37c-2df312fe165c 5 1.71374426875 mySystem

Numbers of submissions per team: Up to 4 submissions for each team are allowed. Note we may not be able to pool all the submissions for a team for relevance assessments.

Please remember to choose this submission topic on EasyChair at https://easychair.org/conferences/?conf=clefehealth2020runs. The submissions are due by 3 May 2020 at 23:55 (GMT).

Complete guidelines for submission in PDF.