CLEF eHealth 2020 – Task 2: Consumer Health Search

The 2020 CLEF eHealth Task 2 on consumer health search builds on the information retrieval tasks that have run at CLEF eHealth since its inception. The consumer health search task follows a standard information retrieval shared challenge paradigm from the perspective that it provides challenge participants with a test collection consisting of a set of documents and a set of topics to develop retrieval techniques for. Runs submitted by participants are pooled, and manual relevance assessment conducted.

This year the lab proposes 2 subtasks:

  1. Adhoc subtask
  2. Spoken queries subtask

The document collection is common to all subtasks, only the topics change (they are provided in several versions).

Timeline

  • CLEF 2018 Collection Released (corpus + topics): January 2020 [released]
  • Result submission: 1st of May 2020, everywhere on Earth
  • Participants’ working notes papers submitted [CEUR-WS]: 31st May 2020
  • Notification of Acceptance Participant Papers [CEUR-WS]: 15th June 2020
  • Camera Ready Copy of Participant Papers [CEUR-WS] due: 29th June 2020
  • CLEFeHealth2020 one-day lab session: Sept 2020

Document Collection

The document collection used is the collection newly introduced in 2018, named clefehealth2018. This collection consists of over 5 million medical webpages from selected domains acquired from the CommonCrawl. Given the positive feedback received for this document collection, it will be used again in the 2020 CHS task.

Document collection structure:

The corpus is divided into a folder by domain name. Each folder contains a set of files, one per webpage from the given domain as captured by the CommonCrawl dump. In total, 2,021 domains were requested from the CommonCrawl dump of 2018-09. We successfully acquired data for 1,903. For the remaining domains, the CommonCrawl API returned an error or corrupted data (10 retries attempted), or incompatible data. Of the 1,903 crawled domains, 84 were not available in the CommonCrawl dump and for these, a folder in the corpus exists and represents the domain that was requested; however, the folder is empty (to mean that it was not available in the dump). Note that PDF documents were excluded from the data acquired from CommonCrawl. A complete list of domains and size of the crawl data for each domain is available here.

The filename of a document, excluding path, is used as document ids in the collection for relevance judgement (qrels). The web page is in the original format as crawled from CommonCrawl, thus it may be html, xhtml, xml, etc. Be mindful of this when setting up your parser for indexing this data (see as an example the information about the basic ElasticSearch index we distribute).

The full collection, named clefehealth2018, occupies about 480GB of space, uncompressed. We have also created a subset of the corpus that contains a restricted number of websites. This is called clefehealth2018_B; this subset contains 1,653 website domains. This subset was created by removing a number of websites that were not strictly health-related (e.g. news websites). Note, this corpus can be used in place of the full corpus for all tasks: however, in doing so, you may miss retrieving some of the relevant documents.

How to get the document collection?

The clefehealth2018 corpus is directly available for download at https://goo.gl/uBJaNi.

The collection is about 480GB uncompressed and is distributed as a tar.gz compressed file; the size compressed is about 96 GB. Due to its size, we highly recommend the use of download manager software that enables resuming the download in case of failure. On Linux machines, this can be accomplished withwget -c.

The clefehealth2018_B corpus is directly available for download at https://goo.gl/fGbwD5. This corpus can be used in place of the full corpus for all tasks: however, in doing so, you may miss retrieving some of the relevant documents. The corpus is about 294GB uncompressed, and 57GB compressed.

Alternatively, the corpus can be obtained by downloading the data directly from the CommonCrawl. To do this, use the script querycc.py available on GitHub. Note that this may take up to 1 week to download the whole collection.

Topics

Historically the CLEF eHealth IR task has released text queries representative of layperson medical information needs in various scenarios. In recent years query variations issued by multiple laypeople for the same information need have been offered. In this year’s task, we extend this to spoken queries. These spoken queries are generated by 6 individuals using the information needs derived for the 2018 challenge. We also provide textual transcripts of these spoken queries and automatic speech-to-text translations.

Topics for subtask 1: Adhoc IR

The topics are similar to 2018 CHS task topics: 50 queries, which were issued by the general public to the HON (Health on the Net) search service. These queries were manually selected by a domain expert from a sample of raw queries collected over a period of 6 months to be representative of the type of queries posed to the search engine. Queries were not preprocessed, for example any spelling mistakes that may be present have not been removed. Queries are numbered using a 6 digits number with the following convention: the first 3 digits of a query ID identified a topic number (information need) ranged from 151 to 200. The last 3 digits of a query ID identified each individual query creator.

Topics for subtask 2: Spoken queries retrieval

All the queries from the adhoc task have been recorded with several users. Transcription of these audio files is also provided, using ESPNET, Librispeech, CommonVoice and Google API (with three models).

To get the audio queries and their transcription: fill in and sign the agreement and send it to hanna.suominen _at_ anu.edu.au.  Please use the subject line of “CLEF eHealth 2020 Task 2”. You will receive a link to download the data, valid for 24 hours.

Additional Resources

Along with the collection, we make available the following additional resources:

  • An ElasticSearch index: https://goo.gl/exkdeA. This index is about 36 GB compressed (tar.gz)
  • An Indri (v5.9) index: https://goo.gl/uNKXcJ. The indri index has around 122 GB compressed (tar.gz).
  • A Terrier (v4.2) index: https://goo.gl/nUwLVo. This index has around 42 GB compressed (tar.gz).
  • Medical CBOW and Skipgram word embeddings (created using the TREC Medical Records collection): available at https://goo.gl/M2tWCf (scroll through the table to find MedTrack embeddings)

Evaluation Methodology

The challenge this year is: given the query variants for a given information need, participants are challenged with retrieving the relevant documents from the provided document collection. This is divided into a number of sub-tasks which can be completed using the spoken queries, textual transcripts of the queries, or the provided automatic speech-to-text transcripts of the queries.

Participants can submit multiple runs for each subtask. Evaluation measures used are NDCG@10, BPref and RBP for the ad-hoc search.