Task 3: Consumer Health Search Task

Task Overview

This task is a continuation of the previous CLEF eHealth information retrieval (IR) tasks that ran in 20132014201520162017, and 2018 and embraces the TREC-style evaluation process, with a shared collection of documents and queries, the contribution of runs from participants and the subsequent formation of relevance assessments and evaluation of the participants submissions.

This year’s IR task will continue the growth path identified in 2014, 2015, 2016, 2017 and 2018 CLEF eHealth information retrieval challenges.

The 2019 task uses a new set of queries (speech-to-text queries) compared to previous years.

Timeline

  • CLEF 2018 Collection Released (corpus + topics): data release begins January 2019
  • Result submission: 1 May 2019
  • Participants’ working notes papers submitted [CEUR-WS]: 24 May 2019
  • Submission system: <URL for result submission will be released in Spring 2019>
  • Notification of Acceptance Participant Papers [CEUR-WS]: 14 June 2019
  • Camera Ready Copy of Participant Papers [CEUR-WS] due: 29 June 2019
  • CLEFeHealth2019 one-day lab session: Sept 2019

Tasks Description

This year’s CLEF eHealth IR Consumer Health Search challenge will offer new queries (query variants, speech-to-text generated queries), and will use the same document collection as the CLEF eHealth 2018 IR Consumer Health Search Challenge. The document collection and training queries will be distributed in January 2019.

Dataset

The document corpus used in CLEF 2019 is the same as the corpus used in CLEF 2018. It consists of web pages acquired from the CommonCrawl. An initial list of websites was identified for acquisition. The list was built by submitting the CLEF 2018 queries to the Microsoft Bing Apis (through the Azure Cognitive Services) repeatedly over a period of few weeks**, and acquiring the URLs of the retrieved results. The domains of the URLs were then included in the list, except some domains that were excluded for decency reasons (e.g. pornhub.com). The list was further augmented by including a number of known reliable health websites and other known unreliable health websites, from lists previously compiled by health institutions and agencies.

** repeated submissions over time were performed because previous work has shown that Bing’s API results vary sensibly over time, both in terms of results and effectiveness; see: Jimmy, G. Zuccon, G. Demartini, “On the Volatility of Commercial Search Engines and its Impact on Information Retrieval Research”, SIGIR 2018 (to appear).

Structure of the corpus

The corpus is divided into folder by domain name. Each folder contains files: each file corresponds to a webpage from the domain as captured by the CommonCrawl dump. In total, 2,021 domains were requested from the CommonCrawl dump of 2018-09. We successfully acquired data for 1,903: for the remaining domains the CommonCrawl API returned an error or corrupted data (10 retries attempted), or incompatible data. Of the 1,903 crawled domains, 84 were not available in the CommonCrawl dump and for these a folder in the corpus exists and represents the domain that was requested; however the folder is empty (to mean that it was not available in the dump). Note that .pdf documents were excluded from the data acquired from CommonCrawl. A complete list of domains and size of the crawl data for each domain is available at https://github.com/CLEFeHealth/CLEFeHealth2018IRtask/blob/master/clef2018collection_listofdomains.txt

Note that each file in a folder represents a web page. The document id for each webpage that is used in the collection (e.g. for the qrels) is the filename. The web page is in the original format as crawled from CommonCrawl, thus it may be html, xhtml, xml, etc. Be mindful of this when setting up your parser for indexing this data (see as example the information about the basic ElasticSearch index we distribute).

The full collection, named clefehealth2018, occupies about 480GB of space, uncompressed. We have also created a subset of the corpus that contains a restricted number of websites. This is called clefehealth2018_B; this subset contains 1,653 sites. This subset was created by removing a number of websites that were not strictly health related (e.g. news websites). Note, this corpus can be used in place of the full corpus for all tasks: however in doing so you may miss to retrieve some of the relevant documents.

Queries

The query set for 2018 consists of 50 queries issued by the general public to the HON search service. Note that the queries may contain typos. The queries and the process to obtain them are described in:

Goeuriot, Lorraine and Hanbury, Allan and Hegarty, Brendan and Hodmon, John and Kelly, Liadh and Kriewel, Sascha and Lupu, Mihai and Markonis, Dimitris and Pecina, Pavel and Schneller, Priscille (2014) D7.3 Meta-analysis of the second phase of empirical and user-centered evaluations. Public Technical Report, Khresmoi Project, August 2014.

(check section 4.1.3.3 of that Technical Report for some details about the queries).

Queries are formatted one per line in the tab-separated query file, with the first string being the query id, and the second string being the query text.

How to obtain the dataset

The clefehealth2018 corpus is directly available for download at https://goo.gl/uBJaNi

Note, the collection is about 480GB uncompressed and is distributed as a tar.gz compressed file; the size compressed is about 96 GB.

Alternatively, the corpus can be obtained by downloading the data directly from the CommonCrawl. To do this, use the script querycc.py available in GitHub at https://github.com/CLEFeHealth/CLEFeHealth2018IRtask/blob/master/querycc.py. Note that this may take up to 1 week to download.

The clefehealth2018_B corpus is directly available for download at https://goo.gl/fGbwD5. Note, this corpus can be used in place of the full corpus for all tasks: however in doing so you may miss to retrieve some of the relevant documents. The corpus is about 294GB uncompressed, and 57GB compressed.

The queries are made available for download in GitHub: https://github.com/CLEFeHealth/CLEFeHealth2018IRtask/

The health intent taxonomy is made available at: https://github.com/CLEFeHealth/CLEFeHealth2018IRtask/blob/master/clef2018_health_intent_taxonomy.csv.

Resources

Similar to last year, this year along with the collection, we make available the following resources:

  • An ElasticSearch index: <URL for download coming soon>. Note, this index is about 36 GB compressed (tar.gz)
  • An Indri index: <URL for download coming soon>. The indri index has around 122 GB compressed (tar.gz).
  • A Terrier index: <URL for download coming soon>. This index has around 42 GB compressed (tar.gz).
  • Medical CBOW and Skipgram word embeddings (created using the TREC Medical Records collection): <URL for download coming soon> (scroll through the table to find MedTrack embeddings)

How the ElasticSearch index was created

The example basic ElasticSearch v.5.1.1 index we distribute as part of the challenge is intended to be used by teams that do not have the possibility to process or index the corpus. Note that this index has not been optimised, beyond converting all text to lowercase, apply a standard stopword list (the one distributed with Terrier 4.2) and a standard rule-based stemming algorithm (Porter). Note that the retrieval model for this index has been set to the standard ElasticSearch BM25 with b=0.75 and k1=1.2.

The ElasticSearch indexing configuration file is reported below.

{
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "analysis": {
        "analyzer": {
            "my_english": {
                "tokenizer": "standard",
                "filter": ["lowercase", "terrier_stopwords", "porter_stem"]
            }
        },
        "filter": {
          "terrier_stopwords": {
              "type": "stop",
              "stopwords_path": "stopwords/terrier-stop.txt"
          }
        }
      },
      "similarity": {
        "bm25_content": {
            "type": "BM25",
            "b": 0.75,
            "k1": 1.2
        }
      }
    },
    "mappings": {
      docType: {
        "_source": {
            "enabled": True
            },
        "properties": {
            "content": {
                 "type": "text",
                 "similarity": "bm25_content",
                 "analyzer": "my_english"
            },
            "is_html": {
                "type": "boolean"
            }
         }
        }
    }
}

If you need a tutorial on how to use ElasticSearch for IR experiments, have a look at the Elastic4IR project.

How the Indri index was created

The example basic Indri version 5.9 index we distribute as part of the challenge is intended to be used by teams that do not have the possibility to process or index the corpus. After uncompressed, each document of the collection was extended with the traditional trectext tags:

<doc> <docno> DOC ID </docno> ORIGINAL CONTENT <doc>

And the collection was indexed using the following configuration:


<parameters>
  <corpus> <class> trectext </class> </corpus>
  <stemmer><name> krovetz </name> </stemmer>
</parameters>

How the Terrier index was created

Likewise the ElasticSearch and Indri indices, the example basic Terrier 4.2 index we distribute as part of the challenge is intended to be used by teams that do not have the possibility to process or index the corpus. The complete configuration file can be found at etc/terrier.properties, but the main configurations used were:


TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
TrecDocTags.skip=DOCHDR
TrecDocTags.skip=SCRIPT,STYLE
TrecDocTags.casesensitive=true
FieldTags.process=TITLE,H1,A,META,ELSE
termpipelines=Stopwords,PorterStemmer
block.indexing=true
blocks.size=1

Useful Links

Organizers

Guido Zuccon (Queensland University of Technology)

Joao Palotti (Qatar Computing Research Institute)

Jimmy (Queensland University of Technology)

Lorraine Goeuriot (Université Grenoble Alpes)

Liadh Kelly (Maynooth University)

Working notes papers

<details coming soon>