CLEF eHealth 2018 – Task 3: Consumer Health Search

Task Overview

This task is a continuation of the previous CLEF eHealth information retrieval (IR) tasks that ran in 2013, 2014, 2015, 2016, and 2017, and embraces the TREC-style evaluation process, with a shared collection of documents and queries, the contribution of runs from participants and the subsequent formation of relevance assessments and evaluation of the participants submissions.

This year’s IR task will continue the growth path identified in 2014, 2015, 2016 and 2017 CLEF eHealth information retrieval challenges.

The 2018 task uses a new web corpus and a new set of queries compared to previous years.

Timeline

  • CLEF 2018 Collection Released (corpus + topics): April 2018 [released]
  • Result submission: 31st May 2018, everywhere on Earth — extended to June 8th. Submission system: https://easychair.org/conferences/?conf=clefehealth2018runs
  • Participants’ working notes papers submitted [CEUR-WS]: 31st May 2018 — extended to June 8th. Submission system: https://easychair.org/conferences/?conf=clef2018
  • Notification of Acceptance Participant Papers [CEUR-WS]: 15th June 2018
  • Camera Ready Copy of Participant Papers [CEUR-WS] due: 29th June 2018
  • CLEFeHealth2018 one-day lab session: Sept 2018 in Avignon

Tasks Description

IRTask 1: Ad-hoc Search 

This is a standard ad-hoc search task, aiming at retrieving information relevant to people seeking health advice on the web. A set of 50 queries are provided as input for the participating systems. Participants need to return a TREC result file containing a ranking of results in answer to each query. 

Note: the document id for each webpage that is used in the collection (e.g. for the qrels) is the filename of each file in the collection (with no path information), e.g. 9db79442-a329-4948-bc0c-2b0aee114362 is the filename of a webpage in the corpus. We use this to identify that webpage.

Evaluation measures for IRTask1: NDCG@10, BPref and RBP. trec_eval will be used for NDCG@10 and BPref: this is available for download at https://github.com/usnistgov/trec_eval

Queries for IRTask1:https://github.com/CLEFeHealth/CLEFeHealth2018IRtask/blob/master/clef2018_queries_task1_task4.txt(Attention: only the <en> ... </en> part of the query file should be used for IRTask1)

Submission format: The format for the submission of runs should follow the standard TREC run format. Fields in the run result file should be separated using a space as the delimiter between columns. The width of the columns in the format is not important, but it is important to include all columns and have some amount of white space between the columns. Each run should contain the following fileds: 

qid Q0 docno rank score tag

where:

  • qid is the query number (or query id)
  • Q0 is the literal Q0
  • docno is the id of a document returned by your system for qid
  • rank (1-999) is the rank of this response for this qid
  • score is a system-generated indication of the quality of the response: please ensure documents are listed in decreasing score value. Ties in score will be treated as per trec_eval convention.
  • tag is the identifier for the system, also called the run id.

Example run:

151001 Q0 3a6ac7fc-b2ea-4631-9438-f58ba0dfef41 1 1.73315273652 mySystem
151001 Q0 bc3b9dda-18d2-4ad5-9a37-26cbc10a3f7f 2 1.72581054377 mySystem
151001 Q0 fc3aa605-1103-494e-be6d-bd5331e7612a 3 1.72522727817 mySystem
151001 Q0 fefda1a5-39b6-486b-b88f-0e534da574d3 4 1.72522727817 mySystem
151001 Q0 341f81da-2f47-42c8-a37c-2df312fe165c 5 1.71374426875 mySystem

Numbers of submissions per team: Up to 4 submissions for each team are allowed. Note we may not be able to pool all the submissions for a team for relevance assessments.

IRTask 2: Personalized Search 

This task develops on top of the IRTask1 and follows on a similar task started in 2017. Here, we aim to personalize the retrieved list of search results so as to match user expertise, measured by how likely the person is to understand the content of a document (with respect to the health information). To this aim, we will use the graded version of the uRBP evaluation measure already adopted since 2016, which evaluates using both topical relevance and other dimensions of relevance, such as understandability and trustworthiness.

We will further vary the parameters of this evaluation measure to evaluate personalisation to different users. Each topic has 7 query variations: the first 4 have been issued with people with no medical knowledge, while the second 3 have been issued by medical experts. When evaluating results for a query variation, we use a parameter alpha to capture user expertise. The parameter will determine the shape of the gain curve, so that documents at the right understandability level obtain the highest gains, with decaying gains being assigned to documents that do not suit the understandability level of the modelled user. We will use alpha=0.0 for query variation 1, alpha=0.2 for query variation 2, alpha=0.4 for query variation 3, alpha=0.5 for query variation 4, alpha=0.6 for query variation 5, alpha=0.8 for query variation 6 and, finally, alpha=1.0 for query variation 7. This will model increasing level of expertise across query variations for one topic. The intuition in such evaluation is that a person with no specific health knowledge (represented by query variant 1) would not understand complex and technical health material, while an expert (represented by query variant 6) would have little or no interest in reading introductory/basic material. A script that implements this evaluation measure, including the parameterized component, will be made available in early May.

Note that the 2016 and 2017 collections includes assessments for understandability (for the same documents for which relevance was assessed), thus they can be used by teams for training. Understandability assessments are captured in the qunder files (similar to qrels, but for understandability); the 2016 assessments are available here.

Evaluation measures for IRTask2: uRBP (with alpha).

Queries for IRTask2: https://github.com/CLEFeHealth/CLEFeHealth2018IRtask/blob/master/clef2018_queries_task2_task3.txt

Submission format: The format for the submission of runs should follow the standard TREC run format. Note this submission will contain results for more queries compared to IRTask1.

Numbers of submissions per team: Up to 4 submissions for each team are allowed. Note we may not be able to pool all the submissions for a team for relevance assessments.

IRTask 3: Query Variations

IRTask2 treated query variations for a topic independently. IRTask3 instead explicitly explores the dependencies among query variations for the same information need. The task aims to foster research into building search systems that are robust to query variations.

For IRTask3 we ask participant to submit a single set of results for each topic (each topic has 7 query variations). Participants are informed of which query variations relate to the same topic, and should take these variations into account when building their systems.

Submissions will be evaluated using the same measures as for IRTask1 but using the mean-variance evaluation framework (MVE). In this framework, evaluation results for each query variations for a topic will be averaged and their variance also accounted for to compute a final system performance estimate. A script that implements the the mean-variance evaluation framework will be made available in early May.

Evaluation measures for IRTask3: NDCG@10, BPref and RBP – in the MVE framework.

Queries for IRTask3: https://github.com/CLEFeHealth/CLEFeHealth2018IRtask/blob/master/clef2018_queries_task2_task3.txt

Submission format: The format for the submission of runs should follow the standard TREC run format. Note this submission will contain results the same number of queries than IRTask1, but you should use one the first 3 digits of a query id (e.g. 150, 151,….,159,160,….,200) 

Numbers of submissions per team: Up to 4 submissions for each team are allowed. Note we may not be able to pool all the submissions for a team for relevance assessments.

IRTask 4: Multilingual Ad-hoc Search 

This task, similar to the corresponding one in previous CLEF eHealth IR challenges, offers parallel queries in the following languages: French, German, Czech. Multilingual submissions will be evaluated against IRtask1 evaluation metrics and runs — but results for the multilingual queries will be reported separately.

Evaluation measures for IRTask4: NDCG@10, BPref and RBP. trec_eval will be used for NDCG@10 and BPref: this is available for download at https://github.com/usnistgov/trec_eval

Queries for IRTask4:https://github.com/CLEFeHealth/CLEFeHealth2018IRtask/blob/master/clef2018_queries_task1_task4.txt

Submission format: The format for the submission of runs should follow the standard TREC run format. Note that you should create a different submission for each language and they should have the same number of queries than IRTask1. To differentiate among runs for each language, make sure you properly name the filename of the run (e.g. TEAM_en.run, TEAM_de.run, etc.), and also the run id in the run file (e.g. myrun_en, myrun_de, etc.).

Numbers of submissions per team: Up to 4 submissions for each team for each language apart from the English runs are allowed. Note we may not be able to pool all the submissions for a team for relevance assessments.

IRTask 5: Query Intent Identification

This task, introduced this year, requires participants to classify queries with respect to the underlying intent. An health intent taxonomy is provided at:

https://github.com/CLEFeHealth/CLEFeHealth2018IRtask/blob/master/clef2018_health_intent_taxonomy.csv.

The taxonomy is a hierarchy of health information needs intents with 8 high-level intents: (1) Disease/illness/syndrome/pathological condition, (2) Drugs and medicinal substances, (3) Healthcare, (4) Test & procedures, (5) First aid, (6) Healthy lifestyle, (7) Human anatomy, (8) Organ systems. For each high-level intent there are a maximum of 13 low-level intents. The figure below provides a snapshot of the health intent taxonomy, with example queries. Given a query, participants need to predict the correct intent underlying the query. Note that a query may have multiple intents. For each query, participants are asked to submit the top 3 intent predictions, in the form of the taxonomy id corresponding to the intent.

Evaluation measures for IRTask5: Mean Reciprocal Rank, nDCG@1, 2, 3. We will differentiate among matches between intents at high-levels of the taxonomy and at low-levels in the taxonomy.

Queries for IRTask5: these are the same queries used for IRTask1 (50 queries, only English version):

https://github.com/CLEFeHealth/CLEFeHealth2018IRtask/blob/master/clef2018_queries_task1_task4.txt

Submission format: The format for the submission should follow the TREC format, where instead of a document id, participants should list a low-level taxonomy id, e.g. 1.1, 1.4, 2.1. Note, for each query, only submit a rank of 3 low-level taxonomy ids. 

Numbers of submissions per team: Up to 2 submissions for each team.

Dataset

The document corpus used in CLEF 2018 consists of web pages acquired from the CommonCrawl. An initial list of websites was identified for acquisition. The list was built by submitting the CLEF 2018 queries to the Microsoft Bing Apis (through the Azure Cognitive Services) repeatedly over a period of few weeks**, and acquiring the URLs of the retrieved results. The domains of the URLs were then included in the list, except some domains that were excluded for decency reasons (e.g. pornhub.com). The list was further augmented by including a number of known reliable health websites and other known unreliable health websites, from lists previously compiled by health institutions and agencies.

** repeated submissions over time were performed because previous work has shown that Bing’s API results vary sensibly over time, both in terms of results and effectiveness; see: Jimmy, G. Zuccon, G. Demartini, “On the Volatility of Commercial Search Engines and its Impact on Information Retrieval Research”, SIGIR 2018 (to appear).

Structure of the corpus

The corpus is divided into folder by domain name. Each folder contains files: each file corresponds to a webpage from the domain as captured by the CommonCrawl dump. In total, 2,021 domains were requested from the CommonCrawl dump of 2018-09. We successfully acquired data for 1,903: for the remaining domains the CommonCrawl API returned an error or corrupted data (10 retries attempted), or incompatible data. Of the 1,903 crawled domains, 84 were not available in the CommonCrawl dump and for these a folder in the corpus exists and represents the domain that was requested; however the folder is empty (to mean that it was not available in the dump). Note that .pdf documents were excluded from the data acquired from CommonCrawl. A complete list of domains and size of the crawl data for each domain is available at https://github.com/CLEFeHealth/CLEFeHealth2018IRtask/blob/master/clef2018collection_listofdomains.txt

Note that each file in a folder represents a web page. The document id for each webpage that is used in the collection (e.g. for the qrels) is the filename. The web page is in the original format as crawled from CommonCrawl, thus it may be html, xhtml, xml, etc. Be mindful of this when setting up your parser for indexing this data (see as example the information about the basic ElasticSearch index we distribute).

The full collection, named clefehealth2018, occupies about 480GB of space, uncompressed. We have also created a subset of the corpus that contains a restricted number of websites. This is called clefehealth2018_B; this subset contains 1,653 sites. This subset was created by removing a number of websites that were not strictly health related (e.g. news websites). Note, this corpus can be used in place of the full corpus for all tasks: however in doing so you may miss to retrieve some of the relevant documents.

Queries

The query set for 2018 consists of 50 queries issued by the general public to the HON search service. Note that the queries may contain typos. The queries and the process to obtain them are described in:

Goeuriot, Lorraine and Hanbury, Allan and Hegarty, Brendan and Hodmon, John and Kelly, Liadh and Kriewel, Sascha and Lupu, Mihai and Markonis, Dimitris and Pecina, Pavel and Schneller, Priscille (2014) D7.3 Meta-analysis of the second phase of empirical and user-centered evaluations. Public Technical Report, Khresmoi Project, August 2014.

(check section 4.1.3.3 of that Technical Report for some details about the queries).

Queries are formatted one per line in the tab-separated query file, with the first string being the query id, and the second string being the query text.

How to obtain the dataset

The clefehealth2018 corpus is directly available for download at https://goo.gl/uBJaNi

Note, the collection is about 480GB uncompressed and is distributed as a tar.gz compressed file; the size compressed is about 96 GB.

Alternatively, the corpus can be obtained by downloading the data directly from the CommonCrawl. To do this, use the script querycc.py available in GitHub at https://github.com/CLEFeHealth/CLEFeHealth2018IRtask/blob/master/querycc.py. Note that this may take up to 1 week to download.

The clefehealth2018_B corpus is directly available for download at https://goo.gl/fGbwD5. Note, this corpus can be used in place of the full corpus for all tasks: however in doing so you may miss to retrieve some of the relevant documents. The corpus is about 294GB uncompressed, and 57GB compressed.

The queries are made available for download in GitHub: https://github.com/CLEFeHealth/CLEFeHealth2018IRtask/

The health intent taxonomy is made available at: https://github.com/CLEFeHealth/CLEFeHealth2018IRtask/blob/master/clef2018_health_intent_taxonomy.csv.

Resources

Along with the collection, we make available the following resources:

  • An ElasticSearch index: available for download until May 31st, 2018 at https://goo.gl/exkdeA. Note, this index is about 36 GB compressed (tar.gz)
  • An Indri index: available for download until May 31st, 2018 at https://goo.gl/uNKXcJ. The indri index has around 122 GB compressed (tar.gz).
  • A Terrier index: available for download until May 31st, 2018 at https://goo.gl/nUwLVo. This index has around 42 GB compressed (tar.gz).
  • Medical CBOW and Skipgram word embeddings (created using the TREC Medical Records collection): available at https://goo.gl/M2tWCf (scroll through the table to find MedTrack embeddings)

How the ElasticSearch index was created

The example basic ElasticSearch v.5.1.1 index we distribute as part of the challenge is intended to be used by teams that do not have the possibility to process or index the corpus. Note that this index has not been optimised, beyond converting all text to lowercase, apply a standard stopword list (the one distributed with Terrier 4.2) and a standard rule-based stemming algorithm (Porter). Note that the retrieval model for this index has been set to the standard ElasticSearch BM25 with b=0.75 and k1=1.2.

The ElasticSearch indexing configuration file is reported below.

{
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "analysis": {
        "analyzer": {
            "my_english": {
                "tokenizer": "standard",
                "filter": ["lowercase", "terrier_stopwords", "porter_stem"]
            }
        },
        "filter": {
          "terrier_stopwords": {
              "type": "stop",
              "stopwords_path": "stopwords/terrier-stop.txt"
          }
        }
      },
      "similarity": {
        "bm25_content": {
            "type": "BM25",
            "b": 0.75,
            "k1": 1.2
        }
      }
    },
    "mappings": {
      docType: {
        "_source": {
            "enabled": True
            },
        "properties": {
            "content": {
                 "type": "text",
                 "similarity": "bm25_content",
                 "analyzer": "my_english"
            },
            "is_html": {
                "type": "boolean"
            }
         }
        }
    }
}

If you need a tutorial on how to use ElasticSearch for IR experiments, have a look at the Elastic4IR project.

How the Indri index was created

The example basic Indri version 5.9 index we distribute as part of the challenge is intended to be used by teams that do not have the possibility to process or index the corpus. After uncompressed, each document of the collection was extended with the traditional trectext tags:

<doc> <docno> DOC ID </docno> ORIGINAL CONTENT <doc>

And the collection was indexed using the following configuration:

<parameters>
  <corpus> <class> trectext </class> </corpus>
  <stemmer><name> krovetz </name> </stemmer>
</parameters>

How the Terrier index was created

Likewise the ElasticSearch and Indri indices, the example basic Terrier 4.2 index we distribute as part of the challenge is intended to be used by teams that do not have the possibility to process or index the corpus. The complete configuration file can be found at etc/terrier.properties, but the main configurations used were:

TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
TrecDocTags.skip=DOCHDR
TrecDocTags.skip=SCRIPT,STYLE
TrecDocTags.casesensitive=true
FieldTags.process=TITLE,H1,A,META,ELSE
termpipelines=Stopwords,PorterStemmer
block.indexing=true
blocks.size=1

Useful Links

Organizers

Guido Zuccon (Queensland University of Technology)

Joao Palotti (Qatar Computing Research Institute)

Jimmy (Queensland University of Technology)

Lorraine Goeuriot (Université Grenoble Alpes)

Liadh Kelly (Maynooth University)

Working notes papers

The deadline to submit your working notes is May 31, 2018. Note that this is the same deadline to submit your runs. The working notes should provide a detailed description of the methods used by the participants, and any analysis they may have done of the results. Note that as the deadline for the submission of the runs is the same as that for working notes, relevance assessments will not be available at the time of writing the working notes. Authors may still want to analyse their results in terms of e.g. overlap of results across runs, top domains retrieved, etc.