CLEF eHealth 2018 – Task 1: Multilingual Information Extraction – ICD10 coding

Task Description

Task 1 challenges participants to information extraction in written text with focus on unexplored languages corpora, specifically English and French this year. This builds upon the 2016 and 2017 tasks which already addressed the analysis of French biomedical text with the extraction of causes of death from a corpus of death reports in French (2016, 2017) and English (2017). This task can be treated as a named entity recognition and normalization task, but also as a text classification task. Each language can be addressed independently, but we encourage participants to explore multilingual approaches. Only fully automated means are allowed, that is, human-in-the-loop approaches are not permitted.

The goal of the task is to automatically assign ICD10 codes to the text content of death certificates.

Timeline

  • Training set release: Mid-February 2018
  • Test set release: 27 April 2018
  • Result submission: 12 May 2018
  • Participants’ working notes papers submitted [CEUR-WS]: 31 May 2018
  • Notification of Acceptance Participant Papers [CEUR-WS]: 15 June 2018
  • Camera Ready Copy of Participant Papers [CEUR-WS] due: 29 June 2018
  • CLEFeHealth2018 one-day lab session: Sept 2018 in Avignon

Targeted Participants

The task is open for everybody. We particularly welcome academic and industrial researchers, scientists, engineers and graduate students in natural language processing, machine learning and biomedical/health informatics to participate. We also encourage participation by multi-disciplinary teams that combine technological skills with clinical expertise.

Data Set

The data set is called the CepiDC Causes of Death Corpus. It comprises free-text descriptions of causes of death as reported by physicians in the standardized causes of death forms. Each document was manually coded by experts with ICD-10 per international WHO standards. The languages of the challenge this year are: French, Hungarian* and Italian* (*confirmation of data use agreement pending).

Training Data

To obtain the training Data, participants need to fill the data use agreement **[TODO: this link does not work anymore]**. The Data Use Agreement is a binding document that requires participants to:

  • Use the corpus for research purposes only;
  • Submit the results of their system to the CLEF eHealth 2018 lab, including a description of their system that will appear in the lab Working Notes; the submitted results may be shared with the community at the discretion of the CLEF eHealth organizers.
  • Not redistribute or otherwise disclose the contents of the corpus to any unauthorized third party. This implies that the use of online translation services that involve sharing the text to be translated with the service provider are not permitted. Only short citations are permitted in publications for illustration purposes.

French Version of ICD-10

If you are on the Swiss territory, you can use the French version of ICD-10 which can be downloaded from the following address:

https://www.bfs.admin.ch/bfs/fr/home/statistiques/sante/nomenclatures/medkk/instruments-codage-medical.html#par_headline_782102000

https://www.bfs.admin.ch/bfsstatic/dam/assets/1140604/master

After unzipping the above archive, you can extract a list of ICD codes and associated preferred terms using the following Unix command:

cut -d\; -f8,9 CIM10GM2016_CSV_S_FR_versionm‚tadonn‚e_codes_2016_12_01.txt

Test Set and Submission Guidelines

The test data for CLEF eHealth 2018 Task 1 has been released on April 27, 2018. Submissions will be expected in the same format as the training data. Participants are bound to submit at least one run for the task as per data use agreement. Participants may submit up to two runs per test dataset. Participants may choose to participate for any number of languages. Participation to at least one language of the participant’s choice is required per data use agreement.

Runs should be submitted using the EasyChair system at: https://easychair.org/conferences/?conf=clefehealth2018runs

The task consists of extracting ICD10 codes from the raw lines of death certificate text. The process of identifying a single ICD code per certificate as the « primary cause » of death may build on the task, but is not evaluated here. The task is an information extraction task that relies on the text supplied to extract ICD10 codes from the certificates, line by line.

Please note that there are two formats for the data: 1/ The raw format presents the data as supplied by CépiDC. In this format, the raw text of death certificates (CausesBrutes) is presented separately from the associated metadata (Ident) and reference coding (CausesCalculees) – supplied in the training data and expected as system output on the test data. 2/ the aligned format presents the data in an automatically integrated fashion as described in Lavergne et al. (2016). In this format, the raw text of death certificate is integrated with the metadata, and has been aligned with a normalized version of the text (sometimes, dictionary entries) used by coders to make a coding decision.

The French dataset is available in the raw and aligned formats, whereas the English dataset is only available in the raw format.

Sample text line (in the French aligned dataset) :

Maladie de Parkinson idiopathique Angioedème membres sup récent non exploré par TDM (à priori pas de cause médicamenteuse)

ICD codes expected to be associated with this text line :

G200

R600

The sample text is given with associated metadata (9 fields) :

80147;2013;2;85;4;5;Maladie de Parkinson idiopathique Angioedème membres sup récent non exploré par TDM (à priori pas de cause médicamenteuse);NULL;NULL

The output will be expected in the following format:

80147;2013;2;85;4;5;Maladie de Parkinson idiopathique Angioedème membres sup récent non exploré par TDM (à priori pas de cause médicamenteuse);NULL;NULL;Maladie de Parkinson idiopathique;maladie Parkinson idiopathique;G200

80147;2013;2;85;4;5;Maladie de Parkinson idiopathique Angioedème membres sup récent non exploré par TDM (à priori pas de cause médicamenteuse);NULL;NULL;Angioedème membres sup;oedème membres supérieurs;R600

The output comprises the 9 input fields plus two text fields that the participants must use to report evidence text supporting the ICD10 code supplied in the twelfth, final field. The tenth field should contain the excerpt of the original text that supports the ICD code prediction. If the system uses a dictionary or other lexical resource linking to ICD10 codes (including the dictionaries supplied by the organizers), the eleventh field should be the dictionary entry that supports the ICD code prediction.

Please note that in some cases, there is no ICD10 code associated with a given text line. In other cases, the ICD10 codes associated with a given line use the context provided in other lines of the same certificate.

Sample text :

14;2007;1;40;5;2;pendaison;NULL;NULL

14;2007;1;40;5;3;suicide ?;NULL;NULL

Sample codes associated :

14;2007;1;40;5;2;pendaison;NULL;NULL;;;

14;2007;1;40;5;3;suicide ?;NULL;NULL;2-1;suicide pendaison;X709

Replication

This year, CLEF eHealth task 1 does not offer a dedicated replication track. However, participants interested in replication and reproducibility are encouraged to participate in the CENTRE task.

Submission File Format – run submissions due May 4, 2018

Each submission must consist of the following items:

Address for Correspondence: address, city, post code, (state), country

Author(s): first name, last name, email, country, organisation

Title: Instead of entering a paper title, please specify your team name here. A good name is something short but identifying. For example, Mayo, Limsi, and UTHealthCCB have been used before. If your team also participated in CLEF eHealth 2018 task 2 or task 3, we ask that you please use the same team name for your task 1 submission.

Keywords: Instead of entering three or more keywords to characterize a paper, please use this field to describe your methods/compilations. We encourage using MeSH or ACM keywords.

Topics: please tick all relevant tracks among CépiDC and Replication. reflecting the runs you are submitting.

ZIP file: This file is an archive containing two files, and several folders with the results of your runs, organized as follows:

file 1: Team description as team.txt (max 100 words): Please write short general description of your team. For example, you may report that “5 PhD students, supervised by 2 Professors, collaborated” or “A multi-disciplinary approach was followed by a clinician bringing in content expertise, a computational linguist capturing this as features of the learning method and two machine learning researchers choosing and developing the learning method”.

file 2: Method description as methods.txt (max 100 words per method): Please write a short general description of the method(s) used for each run. Please include the following information in the description: 1/whether the method used was a/statistical, b/symbolic (expert or rule-based), c/ hybrid (i.e. a combination of a/ and b/) 2/whether the method used the training data supplied (corpus, dictionaries) 3/whether the method used outside data such as additional corpus, annotations on the training corpus, or lexicons, and a brief description or these outside resources including whether the data is public.

Folder: Runs should be stored in a folder called YourTeamName. Participants may submit up to two runs for each test file.

Your run Folder should contain one subfolder for each of the languages you have addressed: FR for French, HU for Hungarian, IT for Italian.

– the FR folder should be further subvided into two folders: 1/ a folder called aligned containing the runs corresponding for the aligned test files, i.e. test files where the metadata information was integrated with the certificate text. 2/ a folder called raw corresponding to the raw test files, i.e. test files where the metadata information was supplied separately from the certificate text.

– the IT and HU folders should contain only one folder called raw corresponding to the raw test files.

Each aligned run should consist of a single .csv file containing 12 fields: the original 9 fields supplied as input and three additional fields: two optional text fields containing supporting evidence and one field with the extracted ICD10 codes.

Each raw run should consist of a single .csv file containing 6 fields: the first 3 fields supplied as input in a the CausesBrutes files (DocID;YearCoded;LineID), and three additional fields: two optional text fields containing supporting evidence and one field with the extracted ICD10 codes.

If you choose not to supply supporting information, the corresponding text fields must be empty. The files corresponding to each run should be named run1.csv and run2.csv. We recommend running the evaluation tool on your data to check format compliance.

Thank you for your participation!

Evaluation Methods

Results will be evaluated with the evaluation program supplied with the training data.

System performance will be assessed by precision, recall and F-measure for ICD code extraction at the line level.

Registration and Data Access

  1. Please register on the main CLEF 2018 registration page at http://clef2018-labs-registration.dei.unipd.it/
  2. Please contact the task organizers to receive credentials to access the task datasets.