Task 1b: Clinical Named Entity Recognition
News:
- Submission dealines – updated April 27, 2015
- updated test set submission guidelines – updated April 24, 2015
The CLEFeHealth 2015 Task 1b addresses clinical named entity recognition in languages other than English. The aim is to automatically identify clinically relevant entities in medical text in French. Only fully automated means are allowed, that is, human-in-the-loop approaches are not permitted.
Targeted Participants
The task is open for everybody. We particularly welcome academic and industrial researchers, scientists, engineers and graduate students in natural language processing and biomedical/health informatics to participate. We also encourage participation by multi-disciplinary teams that combine technological skills with linguistic and/or medical expertise.
Data Set
The data set is called QUAERO French Medical Corpus. It has been developed as a resource for named entity recognition and normalization in 2013 (Névéol et al. 2014).
The data set has been created in the wake of the 2013 CLEF-ER challenge, with the purpose of creating a gold standard set of normalized entities for French biomedical text. A selection of the MEDLINE titles and EMEA documents used in the 2013 CLEF-ER challenge were selected for human annotation and will be used in this challenge. The annotation process was guided by concepts in the Unified Medical Language System (UMLS):
1. 10 types of clinical entities, as defined by the following UMLS Semantic Groups (Bodenreider and McCray 2003) were annotated: Anatomy, Chemical and Drugs, Devices, Disorders, Geographic Areas, Living Beings, Objects, Phenomena, Physiology, Procedures
2. The annotations were made in a comprehensive fashion, so that nested entities were marked, and entities could be mapped to more than one UMLS concept. In particular: (a) If a mention can refer to more than one Semantic Group, all the relevant Semantic Groups should be annotated. For instance, the mention “récidive” (recurrence) in the phrase “prévention des récidives” (recurrence prevention) should be annotated with the category “DISORDER” (CUI C2825055) and the category “PHENOMENON” (CUI C0034897); (b) If a mention can refer to more than one UMLS concept within the same Semantic Group, all the relevant concepts should be annotated. For instance, the mention “maniaques” (obsessive) in the phrase “patients maniaques” (obsessive patients) should be annotated with CUIs C0564408 and C0338831 (category “DISORDER”); (c) Entities which span overlaps with that of another entity should still be annotated. For instance, in the phrase “infarctus du myocarde” (myocardial infarction), the mention “myocarde” (myocardium) should be annotated with category “ANATOMY” (CUI C0027061) and the mention “infarctus du myocarde” should be annotated with category “DISORDER” (CUI C0027051)
Annotations on the training set will be provided to participants in the BRAT standoff format, described here: http://brat.nlplab.org/standoff.html. Participants will also be expected to supply annotations in this format.
References
Névéol A, Grouin C, Leixa J, Rosset S, Zweigenbaum P. The QUAERO French Medical Corpus: A Ressource for Medical Entity Recognition and Normalization. Fourth Workshop on Building and Evaluating Ressources for Health and Biomedical Text Processing – BioTxtM2014. 2014:24-30
Bodenreider, O. and McCray, A. (2003). Exploring semantic groups through visual approaches. Journal of Biomedical Informatics, 36:414–432
Data Examples
MEDLINE title 1
La contraception par les dispositifs intra utérins
MEDLINE title 1 annotations (annotations are provided in stand off format, and can be visualized using the BRAT rapid annotation tool http://brat.nlplab.org/)
T1 PROC 3 16 contraception
#1 AnnotatorNotes T1 C0700589T2 DEVI 25 50 dispositifs intra utérins
#2 AnnotatorNotes T2 C0021900
T3 ANAT 43 50 utérins
#3 AnnotatorNotes T3 C0042149
MEDLINE title 2
Méningites bactériennes de l’ adulte en réanimation médicale .
MEDLINE title 2 annotations
T1 DISO 0 23 Méningites bactériennes#1 AnnotatorNotes T1 C0085437
T2 LIVB 29 36 adulte
#2 AnnotatorNotes T2 C0001765
T3 PROC 40 60 réanimation médicale
#3 AnnotatorNotes T3 C0085559
EMEA document (excerpt)
(…)
Dans quel cas Tysabri est-il utilisé ?
Tysabri est utilisé dans le traitement des adultes atteints de sclérose en plaques ( SEP ).
(…)
EMEA document annotations (excerpt)
(…)
T9 CHEM 206 213 Tysabri#9 AnnotatorNotes T9 C1529600
T10 CHEM 233 240 Tysabri
#10 AnnotatorNotes T10 C1529600
T11 PROC 261 271 traitement
#11 AnnotatorNotes T11 C0087111
T12 LIVB 276 283 adultes
#12 AnnotatorNotes T12 C0001675
T13 DISO 296 315 sclérose en plaques
#13 AnnotatorNotes T13 C0026769
T14 DISO 318 321 SEP
#14 AnnotatorNotes T14 C0026769
(…)
Training Set
The training data can be downloaded from https://clef2015.limsi.fr/train/CLEFeHealth2015_task1b_train.zip. The credentials needed to access the data are supplied to CLEF eHealth Task 1 registered participants upon request to the Task 1b organizers.
The data set includes the following documents:
– MEDLINE folder: 836 text (txt) files with corresponding annotation files (.ann)
for gold standard normalized entities; 3 .conf configuration files.
– EMEA folder: 4 text (txt) files with corresponding annotation files (.ann) for gold standard normalized entities; 3 .conf configuration files.
– BRATEVAL folder: the java source of the tool that will be used for evaluation (brateval)
Test Set and Submission Guidelines
An independent test set hass been released on 23 April 2015 and submissions are due on 1 May 2015 12pm (noon) CET for phase 1 and 4 May 2015 12pm (noon) CET for phase 2 (see phase description below).
Test Set Description, updated on 23 April, 2015
The Phase 1 data set includes the following documents (released on 23 April 2015):
– MEDLINE folder: 832 text (txt) files
– EMEA folder: 12 text (txt) files
The Phase 2 data set includes the following documents (release scheduled on 1 May 2015):
– MEDLINE folder: 832 text (txt) files with corresponding annotation files (.ann) for gold standard entities (normalization not supplied); 3 .conf configuration files.
– EMEA folder: 4 text (txt) files with corresponding annotation files (.ann) for gold standard entities (normalization not supplied); 3 .conf configuration files.
Submission Guidelines, updated on 23 April, 2015
Test documents in the test set are part of the same corpus as the training documents, and were annotated using the same guidelines by the same annotators.
Teams are invited to submit runs in two subsequent phases, one for entity recognition, one for entity normalization.
Phase 1: entity recognition (submission deadline is 1 May 2015, 12 pm-noon CET)
– Only text files are supplied.
Sample text file:
La contraception par les dispositifs intra utérins
– Teams need to supply entity annotations in the BRAT standoff format used in the training data set.
Sample entities output expected for the above sample text file:
T1 PROC 3 16 contraception
T2 DEVI 25 50 dispositifs intra utérins
T3 ANAT 43 50 utérins
– Additionnally, teams may supply normalization information for the entity annotations, also in the BRAT format.
Sample normalized entities output expected for the above sample text file:
T1 PROC 3 16 contraception
#1 AnnotatorNotes T1 C0700589
T2 DEVI 25 50 dispositifs intra utérins
#2 AnnotatorNotes T2 C0021900
T3 ANAT 43 50 utérins
#3 AnnotatorNotes T3 C0042149
When submitting, the teams must select all the relevant tracks for their submission:
1.b.1: MEDLINE entities,
1.b.2: EMEA entities,
1.b.3: MEDLINE normalized entities,
1.b.4: EMEA normalized entities.
Phase 2: entity normalization (submission deadline is 4 May 2015, 12 pm-noon CET)
– Text files and gold-standard entity annotations are supplied.
Sample text file:
La contraception par les dispositifs intra utérins
Sample entities annotations supplied along with the above sample text file:
T1 PROC 3 16 contraception
T2 DEVI 25 50 dispositifs intra utérins
T3 ANAT 43 50 utérins
– Teams need to supply supply normalization information for the gold standard entity annotations, in the BRAT standoff format.
Sample normalization output expected for the above sample text file and gold standard entity annotations:
T1 PROC 3 16 contraception
#1 AnnotatorNotes T1 C0700589
T2 DEVI 25 50 dispositifs intra utérins
#2 AnnotatorNotes T2 C0021900
T3 ANAT 43 50 utérins
#3 AnnotatorNotes T3 C0042149
When submitting, the teams must select all the relevant tracks for their submission:
1.b.5: MEDLINE normalization,
1.b.6: EMEA normalization,
Each team is allowed to submit up to 2 runs for each track.
All submissions must made by using the lab’s Easy Chair system by 1 May 2015 12pm (noon) CET for participants submitting phase 1 runs, and by 4 May 2015 12pm (noon) CET for participants submitting phase 2 runs. Participants submitting runs to both phase 1 and phase 2 must first make a phase 1 submission by 1 May 2015 and update it with phase 2 runs by 4 May 2015.
Each submission must consist of the following items:
Address for Correspondence: address, city, post code, (state), country
Author(s): first name, last name, email, country, organisation
Title: Instead of entering a paper title, please specify your team name here. A good name is something short but identifying. For example, Mayo, Limsi, and UTHealthCCB have been used before. If your team also participated in CLEF eHealth 2015 task 1a or task 2, we ask that you please use the same team name for your task 1b submission.
Keywords: Instead of entering three or more keywords to characterise a paper, please use this field to describe your methods/compilations. We encourage using MeSH or ACM keywords.
Topics: please tick all relevant tracks ammong 1.b.1, 1.b.2, etc. reflecting the runs you are submitting.
ZIP file: This file is an archive containing two files, and several folders with the results of your runs, organized as follows:
file 1: Team description as team.txt (max 100 words): Please write short general description of your team. For example, you may report that “5 PhD students, supervised by 2 Professors, collaborated” or “A multi-disciplinary approach was followed by a clinician bringing in content expertise, a computational linguist capturing this as features of the learning method and two machine learning researchers choosing and developing the learning method”.
file 2: Method description as methods.txt (max 100 words per method): Please write a short general description of the method(s) used for each run. Please include the following information in the description: 1/whether the method used was a/statistical, b/symbolic (expert or rule-based), c/ hybrid (i.e. a combination of a/ and b/) 2/whether the method used the training data supplied (EMEA, MEDLINE or both) 3/whether the method used outside data such as additional corpus, annotations on the training corpus, or lexicons, and a brief description or these ouside resources including whether the data is public.
Folder 1: Runs for tracks 1.b.1, 1.b.3 and 1.b.5 should be stored in a folder called MEDLINE.
Runs for track 1.b.1 should be stored in a subfolder called entities, with in turn one subfolder for each run: run1 and run2.
Runs for track 1.b.3 should be stored in a subfolder called normalizedEntities, with in turn one subfolder for each run: run1 and run2.
Runs for track 1.b.5 should be stored in a subfolder called normalization, with in turn one subfolder for each run: run1 and run2.
Each run folder should contain 832 BRAT standoff format .ann files with the results of your system for the run. Please make sure this is formatted in the BRAT standoff annotation format described here and exemplified in the training dataset. We recommend running the evaluation tool on your data to check format compliance.
Folder 2: Runs for tracks 1.b.2, 1.b.4 and 1.b.6 should be stored in a folder called EMEA.
Runs for track 1.b.2 should be stored in a subfolder called entities, with in turn one subfolder for each run: run1 and run2.
Runs for track 1.b.4 should be stored in a subfolder called normalizedEntities, with in turn one subfolder for each run: run1 and run2.
Runs for track 1.b.6 should be stored in a subfolder called normalization, with in turn one subfolder for each run: run1 and run2.
Each run folder should contain 12 BRAT standoff format .ann files with the results of your system for the run: Please make sure this is formatted in the BRAT standoff annotation format described here and exemplified in the training dataset. We recommend running the evaluation tool on your data to check format compliance.
Do not hesitate to post questions and comments on the task mailing list (clefehealth2015-task1b@limsi.fr) if you require further clarification or assistance.
Thank you for your participation!
Evaluation Methods
Results will be evaluated with the brateval program supplied with the training data. System performance will be assessed by precision, recall and F-measure for entity recognition and entity normalization.