CLEF eHealth Task 2013 – Dataset

Dataset Creation for Tasks

The dataset for Tasks 1 and 2 consists of deidentified clinical free-text notes from the MIMIC II database, version 2.5 (mimic.physionet.org). Notes were authored in the ICU setting and note types include discharge summaries, ECG reports, echo reports, and radiology reports (for more information about the MIMIC II database, we refer the reader to the MIMIC User Guide).

For this evaluation, the training set contains 200 notes, and the test set contains 100 notes.

Task 1 – Disorders

Participants will be provided with an unannotated clinical report dataset and will be evaluated on their ability to (a) automatically identify the boundaries of disorder named entities in the text and (b) map the automatically identified named entities to SNOMED codes. All participant will be evaluated on (a); participation on (b) is optional.

Annotation of disorder mentions was carried out as part of the ongoing ShARe (Shared Annotated Resources) project (clinicalnlpannotation.org). For this task in the evaluation lab, the focus is on the annotation of disorder mentions only. As such, there are two parts to the annotation: identifying a span of text as a disorder mention and mapping the span to a UMLS CUI (concept unique identifier). Each note was annotated by two professional coders trained for this task, followed by an open adjudication step. The annotation guidelines and examples of annotation for this task are available as part of the shared task materials.

A disorder mention is defined as any span of text which can be mapped to a concept in the SNOMED-CT terminology and which belongs to the Disorder semantic group. A concept is in the Disorder semantic group if it belongs to one of the following UMLS semantic types:
–       Congenital Abnormality
–       Acquired Abnormality
–       Injury or Poisoning
–       Pathologic Function
–       Disease or Syndrome
–       Mental or Behavioral Dysfunction
–       Cell or Molecular Dysfunction
–       Experimental Model of Disease
–       Anatomical Abnormality
–       Neoplastic Process
–       Signs and Symptoms
(Note that this definition of Disorder semantic group does not include the Findings semantic type, and as such differs from the one of UMLS Semantic Groups, available at semanticnetwork.nlm.nih.gov/SemGroups).

Task 2 – Acronyms/Abbreviations

Participants will be provided with a clinical report dataset that has been previously annotated for acronym/abbreviation spans. Participants will be evaluated on their ability to map the existing annotations to UMLS codes.

Annotation of acronyms and abbreviations was carried out specifically for the CLEF 2013 eHealth Evaluation Lab. For this task, the focus is normalization of pre-annotated acronyms/abbreviations to UMLS concepts. Annotators were instructed to annotate all acronyms/abbreviations that were contained in narratives and not contained in a list. Participants will be provided the span of the acronyms/abbreviations, which were annotated by multiple nursing students trained for this task, followed by an open adjudication step. The goal for Task 2 is to map the annotation to best matching concept in the UMLS. Some of the annotations will not match UMLS concepts and will be assigned the value “CUI-less”. The annotation guidelines and examples of annotation for this task are available as part of the evaluation lab dataset.

Task 3 – Medical-related documents

The data set for task 3 consists of a set of medical-related documents, provided by the Khresmoi project. This collection contains documents covering a broad set of medical topics, and does not contain any patient information. The documents in the collection come from several online sources, including Health On the Net organization certified websites, as well as well-known medical sites and databases (e.g. Genetics Home Reference, ClinicalTrial.gov, Diagnosia)

Queries and result set will also be provided with the data set. The queries have been manually generated by medical professionals from highlighted disorders identified in Task 1 (a manually extracted set). A mapping between queries and task 1 matching discharge summary will be provided, the participants are free to get access to the discharge summary (guidelines to get tasks 1 and 2 dataset below). It is not mandatory to obtain the tasks 1 and 2 datasets to participate in task 3 but it can be used as an external resource if desired.A training set is provided, containing 5 queries and the matching result set. The test set contains 50 queries.

Format of Annotations and System Output

Task 1 and 2 Annotation Format
The official annotation format for Tasks 1 and 2 is the same. Annotations are standoff and are in the following format (synthetic example):

report name || annotation type || cui || char start || char end 
08100-027513-DISCHARGE_SUMMARY.txt||Disease_Disorder||c0332799||459||473

System results should be submitted in the same format. If the annotation contains disjoint spans (i.e., non-contiguous spans, such as in the sentence “Abdomen: no distention is noted.” in which the single annotation for “abdominal distention, C0235698” encompasses the span 0-6 (abdomen) and 13-22 (distention)), then additional char start and char end values will be appended:

08100-027513-DISCHARGE_SUMMARY.txt||Disease_Disorder||c0332799||459||473||486||493

Note: If only participating in boundary detection part of Task 1, leave the cui slot blank.

To account for potential linefeed/newline confusion among operating systems, we have provided a java program for converting non-Unix to Unix linefeeds.  The script is a java program based on the flip utility for converting non-Unix to Unix linefeeds (https://ccrma.stanford.edu/~craig/utility/flip/). If you are not using a Unix or MacOS operating system, once you download the datasets, please run the program on the directory containing the corpus of text files to create a new directory with Unix linefeeds:

java -jar convertFilesToUnixFormat.jar <directory containing files> <new directory>

Visualizing Task 1 and 2 Annotations

In addition to the official standoff annotation format, the annotations are provided in Knowtator xml format and can be visualized in three ways:

(1) Protege/Knowtator combination:
Protege 3.3.1: http://protege.cim3.net/download/old-releases/3.3.1/full/
Knowtator 1.9beta: http://sourceforge.net/projects/knowtator/files/
(2) eHOST: http://code.google.com/p/ehost/(3) Evaluation Workbench: available with datasets
Details on using the evaluation workbench are on the Evaluation page of this site.

Obtaining Datasets (Tasks 1 and 2)

  • 1. Obtain a human subjects training certificate and create an account (https://mimic.physionet.org/gettingstarted/access/) and follow the instructions.
  • 2. Go to this site and accept the terms of the DUA: https://physionet.org/works/MIMICIIClinicalDatabase/access.shtml You will receive an email telling you to fill in your information on the DUA and email it back with your human subjects training certificate.Fill out the DUA using the word “CLEF” in the description of the project and mail it back (pasted into the email) with your human subjects certificate attached. General research area for which the data will be used: CLEF (plus perhaps something more descriptive)
  • 3. Once you are approved, the organizers will add you to the physionetworks ShARE/CLEF eHealth 2013 account as a reviewer. We will send you an email informing you that you can go to the PhysioNetWorks website and click on the authorized users link to access the data (it will ask you to log in using your physionetworks account login): https://physionet.org/works/ShAReCLEFeHealth2013

Obtaining Datasets (Tasks 3)