You are here

Bootstrapping a de-identification system for narrative patient records: cost-performance tradeoffs.

TitleBootstrapping a de-identification system for narrative patient records: cost-performance tradeoffs.
Publication TypeJournal Article
Year of Publication2013
AuthorsHanauer, DA, Aberdeen, J, Bayer, S, Wellner, B, Clark, C, Zheng, K, Hirschman, L
JournalInt J Med Inform
Date Published2013 Sep
KeywordsComputer Security, Confidentiality, Electronic Health Records, Humans, Information Dissemination, Software

PURPOSE: We describe an experiment to build a de-identification system for clinical records using the open source MITRE Identification Scrubber Toolkit (MIST). We quantify the human annotation effort needed to produce a system that de-identifies at high accuracy.METHODS: Using two types of clinical records (history and physical notes, and social work notes), we iteratively built statistical de-identification models by annotating 10 notes, training a model, applying the model to another 10 notes, correcting the model's output, and training from the resulting larger set of annotated notes. This was repeated for 20 rounds of 10 notes each, and then an additional 6 rounds of 20 notes each, and a final round of 40 notes. At each stage, we measured precision, recall, and F-score, and compared these to the amount of annotation time needed to complete the round.RESULTS: After the initial 10-note round (33min of annotation time) we achieved an F-score of 0.89. After just over 8h of annotation time (round 21) we achieved an F-score of 0.95. Number of annotation actions needed, as well as time needed, decreased in later rounds as model performance improved. Accuracy on history and physical notes exceeded that of social work notes, suggesting that the wider variety and contexts for protected health information (PHI) in social work notes is more difficult to model.CONCLUSIONS: It is possible, with modest effort, to build a functioning de-identification system de novo using the MIST framework. The resulting system achieved performance comparable to other high-performing de-identification systems.

Alternate JournalInt J Med Inform
PubMed ID23643147
Grant ListUL1RR024986 / RR / NCRR NIH HHS / United States
David Hanauer
University of Michigan Rogel Cancer Center at North Campus Research Complex
1600 Huron Parkway, Bldg 100, Rm 1004 
Mailing Address: 2800 Plymouth Rd, NCRC 100-1004
Ann Arbor, MI 48109-2800 
Ph. (734) 764-8848 Fax. (734) 615-0507

Research reported in this publication was supported by the National Cancer Institutes of
Health under Award Number P30CA046592. The content is solely the responsibility
of the authors and does not necessarily represent the official views of the
National Institutes of Health.

Research reported in this publication was supported by the National Cancer Institutes of
Health under Award Number P30CA046592 by the use of the following Cancer Center
Shared Resource(s): Biostatistics, Analytics & Bioinformatics; Flow Cytometry;
Transgenic Animal Models; Tissue and Molecular Pathology; Structure & Drug
Screening; Cell & Tissue Imaging; Experimental Irradiation; Preclinical
Imaging & Computational Analysis; Health Communications; Immune Monitoring;

Copyright © Cancer Center Informatics-2011 Regents of the University of Michigan