De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text...
-
Upload
truongkhanh -
Category
Documents
-
view
228 -
download
0
Transcript of De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text...
![Page 1: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/1.jpg)
De-identification of unstructured clinical documents
July 13, 2017
NAACCR Annual Conference
Albuquerque, NM
![Page 2: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/2.jpg)
2
Outline
• Background
• SEER evaluation of de-identification tools
• Next steps
• Conclusion
![Page 3: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/3.jpg)
3
Use of clinical narrative textMajority of information is in free text formatRegistries collect and store increasing amount of
clinical documentsRegistries generate narrative text Few data elements are abstracted This extremely reach data source can be used for
researchOne major obstacle is that they contain personal
identifying information (PII)
![Page 4: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/4.jpg)
4
Specific applicationsSEER Virtual Tissue Repository InitiativeSEER Natural Language Processing projectsUses at individual registries
![Page 5: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/5.jpg)
5
![Page 6: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/6.jpg)
6
![Page 7: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/7.jpg)
7
Specific applications: NLP
Providing de-identified reports can accelerate the field of NLP for cancer surveillance.
SEER use of 2500 de-id reports Linguamtics I2E
IBM Watson
HLA
ASCO CancerLink
DeepPhe
Single academic partners
NCI-DOE pilot
![Page 8: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/8.jpg)
8
SEER Evaluation of De-identification tools
Two studies
![Page 9: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/9.jpg)
9
De-identification evaluation protocol 5 SEER Registries IRB approvals Pathology report selection 4000 randomly selected from reports received in 2011 800/registry Stratified by cancer site 160 each: breast, lung, crc, prostate and other
IMS provided technical instructions Each registry performed the de-identification Reviewed and compared de-id tool output to original report Recorded number of occurrences PII was missed by PII
category Automated count of de-id phrases by PII category
![Page 10: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/10.jpg)
10
Performance measurement
• De-identification rate• PII phrase level N de-identified phrases/All PII phrases PII at patient level N patients w/ missed PII/4000 (800) Calculated per each PII category and overall and per
registry
• Limitations N de-id phrases counted based on PII tag (includes over
scrubbing) De-id rates for names of patients and providers cannot
be calculated separately
![Page 11: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/11.jpg)
11
DE-IDTM
http://www.de-idata.com/
![Page 12: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/12.jpg)
12
Performance of De-ID™ in five SEER registry
PHI type De-Id phrases N
Missed phrases N
All PHI phrases
PII phrase DeID rate
N pts w/ missed PII
Pt level DeID rate
Names 13030 88 13118 0.993 19 0.995
Dates 8717 31 8748 0.996 23 0.994
Phone Numbers 909 0 909 1.000 0 1.000
Places 1532 0 1532 1.000 0 1.000
Street Addresses 350 10 360 0.972 7 0.998
Zip Codes 844 0 844 1.000 0 1.000
ID Numbers 1358 77 1435 0.946 51 0.987
Total PHI 26740 206 26946 0.992 100 0.975
Path Numbers 1678 1310 2988 0.562 810 0.798
Institutions 1355 1673 3028 0.447 825 0.794
Total de-id info 29773 3189 32962 0.903 1735 0.566
![Page 13: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/13.jpg)
13
NLM scrubberBeta Version tested
https://scrubber.nlm.nih.gov/
![Page 14: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/14.jpg)
14
Performance of NLM scrubber in four SEER registry
NLM scrubber tags
N phrases de-id
N phrases missed
Total N phrases
N patients not de-id
De-id rate phrases
De-id patients
Personal name pt name+provider name 5130 0+8 5138 0 0.998 1.000
Address 466 1 467 1 0.998 0.999Alphanumeric ssn+mrn+phone+ path# 1420 0+0+0+179 1599 77 0.888 0.901
Date 1393 1 1394 1 0.999 0.999
Total 8409 189 8598 79 0.978 0.899
![Page 15: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/15.jpg)
15
Other tools• PARAT, Privacy
Analytics• MIST, MITRE
![Page 16: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/16.jpg)
16
SummaryReasonable performance for PII (with the
exception of Seattle registry)Suboptimal for Institution and pathology specimen
IDs Inconsistency across reports and registriesRegistries opinion: generally not satisfied
![Page 17: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/17.jpg)
17
Next steps
PII annotation on representative sample of ePath reports
Customization and testing of high-potential de-identification tools Latest version of NLM scrubber BoB
![Page 18: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/18.jpg)
18
PII Annotation Protocol for Narrative Clinical Text
Annotation of PII - all PII is clearly marked and categorized in the text CDAP pipeline will be used for annotation Each registry will annotate a sample of reports PII annotated reports will be used for: Customization and training of de-identification tools Validation/testing of the tools prior to deployment Validation/testing each time major revisions/versions of
the tools are introduced
![Page 19: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/19.jpg)
19
Annotation Process
![Page 20: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/20.jpg)
20
Proposed metrics/goals
Patient name: > 99% Other names (relatives; providers, etc.): > 98% SSN: 100% Dates: > 98% Other identification numbers (MRN, account #, insurance plan #): >
99% Patient address (street, city, zip code): > 98% Patient phone, fax, email, URL: > 99% Specimen/slide/path report #: > 97% Institution/lab name: > 97% Institution address: > 97%
![Page 21: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/21.jpg)
21
• Existing de-identification tools generally have good performance
• De-identification rates are as good or better than human de-identification
• Performance decreases with increased variability of reports (multiple institutions)
• Need for customization and testing prior to deployment
• Creation of annotated sample of reports representative of documents corpora is highly suggested
• Governance: Controlled access to the de-identified reports (e.g. DUA) is recommended
Conclusion
![Page 22: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/22.jpg)
22
Resources
NISTIR 8053: De-Identification of Personal Information (Oct. 2015) http://nvlpubs.nist.gov/nistpubs/ir/2015/NIST.IR.8053.pdf
NIST Special Publications 800-188: De-Identifying Government Datasets (second draft, Dec. 2016) http://csrc.nist.gov/publications/drafts/800-
188/sp800_188_draft2.pdf
![Page 23: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/23.jpg)
23
Acknowledgments:
SEER registries: CT, HI, KY, NM, and Seattle
NCI: Spencer Morris, Paul Fearn, Steve Friedman
IMS team: Rusty Shields, Dave Annett, Laurie Buck, Linda Coyle
NIH/NLM: Mehmet Kayaalp
USC: Stephane Meystre
![Page 25: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/25.jpg)
www.cancer.gov www.cancer.gov/espanol
![Page 26: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/26.jpg)
26
Performance of De-ID™ in SEER Seattle registry
PHI type De-Id phrases N
Missed phrases N
All PHI phrases
PII phrase DeID rate
N pts w/ missed PII
Pt level DeID rate
Names 2972 39 3011 0.987 15 0.981Dates 1520 8 1528 0.995 7 0.991Phone Numbers 113 0 113 1.000 0 1.000Places 255 0 255 1.000 0 1.000
Street Addresses 65 0 65 1.000 0 1.000
Zip Codes 105 0 105 1.000 0 1.000ID Numbers 263 47 310 0.848 24 0.970Total PHI 5293 94 5387 0.983 46 0.943Path Numbers 571 221 792 0.721 140 0.825Institutions 284 809 1093 0.260 350 0.563Total de-id info 6148 1124 7272 0.845 536 0.330
![Page 27: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/27.jpg)
27
Performance of De-ID™ in SEER Hawaii registry
PHI type De-Id phrases N
Missed phrases N
All PHI phrases
PII phrase DeID rate
N pts w/ missed PII
Pt level DeID rate
Names 2972 45 3017 0.985 2 0.998
Dates 1520 0 1520 1.000 0 1.000
Phone Numbers 113 0 113 1.000 0 1.000
Places 255 0 255 1.000 0 1.000
Street Addresses 65 0 65 1.000 0 1.000
Zip Codes 105 0 105 1.000 0 1.000
ID Numbers 236 0 236 1.000 0 1.000
Total PHI 5266 45 5311 0.992 2 0.998
Path Numbers 571 36 607 0.941 26 0.968
Institutions 284 45 329 0.863 45 0.944
Total de-id info 6121 126 6247 0.980 73 0.906
![Page 28: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/28.jpg)
28
Performance of De-ID™ in SEER Kentucky registry
PHI type De-Id phrases N
Missed phrases N
All PHI phrases
PII phrase DeID rate
N pts w/ missed PII
Pt level DeID rate
Names 3647 2 3649 0.999 1 0.999
Dates 2974 10 2984 0.997 5 0.994
Phone Numbers 661 0 661 1.000 0 1.000
Places 801 0 801 1.000 0 1.000
Street Addresses 167 10 177 0.944 7 0.991
Zip Codes 559 0 559 1.000 0 1.000
ID Numbers 604 7 611 0.989 4 0.995
Total PHI 9413 29 9442 0.997 17 0.979
Path Numbers 385 57 442 0.871 44 0.945
Institutions 533 521 1054 0.506 186 0.768
Total de-id info 10331 607 10938 0.945 247 0.691
![Page 29: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/29.jpg)
29
Performance of De-ID™ in SEER Connecticut registry
PHI type De-Id phrases N
Missed phrases N
All PHI phrases
PII phrase DeID rate
N pts w/ missed PII
Pt level DeID rate
Names 451 1 452 0.998 0 1.000
Dates 1022 13 1035 0.987 11 0.986
Phone Numbers 1 0 1 1.000 0 1.000
Places 40 0 40 1.000 0 1.000
Street Addresses 13 0 13 1.000 0 1.000
Zip Codes 15 0 15 1.000 0 1.000
ID Numbers 22 0 22 1.000 0 1.000
Total PHI 1564 14 1578 0.991 11 0.986
Path Numbers 17 472 489 0.035 182 0.767
Institutions 87 254 341 0.255 200 0.744
Total de-id info 1668 740 2408 0.693 393 0.482
![Page 30: De-Identification of Unstructured Clinical Text Documents · 3. Use of clinical narrative text Majority of information is in free text format Registries collect and store increasing](https://reader031.fdocuments.in/reader031/viewer/2022021614/5c040fbc09d3f219408d14f3/html5/thumbnails/30.jpg)
30
Performance of De-ID™ in SEER New Mexico registry
PHI type De-Id phrases N
Missed phrases N
All PHI phrases
PII phrase DeID rate
N pts w/ missed PII
Pt level DeID rate
Names 2988 1 2989 1.000 1 0.999
Dates 1681 0 1681 1.000 0 1.000
Phone Numbers 21 0 21 1.000 0 1.000
Places 181 0 181 1.000 0 1.000
Street Addresses 40 0 40 1.000 0 1.000
Zip Codes 60 0 60 1.000 0 1.000
ID Numbers 233 23 256 0.910 23 0.971
Total PHI 5204 24 5228 0.995 24 0.970
Path Numbers 134 524 658 0.204 418 0.478
Institutions 167 44 211 0.791 44 0.945
Total de-id info 5505 592 6097 0.903 486 0.393