Post on 20-Jan-2015
description
Mining Unstructured Data: Practical Applications
Alyona Medelyan @zelandiya Anna Divoli @annadivoli
New York London
Problem 1
Images: Ambro / FreeDigitalPhotos.net
How do lawyers scan, file, store & share client’s case documents efficiently?
slambo_42@
flickr A
noto AB
@flickr
EHR EMR PHR
How do doctors, patients & researchers distribute & share medical records efficiently?
Foreign Financial Ins.tu.on
with IRS agreement
annual report 30% witholding tax
waiver
with waiver
without waiver
U.S. account holders U.S. ownership en..es
30% witholding tax
Custodian bank without IRS agreement
The FATCA Legislation Takes effect 1 January 2013
Problem 3
How can a financial institution find U.S. citizens in masses of paperwork efficiently?
How much time do we actually spend on …
Searching, gathering info
Wri.ng emails
Crea.ng docs
Analyzing info
Reviewing docs
Organizing docs
Crea.ng presenta.ons
Edi.ng images
Entering data
Approving docs
Publishing docs
Transla.ng docs
17
14
13
10
9
7
7
6
6
4
4
1
Translates to annual costs: Search: 17h / week = $37,000 / year
IDC: Hidden cost of information average hours / week
introduction
unstructured data real life problems
unstructured data & text analytics
metadata in legal domain
healthcare records issues
conclusions
compliance in finance
Videos
Emails
Literature
Audio
News
Images
Social Media
Databases
Blogs
Text Mining Natural Language Processing
unstructured data
Opinion Mining
Business Intelligence
Document Organization
Data Extraction
Search
Machine Learning
Text Processing
Statistics Linguistics
What can one mine from unstructured data?
text text text text text text text text text text text text text text text text text text
sentiment
keywords tags
genre
categories taxonomy terms
entities
names patterns
biochemical entities … text text text
text text text text text text text text text text text text text text text
Videos
Emails
Literature
Audio
News
Images
Social Media
Databases
Blogs
text text text text text text text text text text text text text text text text text text
People U.S. politicians News about U.S. politicians
News
Structured biological data
Unique iden.fiers
Literature references
Experts’ annota.on (free text)
Structured & unstructured data interplay
introduction
unstructured data real life problems
unstructured data & text analytics
metadata in legal domain
healthcare records issues
conclusions
compliance in finance
scan
ocr
metadata
dms
save
Legal document processing pipeline
Images: Ambro / FreeDigitalPhotos.net
New York London
Assigning metadata (approximation)
15 docs per day 3 min per doc 0.75 h per day
240 working days per year $200 hourly charge
$36,000 per year per lawyer
Keyword extraction 0.0027 min per doc
10 min for yearly worth of docs
jacockshaw@
flickr
Integra.ng metadata extrac.on with scanning
h[p://www.youtube.com/watch?v=kluVp25upag
metadata
dms
Efficient (legal) document processing pipeline
keywords tags
introduction
unstructured data real life problems
unstructured data & text analytics
metadata in legal domain
healthcare records issues
conclusions
compliance in finance
EHR EMR PHR
slambo_42@
flickr A
noto AB
@flickr
EMR EHR
PHR
Na.onal Alliance for Health Informa.on Technology (NAHIT)
defini.ons
Discon.nued!
?
1. Name, birth date, blood type 2. Emergency contact(s) 3. Primary caregiver/phone number 4. Medicines, dosages, and how long
taken 5. Allergies/allergic reac.ons 6. Date of last physical 7. Dates/results of tests and
screenings 8. Major illnesses/surgeries and their
dates 9. Chronic diseases 10. Family illness history 11. …
h?p://www.nlm.nih.gov/medlineplus/magazine/
PHI
de-‐idenHficaHon process
1. Name, birth date, blood type 2. Emergency contact(s) 3. Primary caregiver/phone number 4. Medicines, dosages, and how long
taken 5. Allergies/allergic reac.ons 6. Date of last physical 7. Dates/results of tests and
screenings 8. Major illnesses/surgeries and their
dates 9. Chronic diseases 10. Family illness history 11. …
h?p://www.nlm.nih.gov/medlineplus/magazine/
Medical researchers use pa.ent records for discoveries…
… records with removed PHI: informa.on from structured fields but mostly from free text!
AMIA 2012
www.hcpro.com
siliconangle.com/blog/
www.informaHon-‐age.com
“The Health Insurance Portability and Accountability Act of 1996 (HIPAA) Privacy and Security Rules” “The Pa.ent Safety and Quality Improvement Act of 2005 (PSQIA) Pa.ent Safety Rule”
Names
Geographic subdivisions smaller than a State: street address, city, county, precinct, zip code…
Dates (except year): birth, admission, discharge…
Phone / Fax numbers
Email addresses
Social security # Medical records # Health plan beneficiary# Accounts #
PHI 18 identifiers!
Vehicle iden.fiers & serial numbers, incl. license plate numbers
Device iden.fiers & serial numbers
URLs / IP addresses
Biometric iden.fiers, including finger and voice prints
Face photo images & any comparable images
Any other unique IDs etc.
Thanks for discussions: Nigam Shah, Stanford Eneida Mendonca, UWinscosin, Madison Irena Spasic, Cardiff University
keywords tags
slambo_42@
flickr A
noto AB
@flickr
text text text text text text text text text text text text text text text text text text
introduction
unstructured data real life problems
metadata in legal domain
conclusions
compliance in finance unstructured data
& text analytics
healthcare records issues
Foreign Financial Ins.tu.on
with IRS agreement
annual report 30% witholding tax
waiver
with waiver
without waiver
U.S. account holders U.S. ownership en..es
30% witholding tax
Custodian bank without IRS agreement
The FATCA Legislation Takes effect 1 January 2013
FATCA COMPLIANCE – STEP 1 Detect U.S. citizenship indicators
Recommended Solution from FATCA Legislation:
• “Query an electronic database using standard queries in programming languages”
• “Adopt similar approaches as used for the Anti-money-laundering and Know-your-customer requirements”
• “Note that information, data, or files are not electronically searchable if they are stored as images”
walm
ink, thomwatson@
flikr
FATCA COMPLIANCE – STEP 2 Contact client for additional info or a waver
Actual Solution for the FATCA Legislation:
ocr
link analysis
en.ty extrac.on
analysis
gather the trail client’s data
convert all images to text
detect loca.ons, bank numbers
auto-‐categorize
check resolve inconsistencies
Efficient FATCA Compliance
introduction
unstructured data real life problems
metadata in legal domain
healthcare records issues
conclusions
compliance in finance unstructured data
& text analytics
healthcare records issues
Alyona Medelyan, PhD @zelandiya
Anna Divoli, PhD @annadivoli
Natural Language Processing Text Mining Wikipedia Mining Machine Learning
Try out text analytics provided by the Pingar API!
Online demo: apidemo.pingar.com Free Sandbox account: pingar.com/get-the-api
Biomedical Text Mining Search User Interfaces Human Factors Knowledge Discovery