E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys...

33
e-health 2009, Istanbul 23-25 September 2009 1 The Medical Information System - MedISys eHealth 2009 Second International ICST Conference on Electronic Healthcare for the 21st century September 23-25, 2009 - Istanbul, Turkey Erik van der Goot & the OPTIMA team (OPensource Text Information Mining and Analysis ) European Commission Joint Research Centre (JRC) Institute for the Protection and Security of the Citizen (IPSC) [email protected]

Transcript of E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys...

Page 1: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 1

The Medical Information System - MedISys

eHealth 2009Second International ICST Conference

onElectronic Healthcare for the 21st centurySeptember 23-25, 2009 - Istanbul, Turkey

Erik van der Goot & the OPTIMA team (OPensource Text Information Mining and Analysis )

European Commission – Joint Research Centre (JRC)Institute for the Protection and Security of the Citizen (IPSC)

[email protected]

Page 2: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

JRC Ispra on 14 December 2007 – Unit Meeting 2

Joint Research Centre (JRC)

The European Commission’s

Research-Based Policy Support Organisation

IPSC - Institute for the Protection and Security of the CitizenIspra - Italy

http://ipsc.jrc.ec.europa.eu/

http://www.jrc.ec.europa.eu/

JRC - who

Page 3: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 3

JRC - where

Page 4: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 4

MedISys - Overview

Objective:Provide open source data collection and analysis for surveillance and epidemiology Replace manual scanning of multiple newspapers and web portals Support national and international Public Health (PH) organisations to monitor issues of

Public Health concern (e.g. CBRN)

Functionality:– Gather, filter, classify, extract and aggregate health-related information– Monitor trends, detect breaking news– Visualise analysis results– Alert users– Allows customised views– In combination with RNS tool, allows manual moderation.

Page 5: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 5

Background - History

Based on JRC’s Europe Media Monitor (EMM) technology (EMM live since 2002; http://emm.newsbrief.eu).

On request / initiative of the EC’s Directorate General for Health and Consumer Protection (DG SANCO).

Password-protected service for Public Health bodies since 2005.

Public service since early 2007 (http://medusa.jrc.it/, restricted functionality).

Page 6: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 6

Background - Media Monitoring

• EU Commission Media Monitoring (until 2001/2002) – Traditional cut and paste for printed press only– Monitoring of incoming news wires (e.g. Reuters, AFP)– Simple keyword based filtering of wires– Manual selection of printed press items– Human classification of items

• Potential problems– Not ‘real-time’ for mainstream media: printed press typically once a day– Limited coverage: not all media is printed– Inaccurate and incomplete classification: subjective and limited number of categories– Labour intensive and expensive: limited number of articles per reviewer per day,

requires topical knowledge and requires language knowledge

Page 7: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 7

EMM History

• New Challenges (as seen in 2002)– Enlargement (+10 countries): more media, more languages– More use of electronic publishing (media)– Electronic distribution of results (web+mobile)– Automatic alerting functions

• New approach: EMM - a one stop shop for Media Monitoring– Facilitate (not replace) human Media Monitoring activities– Extend monitoring beyond the traditional news wires (Internet). – Improve coverage, number of languages, analysis. – Apply automatic categorization and analysis to all sources– Provide new services like automatic e-mail, sms, mobile editions etc. – Provide editorial system to manage the information and produce newsletters

etc.

Important: EMM is not Yet Another Internet Search Engine

Page 8: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 8

EMM System Features

• Automatic language recognitionBased on continuously updated language specific frequency tables

• Automated information/entity extraction400.000 persons and organizations based on continuously updated list of entities, many language specific synonyms.

• GeotaggingBased on homegrown harmonised multilingual geo-data set, about 600.000 place name variants in most languages covered by EMM, mostly national capitals, regional capitals and provincial capitals.

• Improved Categorization EngineBoolean combinations, proximity, wildcardsSupport for Arabic and similar (automatic noun-prefix processing) Support for Chinese and similar (no whitespace)

• Tonality/SentimentSimple bag of words approach, range from very negative to very positive, corrected for long term source bias, interesting for following reporting trends per category

Page 9: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 9

… more features

• Duplicate detection

• Metadata categorizationAllows selection of articles based on any previously assigned meta-data.

• Automated information linkingIncremental topic based clustering and storytracking, geolocation. 10 minute interval incremental clustering on last 4 hours worth of news. (Top Stories on front page)

• Automatic detection of breaking newsCluster growth rate Flux of articles per category

• IndexingIndex full text and most metadata.

• Statistics/Trend analysisQuantitative analysis of reporting. Maintain simple count statistics.

Page 10: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 10

…and more features

• Event extractionLanguage independent event grammars used to parse clusters using language dependent resources to fill the grammar slots.Currently for 5 languages (en, fr, it, pt, ru), violent events, humanitarian events

Page 11: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 11

Development time line

2002

2004

2006

EMM/RNS

MediSys

2009

Continuous developmentNew featuresNewsExplorer

Domain specific application

EMM System redesign

RNS redesign

Redesign based on EMM

First version 2005

Page 12: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 12

MediSys System Overview

EMM Open Source Monitoring Engine

MediSys Newsbrief NewsDesk Service (a.k.a. RNS)Editorial Interface

Page 13: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 13

Problems to solve

Find relevant information– Millions of new articles/blogs/items/tweets published on Internet each day

Deliver the information to the right user– Allow for many (possibly overlapping) categories to meet specific needs

Timely– Right now if possible

In short: Deliver targeted information timely to the right user

Page 14: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 14

Approach

Wide coverageMany sources

Local, Regional, National and International coverageMany languages

Multilinguality & cross-lingual information access

Fast coverageHigh frequency monitoring of sites, some sites every 5 minutes

Overcome the information overflow• Categorization, aggregation, duplicate identification, clustering• Customisability of MedISys NewsBrief• Search functions• RNS tool for manual moderation and targeted dissemination

Page 15: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 15

Input data

~ 2200 Sources (world-wide, but primary focus on Europe)• ~ 4,000 HTML web pages+RSS feeds• ~ 100 specialist medical sites• ~ 20 commercial newswires• Specialist pay-for sources (LexisMed)• 24/7, near continuous monitoring

~80,000 new articles/items per day

Converts dirty html with adverts, menus, html tags, ‘related stories’, etc. into clean and standardised Unicode-encoded RSS format

Use RSS when available

Perform full content analysis

Page 16: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 16

MediSys Screenshots

Page 17: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 17

MedISys – Current subscribers and users include …

Supranational organisationsDirectorate General Health and Consumer Protection (SANCO)European Centre for Disease Control, Stockholm (ECDC)European Food Safety Authority (EFSA)World Health Organisation (WHO)

National Public Health organisationsSwiss Federal Office of Public HealthIcelandic Ministry of HealthSpanish Ministry of Sanitation & Ministry of Health and Consumer ProtectionInstitut de Veille Sanitaire (France)Global Public Health Intelligence Network (Canada)Danish Emergency Management AgencyItalian Ministry of Health and Ministry of DefenceDutch Institute of Public Health & Food and Consumer Product Safety Authority

The (general?) public

Currently ~ 1000 visitors, ~ 37000 hits per day on public system

Page 18: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 18

Locations mentioned in MedISys medical articles across languages

Italian - German

Importance of multilingual information gathering

English - French

Spanish - Portuguese

Page 19: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 19

Data processing layer:

• Detect ‘known entities’ across languages using large multilingual set of name variants (updated daily)

• Geo-locate the articles using large multilingual geo-database

• Apply content based categorization using multilingual category definitions

Multilingual and cross lingual analysis (1)

Barack Obama (Eu,yo)Barak Obama (az,wo)Барак Обама (ba,uk)

أوباما (ar) باراكاوباما (ar,fa) باراك

Барак Хуссейн Обама (ru)Baraque Obama (pt)

バラク・オバマ (ja) บารั�ค โอบามา (th)

(hy)ԲարաքՕբամաާމ� ޮއ�ާބ� ްކ� ަރ� (dv) ާބ�

(yi) באראק אבאמא(he) ברק אובאמה贝拉克 · 奥巴马 (zh)

ާމ� ޮއ�ާބ� ްކ� ަރ� (dv) ާބ�اوبام Influenza-A-Virus(ur) بارک

influenzavirus tipo Aswine-origin influenzasjevernoameričk gripe

pandemia influenzalemexicaanse griepмексиканск гриппсевероамериканск гриппpandemija svinjskesjevernoameričke gripe

grippe nouvellegripă porcinăsvinjski gripsikainfluenssasvininfluensaSchweineinfluenzaPorzine InfluenzaSchweinegrippeinfluenza porcinaprasečí chřipka

Page 20: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 20

Multilingual and cross lingual analysis (2)

Data presentation layer:

• ‘Convenience’ links to external Machine Translation programs, where available.• Display of other MedISys categories, of persons and organisations found in text.

• Display on-line English translation of Chinese and Arabic

Page 21: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 21

Aggregation of multilingual information

Documents from all languages get classified according to the same countries and categories.

An increase of the number of media reports on any country-category combination is detected,

independently of the reporting language.

Graphs and alerts may show events not yet reported in your own language.

Page 22: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 22

Detection using statistics

Detect abnormal flux of reporting for a particular country/category combination

Page 23: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 23

Recent case

Page 24: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 24

News Clusters mostly about CategorySat. 02-05-2009, Influenza A

Page 25: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 25

Categorized and Clustered NewsSat. 02-05-2009, Influenza A

Page 26: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 26

PULS Event detection

Results from Helsinki University

Page 27: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 27

Category definitions – Example: haemorrhagic fever

• Terms (single or multi-word)• Cumulative weights with threshold

• Case forcing• Upper case characters in pattern only match

uppercase in text (useful for acronyms etc.)

• Wild cards• Single letters (_)• Zero, one or more letters (%)• Adjacent words (+)

• Boolean combinations of term lists• And, or, not• Using proximity operator (within X words)

Page 28: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 28

Customisability of MedISys

Add more news sources or new categories, e.g. Events: Cricket World Cup, Rugby World Cup, UEFA Euro 2008New diseasesOther classes, e.g. deliberate release of chemicals

(on request of recognised users/partners)

Output formats: web pages, email alerts, or RSS feed to integrate into your environment.

Email alerts: daily vs. breaking news onlyfor daily notification: specify hourfor breaking news: level-dependentUser-selected languages only

Page 29: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 29

Customisability: Filter by language/news source/category

Page 30: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 30

Rapid News Service - RNS (restricted to subscribed users)

Allows MedISys users to further customise their view of the news

Selection of specific languages and feedsAllows human moderation

Manual selection of news itemsDrag and drop compilation of newsletters

Allows moderators to forward news items to user groupsAllows user management Via SMS alerts, emails or newsletters

Shows overview of relative activity of each category over time

Page 31: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 31

RNS moderation: Editing interface for newsletter

Manual selection of news items, drag and drop compilation of newsletters.

Page 32: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 32

RNS moderation: Alert overview page

Time line shows overview of relative activity of each category over time.

Page 33: E-health 2009, Istanbul 23-25 September 20091 The Medical Information System - MedISysMedISys eHealth 2009 Second International ICST Conference on Electronic.

e-health 2009, Istanbul 23-25 September 2009 33

MedISys - Summary

High coverage: helps monitor a large number of multilingual media reports.Includes tools to help beat the information overflow:

via clustering, duplicate detection; categorization; information aggregation; visualisation; mapping further means are being implemented: e.g. multiligual medical event extraction

Special features of MedISys:Fully automatic (moderation possible)Real time (10-minute updates), 24/7High multilinguality (43 languages)Multilingual information aggregation

Part of EMM family of applications, active team: much new functionality to come.