Transforming Data into Knowledge › ~a78khan › docs › Information...Transforming Data into...

Post on 05-Jul-2020

3 views 0 download

Transcript of Transforming Data into Knowledge › ~a78khan › docs › Information...Transforming Data into...

Transforming DataTransforming Datainto Knowledgeinto Knowledge

Presented by:

Atif Khan

using Knowledge Engineering

&Machine Learning

Scope

Building Blocks ontology inference machine learning

DBpedia

InfoTrellisALLSight

Tools

01-24-2013 3

Motivation

Data Streams » Information?

.....1101111000011

.....1101100011111

.....000010010100

.....1101111111011

.....110110001100

.....011010010100

01-24-2013 4

Motivation

Information » KnowledgeCharles Smith @profcharlesJust finished registering for KDD 2012 in Beijing http://www.kdd.org/kdd2012/

Charles Smith shared a linkI am off to knowledge discovery and data mining conference in Beijing. Looking forward to Michael Jordan's keynote address

01-24-2013 5

Motivation

Information » Knowledge

KDD2012

Beijing

URLwww..kdd2012/

CharlesSmith

CharlesSmith

Conference

Knowledge Discoveryand Data Mining

Beijing

MichaelJordan

01-24-2013 6

Motivation

Information » Knowledge

KDD2012

Beijing

URLwww..kdd2012/

CharlesSmith

CharlesSmith

Conference

Knowledge Discoveryand Data Mining

Beijing

registered for

website

heldAt

MichaelJordan

01-24-2013 7

Motivation

Information » Knowledge

KDD2012

Beijing

URLwww..kdd2012/

CharlesSmith

CharlesSmith

Conference

Knowledge Discoveryand Data Mining

Beijing

registered for

website

heldAt

isa

heldAt

isKeyNoteSpeaker

attending

MichaelJordan

01-24-2013 8

Motivation

Information » Knowledge

KDD2012

Beijing

URLwww..kdd2012/

CharlesSmith

CharlesSmith

Conference

Knowledge Discoveryand Data Mining

Beijing

MichaelJordan

same as

registered for

same as

same as

website

heldAt

isa

heldAt

isKeyNoteSpeaker

attending

01-24-2013 9

Motivation

Information » Knowledge■ Charles Smith is an academic

■ KDD is a conference about data mining and knowledge discovery

■ Michael Jordan is an influential academic in data mining community

■ Charles Smith and Michael Jordan will both be in Beijing during KDD 2012

01-24-2013 10

Building Blocks

ontology inference

machinelearning

01-24-2013 11

Hybrid MDSS - Holmes**

Holmes■ a hybrid medical decision support system (MDSS)

Based on ■ ontological knowledge representation

■ logic-based inference

■ machine learning to deal with noise

**Atif Khan, John Doucette, Robin Cohen. "Validation of an Ontological Medical Decision Support System for Patient Treatment Using a Repository of Patient Data: Insights into the Value of Machine Learning". ACM Transactions on Intelligent Systems and Technology (TIST) 2012.

01-24-2013 12

Hybrid MDSS

Line of Inquiry■ which patients can be prescribed

what sleep medications?

Considerations■ patient-centric & evidence-based

■ automated

■ easy to explain and validate

■ noise tolerant

01-24-2013 13

Hybrid MDSS

Datasets

CDC–Behavioral Risk Factor Surveillance System (BRFSS) – 2010

01-24-2013 14

Hybrid MDSS

Datasets

multi-dimensionala. demographic informationb. medical informationc. behavioural Information

01-24-2013 15

Hybrid MDSS

Datasets

Characteristicsa. 450K+ individualsb. 400+ attributes/recordc. high “missingness”d. numeric coding

01-24-2013 16

Hybrid MDSS

Datasets

sleeping pill prescription protocol(HTML)

01-24-2013 17

Hybrid MDSS

Datasets

drug-to-drug interactions(HTML)

01-24-2013 18

Ontology Model

Drug

Patient

Disease

Condition

01-24-2013 19

Ontology Model

Drug

Patient

Disease

Condition

Pain pills

Sleepingpills

Age GenderMale

Female

MedicalRecord

BRFSSField

01-24-2013 20

Ontology Model

Drug

Patient

Disease

Condition

Pain pills

Sleepingpills

Age GenderMale

Female

MedicalRecord

BRFSSField

isa

01-24-2013 21

Ontology Model

Drug

Patient

Disease

Condition

Pain pills

Sleepingpills

Age GenderMale

Female

MedicalRecord

BRFSSField

hasContraIndication

isa

01-24-2013 22

Ontology Model

Drug

Patient

Disease

Condition

Pain pills

Sleepingpills

Age GenderMale

Female

MedicalRecord

BRFSSField

hasContraIndication

isa

hasRecord

isTakingprescribedTo hasC

ondition

hasAgehasGender

hasField

hasDisease

01-24-2013 23

An Ontology Model

Defines■ taxonomy

● a hierarchy of concepts

■ relationships

Scope - “domain of discourse”■ e.g. medical decision support system

01-24-2013 24

Mapping Raw Data

Patient1

MedicalRecord1

GenderField

2hasValue

AgeField

66hasValue

SnoreField

1hasValue

01-24-2013 25

Mapping Raw Data

Patient1

MedicalRecord1

GenderField

2hasValue

Female

hasGender

AgeField

66hasValue

SnoreField

1hasValue

01-24-2013 26

Mapping Raw Data

Patient1

MedicalRecord1

GenderField

2hasValue

Female

hasGender

AgeField

66hasValue hasCondition

Elderly

SnoreField

1hasValue

01-24-2013 27

Mapping BRFSS Data

Patient1

MedicalRecord1

GenderField

2hasValue

Female

hasGender

AgeField

66hasValue hasCondition

Elderly

SnoreField

1hasValue hasDisease

SleepApnea

01-24-2013 28

Mapping Expert Knowledge

Ramelteon

Eszopiclone

Insomnia

prescribedFor

01-24-2013 29

Mapping Expert Knowledge

Ramelteon

Eszopiclone

LungDisease

LiverDisease

Depression

SleepApnea

Asthma

Insomnia

hasContraIndication

prescribedFor

01-24-2013 30

Mapping Expert Knowledge

Ramelteon

Eszopiclone

Pregnancy

BreastFeeding

Elderly

AlcoholAbuse

LungDisease

LiverDisease

Depression

SleepApnea

Asthma

Insomnia

hasContraIndication

prescribedFor

01-24-2013 31

Mapping Expert Knowledge

Ramelteon

Eszopiclone

Pregnancy

BreastFeeding

Elderly

AlcoholAbuse

Propoxyphene

LungDisease

LiverDisease

Depression

SleepApnea

Asthma

Insomnia

hasContraIndication

prescribedFor

01-24-2013 32

Mapping Expert Knowledge

:Propoxyphene a :Drug; :isPrescribedFor :Pain; :hasContraIndication

:Eszopiclone.

:Wygesic a :Drug; :isPrescribedFor :Pain; :hasContraIndication :Eszopiclone.

:Trycet a :Drug; :isPrescribedFor :Pain; :hasContraIndication

:Eszopiclone.

:PropoxypheneCompound65 a :Drug; :isPrescribedFor :Pain; :hasContraIndication

:Eszopiclone.

:Propacet100 a :Drug; :isPrescribedFor :Pain; :hasContraIndication :Eszopiclone.

:Aspirin a :Drug; :isPrescribedFor :Pain.

:Tylenol1 a :Drug; :isPrescribedFor :Pain.

:Tylenol2 a :Drug; :isPrescribedFor :Pain; :hasContraIndication :SleepingMedication.

N3 representation

01-24-2013 33

Mapping Expert Knowledge

:Propoxyphene a :Drug; :isPrescribedFor :Pain; :hasContraIndication

:Eszopiclone.

:Wygesic a :Drug; :isPrescribedFor :Pain; :hasContraIndication :Eszopiclone.

:Trycet a :Drug; :isPrescribedFor :Pain; :hasContraIndication

:Eszopiclone.

:PropoxypheneCompound65 a :Drug; :isPrescribedFor :Pain; :hasContraIndication

:Eszopiclone.

:Propacet100 a :Drug; :isPrescribedFor :Pain; :hasContraIndication :Eszopiclone.

:Aspirin a :Drug; :isPrescribedFor :Pain.

:Tylenol1 a :Drug; :isPrescribedFor :Pain.

:Tylenol2 a :Drug; :isPrescribedFor :Pain; :hasContraIndication :SleepingMedication.

N3 representation

01-24-2013 34

Knowledge Representation-Recap

Why?■ to create, maintain

and share informationin a precise manner without ambiguity of meaning

How?■ ontologies

“Now! That should clear up a few things around here!”

http://photos1.blogger.com/blogger2/1715/1669/1600/larson-oct-1987.gif

01-24-2013 35

Knowledge-Inference

a discovery processto find implied knowledge

using explicitly defined information

01-24-2013 36

Knowledge-Inference

logic-based

a discovery processto find implied knowledge

using explicitly defined information

ontology

01-24-2013 37

Knowledge-Inference: example

What do we know about Mary?

■ Mary is grandmother

■ Mary is a grand parent

■ Mary is woman

Mary EmilyJames

hasChild

hasChild

01-24-2013 38

Knowledge-Inference: example

What do we know about Mary?

■ Mary is grandmother

■ Mary is a grand parent

■ Mary is woman

Mary EmilyJames

hasChild

hasChild

if a person has a child, and that child also has a child, then the person is a grandparent

01-24-2013 39

Inference Rules

Drug-to-Drug Interactions■ If a patient is taking an existing drug (D1) and

D1 has contraindication to another drug D2 then drug D2 should not be prescribed to the patient

{ ?P a :Patient. ?D1 a :Drug.?D2 a :Drug.

?P :isTaking ?D1.?D1 :hasContraIndication ?D2.

} => {?P :cannotBeGiven ?D2}.

01-24-2013 40

Inference Rules

Drug-to-Disease Interactions■ If a patient has a condition that has a contra

indication to a drugthen the patient should not be given the drug

{ ?P a :Patient.?D a :Drug.?DIS a :Disease.

?P :hasDisease ?DIS.?D :hasContraIndication ?DIS.

} => {?P :cannotBeGiven ?D}.

01-24-2013 41

Putting it all Together

triplestore

reasonerquery

inferencerules

01-24-2013 42

Putting it all Together

triplestore

reasonerquery

inferencerules

Result

Proof

01-24-2013 43

Putting it all Together

Result

Proof

logic-based, can be verified by traversing the knowledge graph

query answer

triplestore

reasonerquery

inferencerules

01-24-2013 44

But What about Noise?

Mary Emily?

hasChild

hasChild

Noise■ cripples knowledge-based solutions

■ limits inference capability

01-24-2013 45

But What about Noise?

Noise■ cripples knowledge-based solutions

■ limits inference capability

Mary EmilyJames

hasChild

?

01-24-2013 46

Data is Almost Always Noisy

use “machine learning”

to deal with noise

01-24-2013 47

Machine Learning?

feature 2age

feature 1income

query: given a person's age and income predict if they are happy or sad

01-24-2013 48

Machine Learning?

feature 2age

feature 1income

query: given a person's age and income predict if they are happy or sad

happy people

sad people

01-24-2013 49

Machine Learning?

query: is he a happy or sad person ?

happy person

sad person

01-24-2013 50

Machine Learning 101

Machine Learning■ classification: predict class of an instance

■ regression: prediction of a numeric value

■ clustering: group similar items together

01-24-2013 51

Machine Learning 101

Machine Learning■ classification: predict class of an instance

■ regression: prediction of a numeric value

■ clustering: group similar items together

01-24-2013 52

Machine Learning 101

Machine Learning■ classification: predict class of an instance

01-24-2013 53

Classification

0. Train■ build a classifier based on known exemplars

1. Predict

2. Update

3. Evaluate

01-24-2013 54

Dealing with Noise

query: is John depressed ?

but we don't have access to that information

01-24-2013 55

Dealing with Noise

query: is John depressed ?

depressed

Training exemplars

unknown

healthy

01-24-2013 56

Dealing with Noise

answer:

sameAs( , ) = 0.14

sameAs( , ) = 0.85

sameAs( , ) = 0.01

query: is John depressed ?

01-24-2013 57

Dealing with Noise

answer:

sameAs( , ) = 0.14

sameAs( , ) = 0.85

sameAs( , ) = 0.01

query: is John depressed ?

01-24-2013 58

Recap

a hybrid decision supportsystem with ontology-based knowledge representation and logic- based reasoning augmented with machine learning classification fornoise tolerance

01-24-2013 59

Recap

a hybrid decision support system with ontology-based knowledge representation and logic-based reasoning augmented with machine learning classification for noise tolerance

01-24-2013 60

Taming Data in the Wild

01-24-2013 61

Taming Data in the Wild

“Data-to-knowledge” ■ an expensive journey

however, connecting the “data dots” is the key

“Linked Data”*■ can make this a reality

* http://linkeddata.org/

01-24-2013 62

Linked Data

Brief Summary■ Tim Berners-Lee's vision

■ URIs* to identify 'things'

■ HTTP-based dereferencing of URIs

■ structured representation (RDF**)

■ hyperlink 'things' together

Tim Berners-Lee

http://www.w3.org/Press/Stock/Berners-Lee/2001-eur-head-quarter.jpg

* URI = Universal Resource Identifier ** RDF = Resource Description Framework

01-24-2013 63

DBpedia

Extract Information from Wikipedia■ unstructured information

● articles (free text + noise)

■ structured components● infobox templates● categorisation information● images● geo-coordinates, ● external web links

http://wiki.dbpedia.org

01-24-2013 64

DBpedia

Knowledge Engineering using Ontology*■ 359 classes

● in a subsumption hierarchy

■ 1,775 different properties

*http://wiki.dbpedia.org/Ontology

01-24-2013 65

DBpedia

Knowledge Engineering using Ontology

http://wiki.dbpedia.org/Datasets

“English version of the DBpedia knowledge base currently

describes 3.77 million things, out of which 2.35 million are classified in a consistent Ontology, including

764,000 persons, 573,000 places (including

387,000 populated places), 333,000 creative works (including 112,000 music albums, 72,000

films and 18,000 video games), 192,000 organizations (including 45,000 companies

and 42,000 educational institutions), 202,000 species and 5,500 diseases.”

01-24-2013 66

DBpedia – Interlinked Datasets

2007

http://richard.cyganiak.de/2007/10/lod/lod-datasets_2007-05-01.png

01-24-2013 67

2009

http://richard.cyganiak.de/2007/10/lod/lod-datasets_2009-03-05_colored.png

01-24-2013 68

2011

http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png

01-24-2013 69

DBpedia in Action

http://www.visualdataweb.org/relfinder/relfinder.php

Demo/Video

01-24-2013 70

know your customer

ALLSight Platform

01-24-2013 71

AllSight – 360 View of a Customer

01-24-2013 72

AllSight – 360 View of a Customer

01-24-2013 73

AllSight – 360 View of a Customer

01-24-2013 74

AllSight – 360 View of a Customer

01-24-2013 75

product

Example

location

transactions

customer

01-24-2013 76

Example

female

teenager

adult

senior

male

tech buff

shopaholic

Bieber nation

customer

01-24-2013 77

01-24-2013 78

01-24-2013 79

01-24-2013 80

Under the Hood

Data Enrichment■ taxonomies/ontologies

■ dictionaries/catalogues

Matching (decision making)■ knowledge-based rules

■ machine learning

01-24-2013 81

Challenges

Entity Resolution

Pre-processing

Feature Selection

Training (& retraining)

Noise

01-24-2013 82

Tools of the Trade

01-24-2013 83

Some Basic Tools

Knowledge Engineering■ Protégé

■ Web Ontology Language – OWL

■ Resource Description Framework (RDF)● XML● Notation 3 (N3)

01-24-2013 84

Some Basic Tools

Semantic Reasoners■ CWM

■ Euler Sharp

■ Jenna

■ FuXi

01-24-2013 85

Some Basic Tools

Machine Learning Toolkits■ Weka

■ Stanford NLP

■ Apache Mahout*

■ LIBLINEAR*

■ LIBSVM*

■ numpy, scipy (python libs)

01-24-2013 86

In Summary

The collaborative work of this group has the potential to unlock data to create knowledge

There are many■ uses cases that can benefit from this work

■ tools available to facilitate this process

01-24-2013 87

Thank You!!

01-24-2013 88

Confidence in Your Data

* Adler on Data Governancehttps://www.ibm.com/developerworks/mydeveloperworks/blogs/adler/entry/the_ibm_big_data_governance_summit3?lang=en

“Velocity, Volume, and Variety without Veracity creates Vulnerability”*

01-24-2013 89

Confidence in Your Data

“Velocity, Volume, and Variety without Veracity creates Vulnerability”*

* Adler on Data Governancehttps://www.ibm.com/developerworks/mydeveloperworks/blogs/adler/entry/the_ibm_big_data_governance_summit3?lang=en

evaluate confidence values