Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data

Post on 27-Jan-2015

108 views 2 download

Tags:

description

Amit Sheth, "Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data,"WSU & AFRL Window-on-Science Seminar on Data Mining, August 05, 2009.http://wiki.knoesis.org/index.php/Seminar_on_Data_Mining#Semantics_empowered_Understanding.2C_Analysis_and_Mining_of_Nontraditional_and_Unstructured_Data

Transcript of Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data

1

Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data

WSU & AFRL Window-on-Science Seminar on Data Mining

Amit P. Sheth,LexisNexis Ohio Eminent Scholar

Director, Kno.e.sis center, Wright State Universityknoesis.org

Thanks: K. Gomadam, M. Nagarajan, C. Thomas, C. Henson, C. Ramakrishnan, P. Jain and Kno.e.sis Researchers

Data & Knowledge Ecosystem

3

Data Mining

Knowledge Discovery

Understanding & Perception

IntegrationSearch

Analysis (eg Patterns)

Browsing

Insight

Situational Awareness

Decision Support

Transactional DataObservational Data

Multimedia Data

Experimental Data

Textual Data: Scientific Literature, Web Pages, News, Blogs, Reports, Wiki, Forums, Comments, Tweets

Structured,SemistructuredUnstructuredData

Some examples of R&D we have done

• Semantic Search & Ranking of Stories and Reports – connecting the dots applications (insider threat, financial risk analysis)

• Mining of biomedical (scientific) literature (extraction of entities and relationships) – discovering hidden public knowledge

• Semantic Integration, Analysis and Decision Support over Sensor Data

• Extracting taxonomy/domain model from Wikipedia• Discovering Hidden Relationships (insights) in

Community Created Content (Wikipedia)

4

• Understanding User Generated Content (on Social Networking Sites)*– What are people talking about– How people write– Why people write

With application to - Artist Popularity Ranking- Advertisement on Social Media- Identifying Social Signals – spatio-temporal-thematic analysis of

Citizen Sensor Data

5* Meena Nagarajan

TextMultimedia Content

and Web data

Metadata Extraction

Patterns / Inference / Reasoning

Domain Models

Meta data / Semantic Annotations

Relationship Web

SearchIntegrationAnalysisDiscoveryQuestion AnsweringSituational Awareness

Sensor Data

RDB

Structured and Semi-structured data

Insider threat demo (semantic search/querying, ranking, …)

7

Knowledge Discovery from Scientific Literature

Cartic Ramakrishnan

9

What Knowledge Discovery is NOT

•Search– Keyword-in-document-out – Keywords are fully specified

features of expected outcome

– Searching for prospective mining sites

•Mining – Know where to look– Underspecified

characteristics of what is sought are available

– Patterns

Cartic Ramakrishnan

10

What is knowledge discovery?

• “knowledge discovery is more like sifting through a warehouse filled with small gears, levers, etc., none of which is particularly valuable by itself. After appropriate assembly, however, a Rolex watch emerges from the disparate parts.” – James Caruther

• “discovery is often described as more opportunistic search in a less well-defined space, leading to a psychological element of surprise” – James Buchanan

• Opportunistic search over an ill-defined space leading to surprising but useful emergent knowledge

Cartic Ramakrishnan

11

Element of surprise – Swanson’s discoveries

MagnesiumMigraine

PubMed

?Stress

Spreading Cortical Depression

Calcium Channel Blockers

Swanson’s Discoveries

Associations Discovered based on keyword searches followed by manually analysis of text to establish possible relevant relationships

11 possible associations found

12

Knowledge Discovery over text

Text

Extraction of Semantics from text

Semantic Metadata Guided

Knowledge Explorations

Assigning interpretation to text

Semantic Metadata Guided

Knowledge Discovery

Triple-basedSemantic

Search

Semanticbrowser

Subgraphdiscovery

Semantic metadata in the form ofsemi-structured data

Cartic Ramakrishnan

13

Information Extraction via Ontology assisted text mining – Relationship extraction

Biologically active substance

LipidDisease or Syndrome

affects

causes

affectscauses

complicates

Fish Oils Raynaud’s Disease???????

instance_of instance_of

UMLS Semantic Network

MeSH

PubMed9284 documents

4733 documents

5 documents

Cartic Ramakrishnan

14

Background knowledge and Data used

• UMLS – A high level schema of the biomedical domain– 136 classes and 49 relationships– Synonyms of all relationship – using variant lookup (tools from

NLM)– 49 relationship + their synonyms = ~350 verbs

• MeSH – 22,000+ topics organized as a forest of 16 trees– Used to query PubMed

• PubMed – Over 16 million abstract– Abstracts annotated with one or more MeSH terms

15

Method – Parse Sentences in PubMed

SS-Tagger (University of Tokyo)

SS-Parser (University of Tokyo)

(TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ endogenous) (CC or) (JJ exogenous) ) (NN stimulation) ) (PP (IN by) (NP (NN estrogen) ) ) ) (VP (VBZ induces) (NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT the) (NN endometrium) ) ) ) ) ) )

• Entities (MeSH terms) in sentences occur in modified forms• “adenomatous” modifies “hyperplasia”• “An excessive endogenous or exogenous stimulation” modifies

“estrogen”• Entities can also occur as composites of 2 or more other entities

• “adenomatous hyperplasia” and “endometrium” occur as “adenomatous hyperplasia of the endometrium”

Cartic Ramakrishnan

18

Preliminary Results

• Swanson’s discoveries – Associations between Migraine and Magnesium [Hearst99]

• stress is associated with migraines • stress can lead to loss of magnesium • calcium channel blockers prevent some migraines • magnesium is a natural calcium channel blocker • spreading cortical depression (SCD) is implicated in some migraines • high levels of magnesium inhibit SCD • migraine patients have high platelet aggregability • magnesium can suppress platelet aggregability

•Data sets generated using these entities (marked red above) as boolean keyword queries against pubmed

•Bidirectional breadth-first search used to find paths in resulting RDF

19

Paths between Migraine and Magnesium

Paths are considered interesting if they have one or more named relationshipOther than hasPart or hasModifiers in them

Cartic Ramakrishnan

20

An example of such a path

platelet(D001792)

collagen(D003094)

migraine(D008881)

magnesium(D008274)

me_3142by_a_primary_abnormality_of_platelet_behavior

me_2286_13%_and_17%_adp_and_collagen_induced_platelet_aggregation

caused_by

hasPart

hasPart

stimulated

stimulatedhasPart

CONCLUSION Rules over parse trees are able to extract structure from

sentences

Our definition of compound and modified entities are critical for identifying both implicit and explicit relationships

Swanson’s discovery can be automated – if recall can be improved – what hurts recall?

Unsupervised Joint Extraction of Compound Entities and Relationship

Cartic Ramakrishnan, Pablo N. Mendes, Shaojun Wang and Amit P. Sheth "Unsupervised Discovery of Compound Entities for Relationship Extraction"EKAW 2008 - 16th International Conference on Knowledge Engineering and Knowledge Management Knowledge Patterns

22

Joint Extraction approach

•Dependency parse – Stanford Parser

governor

dependent

amod = adjectival modifiernsubjpass = nominal subject in passive voice

23

Algorithm

Relationship head

Subject head

Object head Object head

Cartic Ramakrishnan

24

Preliminary results

Cartic Ramakrishnan

25

Extracted Triples

Semantic Metadata Guided Knowledge Explorations and Discovery

27

Results

Cartic Ramakrishnan

28

Hypothesis Driven retrieval of Scientific Literature

PubMed

Complex Query

SupportingDocument setsretrieved

Migraine

Stress

Patient

affects

isaMagnesium

Calcium Channel Blockers

inhibit

Keyword query: Migraine[MH] + Magnesium[MH]

29

Applications

• Triple-based semantic search• Semantic Browser

30

Knowledge Discovery = Extraction + Heuristic Aggregation

Leonardo Da Vinci

The Da Vinci code

The Louvre

Victor Hugo

The Vitruvian man

Santa Maria delle Grazie

Et in Arcadia EgoHoly Blood, Holy Grail

Harry Potter

The Last Supper

Nicolas Poussin

Priory of Sion

The Hunchback of Notre Dame

The Mona Lisa

Nicolas Flammel

painted_by

painted_by

painted_by

painted_by

member_of

member_of

member_of

written_by

mentioned_in

mentioned_in

displayed_at

displayed_at

cryptic_motto_of

displayed_at

mentioned_in

mentioned_in

Undiscovered Public Knowledge

Understanding, Analyzing, Mining

Social Media

Meena Nagarajan, Karthik Gomadam

mumbai, india

november 26, 2008

another chapter in the war against civilization

and

the world saw it

Through the eyes of the people

the world read itThrough the words of the people

PEOPLE told their stories to PEOPLE

A powerful new era in Information dissemination had

taken firm ground

Making it possible for us to

create a global network of citizens

Citizen Sensors – Citizens observing, processing,

transmitting, reporting

Image Metadatalatitude: 18° 54′ 59.46″ N, longitude: 72° 49′ 39.65″ E

Geocoder (Reverse Geo-coding)

Address to location database

18 Hormusji Street, Colaba

Nariman House

Identify and extract information from tweetsSpatio-Temporal Analysis

Structured Meta Extraction

Income Tax Office

Vasant Vihar

Research Challenge #1

• Spatio Temporal and Thematic analysis– What else happened “near” this event

location?– What events occurred “before” and

“after” this event?– Any message about “causes” for this

event?

Spatial Analysis….Which tweets originated from an

address near 18.916517°N 72.827682°E?

Which tweets originated during Nov 27th 2008,from 11PM to 12 PM

Giving us

Tweets originated from an address near 18.916517°N, 72.827682°E during time interval 27th Nov 2008 between 11PM to 12PM?

Research Challenge #2:Understanding and Analyzing Casual Text

• Casual text– Microblogs are often written in SMS

style language– Slangs, abbreviations

Understanding Casual Text

• Not the same as news articles or scientific literature– Grammatical errors

• Implications on NL parser results

– Inconsistent writing style• Implications on learning algorithms that

generalize from corpus

Nature of Microblogs

• Additional constraint of limited context– Max. of x chars in a microblog– Context often provided by the discourse

• Entity identification and disambiguation

• Pre-requisite to other sophisticated information analytics

NL understanding is hard to begin with..

• Not so hard– “commando raid appears to be nigh at

Oberoi now”• Oberoi = Oberoi Hotel, Nigh = high

• Challenging– new wing, live fire @ taj 2nd floor on

iDesi TV stream• Fire on the second floor of the Taj hotel, not

on iDesi TV

Social Context surrounding content

• Social context in which a message appears is also an added valuable resource

• Post 1: – “Hareemane House hostages said by

eyewitnesses to be Jews. 7 Gunshots heard by reporters at Taj”

• Follow up post– that is Nariman House, not (Hareemane)

Understanding content … informal text

• I say: “Your music is wicked”

• What I really mean: “Your music is good”

54

Structured text (biomedical literature)

Multimedia Content and Web

data

Web Services

Semantic Metadata: Smile is a TrackLil transliterates to Lilly Allen

Lilly Allen is an Artist

Informal Text (Social Network

chatter)

Your smile rocks Lil

Urban Dictionary

MusicBrainz Taxonomy

Artist: Lilly AllenTrack: Smile

Sentiment expression: Rocks Transliterates to: cool, good

Example: Pulse of a Community

• Imagine millions of such informal opinions– Individual expressions to mass opinions

• “Popular artists” lists from MySpace comments

Lilly Allen

Lady Sovereign

Amy Winehouse

Gorillaz

Coldplay

Placebo

Sting

Kean

Joss Stone

What Drives the Spatio-Temporal-Thematic Analysis and Casual Text

Understanding

Semantics with the help of

1. Domain Models2. Domain Models3. Domain Models

(ontologies, folksonomies)

Domain Knowledge: A key driver

• Places that are nearby ‘Nariman house’– Spatial query

• Messages originated around this place– Temporal analysis

• Messages about related events / places– Thematic analysis

Research Challenge #3But Where does the Domain Knowledge come from?

• Expert and committee based ontology creation … works in some domains (e.g., biomedicine, health care,…)

• Community driven knowledge extraction – How to create models that are “socially

scalable”?– How to organically grow and maintain

this model?

Building models…seed word to hierarchy creation using WIKIPEDIA

Seed Query

BWikipedia

Fulltext Concept Search

Wikigraph-Based expansion

Graph Search

Graph Search

Graph Search

Hierarchy Creation

Query: “cognition”

Identifying relationships: Hard, harder than many hard things

But NOT that Hard, When WE do it

Games with a purpose

• Get humans to give their solitaire time – Solve real hard computational problems– Image tagging, Identifying part of an

image– Tag a tune, Squigl, Verbosity, and

Matchin– Pioneered by Luis Von Ahn

OntoLablr

• Relationship Identification Game

•leads to•causes

Explosion Traffic congestion

• How do you get comprehensive situational awareness by merging “human sensing” and “machine sensing”?

64

Research Challenge #4: Semantic Sensor Web

Semantically Annotated O&M

<swe:component name="time"><swe:Time definition="urn:ogc:def:phenomenon:time" uom="urn:ogc:def:unit:date-time">

<sa:swe rdfa:about="?time" rdfa:instanceof="time:Instant"><sa:sml rdfa:property="xs:date-time"/>

</sa:swe></swe:Time>

</swe:component><swe:component name="measured_air_temperature">

<swe:Quantity definition="urn:ogc:def:phenomenon:temperature“ uom="urn:ogc:def:unit:fahrenheit"><sa:swe rdfa:about="?measured_air_temperature“

rdfa:instanceof=“senso:TemperatureObservation"><sa:swe rdfa:property="weather:fahrenheit"/><sa:swe rdfa:rel="senso:occurred_when" resource="?time"/><sa:swe rdfa:rel="senso:observed_by" resource="senso:buckeye_sensor"/>

</sa:sml></swe:Quantity>

</swe:component>

<swe:value name=“weather-data">2008-03-08T05:00:00,29.1

</swe:value>

Semantic Sensor ML – Adding Ontological Metadata

67

Person

Company

Coordinates

Coordinate System

Time Units

Timezone

SpatialOntology

DomainOntology

TemporalOntology

Mike Botts, "SensorML and Sensor Web Enablement," Earth System Science Center, UAB Huntsville

68

Semantic Query• Semantic Temporal Query

• Model-references from SML to OWL-Time ontology concepts provides the ability to perform semantic temporal queries

• Supported semantic query operators include:– contains: user-specified interval falls wholly within a sensor reading interval

(also called inside)– within: sensor reading interval falls wholly within the user-specified interval

(inverse of contains or inside)– overlaps: user-specified interval overlaps the sensor reading interval

• Example SPARQL query defining the temporal operator ‘within’

Kno.e.sis’ Semantic Sensor Web

69

Semantic Sensor Web demo (online)

Semantic Sensor Web demo (local)

70

Synthetic but realistic scenario

• an image taken from a raw satellite feed

71

• an image taken by a camera phone with an associated label, “explosion.”

Synthetic but realistic scenario

72

• Textual messages (such as tweets) using STT analysis

Synthetic but realistic scenario

73

• Correlating to get

Synthetic but realistic scenario

Create better views (smart mashups)

Extracting Social Signals

• what are the important topics of discussions and concerns in different parts of the world on a particular day

• how different cultures or countries are reacting to the same event or situation (eg Mumbai Attack)

• how a situation such as financial crisis is evolving over a period of time in terms of key topics of discussion and issues of concern (eg subprime mortgages and foreclosures, followed by troubled banks and credit freeze, followed by massive government intervention and borrowing, and so on).

Twitris Demo

76

A few more things

• Use of background knowledge• Event extraction from text

– time and location extraction • Such information may not be present• Someone from Washington DC can tweet about

Mumbai

• Scalable semantic analytics– Subgraph and pattern discovery

• Meaningful subgraphs like relevant and interesting paths

• Ranking paths

The Sum of the Parts

Spatio-Temporal analysis– Find out where and when

+ Thematic – What and how

+ Semantic Extraction from text, multimedia and sensor data- tags, time, location, concepts, events

+ Semantic models & background knowledge– Making better sense of STT– Integration

+ Semantic Sensor Web– The platform

= Situational Awareness

KNO.E.SIS as a case study of world class research based higher education environment

http://knoesis.org

79

Kno.e.sis Center Labs (3rd Floor, Joshi)

Amit Sheth•Semantic Science Lab•Semantic Web Lab•Service Research Lab

TK Prasad•Metadata and Languages Lab

Shaojun Wang•Statistical Machine Learning

Pascal Hitzler•Formal Semantics & Reasoning lab

Michael Raymer•Bioinformatics Lab

Guozhu Dong•Data Mining Lab

Keke Chen•Data Intensive Analysis and Computing Lab

KNO.E.SIS MEMBERS – A SUBSET

Exceptional students

• Six of the senior PhD students: 84 papers, 43 program committees, contributed to winning NIH and NSF grants.

• Successfully competed with two Stanford PhDs, 1000+ citations in 2 years of his graduation.

• “BTW, Meena is an absolute find.  If all of your other students are as talented, you are very lucky.  …  I’d definitely like to work with more interns of her caliber, ... ”[Dr. Kevin Haas, Director of Search at Yahoo!]

• “It has been a few years since I visited Dayton (Wright AFB). However, it is clear that Wright State has transformed itself. Congratulations on your success with the Knoesis Center.” [Dr. Alpers Caglayan –

looking to hire Kno.e.sis grads]

Funding, Collaboration, etc

• UGA, Stanford, CCHMC, SAIC, HP, IBM, Yahoo!

• NIH, NSF, AFRL-HE, AFRL-Sensor, HP, IBM, Microsoft, Google

• 70% Federal, 19% State, 11% Industry

• Students intern at the bestIndustry labs & national labs

• Graduates very successful

83

Interested in more background?

• Semantics-Empowered Social Computing• Semantic Sensor Web • Traveling the Semantic Web through Space, Theme

and Time • Relationship Web: Blazing Semantic Trails between

Web Resources • Text Mining, Workflow Management, Semantic

Web Services, Cloud Computing with application to healthcare, biomedicine, defense/intelligence, energy

Contact/more details: amit @ knoesis.org

Special thanks: Karthik Gomadam, Meena Nagarajan, Christopher Thomas

Partial Funding: NSF (Semantic Discovery: IIS: 071441, Spatio Temporal Thematic: IIS-0842129), AFRL and DAGSI (Semantic Sensor Web), Microsoft Research and IBM Research (Analysis of Social Media Content),and HP Research (Knowledge Extraction from Community-Generated Content).