Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of...

Adaptive Multi-modal Data Mining

and Fusion For Autonomous

Intelligence Discovery

Edward J. Wegman, Ph.D.

Yasmin H. Said, Ph.D.

Outline of Presentation

• Problem Description

• Background in Text Mining

• Outline of System

• Arabic Language Tool

• Geospatial Tool

• Integration of Text and Images

• Streaming Documents

Problem Description

• Consider the plight of an analyst, who is faced with

multimedia sources that stream in data constantly.

• Data can be structured text, unstructured text, voice, images,

and video.

• The data likely are not English language; the data are likely to

be massive in scale; the data are streaming.

• Our premise: The analyst needs a system tool to integrate,

filter, and present to the analyst for his or her consideration

the data that are most likely to be useful.

• The tool should be a query system that must operate

transparently and without significant human fine tuning.

Text Mining

• The roots of the proposed tool are focused in text mining.

• Text mining uses statistical, mathematical, and computer science techniques to extract subtle and unanticipated information and relationships from sets of documents.

• These sets of documents are called corpora.

• Two important methods:

– Cross-corpus discovery.

– Clustering.

Cross Corpus Discovery

•Test case examples

–1200 Science News abstracts.

–350 Naval Research ILIR documents.

The Approach

Text Data Mining

Via

MST Exploration

Multi-Discipline Document

Set

Minimal Spanning Tree (MST)

Calculation

Interpoint Distance

Calculation

Feature Extraction

(Denoising, stemming,

BPM, TPM)

MST Layout Via

Spring Based Models

Cross Corpora

Associations

Cluster Determination

and

Exploration

Feature Extraction -

Bigram and Trigram Proximity Matrix

“The wise young man sought his father in the crowd.”

MST Classifier Complexity

Characterization

Insight: the

number of cross

class edges can

be used as a

surrogate for

classification

complexity. These

cross class

(corpora) edges

will be used in our

scheme to

facilitate the cross-

corpora discovery

process.

The Environment (Opening Screen)

Mathematics and Computer Sciences vs.

Physical Sciences and Technology Second

Strongest Association in MST

Anthropology and Archaeology vs.

Medical Sciences Strongest Associated

Articles in the MST

Anthropology and Archaeology vs. Medical Sciences

Strongest Associated Articles Comparison

A Duplicate in the ILIR Database in the Advanced

Naval Materials Category

NAVSTO

FY99/FY00

Duplicate enters for L. MERWIN and C. RICE

ORGANICALLY MODIFIED CERAMICS FOR CORROSION CONTROL

../../user/Desktop/KSA_Quran/IMI Lecture Stuff/NAVSTOFY99/FY00L. MERWINORGANICALLY MODIFIED CERAMICS FOR CORROSION CONTROL

Two Closely Related Articles in the Human Performance

Factors and the Information Technology and Operations

NUWC

FY01

Dr. Susan S. Kirschenbaum

ADAPTIVE GROUPWARE FOR PLANNING

NUWC

FY99

S. S. KIRSCHENBAUM

TRAINING A SYSTEM

Two Articles in the Information Technology and

Operations that are Identical)

NAVSTO

FY99/FY00

L. VENETSKY

DIRECT ADAPTIVE, GRADIENT DESCENT,

AND GENETIC ALGORITHM TECHNIQUES

FOR FUZZY CONTROLLERS

NAVSTO

FY99/FY00

L. VENETSKY

MISSION SCENARIO CLASSIFICATION USING

PARAMETER SPACE CONCEPT LEARNING

Document Clustering

• An obvious statement: “It is extremely useful to

group documents that are similar.”

• Ultimately, document should be interpreted in a

multimedia sense.

Test Data for this Example

• Our test bed for text data was collected by the Linguistic Data Consortium in 1997.

– The data consisted of 15,863 news reports collected from Reuters and CNN from July 1, 1994 to June 30, 1995.

• Features – The human classifiers claimed 25 clusters in their

limited document database

– Just as before, we denoise and stem the text data.

Text Example - Clusters

Cluster 0, Size: 157, ISim: 0.142, ESim: 0.008

Descriptive: ireland 12.2%, ira 9.1%, northern.ireland 7.6%, irish 5.5%, fein

5.0%, sinn 5.0%, sinn.fein 5.0%, northern 3.2%, british 3.2%, adam 2.4%

Discriminating: ireland 7.7%, ira 5.9%, northern.ireland 4.9%, irish 3.5%,

fein 3.2%, sinn 3.2%, sinn.fein 3.2%, northern 1.6%, british 1.5%, adam

1.5%

Phrases 1: ireland 121, northern 119, british 116, irish 111, ira 110, peac 107,

minist 104, govern 104, polit 104, talk 102

Phrases 2: northern.ireland 115, sinn.fein 95, irish.republican 94,

republican.armi 91, ceas.fire 87, polit.wing 76, prime.minist 71, peac.process

66, gerri.adam 59, british.govern 50

Phrases 3: irish.republican.armi 91, prime.minist.john 47, minist.john.major

43, ira.ceas.fire 35, ira.polit.wing 34, british.prime.minist 34, sinn.fein.leader

30, rule.northern.ireland 27, british.rule.northern 27, declar.ceas.fire 26

Text Example - Clusters

Cluster 1, Size: 323, ISim: 0.128, ESim: 0.008

Descriptive: korea 19.8%, north 13.2%, korean 11.2%, north.korea 10.8%, kim

5.8%, north.korean 3.7%, nuclear 3.5%, pyongyang 2.0%, south 1.9%,

south.korea 1.5%

Discriminating: korea 12.7%, north 7.4%, korean 7.2%, north.korea 7.0%, kim

3.8%, north.korean 2.4%, nuclear 1.7%, pyongyang 1.3%, south.korea 1.0%,

simpson 0.8%

Phrases 1: korea 305, north 303, korean 285, south 243, unit 215, nuclear 204,

offici 196, pyongyang 179, presid 167, talk 165

Phrases 2: north.korea 291, north.korean 233, south.korea 204, south.korean

147, kim.sung 108, presid.kim 83, nuclear.program 79, kim.jong 74, light.water

71, presid.clinton 69

Phrases 3: light.water.reactor 56, unit.north.korea 55, north.korea.nuclear 53,

chief.warrant.offic 49, presid.kim.sung 46, leader.kim.sung 39, presid.kim.sam

37, north.korean.offici 36, warrant.offic.bobbi 35, bobbi.wayn.hall 29

Outline of System

Four core capabilities:

• Text and image mining for feature extraction

• Multi-modal data fusion

• Agent-based adaptive information filtering

• Cognitively friendly information visualization

Outline of System

Unstructured TextStructured Text

RelationalDatabase

Speech recognitionengine

EmailInternet chat

record

Interceptedphone calls

Recordedconversations

Speech Audio

Structured textfeature extractor

Unstructured textfeature extractor

Static imagery(geo-spatial)

Video(geo-spatial)

Text miner

...

...

Human analyst

Personaluser agent

Textual informationfilter Filter parameters

Text filtering agent

Image miner

Filter parameters

Image filtering agent

Imagefeature extractor

Image Filter

KQMLKQML

Arabic Language Tool


RelationalDatabase


EmailInternet chat

record



Speech Audio




Video(geo-spatial)

Text miner

...

...

Human analyst

Personaluser agent



Image miner

Filter parameters



Image Filter

KQMLKQML


• Our fundamental premise is that Arabic language documents, open source and otherwise, provide valuable insight.

• Open source documents are streaming.

• Not enough Arabic language experts are available to translate everything.

• We need a system for English language queries to an Arabic language text database.


• Basic functionality

– Arabic language documents are background processed,

stemmed, denoised, clustered, bigrammed.

• Bigrams are attached as metadata.

– English language query is translated to Arabic

• Query is divided into multiple bigrams.

– Reduced Arabic language document set is presented to

analyst for consideration and translation.


• Status

– Native Arabic speaker, Eiman Alshammari, is our graduate student developing tool.

– We met with the Arabic Language Data Mining Group in Cairo and secured cooperation and an Arabic language corpus.

• Professor Aly Fahmy, Dean of the Faculty of Computers and Information, Cairo University.

• Dr. Amir Atiya, Associate Professor of Computer Engineering, Cairo University.

• Dr. Ahmed S. Moussa, Program Manager, Smart Village.

– We met with representatives of King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia.

• Dr. Turki Saud Mohammed Al-Saud, Vice President Research Institutes

• Dr. Mansour M. Alghamidi, Director, Computers and Electronics

• Dr. Ibrahim A. Al-Kharashi, Arabic Language Projects

– Project is underway … Eiman is anxious to graduate.

Geospatial Tool


RelationalDatabase


EmailInternet chat

record



Speech Audio




Video(geo-spatial)

Text miner

...

...

Human analyst

Personaluser agent



Image miner

Filter parameters



Image Filter

KQMLKQML

Geospatial Tool

• Basic Functionality

• Develop a geospatial visualization tool for both

display and query.

• Locate source IP addresses.

• Locate imagery and video sources geospatially based

on geospatial metadata.

• Query geospatial coordinates for multimedia

documents in the database.

Geospatial Tool

Geospatial Tool

• Status

• Felix Mihai and In-ja Youn are graduate students

developing this tool.

• The basic map functionality is available

• IP locator is underway

• Geospatially located satellite image database is

also available (MISR imagery)

• Graduate student funding is a problem for Felix in

particular

Integration of Text and Images


RelationalDatabase


EmailInternet chat

record



Speech Audio




Video(geo-spatial)

Text miner

...

...

Human analyst

Personaluser agent



Image miner

Filter parameters



Image Filter

KQMLKQML


• Functionality Desired

– Attach metadata to images and to text either endogenously or exogenously

• Be able to query an image for related text documents

– E.g, Who is this a picture of? What is this a picture of?

• Be able to query a text document to identify related images

– E. g., Find me a picture of this named person. Find me a picture of this facility.


• Two approaches – The bigram proximity matrix (for text documents) and

the gray level co-occurrence matrix (for images) have

the same basic structure.

• Work is underway to develop and exploit this characteristic

– Integrated text and image documents (such as news

documents, video with voice) may be deconstructed to

provide metadata data for each other.

• Not yet implemented (google image does this for webpages)


• Status

– Peter Mburu is the graduate student identified

to work on this part of the project

• Work has just begun … this is a hard problem.

• Peter is very bright, but not yet in candidacy.

Streaming Documents


RelationalDatabase


EmailInternet chat

record



Speech Audio




Video(geo-spatial)

Text miner

...

...

Human analyst

Personaluser agent



Image miner

Filter parameters



Image Filter

KQMLKQML

Streaming Documents

• Functionality Desired

– Process streaming text documents.

• Vector space representation of a document.

• Streaming documents imply evolving lexicon.

– Recursive computation of document frequency.

– Use evolving lexicon.

• Track evolving sense of documents.

• Introduce new query terms.

• Classify new documents.

Streaming Documents

Streaming Documents

Number of times word

w is in document d.

Number of documents

that contain word w.

Size of the corpus

Streaming Documents

Streaming Documents

• Status

– Graduate students, Elizabeth Leeds Hohman and Loulwah Al-Samait, are separately working on streaming documents.

• Elizabeth is developing a visual representation using graph theory of streaming document clusters.

• Loulwah is developing a method for understanding evolving sense of documents.

– Theory development is relatively advanced, system development is less so.

• Project has been underway about 4 months.

Work Left to Be Done!

• Lots!

– Progress is good and a number of bright students are working on the project

– The Arabic Text Tool should be in hand by December.

– The Geospatial Tool is fairly advanced, but Felix has no funding and is fragile.

– The Text and Image Integration is at early stages and is probably the most difficult conceptually.

– The Streaming Text Tools are advanced theoretically, but system development is not yet underway

– Filtering tasks and system integration has not yet begun.

– But, we have only been at it for four months.

Contact Information

Edward J. Wegman, Ph.D.

Center for Computational Data Science

George Mason University, MS 6A2

Fairfax, VA 22030-4444

[email protected]

Yasmin H. Said, Ph.D.

Center for Computational Data Science

George Mason University, MS 6A2

Fairfax, VA 22030-4444

[email protected]

Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of...

Documents

Transcript of Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of...