Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of...
Transcript of Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of...
Adaptive Multi-modal Data Mining
and Fusion For Autonomous
Intelligence Discovery
Edward J. Wegman, Ph.D.
Yasmin H. Said, Ph.D.
Outline of Presentation
• Problem Description
• Background in Text Mining
• Outline of System
• Arabic Language Tool
• Geospatial Tool
• Integration of Text and Images
• Streaming Documents
Problem Description
• Consider the plight of an analyst, who is faced with
multimedia sources that stream in data constantly.
• Data can be structured text, unstructured text, voice, images,
and video.
• The data likely are not English language; the data are likely to
be massive in scale; the data are streaming.
• Our premise: The analyst needs a system tool to integrate,
filter, and present to the analyst for his or her consideration
the data that are most likely to be useful.
• The tool should be a query system that must operate
transparently and without significant human fine tuning.
Text Mining
• The roots of the proposed tool are focused in text mining.
• Text mining uses statistical, mathematical, and computer science techniques to extract subtle and unanticipated information and relationships from sets of documents.
• These sets of documents are called corpora.
• Two important methods:
– Cross-corpus discovery.
– Clustering.
Cross Corpus Discovery
•Test case examples
–1200 Science News abstracts.
–350 Naval Research ILIR documents.
The Approach
Text Data Mining
Via
MST Exploration
Multi-Discipline Document
Set
Minimal Spanning Tree (MST)
Calculation
Interpoint Distance
Calculation
Feature Extraction
(Denoising, stemming,
BPM, TPM)
MST Layout Via
Spring Based Models
Cross Corpora
Associations
Cluster Determination
and
Exploration
Feature Extraction -
Bigram and Trigram Proximity Matrix
“The wise young man sought his father in the crowd.”
MST Classifier Complexity
Characterization
Insight: the
number of cross
class edges can
be used as a
surrogate for
classification
complexity. These
cross class
(corpora) edges
will be used in our
scheme to
facilitate the cross-
corpora discovery
process.
The Environment (Opening Screen)
Mathematics and Computer Sciences vs.
Physical Sciences and Technology Second
Strongest Association in MST
Mathematics and Computer Sciences vs.
Physical Sciences and Technology Second
Strongest Association in MST
Anthropology and Archaeology vs.
Medical Sciences Strongest Associated
Articles in the MST
Anthropology and Archaeology vs. Medical Sciences
Strongest Associated Articles Comparison
A Duplicate in the ILIR Database in the Advanced
Naval Materials Category
NAVSTO
FY99/FY00
Duplicate enters for L. MERWIN and C. RICE
ORGANICALLY MODIFIED CERAMICS FOR CORROSION CONTROL
Two Closely Related Articles in the Human Performance
Factors and the Information Technology and Operations
NUWC
FY01
Dr. Susan S. Kirschenbaum
ADAPTIVE GROUPWARE FOR PLANNING
NUWC
FY99
S. S. KIRSCHENBAUM
TRAINING A SYSTEM
Two Articles in the Information Technology and
Operations that are Identical)
NAVSTO
FY99/FY00
L. VENETSKY
DIRECT ADAPTIVE, GRADIENT DESCENT,
AND GENETIC ALGORITHM TECHNIQUES
FOR FUZZY CONTROLLERS
NAVSTO
FY99/FY00
L. VENETSKY
MISSION SCENARIO CLASSIFICATION USING
PARAMETER SPACE CONCEPT LEARNING
Document Clustering
• An obvious statement: “It is extremely useful to
group documents that are similar.”
• Ultimately, document should be interpreted in a
multimedia sense.
Test Data for this Example
• Our test bed for text data was collected by the Linguistic Data Consortium in 1997.
– The data consisted of 15,863 news reports collected from Reuters and CNN from July 1, 1994 to June 30, 1995.
• Features – The human classifiers claimed 25 clusters in their
limited document database
– Just as before, we denoise and stem the text data.
Text Example - Clusters
Cluster 0, Size: 157, ISim: 0.142, ESim: 0.008
Descriptive: ireland 12.2%, ira 9.1%, northern.ireland 7.6%, irish 5.5%, fein
5.0%, sinn 5.0%, sinn.fein 5.0%, northern 3.2%, british 3.2%, adam 2.4%
Discriminating: ireland 7.7%, ira 5.9%, northern.ireland 4.9%, irish 3.5%,
fein 3.2%, sinn 3.2%, sinn.fein 3.2%, northern 1.6%, british 1.5%, adam
1.5%
Phrases 1: ireland 121, northern 119, british 116, irish 111, ira 110, peac 107,
minist 104, govern 104, polit 104, talk 102
Phrases 2: northern.ireland 115, sinn.fein 95, irish.republican 94,
republican.armi 91, ceas.fire 87, polit.wing 76, prime.minist 71, peac.process
66, gerri.adam 59, british.govern 50
Phrases 3: irish.republican.armi 91, prime.minist.john 47, minist.john.major
43, ira.ceas.fire 35, ira.polit.wing 34, british.prime.minist 34, sinn.fein.leader
30, rule.northern.ireland 27, british.rule.northern 27, declar.ceas.fire 26
Text Example - Clusters
Cluster 1, Size: 323, ISim: 0.128, ESim: 0.008
Descriptive: korea 19.8%, north 13.2%, korean 11.2%, north.korea 10.8%, kim
5.8%, north.korean 3.7%, nuclear 3.5%, pyongyang 2.0%, south 1.9%,
south.korea 1.5%
Discriminating: korea 12.7%, north 7.4%, korean 7.2%, north.korea 7.0%, kim
3.8%, north.korean 2.4%, nuclear 1.7%, pyongyang 1.3%, south.korea 1.0%,
simpson 0.8%
Phrases 1: korea 305, north 303, korean 285, south 243, unit 215, nuclear 204,
offici 196, pyongyang 179, presid 167, talk 165
Phrases 2: north.korea 291, north.korean 233, south.korea 204, south.korean
147, kim.sung 108, presid.kim 83, nuclear.program 79, kim.jong 74, light.water
71, presid.clinton 69
Phrases 3: light.water.reactor 56, unit.north.korea 55, north.korea.nuclear 53,
chief.warrant.offic 49, presid.kim.sung 46, leader.kim.sung 39, presid.kim.sam
37, north.korean.offici 36, warrant.offic.bobbi 35, bobbi.wayn.hall 29
Outline of System
Four core capabilities:
• Text and image mining for feature extraction
• Multi-modal data fusion
• Agent-based adaptive information filtering
• Cognitively friendly information visualization
Outline of System
Unstructured TextStructured Text
RelationalDatabase
Speech recognitionengine
EmailInternet chat
record
Interceptedphone calls
Recordedconversations
Speech Audio
Structured textfeature extractor
Unstructured textfeature extractor
Static imagery(geo-spatial)
Video(geo-spatial)
Text miner
...
...
Human analyst
Personaluser agent
Textual informationfilter Filter parameters
Text filtering agent
Image miner
Filter parameters
Image filtering agent
Imagefeature extractor
Image Filter
KQMLKQML
Arabic Language Tool
Unstructured TextStructured Text
RelationalDatabase
Speech recognitionengine
EmailInternet chat
record
Interceptedphone calls
Recordedconversations
Speech Audio
Structured textfeature extractor
Unstructured textfeature extractor
Static imagery(geo-spatial)
Video(geo-spatial)
Text miner
...
...
Human analyst
Personaluser agent
Textual informationfilter Filter parameters
Text filtering agent
Image miner
Filter parameters
Image filtering agent
Imagefeature extractor
Image Filter
KQMLKQML
Arabic Language Tool
• Our fundamental premise is that Arabic language documents, open source and otherwise, provide valuable insight.
• Open source documents are streaming.
• Not enough Arabic language experts are available to translate everything.
• We need a system for English language queries to an Arabic language text database.
Arabic Language Tool
Arabic Language Tool
• Basic functionality
– Arabic language documents are background processed,
stemmed, denoised, clustered, bigrammed.
• Bigrams are attached as metadata.
– English language query is translated to Arabic
• Query is divided into multiple bigrams.
– Reduced Arabic language document set is presented to
analyst for consideration and translation.
Arabic Language Tool
• Status
– Native Arabic speaker, Eiman Alshammari, is our graduate student developing tool.
– We met with the Arabic Language Data Mining Group in Cairo and secured cooperation and an Arabic language corpus.
• Professor Aly Fahmy, Dean of the Faculty of Computers and Information, Cairo University.
• Dr. Amir Atiya, Associate Professor of Computer Engineering, Cairo University.
• Dr. Ahmed S. Moussa, Program Manager, Smart Village.
– We met with representatives of King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia.
• Dr. Turki Saud Mohammed Al-Saud, Vice President Research Institutes
• Dr. Mansour M. Alghamidi, Director, Computers and Electronics
• Dr. Ibrahim A. Al-Kharashi, Arabic Language Projects
– Project is underway … Eiman is anxious to graduate.
Geospatial Tool
Unstructured TextStructured Text
RelationalDatabase
Speech recognitionengine
EmailInternet chat
record
Interceptedphone calls
Recordedconversations
Speech Audio
Structured textfeature extractor
Unstructured textfeature extractor
Static imagery(geo-spatial)
Video(geo-spatial)
Text miner
...
...
Human analyst
Personaluser agent
Textual informationfilter Filter parameters
Text filtering agent
Image miner
Filter parameters
Image filtering agent
Imagefeature extractor
Image Filter
KQMLKQML
Geospatial Tool
• Basic Functionality
• Develop a geospatial visualization tool for both
display and query.
• Locate source IP addresses.
• Locate imagery and video sources geospatially based
on geospatial metadata.
• Query geospatial coordinates for multimedia
documents in the database.
Geospatial Tool
Geospatial Tool
Geospatial Tool
Geospatial Tool
• Status
• Felix Mihai and In-ja Youn are graduate students
developing this tool.
• The basic map functionality is available
• IP locator is underway
• Geospatially located satellite image database is
also available (MISR imagery)
• Graduate student funding is a problem for Felix in
particular
Integration of Text and Images
Unstructured TextStructured Text
RelationalDatabase
Speech recognitionengine
EmailInternet chat
record
Interceptedphone calls
Recordedconversations
Speech Audio
Structured textfeature extractor
Unstructured textfeature extractor
Static imagery(geo-spatial)
Video(geo-spatial)
Text miner
...
...
Human analyst
Personaluser agent
Textual informationfilter Filter parameters
Text filtering agent
Image miner
Filter parameters
Image filtering agent
Imagefeature extractor
Image Filter
KQMLKQML
Integration of Text and Images
• Functionality Desired
– Attach metadata to images and to text either endogenously or exogenously
• Be able to query an image for related text documents
– E.g, Who is this a picture of? What is this a picture of?
• Be able to query a text document to identify related images
– E. g., Find me a picture of this named person. Find me a picture of this facility.
Integration of Text and Images
• Two approaches – The bigram proximity matrix (for text documents) and
the gray level co-occurrence matrix (for images) have
the same basic structure.
• Work is underway to develop and exploit this characteristic
– Integrated text and image documents (such as news
documents, video with voice) may be deconstructed to
provide metadata data for each other.
• Not yet implemented (google image does this for webpages)
Integration of Text and Images
• Status
– Peter Mburu is the graduate student identified
to work on this part of the project
• Work has just begun … this is a hard problem.
• Peter is very bright, but not yet in candidacy.
Streaming Documents
Unstructured TextStructured Text
RelationalDatabase
Speech recognitionengine
EmailInternet chat
record
Interceptedphone calls
Recordedconversations
Speech Audio
Structured textfeature extractor
Unstructured textfeature extractor
Static imagery(geo-spatial)
Video(geo-spatial)
Text miner
...
...
Human analyst
Personaluser agent
Textual informationfilter Filter parameters
Text filtering agent
Image miner
Filter parameters
Image filtering agent
Imagefeature extractor
Image Filter
KQMLKQML
Streaming Documents
• Functionality Desired
– Process streaming text documents.
• Vector space representation of a document.
• Streaming documents imply evolving lexicon.
– Recursive computation of document frequency.
– Use evolving lexicon.
• Track evolving sense of documents.
• Introduce new query terms.
• Classify new documents.
Streaming Documents
Streaming Documents
Streaming Documents
Number of times word
w is in document d.
Number of documents
that contain word w.
Size of the corpus
Streaming Documents
Streaming Documents
• Status
– Graduate students, Elizabeth Leeds Hohman and Loulwah Al-Samait, are separately working on streaming documents.
• Elizabeth is developing a visual representation using graph theory of streaming document clusters.
• Loulwah is developing a method for understanding evolving sense of documents.
– Theory development is relatively advanced, system development is less so.
• Project has been underway about 4 months.
Work Left to Be Done!
• Lots!
– Progress is good and a number of bright students are working on the project
– The Arabic Text Tool should be in hand by December.
– The Geospatial Tool is fairly advanced, but Felix has no funding and is fragile.
– The Text and Image Integration is at early stages and is probably the most difficult conceptually.
– The Streaming Text Tools are advanced theoretically, but system development is not yet underway
– Filtering tasks and system integration has not yet begun.
– But, we have only been at it for four months.
Contact Information
Edward J. Wegman, Ph.D.
Center for Computational Data Science
George Mason University, MS 6A2
Fairfax, VA 22030-4444
Yasmin H. Said, Ph.D.
Center for Computational Data Science
George Mason University, MS 6A2
Fairfax, VA 22030-4444