Pattern Analysis & Machine Intelligence Research Group UNIVERSITY OF WATERLOO LORNET Theme 4 Data...
-
Upload
camron-lee -
Category
Documents
-
view
216 -
download
0
Transcript of Pattern Analysis & Machine Intelligence Research Group UNIVERSITY OF WATERLOO LORNET Theme 4 Data...
Pattern Analysis & Machine IntelligenceResearch Group
UNIVERSITY OF WATERLOO
LORNET Theme 4
Data Mining and Knowledge Extraction for LO
T L : Mohamed KamelPI’s: O. Basir, F. Karray, H. TizhooshAssoc PI’s: A. Wong, C. DiMarco
PAMI Research Group, University of Waterloo
Knowledge Extraction and LO Mining
GOAL:
Develop Data mining and knowledge extraction techniques and tools for learning object repositories.
These tools can provide context and facilitate interactions, efficient organization, efficient delivery, navigation and retrieval.
PAMI Research Group, University of Waterloo
Theme Overview
KnowledgeExtraction
Taggingand
Organizing
Matchingand
Ranking
LOMining
Classification (MCS, Data Partitioning, Imbalanced Classes)
Clustering (Parallel/Distributed Clustering, Cluster Aggregation)
From Text Syntactic: Keyword, Keyphrase-based Semantic: Concept-based
From Images Image Features, Shape Features
From Text + Images Describing Images with Text Enriching Text with Images
LO Similarity and RankingAssociation Rules / Social Networks
Reinforcement LearningSpecialized / Personalized Search
PAMI Research Group, University of Waterloo
Types of Data in LORNET
LCMS
CourseCourseCourseModule Lesson LOModuleModule LessonLesson LOLO
Discussion Board
Thread PostThreadThread PostPostBoardBoardBoard
LOR
MetadataMetadataMetadataRecordRecordRecord
TELOS
SemanticLayer
ResourceResourceResourceSubject MatterText, Images, Flash, Applets, Metadata, Interaction Logs
DiscussionsText, Interaction Logs
LO DescriptorsMetadata
ResourcesMetadata,Semantic References
PAMI Research Group, University of Waterloo
LO Mining Scenarios
Task
Environment
Knowledge Extraction
Tagging / Organizing
Matching / Ranking
TELOS
Ontology Construction Grouping Components Finding & Ranking Components
E-Learning Design Environment
(LMS)
Extracting LO Summary
Extracting LO Concepts
Extracting Image Description
Grouping LOs Finding Similar LOs
Ranking LOs
Learning Object Content MS
(LCMS)
Summarizing Documents
Extracting Concepts from Documents
Grouping Documents
Tagging Documents
Finding Similar Topics
Finding Similar Profiles
Building Social Networks
Detect Plagiarism
LO Repository
Extracting Metadata
Extracting Ontologies
Classifying LOs
Building LO Clusters
Detecting Duplicate LOs
Ranking LOs
Metadata Matching
PAMI Research Group, University of Waterloo
LO Mining and Knowledge Extraction
LO Automatic Tagging
LO Grouping/Ranking
Text MiningParsing, Tokenization,
Keyword/phrase Exraction
Semantic AnalysisNLP, Ontologies,Knowledge Rep.
CategorizationClassification,
Clustering
Learning from Interactions
Reinforcement Learning,Multi-Agent Systems
Math & StatisticsVectors, Matrices,
Statistics
Data MiningAlgorithms
Data MiningFoundations
Applications / Services
LO Similarity . . . .
Data RepresentationFeatures, Feature Types,
Normalization, Discretization
Data StructuresArrays, Lists, Trees,
Graphs
Data AccessData Sources, Data
Readers/Writers, Data Converters
Image MiningFeature Extraction,
Shape Analysis, Indexing and Retrieval
LO Summarization
LO Recommendation
PAMI Research Group, University of Waterloo
Projects Overview
Text Document
Information ExtractionAnalyzing content to extract relevant information
Keyword ExtractionSummarizationConcept ExtractionSocial Network Analysis
CategorizationOrganizing LOs according to their content
Text Document Classification
Clustering
- Traditional- MCS- Imbalanced
- Traditional- Ensembles- Distributed
PersonalizationProviding user-specific results
ReinforcementLearning
- Traditional- Opposition- based
Image MiningDescribing and finding relevant images
CBIR - Traditional- Fusion-based
ImageInteraction
Logs
Integration and Applications
In Progress PublicationsTheme and Industry Collaboration
Software Components
PAMI Research Group, University of Waterloo
Information Extraction: Summarization
LO Content Package Summarization
Learning objects stored in IMS content pacakges are loaded and parsed. Textual content files are extracted for analysis.
Statistical term weighting and sentence ranking are performed on each document, and to the whole collection.
Top relevant sentences are extracted for each document.
Planned functionality: Summarization of whole modules or lessons (as opposed to single documents).
Benefits Provide summarized overview of learning objects
for quick browsing and access to learning material.
Scenarios Learning Management Systems can call the
summarization component to produce summaries for content packages.
Data is courtesy University of Saskatchewan
PAMI Research Group, University of Waterloo
Information Extraction: Concept ExtractionL
an
gu
ag
e In
dep
en
den
t
TextText
La
ng
ua
ge
De
pen
de
nt
l
Semantic Role Labeler
Syntax Parser
POS Tagger
La
ng
ua
ge
De
pen
de
nt
Natural Language Processing
Semantic Parser
Syntax Parser
POS Tagger
Concept - based Model
Sentence Separator
Concept -based Statistical Analyzer
(tf : term frequency)(ctf: conceptual term frequency)
Conceptual Ontological Graph (COG)
Representation
Text Pre- processorText Pre- processor
ConceptsConceptsConceptsConcepts
F-measure of Hierarchical Clustering
Single-Term Concept-based Improvement
Reuters 0.723 0.925 +27.94%
ACM 0.697 0.918 +31.70%
Brown 0.581 0.906 +55.93%
Entropy of Hierarchical Clustering
Single-Term Concept-based Improvement
Reuters 0.251 0.012 -95.21%
ACM 0.317 0.043 -86.43%
Brown 0.385 0.018 -95.32%
Precision of Search
Single-Term Concept-based Improvement
Cran 0.536 0.901 +68.09%
Reuters 0.591 0.897 +51.77%
Recall of Search Result
Single-Term Concept-based Improvement
Cran 0.486 0.827 +70.16%
Reuters 0.452 0.841 +86.06%
Concept-Based Statistical Analyser
Conceptual Ontological Graph (COG) Ranking
PAMI Research Group, University of Waterloo
Information Extraction: Keyword Extraction
Semantic Keyword Extraction
Tasks Developing tools and techniques to extract semantic keywords
toward facilitating metadata generation Developing algorithms to enrich metadata (tags) which can be
applied in index-based multimedia retrieval
Progress Proposed a new information theoretic inclusion index to measure
the asymmetric dependency between terms (and concepts), which can be used in term selection (keyword extraction) and taxonomy extraction (pseudo ontology)
Makrehchi, M. and Kamel, ICDM07, WI 07
PAMI Research Group, University of Waterloo
Information Extraction: Keyword Extraction
Learn rules to find keywords in English sentences
Rules represent sentence fragments Specific enough for reliable keyword
extraction General enough to be applied to
unseen sentences Rule generalization
Begin with an exact sentence fragment
Merge with another by moving different words to the lowest common level in the part-of-speech hierarchy
Keep merged rule if it does not reduce precision and recall of keyword extraction; keep original rules otherwise
Keyword extraction Find sequence of rules that best
cover an unseen sentence Extract keywords according to rules
Rule base size shows quick initial growth, followed by slow and irregular growth and rule elimination
Learns 20 rules from the first 50 training rules Learns 13 additional rules from the next 220
training rules
Both precision and recall values increase during training
Precision (blue) increases 10%Recall (red) shows slight upward trend
Rule-based Keyword Extraction
PAMI Research Group, University of Waterloo
Categorization: Ensemble-based Clustering
Consensus Clustering Categorization of learning objects using proposed consensus clustering
algorithms. The goal of consensus clustering is to find a clustering of the data objects
that optimally summarizes an ensemble of multiple clusterings. Consensus clustering can offer several advantages over a single data
clustering, such as the improvement of clustering accuracy, enhancing the scalability of clustering algorithms to large volumes of data objects, and enhancing the robustness by reducing the sensitivity to outlier data objects or noisy attributes.
Tasks Development of techniques for producing ensembles of multiple data
clusterings where diverse information about the structure of the data is likely to occur.
Development of consensus algorithms to aggregate the individual clusterings.
Develop solutions for the cluster symbolic-label matching problem Empirical analysis on real-world data and validation of proposed method.
PAMI Research Group, University of Waterloo
Categorization using cluster ensemble
Dataset # samples
# attributes
# classes
K-means’ Mean Error Rate in %
Ensemble’s Mean
Error Rate in %
Synthetic1 1000 8 5 17.41 0
Yahoo! (text) 2340 1458 6 38.23 16.24
Texture (image) 5500 40 11 37.99 11.54
Optical Digit Recognition
500 64 10 27.31 16.40
PAMI Research Group, University of Waterloo
Categorization: Distributed Clustering
Peer nodes are arranged into groups called “neighborhoods”.
Multiple neighborhoods are formed at each level of the hierarchy.
This size of each neighborhood is determined through a network partitioning factor.
Each neighborhood has a designated supernode.
Supernodes of level h form the neibhorhoods for level h+1.
Clustering is done within neighborhood boundaries, then is merged up the hierarchy through the supernodes.
Benefits Significant speedup over centralized clustering and
flat peer-to-peer clustering. Multiple levels of clusters. Distributed summarization of clusters using
CorePhrase keyphrase extraction.
Scenarios Distributed knowledge discovery in hierarchical
organizations.
Neighborhood (Q)
SuperNode (S)
h = 0
h = 1
h = 2
Root
h = H-1
h = H
h = 0β = 0.2
h = 1β = 0.33
h = 2β = 0
h = 3
},,{
},,{)0(
4)0(
1)0(
)0(16
)0(1
)0(
pp
Q
P
},{
},,,{)1(
2)1(
1)1(
)1(4
)1(3
)1(2
)1(1
)1(
pppp
Q
P}{
},{)2(
1)2(
)2(2
)2(1
)2(
Q
pp
Q
P
HP2PC Architecture
HP2PC Example3-level network, 16 nodes
Hierarchical P2P Document Clustering
PAMI Research Group, University of Waterloo
Categorization: Multiple Classifier Systems
Tasks To investigate various aspects of
cooperation in Multiple Classifier Systems (Classifier Ensembles)
To develop evaluation measures in order to estimate various types of cooperation in the system
To gain insight into the impact of changes in the cooperative components with respect to system performance using the proposed evaluation measures
To apply these findings to optimize existing ensemble methods
To apply these findings to develop novel ensemble methods with the goal of improving classification accuracy and reducing computation complexity
Progress Proposed a set of evaluation
measures to select sub-optimal training partitions for training classifier ensembles.
Proposed an ensemble training algorithm called Clustering, De-clustering, and Selection (CDS).
Proposed and optimized a cooperative training algorithm called Cooperative Clustering, De-clustering, and Selection (CO-CDS).
Investigated the applications of proposed training methods (CDS and CO-CDS) on LO classification.
PAMI Research Group, University of Waterloo
Categorization: Imbalanced Class Distribution
Objective Advance classification of multi-class imbalanced data
Tasks
To develop cost-sensitive boosting algorithm AdaC2.M1
To improve the identification performance on the important classes
To balance classification performance among several classes
PAMI Research Group, University of Waterloo
Categorization: Imbalanced Class Distribution
IndInd
..
sizesize Dist.Dist.
C1C1 4949 7.84%7.84%
C2C2 288288 46.08%46.08%
C3C3 288288 46.08%46.08%
Class DistributionClass DistributionC4.5C4.5 HPWR (Od=3)HPWR (Od=3)
classclass Meas.Meas. BaseBase AdaBoostAdaBoost BaseBase AdaBoostAdaBoost
C1C1
RR 00 5.115.11 10.7010.70 44.0644.06
PP N/AN/A 6.56.5 11.8211.82 32.8932.89
FF N/AN/A 5.845.84 10.8310.83 35.8435.84
C2C2
RR 73.2173.21 92.2892.28 88.3188.31 87.4387.43
PP 69.5369.53 88.7588.75 86.7986.79 91.9991.99
FF 72.2972.29 90.3890.38 87.4387.43 89.6489.64
C3C3
RR 67.9467.94 91.3691.36 87.6387.63 88.4288.42
PP 73.8973.89 87.8887.88 87.0787.07 89.9189.91
FF 71.9171.91 89.4289.42 86.9986.99 89.0389.03
G-measureG-measure 00 11.4611.46 33.3233.32 68.5068.50
Performance of Base Classification and AdaBoost
C4.5C4.5 HPWR (Od=3)HPWR (Od=3)
ClassClass Meas.Meas. BaseBase AdaBoostAdaBoost AdaC2.M1AdaC2.M1 BaseBase AdaBoostAdaBoost AdaC2.M1AdaC2.M1
C1C1 RR 00 5.115.11 77.5877.58 10.7010.70 44.0644.06 65.7265.72
PP N/AN/A 6.506.50 14.1214.12 11.8211.82 32.8932.89 30.8330.83
C2C2 RR 73.2173.21 92.2892.28 64.7364.73 88.3188.31 87.4387.43 83.1283.12
PP 69.5369.53 88.7588.75 97.2497.24 86.7986.79 91.9991.99 91.3891.38
C3C3 RR 67.9467.94 91.3691.36 65.2365.23 87.6387.63 88.4288.42 83.9583.95
PP 73.8973.89 87.8887.88 93.2293.22 87.0787.07 89.9189.91 90.8190.81
G-meanG-mean 00 11.4611.46 68.4268.42 33.3233.32 68.5068.50 76.0876.08
Balanced performance among classes - Evaluated by G-mean
PAMI Research Group, University of Waterloo
Personalization
Opposition-based Reinforcement Learning for Personalizing Image Search
Developing a reliable technique to assist users, facilitate and enhance the learning process
Personalized ORL tool assists user to observe the searched images desirable for her/him
Personalized tool gathers images of the searched results, selects a sample of them
By interacting with user and presenting the sample, it learns the user’s preferences
PAMI Research Group, University of Waterloo
Personalization
PAMI Research Group, University of Waterloo
Image Mining: CBIR
Content based image retrieval Build an IR system that can retrieve images based on:
Textual Cues, Image content, NL Queries
imag
esR
ich
Doc
umen
ts
Documents contain QI
Images match QI
NL Description of Image
Images contain QT
Automated image tagging
Image RetrievalTool Set
Query Image QIQuery Text QTQuery Document
PAMI Research Group, University of Waterloo
Accuracy= 70%
Accuracy= 55%
Accuracy= 60%
Accuracy= 95%
IZM FD
MTAR The proposed approachx x x
xx
x x
x
x x
x x x x
x x x
x
x
xxxxx
Illustrative Example
PAMI Research Group, University of Waterloo
The Performance of the proposed approach
Experimental Results (Cont’d)
PAMI Research Group, University of Waterloo
Integration and Applications
Progress
Finished core parts of the common data mining framework.
Built components and services from theme researchers’ work around the data mining framework.
Provided documentation for the data mining framework and software components.
Launched web site to host components and documentation from Theme 4:http://pami.uwaterloo.ca/projects/lornet/software/
PAMI Research Group, University of Waterloo
Integration and Applications
Progress
Core parts of the common data mining framework are available, including:
• Vector and matrix manipulation.• Document parsing and tokenization.• Statistical term and sentence analysis.• Similarity calculation using multiple distance functions.• IMS Content Package compliant parser.
Components and tools built around the common data mining framework:
• Metadata extraction from single documents; supports Dublin Core encoding.• Document similarity calculation using cosine similarity.• Single document and content package summarization.• Building of standard text datasets from large document collections.
Integration with TELOS:• Developed C# TELOS connector for integrating Theme 4 components.• Worked on component manifest specification with Theme 6.• Provided metadata extraction as part of a complete scenario for TELOS components integration.• The following components were wrapped for use by TELOS through the C# connector: Automatic
Metadata Extractor, Document Similarity, and Document Summarizer.
PAMI Research Group, University of Waterloo
Industry Collaboration
Pattern Discovery Software (PDS) provided data mining software tools for use by researchers.
Vestech provided opportunities for researchers to work on speech technologies. Desire2Learn opened job opportunities for LORNET researchers.
PAMI Research Group, University of Waterloo
Software Components
Learning Object Repository
Metadata Structured Text Categorical
e-Learning Environment
Structured Text Images Object Relationships Context
Automatic metadata extraction LO automatic classification LO organization through clustering Multiple organization strategies through
cluster ensembles
Extracting concepts from LO Summarizing Documents Grouping LOs Tagging LOs Discovering Similar Topics Discovering Similar Peers Building Social Networks Detecting Plagiarism LO recommendation using similarity ranking Personalization / Specialization through
reinforcement learning
Legend Integrated Ready In Progress Year 5
TELOS Metadata Ontology
Ontology construction and unification Finding relations between components Ranking components Grouping components Tagging components
General ToolsC# Connector for TELOSCommon Data Mining Framework
Standard Text Mining ToolsMetadata ExtractorDocument SummarizerContent Package SummarizerDocument SimilarityLO RecommenderMetadata HarvesterKeyword ExtractorTaxonomy ExtractorMetadata Enrichment Tools
Concept-based and Semantic Text Mining Tools
Metadata ExtractorLO Search EngineDocument SimilarityDocument ClassifierDocument ClustererSemantic-based Ontology
RepresentationSemantic Metadata MatchingPOS Rule-Learning SystemTriplet Representation System
Categorization ToolsLO ClassifierLO Multiple ClassifierLO ClustererLO Ensemble ClustererLO Consensus ClustererLO Distributed Clusterer
Overview of Components
Environment Data Types Tasks
Scenarios for Use of Software Components
User-centric ToolsPersonalized Search EngineSocial Network Learner
Image Mining ToolsContent-based Image SearchPersonalized Image SearchConsensus-based Fusion for Image Retrieval
PAMI Research Group, University of Waterloo
Publications
Papers(accepted / published)
Papers(submitted / in prep)
Theses(completed / in progress)
4.1 Information Extraction from Text
11 7 3/2
4.2 Semantic Knowledge Synthesis from Text
10 4 4/1
4.3 Knowledge Discovery through Categorization
12 10 4/1
4.4 Knowledge from Interaction 8 3 1/2
4.5 Knowledge from Image Mining 10 3 2/1
Total 51 27 14//7 = 21
PAMI Research Group, University of Waterloo
Theme 4 TeamLeader: M. Kamel
PI’s: Dr. Basir Dr. Tizhoosh
Researchers H. Ayad R. Kashef A. Ghazel Dr. Makhreshi
Funding CRC/CFI/OIT NSERC PAMI Lab
Dr. Karray Asso PI (Wong,
DiMarco
M. Shokri S. Hassan A. Farahat Dr. R. Khoury
PDS, Vestech, Desire2Learn
Graduated R. Khoury, PhD 07 L. Chen, PhD 07 M. Makhreshi,PhD 07 K.Hammouda,PhD 07 R. Dara, PhD 07 Y.Sun, PhD 07 K. Shaban, PhD 06 Y. Sun, PhD 06 M. Hussin, PhD 05 Jan Bakus, PhD 05 A. Adegorite, MA.Sc04 A. Khandani, MA.Sc05. S. Podder, MA.Sc.04
PAMI Research Group, University of Waterloo
Pattern Analysis and Machine Intelligence Lab
Electrical and Computer EngineeringUniversity of WaterlooCanada
www.pami.uwaterloo.ca
www.pami.uwaterloo.ca/projects/lornet/software/
www.pami.uwaterloo.ca/kamel.html publications