Data Mining: Efficiently Extracting Interpretable and ... · – Efficient data structures •...
Transcript of Data Mining: Efficiently Extracting Interpretable and ... · – Efficient data structures •...
1
Data Mining Efficiently Extracting Interpretable and Actionable Patterns
Tao Li School of Computing and Information Sciences
Florida International University
2
Data Mining CharacteristicsWe are drowning in information
But starving for knowledge------John Naisbett
Data Mining
An Engineering Process
An Interdisciplinary Field
A Collection of Functionalities
A Combination of Theory and Application
bull Data mining (knowledge discovery from data) ndash Efficiently extracting interpretable and actionable
patternsknowledge from large amount of data collectionsndash Patterns are Interesting
bull non-trivial implicit previously unknown and potentially useful
3
Research Challengesbull Efficiently extracting interpretable and actionable patterns from large amount of data
collectionsndash Performance efficiency effectiveness and scalability
bull Novel algorithmsndash High dimensional large volumendash Efficient data structures
bull Parallel distributed and incremental mining methods
ndash Interpretable and actionable patternsbull Pattern evaluation Incorporation of background knowledge bull Knowledge fusion Integration of the discovered knowledge with existing onebull User interaction
ndash Expression and visualization of data mining resultsndash Interactive mining of knowledge at multiple levels of abstraction
bull Complex structure mining (eg graph relational data)bull Mining from heterogeneous information sourcesbull Mining in Social Network Setting
ndash Linked data between emails webpages blogs citations sequencesndash Static and dynamic structural behavior
bull Security Privacy and Data Integrity
bull New Applicationsndash Anti-terrorism Search advertising Recommendation systems Computing system
management Personalized Search
4
My Research DirectionsResearch Interests
Data Mining Machine Learning and Information Retrieval
Data Mining Machine Learning and Information Retrieval
Practical ProblemsPractical Problems
Generic ToolsGeneric ToolsTheoryTheory
Tools and LibrariesWorkflow and System
ManagementEvent MiningLog ParserVisualization Tools
Personal Music AssistantTools for Bioinformatics
Gene SelectionTissue Classification
Sequence analysisLiterature Assistant
Algorithm IssuesMatrix-based Learning
Framework
Semi-supervised LearningLearning from
heterogeneous data types
Learning from non-vectorial data
Stream data mining
Incremental learning
ApplicationsComputing System Management
Help Desk ApplicationsSelf-managing SystemsSecurity Management
Music Information RetrievalPersonalized Music Service
Adaptive Workflow ManagementProcess MiningWorkflow Optimization
BioinformaticsMicroarray data analysisSequence Analysis
5
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
6
Example of Call Center Cases
bull Adding more memory on XXXXXXbull Email (a wrong setting)bull Scanner Accountbull problem with encrypted telnetbull PDFs not opening in IE
ndash S What is your examplendash C An example httpexamplepdfndash S What is your environmentndash C Windows XP with Acrobat Standard ndash S Running a repair on Windowsndash C It doesnrsquot workndash S Uninstalling Acrobat Standard and Reader and then
reinstalling the Readerndash C It doesnrsquot work eitherndash S Install two Microsoft updates for IEndash C It works
7
Motivationbull Problems
ndash Symptoms must be manually described to help-deskndash Help-desk needs to determine solution - difficult to compare with all previous
problemsndash Difficult for one help-desk employee to gain from anothers experiencendash Difficult to determine if solution worked successfully - User feedback may not be
accurate
bull How can data miningmachine learning helpndash Malfunction diagnoses call distribution skill-based routing outbound targeting etcndash 90 of calls are due to the most frequent 10 of problemsndash Problem analysis based on known casesndash Augment known cases when a new case is solvedndash Significantly reducing the cost and shortening time needed to diagnose malfunction
issues improve customer satisfaction
8
Two Different Approaches
bull Semantic-based Case Retrievalndash Using Semantic features to retrieval the most
similar past cases
bull Problem Probingndash Find probequestion set which maximizes
utility (benefit ndash cost)
9
Semantic Feature ExtractionTo detect sentence relations
sentence 1
sentence 2SENNA
(NEC Parser)
frames of sentence 1
frames of sentence 2
finding word relations among words with the same semantic role in each frame pair using
WordNet
Features
Semantic Roles For certain verbs of motionArg0 causer of motion Arg3 start point Arg1 thing in motion Arg4 end pointArg2 distance moved Arg5 direction-- eg [Arg0 I] [ArgM-MOD can][ArgM-NEG not] [rel receive] [Arg1 emails]
Subtypes of ArgM modifier tagLOC location CAU causeEXT extent TMP timeDIS discourse connectives PNC purposeADV general-purpose MNR mannerNEG negation marker DIR directionMOD modal verb
Word Relations WordNet 30 (147306 words)20 word relations egidentical synonyms antonyms attribute hypernym and hyponym-- A is a hypernym of B if B is a kind of Ameronym and holonym-- A is a meronym of B if A is a part or member of B
verb
frame
FeaturesArg0_identicalArg0_synonymsArg0_hypernymArg0_meronymrel_identicalrel_synonymsrel_hypernymRel_meronymArg1_identical
hellip hellip
10
Case Study ndash Online Help Desk
Interface
Most related emails
15496txt 18697txt 11002txt3191txt
Can you switch these two computers
Slides from Dingding Wang
11
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
12
SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them
High return Low risk Low expense hellip
Information overloading problem users often can not form a query that returns answers exactly matching their needs
13
Existing Solutions ndashIterative Refinement
bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1
Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt
10 and expense lt 2ndash hellip
bull Problem too time consuming
14
Existing Solutions - Ranking
bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD
2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)
15
Existing Solutions - Ranking
bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower
returns)ndash User 3 prefers a mix of high return funds and low
risk fundsndash One ranking function can not fit all
16
Existing Solutions -Categorization
Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)
The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)
User can navigate this tree and show records in a node
Categorization is complementary to ranking (rank records in a node)
Root(193)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
0 lt= 3 Yr return lt 10 (93)
17
Existing Solutions -Categorization
Problem differences between users are not consideredRoot(193)
0 lt= 3 Yr return lt 10 (93)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
10 lt=1 Yr return lt 20 (55)
10 lt= Std lt 20 (48)
5 lt= Std lt 10 (7)
0 lt= 1 Yr return lt 10 (63)
0 lt= Std lt 5 (38)
User 1 3
User 2 3 User 2 3
This tree does not help user 2 and 3 much
We try to address this issue for categorization
18
Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference
(eg high return funds low risk funds)
bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters
19
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
20
System Log Databull Performance Metric
ndash CPUMemorySwap utilization workload average response timebull Application-level Log
ndash Windows NT system log Websphere application log Oracle activitybull Failure data
ndash System crash dumps error reportsbull Network Traffic Data
ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces
bull Request Datandash Apache and Microsoft IIS log
bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures
bull Othersndash Network-based alert logs Probes
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
2
Data Mining CharacteristicsWe are drowning in information
But starving for knowledge------John Naisbett
Data Mining
An Engineering Process
An Interdisciplinary Field
A Collection of Functionalities
A Combination of Theory and Application
bull Data mining (knowledge discovery from data) ndash Efficiently extracting interpretable and actionable
patternsknowledge from large amount of data collectionsndash Patterns are Interesting
bull non-trivial implicit previously unknown and potentially useful
3
Research Challengesbull Efficiently extracting interpretable and actionable patterns from large amount of data
collectionsndash Performance efficiency effectiveness and scalability
bull Novel algorithmsndash High dimensional large volumendash Efficient data structures
bull Parallel distributed and incremental mining methods
ndash Interpretable and actionable patternsbull Pattern evaluation Incorporation of background knowledge bull Knowledge fusion Integration of the discovered knowledge with existing onebull User interaction
ndash Expression and visualization of data mining resultsndash Interactive mining of knowledge at multiple levels of abstraction
bull Complex structure mining (eg graph relational data)bull Mining from heterogeneous information sourcesbull Mining in Social Network Setting
ndash Linked data between emails webpages blogs citations sequencesndash Static and dynamic structural behavior
bull Security Privacy and Data Integrity
bull New Applicationsndash Anti-terrorism Search advertising Recommendation systems Computing system
management Personalized Search
4
My Research DirectionsResearch Interests
Data Mining Machine Learning and Information Retrieval
Data Mining Machine Learning and Information Retrieval
Practical ProblemsPractical Problems
Generic ToolsGeneric ToolsTheoryTheory
Tools and LibrariesWorkflow and System
ManagementEvent MiningLog ParserVisualization Tools
Personal Music AssistantTools for Bioinformatics
Gene SelectionTissue Classification
Sequence analysisLiterature Assistant
Algorithm IssuesMatrix-based Learning
Framework
Semi-supervised LearningLearning from
heterogeneous data types
Learning from non-vectorial data
Stream data mining
Incremental learning
ApplicationsComputing System Management
Help Desk ApplicationsSelf-managing SystemsSecurity Management
Music Information RetrievalPersonalized Music Service
Adaptive Workflow ManagementProcess MiningWorkflow Optimization
BioinformaticsMicroarray data analysisSequence Analysis
5
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
6
Example of Call Center Cases
bull Adding more memory on XXXXXXbull Email (a wrong setting)bull Scanner Accountbull problem with encrypted telnetbull PDFs not opening in IE
ndash S What is your examplendash C An example httpexamplepdfndash S What is your environmentndash C Windows XP with Acrobat Standard ndash S Running a repair on Windowsndash C It doesnrsquot workndash S Uninstalling Acrobat Standard and Reader and then
reinstalling the Readerndash C It doesnrsquot work eitherndash S Install two Microsoft updates for IEndash C It works
7
Motivationbull Problems
ndash Symptoms must be manually described to help-deskndash Help-desk needs to determine solution - difficult to compare with all previous
problemsndash Difficult for one help-desk employee to gain from anothers experiencendash Difficult to determine if solution worked successfully - User feedback may not be
accurate
bull How can data miningmachine learning helpndash Malfunction diagnoses call distribution skill-based routing outbound targeting etcndash 90 of calls are due to the most frequent 10 of problemsndash Problem analysis based on known casesndash Augment known cases when a new case is solvedndash Significantly reducing the cost and shortening time needed to diagnose malfunction
issues improve customer satisfaction
8
Two Different Approaches
bull Semantic-based Case Retrievalndash Using Semantic features to retrieval the most
similar past cases
bull Problem Probingndash Find probequestion set which maximizes
utility (benefit ndash cost)
9
Semantic Feature ExtractionTo detect sentence relations
sentence 1
sentence 2SENNA
(NEC Parser)
frames of sentence 1
frames of sentence 2
finding word relations among words with the same semantic role in each frame pair using
WordNet
Features
Semantic Roles For certain verbs of motionArg0 causer of motion Arg3 start point Arg1 thing in motion Arg4 end pointArg2 distance moved Arg5 direction-- eg [Arg0 I] [ArgM-MOD can][ArgM-NEG not] [rel receive] [Arg1 emails]
Subtypes of ArgM modifier tagLOC location CAU causeEXT extent TMP timeDIS discourse connectives PNC purposeADV general-purpose MNR mannerNEG negation marker DIR directionMOD modal verb
Word Relations WordNet 30 (147306 words)20 word relations egidentical synonyms antonyms attribute hypernym and hyponym-- A is a hypernym of B if B is a kind of Ameronym and holonym-- A is a meronym of B if A is a part or member of B
verb
frame
FeaturesArg0_identicalArg0_synonymsArg0_hypernymArg0_meronymrel_identicalrel_synonymsrel_hypernymRel_meronymArg1_identical
hellip hellip
10
Case Study ndash Online Help Desk
Interface
Most related emails
15496txt 18697txt 11002txt3191txt
Can you switch these two computers
Slides from Dingding Wang
11
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
12
SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them
High return Low risk Low expense hellip
Information overloading problem users often can not form a query that returns answers exactly matching their needs
13
Existing Solutions ndashIterative Refinement
bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1
Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt
10 and expense lt 2ndash hellip
bull Problem too time consuming
14
Existing Solutions - Ranking
bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD
2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)
15
Existing Solutions - Ranking
bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower
returns)ndash User 3 prefers a mix of high return funds and low
risk fundsndash One ranking function can not fit all
16
Existing Solutions -Categorization
Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)
The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)
User can navigate this tree and show records in a node
Categorization is complementary to ranking (rank records in a node)
Root(193)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
0 lt= 3 Yr return lt 10 (93)
17
Existing Solutions -Categorization
Problem differences between users are not consideredRoot(193)
0 lt= 3 Yr return lt 10 (93)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
10 lt=1 Yr return lt 20 (55)
10 lt= Std lt 20 (48)
5 lt= Std lt 10 (7)
0 lt= 1 Yr return lt 10 (63)
0 lt= Std lt 5 (38)
User 1 3
User 2 3 User 2 3
This tree does not help user 2 and 3 much
We try to address this issue for categorization
18
Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference
(eg high return funds low risk funds)
bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters
19
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
20
System Log Databull Performance Metric
ndash CPUMemorySwap utilization workload average response timebull Application-level Log
ndash Windows NT system log Websphere application log Oracle activitybull Failure data
ndash System crash dumps error reportsbull Network Traffic Data
ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces
bull Request Datandash Apache and Microsoft IIS log
bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures
bull Othersndash Network-based alert logs Probes
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
3
Research Challengesbull Efficiently extracting interpretable and actionable patterns from large amount of data
collectionsndash Performance efficiency effectiveness and scalability
bull Novel algorithmsndash High dimensional large volumendash Efficient data structures
bull Parallel distributed and incremental mining methods
ndash Interpretable and actionable patternsbull Pattern evaluation Incorporation of background knowledge bull Knowledge fusion Integration of the discovered knowledge with existing onebull User interaction
ndash Expression and visualization of data mining resultsndash Interactive mining of knowledge at multiple levels of abstraction
bull Complex structure mining (eg graph relational data)bull Mining from heterogeneous information sourcesbull Mining in Social Network Setting
ndash Linked data between emails webpages blogs citations sequencesndash Static and dynamic structural behavior
bull Security Privacy and Data Integrity
bull New Applicationsndash Anti-terrorism Search advertising Recommendation systems Computing system
management Personalized Search
4
My Research DirectionsResearch Interests
Data Mining Machine Learning and Information Retrieval
Data Mining Machine Learning and Information Retrieval
Practical ProblemsPractical Problems
Generic ToolsGeneric ToolsTheoryTheory
Tools and LibrariesWorkflow and System
ManagementEvent MiningLog ParserVisualization Tools
Personal Music AssistantTools for Bioinformatics
Gene SelectionTissue Classification
Sequence analysisLiterature Assistant
Algorithm IssuesMatrix-based Learning
Framework
Semi-supervised LearningLearning from
heterogeneous data types
Learning from non-vectorial data
Stream data mining
Incremental learning
ApplicationsComputing System Management
Help Desk ApplicationsSelf-managing SystemsSecurity Management
Music Information RetrievalPersonalized Music Service
Adaptive Workflow ManagementProcess MiningWorkflow Optimization
BioinformaticsMicroarray data analysisSequence Analysis
5
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
6
Example of Call Center Cases
bull Adding more memory on XXXXXXbull Email (a wrong setting)bull Scanner Accountbull problem with encrypted telnetbull PDFs not opening in IE
ndash S What is your examplendash C An example httpexamplepdfndash S What is your environmentndash C Windows XP with Acrobat Standard ndash S Running a repair on Windowsndash C It doesnrsquot workndash S Uninstalling Acrobat Standard and Reader and then
reinstalling the Readerndash C It doesnrsquot work eitherndash S Install two Microsoft updates for IEndash C It works
7
Motivationbull Problems
ndash Symptoms must be manually described to help-deskndash Help-desk needs to determine solution - difficult to compare with all previous
problemsndash Difficult for one help-desk employee to gain from anothers experiencendash Difficult to determine if solution worked successfully - User feedback may not be
accurate
bull How can data miningmachine learning helpndash Malfunction diagnoses call distribution skill-based routing outbound targeting etcndash 90 of calls are due to the most frequent 10 of problemsndash Problem analysis based on known casesndash Augment known cases when a new case is solvedndash Significantly reducing the cost and shortening time needed to diagnose malfunction
issues improve customer satisfaction
8
Two Different Approaches
bull Semantic-based Case Retrievalndash Using Semantic features to retrieval the most
similar past cases
bull Problem Probingndash Find probequestion set which maximizes
utility (benefit ndash cost)
9
Semantic Feature ExtractionTo detect sentence relations
sentence 1
sentence 2SENNA
(NEC Parser)
frames of sentence 1
frames of sentence 2
finding word relations among words with the same semantic role in each frame pair using
WordNet
Features
Semantic Roles For certain verbs of motionArg0 causer of motion Arg3 start point Arg1 thing in motion Arg4 end pointArg2 distance moved Arg5 direction-- eg [Arg0 I] [ArgM-MOD can][ArgM-NEG not] [rel receive] [Arg1 emails]
Subtypes of ArgM modifier tagLOC location CAU causeEXT extent TMP timeDIS discourse connectives PNC purposeADV general-purpose MNR mannerNEG negation marker DIR directionMOD modal verb
Word Relations WordNet 30 (147306 words)20 word relations egidentical synonyms antonyms attribute hypernym and hyponym-- A is a hypernym of B if B is a kind of Ameronym and holonym-- A is a meronym of B if A is a part or member of B
verb
frame
FeaturesArg0_identicalArg0_synonymsArg0_hypernymArg0_meronymrel_identicalrel_synonymsrel_hypernymRel_meronymArg1_identical
hellip hellip
10
Case Study ndash Online Help Desk
Interface
Most related emails
15496txt 18697txt 11002txt3191txt
Can you switch these two computers
Slides from Dingding Wang
11
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
12
SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them
High return Low risk Low expense hellip
Information overloading problem users often can not form a query that returns answers exactly matching their needs
13
Existing Solutions ndashIterative Refinement
bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1
Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt
10 and expense lt 2ndash hellip
bull Problem too time consuming
14
Existing Solutions - Ranking
bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD
2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)
15
Existing Solutions - Ranking
bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower
returns)ndash User 3 prefers a mix of high return funds and low
risk fundsndash One ranking function can not fit all
16
Existing Solutions -Categorization
Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)
The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)
User can navigate this tree and show records in a node
Categorization is complementary to ranking (rank records in a node)
Root(193)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
0 lt= 3 Yr return lt 10 (93)
17
Existing Solutions -Categorization
Problem differences between users are not consideredRoot(193)
0 lt= 3 Yr return lt 10 (93)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
10 lt=1 Yr return lt 20 (55)
10 lt= Std lt 20 (48)
5 lt= Std lt 10 (7)
0 lt= 1 Yr return lt 10 (63)
0 lt= Std lt 5 (38)
User 1 3
User 2 3 User 2 3
This tree does not help user 2 and 3 much
We try to address this issue for categorization
18
Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference
(eg high return funds low risk funds)
bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters
19
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
20
System Log Databull Performance Metric
ndash CPUMemorySwap utilization workload average response timebull Application-level Log
ndash Windows NT system log Websphere application log Oracle activitybull Failure data
ndash System crash dumps error reportsbull Network Traffic Data
ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces
bull Request Datandash Apache and Microsoft IIS log
bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures
bull Othersndash Network-based alert logs Probes
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
4
My Research DirectionsResearch Interests
Data Mining Machine Learning and Information Retrieval
Data Mining Machine Learning and Information Retrieval
Practical ProblemsPractical Problems
Generic ToolsGeneric ToolsTheoryTheory
Tools and LibrariesWorkflow and System
ManagementEvent MiningLog ParserVisualization Tools
Personal Music AssistantTools for Bioinformatics
Gene SelectionTissue Classification
Sequence analysisLiterature Assistant
Algorithm IssuesMatrix-based Learning
Framework
Semi-supervised LearningLearning from
heterogeneous data types
Learning from non-vectorial data
Stream data mining
Incremental learning
ApplicationsComputing System Management
Help Desk ApplicationsSelf-managing SystemsSecurity Management
Music Information RetrievalPersonalized Music Service
Adaptive Workflow ManagementProcess MiningWorkflow Optimization
BioinformaticsMicroarray data analysisSequence Analysis
5
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
6
Example of Call Center Cases
bull Adding more memory on XXXXXXbull Email (a wrong setting)bull Scanner Accountbull problem with encrypted telnetbull PDFs not opening in IE
ndash S What is your examplendash C An example httpexamplepdfndash S What is your environmentndash C Windows XP with Acrobat Standard ndash S Running a repair on Windowsndash C It doesnrsquot workndash S Uninstalling Acrobat Standard and Reader and then
reinstalling the Readerndash C It doesnrsquot work eitherndash S Install two Microsoft updates for IEndash C It works
7
Motivationbull Problems
ndash Symptoms must be manually described to help-deskndash Help-desk needs to determine solution - difficult to compare with all previous
problemsndash Difficult for one help-desk employee to gain from anothers experiencendash Difficult to determine if solution worked successfully - User feedback may not be
accurate
bull How can data miningmachine learning helpndash Malfunction diagnoses call distribution skill-based routing outbound targeting etcndash 90 of calls are due to the most frequent 10 of problemsndash Problem analysis based on known casesndash Augment known cases when a new case is solvedndash Significantly reducing the cost and shortening time needed to diagnose malfunction
issues improve customer satisfaction
8
Two Different Approaches
bull Semantic-based Case Retrievalndash Using Semantic features to retrieval the most
similar past cases
bull Problem Probingndash Find probequestion set which maximizes
utility (benefit ndash cost)
9
Semantic Feature ExtractionTo detect sentence relations
sentence 1
sentence 2SENNA
(NEC Parser)
frames of sentence 1
frames of sentence 2
finding word relations among words with the same semantic role in each frame pair using
WordNet
Features
Semantic Roles For certain verbs of motionArg0 causer of motion Arg3 start point Arg1 thing in motion Arg4 end pointArg2 distance moved Arg5 direction-- eg [Arg0 I] [ArgM-MOD can][ArgM-NEG not] [rel receive] [Arg1 emails]
Subtypes of ArgM modifier tagLOC location CAU causeEXT extent TMP timeDIS discourse connectives PNC purposeADV general-purpose MNR mannerNEG negation marker DIR directionMOD modal verb
Word Relations WordNet 30 (147306 words)20 word relations egidentical synonyms antonyms attribute hypernym and hyponym-- A is a hypernym of B if B is a kind of Ameronym and holonym-- A is a meronym of B if A is a part or member of B
verb
frame
FeaturesArg0_identicalArg0_synonymsArg0_hypernymArg0_meronymrel_identicalrel_synonymsrel_hypernymRel_meronymArg1_identical
hellip hellip
10
Case Study ndash Online Help Desk
Interface
Most related emails
15496txt 18697txt 11002txt3191txt
Can you switch these two computers
Slides from Dingding Wang
11
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
12
SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them
High return Low risk Low expense hellip
Information overloading problem users often can not form a query that returns answers exactly matching their needs
13
Existing Solutions ndashIterative Refinement
bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1
Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt
10 and expense lt 2ndash hellip
bull Problem too time consuming
14
Existing Solutions - Ranking
bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD
2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)
15
Existing Solutions - Ranking
bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower
returns)ndash User 3 prefers a mix of high return funds and low
risk fundsndash One ranking function can not fit all
16
Existing Solutions -Categorization
Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)
The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)
User can navigate this tree and show records in a node
Categorization is complementary to ranking (rank records in a node)
Root(193)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
0 lt= 3 Yr return lt 10 (93)
17
Existing Solutions -Categorization
Problem differences between users are not consideredRoot(193)
0 lt= 3 Yr return lt 10 (93)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
10 lt=1 Yr return lt 20 (55)
10 lt= Std lt 20 (48)
5 lt= Std lt 10 (7)
0 lt= 1 Yr return lt 10 (63)
0 lt= Std lt 5 (38)
User 1 3
User 2 3 User 2 3
This tree does not help user 2 and 3 much
We try to address this issue for categorization
18
Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference
(eg high return funds low risk funds)
bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters
19
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
20
System Log Databull Performance Metric
ndash CPUMemorySwap utilization workload average response timebull Application-level Log
ndash Windows NT system log Websphere application log Oracle activitybull Failure data
ndash System crash dumps error reportsbull Network Traffic Data
ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces
bull Request Datandash Apache and Microsoft IIS log
bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures
bull Othersndash Network-based alert logs Probes
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
5
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
6
Example of Call Center Cases
bull Adding more memory on XXXXXXbull Email (a wrong setting)bull Scanner Accountbull problem with encrypted telnetbull PDFs not opening in IE
ndash S What is your examplendash C An example httpexamplepdfndash S What is your environmentndash C Windows XP with Acrobat Standard ndash S Running a repair on Windowsndash C It doesnrsquot workndash S Uninstalling Acrobat Standard and Reader and then
reinstalling the Readerndash C It doesnrsquot work eitherndash S Install two Microsoft updates for IEndash C It works
7
Motivationbull Problems
ndash Symptoms must be manually described to help-deskndash Help-desk needs to determine solution - difficult to compare with all previous
problemsndash Difficult for one help-desk employee to gain from anothers experiencendash Difficult to determine if solution worked successfully - User feedback may not be
accurate
bull How can data miningmachine learning helpndash Malfunction diagnoses call distribution skill-based routing outbound targeting etcndash 90 of calls are due to the most frequent 10 of problemsndash Problem analysis based on known casesndash Augment known cases when a new case is solvedndash Significantly reducing the cost and shortening time needed to diagnose malfunction
issues improve customer satisfaction
8
Two Different Approaches
bull Semantic-based Case Retrievalndash Using Semantic features to retrieval the most
similar past cases
bull Problem Probingndash Find probequestion set which maximizes
utility (benefit ndash cost)
9
Semantic Feature ExtractionTo detect sentence relations
sentence 1
sentence 2SENNA
(NEC Parser)
frames of sentence 1
frames of sentence 2
finding word relations among words with the same semantic role in each frame pair using
WordNet
Features
Semantic Roles For certain verbs of motionArg0 causer of motion Arg3 start point Arg1 thing in motion Arg4 end pointArg2 distance moved Arg5 direction-- eg [Arg0 I] [ArgM-MOD can][ArgM-NEG not] [rel receive] [Arg1 emails]
Subtypes of ArgM modifier tagLOC location CAU causeEXT extent TMP timeDIS discourse connectives PNC purposeADV general-purpose MNR mannerNEG negation marker DIR directionMOD modal verb
Word Relations WordNet 30 (147306 words)20 word relations egidentical synonyms antonyms attribute hypernym and hyponym-- A is a hypernym of B if B is a kind of Ameronym and holonym-- A is a meronym of B if A is a part or member of B
verb
frame
FeaturesArg0_identicalArg0_synonymsArg0_hypernymArg0_meronymrel_identicalrel_synonymsrel_hypernymRel_meronymArg1_identical
hellip hellip
10
Case Study ndash Online Help Desk
Interface
Most related emails
15496txt 18697txt 11002txt3191txt
Can you switch these two computers
Slides from Dingding Wang
11
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
12
SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them
High return Low risk Low expense hellip
Information overloading problem users often can not form a query that returns answers exactly matching their needs
13
Existing Solutions ndashIterative Refinement
bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1
Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt
10 and expense lt 2ndash hellip
bull Problem too time consuming
14
Existing Solutions - Ranking
bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD
2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)
15
Existing Solutions - Ranking
bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower
returns)ndash User 3 prefers a mix of high return funds and low
risk fundsndash One ranking function can not fit all
16
Existing Solutions -Categorization
Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)
The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)
User can navigate this tree and show records in a node
Categorization is complementary to ranking (rank records in a node)
Root(193)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
0 lt= 3 Yr return lt 10 (93)
17
Existing Solutions -Categorization
Problem differences between users are not consideredRoot(193)
0 lt= 3 Yr return lt 10 (93)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
10 lt=1 Yr return lt 20 (55)
10 lt= Std lt 20 (48)
5 lt= Std lt 10 (7)
0 lt= 1 Yr return lt 10 (63)
0 lt= Std lt 5 (38)
User 1 3
User 2 3 User 2 3
This tree does not help user 2 and 3 much
We try to address this issue for categorization
18
Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference
(eg high return funds low risk funds)
bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters
19
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
20
System Log Databull Performance Metric
ndash CPUMemorySwap utilization workload average response timebull Application-level Log
ndash Windows NT system log Websphere application log Oracle activitybull Failure data
ndash System crash dumps error reportsbull Network Traffic Data
ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces
bull Request Datandash Apache and Microsoft IIS log
bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures
bull Othersndash Network-based alert logs Probes
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
6
Example of Call Center Cases
bull Adding more memory on XXXXXXbull Email (a wrong setting)bull Scanner Accountbull problem with encrypted telnetbull PDFs not opening in IE
ndash S What is your examplendash C An example httpexamplepdfndash S What is your environmentndash C Windows XP with Acrobat Standard ndash S Running a repair on Windowsndash C It doesnrsquot workndash S Uninstalling Acrobat Standard and Reader and then
reinstalling the Readerndash C It doesnrsquot work eitherndash S Install two Microsoft updates for IEndash C It works
7
Motivationbull Problems
ndash Symptoms must be manually described to help-deskndash Help-desk needs to determine solution - difficult to compare with all previous
problemsndash Difficult for one help-desk employee to gain from anothers experiencendash Difficult to determine if solution worked successfully - User feedback may not be
accurate
bull How can data miningmachine learning helpndash Malfunction diagnoses call distribution skill-based routing outbound targeting etcndash 90 of calls are due to the most frequent 10 of problemsndash Problem analysis based on known casesndash Augment known cases when a new case is solvedndash Significantly reducing the cost and shortening time needed to diagnose malfunction
issues improve customer satisfaction
8
Two Different Approaches
bull Semantic-based Case Retrievalndash Using Semantic features to retrieval the most
similar past cases
bull Problem Probingndash Find probequestion set which maximizes
utility (benefit ndash cost)
9
Semantic Feature ExtractionTo detect sentence relations
sentence 1
sentence 2SENNA
(NEC Parser)
frames of sentence 1
frames of sentence 2
finding word relations among words with the same semantic role in each frame pair using
WordNet
Features
Semantic Roles For certain verbs of motionArg0 causer of motion Arg3 start point Arg1 thing in motion Arg4 end pointArg2 distance moved Arg5 direction-- eg [Arg0 I] [ArgM-MOD can][ArgM-NEG not] [rel receive] [Arg1 emails]
Subtypes of ArgM modifier tagLOC location CAU causeEXT extent TMP timeDIS discourse connectives PNC purposeADV general-purpose MNR mannerNEG negation marker DIR directionMOD modal verb
Word Relations WordNet 30 (147306 words)20 word relations egidentical synonyms antonyms attribute hypernym and hyponym-- A is a hypernym of B if B is a kind of Ameronym and holonym-- A is a meronym of B if A is a part or member of B
verb
frame
FeaturesArg0_identicalArg0_synonymsArg0_hypernymArg0_meronymrel_identicalrel_synonymsrel_hypernymRel_meronymArg1_identical
hellip hellip
10
Case Study ndash Online Help Desk
Interface
Most related emails
15496txt 18697txt 11002txt3191txt
Can you switch these two computers
Slides from Dingding Wang
11
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
12
SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them
High return Low risk Low expense hellip
Information overloading problem users often can not form a query that returns answers exactly matching their needs
13
Existing Solutions ndashIterative Refinement
bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1
Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt
10 and expense lt 2ndash hellip
bull Problem too time consuming
14
Existing Solutions - Ranking
bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD
2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)
15
Existing Solutions - Ranking
bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower
returns)ndash User 3 prefers a mix of high return funds and low
risk fundsndash One ranking function can not fit all
16
Existing Solutions -Categorization
Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)
The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)
User can navigate this tree and show records in a node
Categorization is complementary to ranking (rank records in a node)
Root(193)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
0 lt= 3 Yr return lt 10 (93)
17
Existing Solutions -Categorization
Problem differences between users are not consideredRoot(193)
0 lt= 3 Yr return lt 10 (93)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
10 lt=1 Yr return lt 20 (55)
10 lt= Std lt 20 (48)
5 lt= Std lt 10 (7)
0 lt= 1 Yr return lt 10 (63)
0 lt= Std lt 5 (38)
User 1 3
User 2 3 User 2 3
This tree does not help user 2 and 3 much
We try to address this issue for categorization
18
Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference
(eg high return funds low risk funds)
bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters
19
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
20
System Log Databull Performance Metric
ndash CPUMemorySwap utilization workload average response timebull Application-level Log
ndash Windows NT system log Websphere application log Oracle activitybull Failure data
ndash System crash dumps error reportsbull Network Traffic Data
ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces
bull Request Datandash Apache and Microsoft IIS log
bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures
bull Othersndash Network-based alert logs Probes
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
7
Motivationbull Problems
ndash Symptoms must be manually described to help-deskndash Help-desk needs to determine solution - difficult to compare with all previous
problemsndash Difficult for one help-desk employee to gain from anothers experiencendash Difficult to determine if solution worked successfully - User feedback may not be
accurate
bull How can data miningmachine learning helpndash Malfunction diagnoses call distribution skill-based routing outbound targeting etcndash 90 of calls are due to the most frequent 10 of problemsndash Problem analysis based on known casesndash Augment known cases when a new case is solvedndash Significantly reducing the cost and shortening time needed to diagnose malfunction
issues improve customer satisfaction
8
Two Different Approaches
bull Semantic-based Case Retrievalndash Using Semantic features to retrieval the most
similar past cases
bull Problem Probingndash Find probequestion set which maximizes
utility (benefit ndash cost)
9
Semantic Feature ExtractionTo detect sentence relations
sentence 1
sentence 2SENNA
(NEC Parser)
frames of sentence 1
frames of sentence 2
finding word relations among words with the same semantic role in each frame pair using
WordNet
Features
Semantic Roles For certain verbs of motionArg0 causer of motion Arg3 start point Arg1 thing in motion Arg4 end pointArg2 distance moved Arg5 direction-- eg [Arg0 I] [ArgM-MOD can][ArgM-NEG not] [rel receive] [Arg1 emails]
Subtypes of ArgM modifier tagLOC location CAU causeEXT extent TMP timeDIS discourse connectives PNC purposeADV general-purpose MNR mannerNEG negation marker DIR directionMOD modal verb
Word Relations WordNet 30 (147306 words)20 word relations egidentical synonyms antonyms attribute hypernym and hyponym-- A is a hypernym of B if B is a kind of Ameronym and holonym-- A is a meronym of B if A is a part or member of B
verb
frame
FeaturesArg0_identicalArg0_synonymsArg0_hypernymArg0_meronymrel_identicalrel_synonymsrel_hypernymRel_meronymArg1_identical
hellip hellip
10
Case Study ndash Online Help Desk
Interface
Most related emails
15496txt 18697txt 11002txt3191txt
Can you switch these two computers
Slides from Dingding Wang
11
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
12
SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them
High return Low risk Low expense hellip
Information overloading problem users often can not form a query that returns answers exactly matching their needs
13
Existing Solutions ndashIterative Refinement
bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1
Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt
10 and expense lt 2ndash hellip
bull Problem too time consuming
14
Existing Solutions - Ranking
bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD
2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)
15
Existing Solutions - Ranking
bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower
returns)ndash User 3 prefers a mix of high return funds and low
risk fundsndash One ranking function can not fit all
16
Existing Solutions -Categorization
Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)
The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)
User can navigate this tree and show records in a node
Categorization is complementary to ranking (rank records in a node)
Root(193)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
0 lt= 3 Yr return lt 10 (93)
17
Existing Solutions -Categorization
Problem differences between users are not consideredRoot(193)
0 lt= 3 Yr return lt 10 (93)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
10 lt=1 Yr return lt 20 (55)
10 lt= Std lt 20 (48)
5 lt= Std lt 10 (7)
0 lt= 1 Yr return lt 10 (63)
0 lt= Std lt 5 (38)
User 1 3
User 2 3 User 2 3
This tree does not help user 2 and 3 much
We try to address this issue for categorization
18
Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference
(eg high return funds low risk funds)
bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters
19
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
20
System Log Databull Performance Metric
ndash CPUMemorySwap utilization workload average response timebull Application-level Log
ndash Windows NT system log Websphere application log Oracle activitybull Failure data
ndash System crash dumps error reportsbull Network Traffic Data
ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces
bull Request Datandash Apache and Microsoft IIS log
bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures
bull Othersndash Network-based alert logs Probes
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
8
Two Different Approaches
bull Semantic-based Case Retrievalndash Using Semantic features to retrieval the most
similar past cases
bull Problem Probingndash Find probequestion set which maximizes
utility (benefit ndash cost)
9
Semantic Feature ExtractionTo detect sentence relations
sentence 1
sentence 2SENNA
(NEC Parser)
frames of sentence 1
frames of sentence 2
finding word relations among words with the same semantic role in each frame pair using
WordNet
Features
Semantic Roles For certain verbs of motionArg0 causer of motion Arg3 start point Arg1 thing in motion Arg4 end pointArg2 distance moved Arg5 direction-- eg [Arg0 I] [ArgM-MOD can][ArgM-NEG not] [rel receive] [Arg1 emails]
Subtypes of ArgM modifier tagLOC location CAU causeEXT extent TMP timeDIS discourse connectives PNC purposeADV general-purpose MNR mannerNEG negation marker DIR directionMOD modal verb
Word Relations WordNet 30 (147306 words)20 word relations egidentical synonyms antonyms attribute hypernym and hyponym-- A is a hypernym of B if B is a kind of Ameronym and holonym-- A is a meronym of B if A is a part or member of B
verb
frame
FeaturesArg0_identicalArg0_synonymsArg0_hypernymArg0_meronymrel_identicalrel_synonymsrel_hypernymRel_meronymArg1_identical
hellip hellip
10
Case Study ndash Online Help Desk
Interface
Most related emails
15496txt 18697txt 11002txt3191txt
Can you switch these two computers
Slides from Dingding Wang
11
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
12
SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them
High return Low risk Low expense hellip
Information overloading problem users often can not form a query that returns answers exactly matching their needs
13
Existing Solutions ndashIterative Refinement
bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1
Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt
10 and expense lt 2ndash hellip
bull Problem too time consuming
14
Existing Solutions - Ranking
bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD
2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)
15
Existing Solutions - Ranking
bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower
returns)ndash User 3 prefers a mix of high return funds and low
risk fundsndash One ranking function can not fit all
16
Existing Solutions -Categorization
Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)
The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)
User can navigate this tree and show records in a node
Categorization is complementary to ranking (rank records in a node)
Root(193)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
0 lt= 3 Yr return lt 10 (93)
17
Existing Solutions -Categorization
Problem differences between users are not consideredRoot(193)
0 lt= 3 Yr return lt 10 (93)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
10 lt=1 Yr return lt 20 (55)
10 lt= Std lt 20 (48)
5 lt= Std lt 10 (7)
0 lt= 1 Yr return lt 10 (63)
0 lt= Std lt 5 (38)
User 1 3
User 2 3 User 2 3
This tree does not help user 2 and 3 much
We try to address this issue for categorization
18
Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference
(eg high return funds low risk funds)
bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters
19
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
20
System Log Databull Performance Metric
ndash CPUMemorySwap utilization workload average response timebull Application-level Log
ndash Windows NT system log Websphere application log Oracle activitybull Failure data
ndash System crash dumps error reportsbull Network Traffic Data
ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces
bull Request Datandash Apache and Microsoft IIS log
bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures
bull Othersndash Network-based alert logs Probes
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
9
Semantic Feature ExtractionTo detect sentence relations
sentence 1
sentence 2SENNA
(NEC Parser)
frames of sentence 1
frames of sentence 2
finding word relations among words with the same semantic role in each frame pair using
WordNet
Features
Semantic Roles For certain verbs of motionArg0 causer of motion Arg3 start point Arg1 thing in motion Arg4 end pointArg2 distance moved Arg5 direction-- eg [Arg0 I] [ArgM-MOD can][ArgM-NEG not] [rel receive] [Arg1 emails]
Subtypes of ArgM modifier tagLOC location CAU causeEXT extent TMP timeDIS discourse connectives PNC purposeADV general-purpose MNR mannerNEG negation marker DIR directionMOD modal verb
Word Relations WordNet 30 (147306 words)20 word relations egidentical synonyms antonyms attribute hypernym and hyponym-- A is a hypernym of B if B is a kind of Ameronym and holonym-- A is a meronym of B if A is a part or member of B
verb
frame
FeaturesArg0_identicalArg0_synonymsArg0_hypernymArg0_meronymrel_identicalrel_synonymsrel_hypernymRel_meronymArg1_identical
hellip hellip
10
Case Study ndash Online Help Desk
Interface
Most related emails
15496txt 18697txt 11002txt3191txt
Can you switch these two computers
Slides from Dingding Wang
11
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
12
SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them
High return Low risk Low expense hellip
Information overloading problem users often can not form a query that returns answers exactly matching their needs
13
Existing Solutions ndashIterative Refinement
bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1
Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt
10 and expense lt 2ndash hellip
bull Problem too time consuming
14
Existing Solutions - Ranking
bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD
2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)
15
Existing Solutions - Ranking
bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower
returns)ndash User 3 prefers a mix of high return funds and low
risk fundsndash One ranking function can not fit all
16
Existing Solutions -Categorization
Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)
The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)
User can navigate this tree and show records in a node
Categorization is complementary to ranking (rank records in a node)
Root(193)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
0 lt= 3 Yr return lt 10 (93)
17
Existing Solutions -Categorization
Problem differences between users are not consideredRoot(193)
0 lt= 3 Yr return lt 10 (93)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
10 lt=1 Yr return lt 20 (55)
10 lt= Std lt 20 (48)
5 lt= Std lt 10 (7)
0 lt= 1 Yr return lt 10 (63)
0 lt= Std lt 5 (38)
User 1 3
User 2 3 User 2 3
This tree does not help user 2 and 3 much
We try to address this issue for categorization
18
Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference
(eg high return funds low risk funds)
bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters
19
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
20
System Log Databull Performance Metric
ndash CPUMemorySwap utilization workload average response timebull Application-level Log
ndash Windows NT system log Websphere application log Oracle activitybull Failure data
ndash System crash dumps error reportsbull Network Traffic Data
ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces
bull Request Datandash Apache and Microsoft IIS log
bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures
bull Othersndash Network-based alert logs Probes
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
10
Case Study ndash Online Help Desk
Interface
Most related emails
15496txt 18697txt 11002txt3191txt
Can you switch these two computers
Slides from Dingding Wang
11
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
12
SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them
High return Low risk Low expense hellip
Information overloading problem users often can not form a query that returns answers exactly matching their needs
13
Existing Solutions ndashIterative Refinement
bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1
Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt
10 and expense lt 2ndash hellip
bull Problem too time consuming
14
Existing Solutions - Ranking
bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD
2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)
15
Existing Solutions - Ranking
bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower
returns)ndash User 3 prefers a mix of high return funds and low
risk fundsndash One ranking function can not fit all
16
Existing Solutions -Categorization
Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)
The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)
User can navigate this tree and show records in a node
Categorization is complementary to ranking (rank records in a node)
Root(193)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
0 lt= 3 Yr return lt 10 (93)
17
Existing Solutions -Categorization
Problem differences between users are not consideredRoot(193)
0 lt= 3 Yr return lt 10 (93)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
10 lt=1 Yr return lt 20 (55)
10 lt= Std lt 20 (48)
5 lt= Std lt 10 (7)
0 lt= 1 Yr return lt 10 (63)
0 lt= Std lt 5 (38)
User 1 3
User 2 3 User 2 3
This tree does not help user 2 and 3 much
We try to address this issue for categorization
18
Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference
(eg high return funds low risk funds)
bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters
19
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
20
System Log Databull Performance Metric
ndash CPUMemorySwap utilization workload average response timebull Application-level Log
ndash Windows NT system log Websphere application log Oracle activitybull Failure data
ndash System crash dumps error reportsbull Network Traffic Data
ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces
bull Request Datandash Apache and Microsoft IIS log
bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures
bull Othersndash Network-based alert logs Probes
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
11
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
12
SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them
High return Low risk Low expense hellip
Information overloading problem users often can not form a query that returns answers exactly matching their needs
13
Existing Solutions ndashIterative Refinement
bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1
Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt
10 and expense lt 2ndash hellip
bull Problem too time consuming
14
Existing Solutions - Ranking
bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD
2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)
15
Existing Solutions - Ranking
bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower
returns)ndash User 3 prefers a mix of high return funds and low
risk fundsndash One ranking function can not fit all
16
Existing Solutions -Categorization
Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)
The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)
User can navigate this tree and show records in a node
Categorization is complementary to ranking (rank records in a node)
Root(193)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
0 lt= 3 Yr return lt 10 (93)
17
Existing Solutions -Categorization
Problem differences between users are not consideredRoot(193)
0 lt= 3 Yr return lt 10 (93)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
10 lt=1 Yr return lt 20 (55)
10 lt= Std lt 20 (48)
5 lt= Std lt 10 (7)
0 lt= 1 Yr return lt 10 (63)
0 lt= Std lt 5 (38)
User 1 3
User 2 3 User 2 3
This tree does not help user 2 and 3 much
We try to address this issue for categorization
18
Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference
(eg high return funds low risk funds)
bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters
19
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
20
System Log Databull Performance Metric
ndash CPUMemorySwap utilization workload average response timebull Application-level Log
ndash Windows NT system log Websphere application log Oracle activitybull Failure data
ndash System crash dumps error reportsbull Network Traffic Data
ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces
bull Request Datandash Apache and Microsoft IIS log
bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures
bull Othersndash Network-based alert logs Probes
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
12
SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them
High return Low risk Low expense hellip
Information overloading problem users often can not form a query that returns answers exactly matching their needs
13
Existing Solutions ndashIterative Refinement
bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1
Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt
10 and expense lt 2ndash hellip
bull Problem too time consuming
14
Existing Solutions - Ranking
bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD
2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)
15
Existing Solutions - Ranking
bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower
returns)ndash User 3 prefers a mix of high return funds and low
risk fundsndash One ranking function can not fit all
16
Existing Solutions -Categorization
Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)
The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)
User can navigate this tree and show records in a node
Categorization is complementary to ranking (rank records in a node)
Root(193)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
0 lt= 3 Yr return lt 10 (93)
17
Existing Solutions -Categorization
Problem differences between users are not consideredRoot(193)
0 lt= 3 Yr return lt 10 (93)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
10 lt=1 Yr return lt 20 (55)
10 lt= Std lt 20 (48)
5 lt= Std lt 10 (7)
0 lt= 1 Yr return lt 10 (63)
0 lt= Std lt 5 (38)
User 1 3
User 2 3 User 2 3
This tree does not help user 2 and 3 much
We try to address this issue for categorization
18
Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference
(eg high return funds low risk funds)
bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters
19
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
20
System Log Databull Performance Metric
ndash CPUMemorySwap utilization workload average response timebull Application-level Log
ndash Windows NT system log Websphere application log Oracle activitybull Failure data
ndash System crash dumps error reportsbull Network Traffic Data
ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces
bull Request Datandash Apache and Microsoft IIS log
bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures
bull Othersndash Network-based alert logs Probes
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
13
Existing Solutions ndashIterative Refinement
bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1
Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt
10 and expense lt 2ndash hellip
bull Problem too time consuming
14
Existing Solutions - Ranking
bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD
2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)
15
Existing Solutions - Ranking
bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower
returns)ndash User 3 prefers a mix of high return funds and low
risk fundsndash One ranking function can not fit all
16
Existing Solutions -Categorization
Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)
The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)
User can navigate this tree and show records in a node
Categorization is complementary to ranking (rank records in a node)
Root(193)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
0 lt= 3 Yr return lt 10 (93)
17
Existing Solutions -Categorization
Problem differences between users are not consideredRoot(193)
0 lt= 3 Yr return lt 10 (93)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
10 lt=1 Yr return lt 20 (55)
10 lt= Std lt 20 (48)
5 lt= Std lt 10 (7)
0 lt= 1 Yr return lt 10 (63)
0 lt= Std lt 5 (38)
User 1 3
User 2 3 User 2 3
This tree does not help user 2 and 3 much
We try to address this issue for categorization
18
Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference
(eg high return funds low risk funds)
bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters
19
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
20
System Log Databull Performance Metric
ndash CPUMemorySwap utilization workload average response timebull Application-level Log
ndash Windows NT system log Websphere application log Oracle activitybull Failure data
ndash System crash dumps error reportsbull Network Traffic Data
ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces
bull Request Datandash Apache and Microsoft IIS log
bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures
bull Othersndash Network-based alert logs Probes
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
14
Existing Solutions - Ranking
bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD
2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)
15
Existing Solutions - Ranking
bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower
returns)ndash User 3 prefers a mix of high return funds and low
risk fundsndash One ranking function can not fit all
16
Existing Solutions -Categorization
Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)
The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)
User can navigate this tree and show records in a node
Categorization is complementary to ranking (rank records in a node)
Root(193)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
0 lt= 3 Yr return lt 10 (93)
17
Existing Solutions -Categorization
Problem differences between users are not consideredRoot(193)
0 lt= 3 Yr return lt 10 (93)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
10 lt=1 Yr return lt 20 (55)
10 lt= Std lt 20 (48)
5 lt= Std lt 10 (7)
0 lt= 1 Yr return lt 10 (63)
0 lt= Std lt 5 (38)
User 1 3
User 2 3 User 2 3
This tree does not help user 2 and 3 much
We try to address this issue for categorization
18
Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference
(eg high return funds low risk funds)
bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters
19
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
20
System Log Databull Performance Metric
ndash CPUMemorySwap utilization workload average response timebull Application-level Log
ndash Windows NT system log Websphere application log Oracle activitybull Failure data
ndash System crash dumps error reportsbull Network Traffic Data
ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces
bull Request Datandash Apache and Microsoft IIS log
bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures
bull Othersndash Network-based alert logs Probes
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
15
Existing Solutions - Ranking
bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower
returns)ndash User 3 prefers a mix of high return funds and low
risk fundsndash One ranking function can not fit all
16
Existing Solutions -Categorization
Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)
The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)
User can navigate this tree and show records in a node
Categorization is complementary to ranking (rank records in a node)
Root(193)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
0 lt= 3 Yr return lt 10 (93)
17
Existing Solutions -Categorization
Problem differences between users are not consideredRoot(193)
0 lt= 3 Yr return lt 10 (93)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
10 lt=1 Yr return lt 20 (55)
10 lt= Std lt 20 (48)
5 lt= Std lt 10 (7)
0 lt= 1 Yr return lt 10 (63)
0 lt= Std lt 5 (38)
User 1 3
User 2 3 User 2 3
This tree does not help user 2 and 3 much
We try to address this issue for categorization
18
Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference
(eg high return funds low risk funds)
bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters
19
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
20
System Log Databull Performance Metric
ndash CPUMemorySwap utilization workload average response timebull Application-level Log
ndash Windows NT system log Websphere application log Oracle activitybull Failure data
ndash System crash dumps error reportsbull Network Traffic Data
ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces
bull Request Datandash Apache and Microsoft IIS log
bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures
bull Othersndash Network-based alert logs Probes
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
16
Existing Solutions -Categorization
Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)
The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)
User can navigate this tree and show records in a node
Categorization is complementary to ranking (rank records in a node)
Root(193)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
0 lt= 3 Yr return lt 10 (93)
17
Existing Solutions -Categorization
Problem differences between users are not consideredRoot(193)
0 lt= 3 Yr return lt 10 (93)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
10 lt=1 Yr return lt 20 (55)
10 lt= Std lt 20 (48)
5 lt= Std lt 10 (7)
0 lt= 1 Yr return lt 10 (63)
0 lt= Std lt 5 (38)
User 1 3
User 2 3 User 2 3
This tree does not help user 2 and 3 much
We try to address this issue for categorization
18
Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference
(eg high return funds low risk funds)
bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters
19
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
20
System Log Databull Performance Metric
ndash CPUMemorySwap utilization workload average response timebull Application-level Log
ndash Windows NT system log Websphere application log Oracle activitybull Failure data
ndash System crash dumps error reportsbull Network Traffic Data
ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces
bull Request Datandash Apache and Microsoft IIS log
bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures
bull Othersndash Network-based alert logs Probes
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
17
Existing Solutions -Categorization
Problem differences between users are not consideredRoot(193)
0 lt= 3 Yr return lt 10 (93)
10 lt= 3 Yr return lt 20(71)
3 Yr return gt= 20 (29)
10 lt=1 Yr return lt 20 (55)
10 lt= Std lt 20 (48)
5 lt= Std lt 10 (7)
0 lt= 1 Yr return lt 10 (63)
0 lt= Std lt 5 (38)
User 1 3
User 2 3 User 2 3
This tree does not help user 2 and 3 much
We try to address this issue for categorization
18
Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference
(eg high return funds low risk funds)
bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters
19
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
20
System Log Databull Performance Metric
ndash CPUMemorySwap utilization workload average response timebull Application-level Log
ndash Windows NT system log Websphere application log Oracle activitybull Failure data
ndash System crash dumps error reportsbull Network Traffic Data
ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces
bull Request Datandash Apache and Microsoft IIS log
bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures
bull Othersndash Network-based alert logs Probes
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
18
Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference
(eg high return funds low risk funds)
bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters
19
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
20
System Log Databull Performance Metric
ndash CPUMemorySwap utilization workload average response timebull Application-level Log
ndash Windows NT system log Websphere application log Oracle activitybull Failure data
ndash System crash dumps error reportsbull Network Traffic Data
ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces
bull Request Datandash Apache and Microsoft IIS log
bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures
bull Othersndash Network-based alert logs Probes
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
19
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
20
System Log Databull Performance Metric
ndash CPUMemorySwap utilization workload average response timebull Application-level Log
ndash Windows NT system log Websphere application log Oracle activitybull Failure data
ndash System crash dumps error reportsbull Network Traffic Data
ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces
bull Request Datandash Apache and Microsoft IIS log
bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures
bull Othersndash Network-based alert logs Probes
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
20
System Log Databull Performance Metric
ndash CPUMemorySwap utilization workload average response timebull Application-level Log
ndash Windows NT system log Websphere application log Oracle activitybull Failure data
ndash System crash dumps error reportsbull Network Traffic Data
ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces
bull Request Datandash Apache and Microsoft IIS log
bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures
bull Othersndash Network-based alert logs Probes
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
21
logs
logs
logs
HistoricalData Collection
Realtime Analysis
Offline Analysis
Knowledge Base
Real Time Management
Anomaly Detection
Fault Diagnosis
Problem Determination
CorrelationDependencyKnowledge
SummarizationVisualizationRule Construction
Temporal Pattern DiscoveryKnowledge Management
PlanningActions
The Integrated Framework on Data-Driven Computing System Management
Log Data Organization
Componentlogs Active Data Collection
Log Adapter
Situation Identification
and Categorization
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
22
Raw data-gtData table
Aggregationsummarization
Event Plots
Pattern MiningPattern Mining
OutagesTemporal patternsEvent bursts
Morning rebootingperiodic patterns
Weekly rebootingPeriodic patterns
Hosts
Knowledge validation completion and construction
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
23
A Case Study
State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
24
Pattern Visualization in the Case Study
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
25
ERN in the Case Study
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
26
Other Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
27
bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key
Performance Indicators) to support better decision-making
bull Focus leading indicator discovery
ndash Using operational metrics to adaptively manage amp improve workflow efficiency
bull Focus workflow exception detection and tracing
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
28
Examples of KPI Types
Leading Indicators(Value Drivers)
Lagging Indicators(Outcomes )
Number of clients that sales people meet face to face each week
Sales revenue
Complex repairs completed successfully during the first call or visit
Customer satisfaction
Number of parts for which orders exceed forecasts within 30 days of scheduled delivery
Per unit manufacturing costs
Number of customers who are delinquent paying their first bill
Customer churn
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
29
Methodology for Leading Indicator Learning
Selected Leading Indicators or Leading Indicator Combinations
clustering
Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality
hellip
hellip
hellipThe root KPI may be the key leading indicator
Time seriesproduction model(eg assumption data model)
Leading Indicator Selection based on Domain Knowledge
sequence KPIs
Reduced Time series
define KPIs
correlation
dimensionalityreduction
If no update modelIs Model Valid
New Business Concern(s)
If yes
Leading Indicators
Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
30
Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
31
Biological Data SourcesGene Expression Text LiteratureSequence Data
Clustering Protein ExpressionProtein-protein Interactions
Transcription FactorBinding SitesGene Ontology
Gene Clusters
Function Co-regulation
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
32
Problem We are Studyingbull Marker Gene Identification
ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms
bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR
bull RankGene 11
bull How to perform clustering using data from multiple diverse sources
bull How to perform clustering using partial prior knowledge
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
33
Some Projects in My Group
bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence
ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
34
Business Continuity Information Network (BCIN)
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
35
BCIN
bull Central questionndash How to help businesses recover and resume
operations quicker after a major hurricane (or other) disaster
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
3620082008--44--66 Florida International UniversityFlorida International University 3636
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
3720082008--44--66 Florida International UniversityFlorida International University 3737
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
3820082008--44--66 Florida International UniversityFlorida International University 3838
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
3920082008--44--66 Florida International UniversityFlorida International University 3939
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
40Florida International UniversityFlorida International University
Situation Analysis
John Smith Asst Manager Store 2345
Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs
Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff
Facility Non-Func 10107 1001am R Carr
FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia
33
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
41
Acknowledgementsbull Funding
ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM
bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)
bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-
42
Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli
- Data Mining Efficiently Extracting Interpretable and Actionable Patterns
- Research Challenges
- My Research Directions
- Some Projects in My Group
- Example of Call Center Cases
- Motivation
- Two Different Approaches
- Semantic Feature Extraction
- Case Study ndash Online Help Desk
- Some Projects in My Group
- SQL-Query-Result Navigation
- Existing Solutions ndash Iterative Refinement
- Existing Solutions - Ranking
- Existing Solutions - Ranking
- Existing Solutions - Categorization
- Existing Solutions - Categorization
- Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
- Some Projects in My Group
- System Log Data
- Knowledge validation completion and construction
- A Case Study
- Pattern Visualization in the Case Study
- ERN in the Case Study
- Other Projects in My Group
- Examples of KPI Types
- Methodology for Leading Indicator Learning
- Some Projects in My Group
- Biological Data Sources
- Problem We are Studying
- Some Projects in My Group
- Business Continuity Information Network (BCIN)
- BCIN
- Acknowledgements
- Question
-