Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ e-mail: salexe@rutcor.rutgers

Post on 02-Jan-2016

33 views 2 download

description

DIMACS Mixer Series, September 19, 2002. Datascope - a new tool for Logical Analysis of Data (LAD). Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ e-mail: salexe@rutcor.rutgers.edu URL: rutcor.rutgers.edu/~salexe. Hidden Function. LAD Approximation. LAD - Problem. - PowerPoint PPT Presentation

Transcript of Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ e-mail: salexe@rutcor.rutgers

III 1

Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ e-mail: salexe@rutcor.rutgers.edu URL: rutcor.rutgers.edu/~salexe

Datascope - a new tool for Logical Analysis of Data (LAD)

Datascope - a new tool for Logical Analysis of Data (LAD)

DIMACS Mixer Series,September 19, 2002

III 2

DatasetHidden

Function LAD

Approximation

LAD - ProblemLAD - Problem

III 3

LAD - PatternsLAD - Patterns

Positive Pattern Negative Pattern

III 4

LAD - Theories, Models, Classifications

LAD - Theories, Models, Classifications

Positive Theory Negative Theory

Model

III 5

Datascope FunctionsDatascope Functions

Support Set IdentificationSpace DiscretizationPattern DetectionModel ConstructionDiscriminant / Prognostic IndexClassificationFeature Analysis

III 6

Matlab Solver

InternalSolver

Datascope DataflowDatascope Dataflow

DiscretizationDiscretization

Significant Features

Cutpoints,Support Set

FeatureAnalysis

Pattern Space

DiagnosisPrognosis

RiskStratification

Pandect GenerationPandect Generation

Discriminant ConstructionDiscriminant Construction UserExcel Model

Pre-ProcessingPre-Processing

Raw Data

Theories/ModelsTheories/Models

Pattern Report

III 7

1. Support Set Identification1. Support Set Identification

Selects Small Subset of Significant Features

Preserves Hidden Knowledge

Feature Ranking Criteria:

Statistical Correlation with Outcome

Combinatorial Entropy

Distribution Monotonicity

Class Separation

Envelope Eccentricity

E.g., 10 proteins selected out of

15,144

E.g., 10 proteins selected out of

15,144

III 8

DataData

Spreadsheet OrientedOLE (via Clipboard)/ Excel Spreadsheet /

dBase tables

Training / Test GenerationBootstrapk-FoldingJackknife

New FeaturesCorrelation

III 9

Data: Training/Test Data: Training/Test

III 10

2. Space Discretization 2. Space Discretization

Criteria:

Entropy

Correlation with Output

Bins (equipartitioning)

Intervals

Clustered

Class Separation

Criteria:

Entropy

Correlation with Output

Bins (equipartitioning)

Intervals

Clustered

Class Separation

Parameter Choice: User Defined Minimizing Support Set

Parameter Choice: User Defined Minimizing Support Set

Quality Measures: Entropy Separability

Quality Measures: Entropy Separability

III 11

Entropy Correlation with Output Bins

Intervals Clustered Class Separation

III 12

3. Generation of Maximal Patterns 3. Generation of Maximal Patterns

Pattern Type Selection:Prime

ConesIntervals

Spanned

Pattern Type Selection:Prime

ConesIntervals

Spanned

Parameter Bound Settings:Prevalence:

% of positive observations% of negative observations

Homogeneity:on positive patternson negative patterns

Degree.

Parameter Bound Settings:Prevalence:

% of positive observations% of negative observations

Homogeneity:on positive patternson negative patterns

Degree.Post-Generation Filters:

By CharacteristicsMaximalityStrongness

Post-Generation Filters:By CharacteristicsMaximalityStrongness

III 13

16 xi.e.,

Positive Patterns

Positive Patterns

Pattern Definition Training Set Test Set Pattern Definition Training Set Test Set

III 14

Negative Patterns

Negative Patterns

Pattern Definition Training Set Test Set Pattern Definition Training Set Test Set

III 15

4. Theories and Models 4. Theories and Models

PandectPandect

Theory Selection:via:

Greedy

Bottleneck Greedy

Lexicographic Greedy

Set Covering Heuristics

Theory Selection:via:

Greedy

Bottleneck Greedy

Lexicographic Greedy

Set Covering Heuristics

Model Selection:

2 Set-Covering Problems

Quadratic Set-Covering Problem

Model Selection:

2 Set-Covering Problems

Quadratic Set-Covering Problem

III 16

4. Example (Model)4. Example (Model)

III 17

5. Example (Classification)5. Example (Classification)

III 18

III 19

III 20

III 21

5. Discriminants 5. Discriminants

Weight Selection Methods:Direct

1. Prognostic Index

2. Weighted Prognostic Index

LP-Based

3. Distance Maximizing Separator (SVM)

4. Cost Minimizing Separator

5. Expected Value Separator

NLP-Based

6. Regression in Pattern Space (ANN)

7. Best Correlation with Output

(weighted sums of patterns)

III 22

Prognostic Index Weighted Prognostic Expected Value Index Separator

Distance Maximizing Cost Minimizing Best Correlation Separator Separator with Output

III 23

%83.93%25.2*5.0%75.97%40.8*5.0%24.884

1

Accuracy

Sensitivity Specificity

III 24

III 25

Reporting Reporting

CutpointsDiscretized SpacePandectCoverage of Observations by PatternsPattern Report (Compact/Full Versions)Theories/ModelsAttribute AnalysisLog File

III 26

Pattern Space

Pattern Space

Training

+ + + + + + - - -Patterns

Test

+ + + + + + - - -Patterns

Positive Observations

Unclassified Observations

Negative Observations

III 27

ClusteredPattern Space

ClusteredPattern Space

III 28

AccuracySensitivitySpecificity

AccuracySensitivitySpecificity

BootstrapK-FoldingJackknife

BootstrapK-FoldingJackknife

Validation ProceduresValidation Procedures

Stratified Random Partition

Stratified Random Partition

LAD Model on Training Set

LAD Model on Training Set

Performance Evaluation

Performance Evaluation

Raw Data

III 29

Special FeaturesSpecial Features

Generating User Model Generation(Excel Files)

Datascope Macro LanguageMultiple and Complex Experiments

Interface with Other Applications

(Datascope Server)

III 30

Performance Performance C o m p a r a t i v e r e s u l t s f o r 5 d a t a s e t s f r o m t h e I r v i n e r e p o s i t o r yL A D a n d o t h e r 3 3 a l g o r i t h m s

D a t a s e t N a i v e B e s t ( B ) W o r s t ( W ) L A D ( L ) A c c u r a c y

b c w 3 5 3 9 3 . 5 0 . 0 8 9 9 . 4 8 %b l d 4 2 2 8 4 3 2 7 . 8 - 0 . 0 1 1 0 0 . 2 8 %

h e a 4 4 1 4 3 4 1 4 . 7 0 . 0 4 9 9 . 1 9 %p i d 3 3 2 2 3 1 2 1 . 5 - 0 . 0 6 1 0 0 . 6 4 %v o t 3 9 4 6 4 . 6 0 . 3 0 9 9 . 3 8 %

a v e r a g e 0 . 0 7 9 9 . 7 9 %

WBL 1:

Tjen-Sien Lim, Wei-Yin Loh and Yu-Shan Shin A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms, by, Machine Learning, 40, 203-229 (2000)

http://www.ics.uci.edu/~mlearn/MLRepository.html

III 31

LAD Case Studies LAD Case Studies

Assessing Long-Term Mortality Risk After Exercise Electrocardiography

Ovarian Cancer Detection Using Proteomic Data

Combinatorial Analysis of Breast Cancer Data from Image Cytometry and Gene Expression Microarrays

Cell Proliferation on Medical Implants

Country Risk Rating