Data Mining Solutions (Westphal & Blaxton, 1998) Dr. K. Palaniappan.

21
Data Mining Solutions Data Mining Solutions (Westphal & Blaxton, 1998) (Westphal & Blaxton, 1998) Dr. K. Palaniappan Dr. K. Palaniappan

Transcript of Data Mining Solutions (Westphal & Blaxton, 1998) Dr. K. Palaniappan.

Data Mining SolutionsData Mining Solutions (Westphal & Blaxton, 1998)(Westphal & Blaxton, 1998)

Dr. K. PalaniappanDr. K. Palaniappan

Oct. 7, 1999 Dr. K. Palaniappan 2

Data Mining Tasks (review)Data Mining Tasks (review)(1) Classification or identification(1) Classification or identification - -

automatically label input recordsautomatically label input records

(2) Estimation or regression(2) Estimation or regression - predict - predict magnitude of response or other missing magnitude of response or other missing field given input recordsfield given input records

(3) Segmentation or clustering(3) Segmentation or clustering - group the - group the input records into meaningful sub-input records into meaningful sub-populationspopulations

(4) Description or visualization(4) Description or visualization - looking for - looking for gems and diamonds among pebblesgems and diamonds among pebbles

Oct. 7, 1999 Dr. K. Palaniappan 3

Data Mining Tasks/AlgorithmsData Mining Tasks/Algorithms (handout)(handout)

Classification -Classification - supervised induction supervised induction Analyze historical data to create a model that can predict future Analyze historical data to create a model that can predict future

behaviorbehavior Common tools - neural networks, decision trees, if-then-else rulesCommon tools - neural networks, decision trees, if-then-else rules

Clustering -Clustering - partition database into similar groups or partition database into similar groups or segmentssegments Expert interpretation and modification of clusters neededExpert interpretation and modification of clusters needed

Association -Association - establish relationships about items occurring establish relationships about items occurring togethertogether

Sequence Discovery -Sequence Discovery - identification of associations over identification of associations over timetime

Visualization -Visualization - understanding relationships using visual understanding relationships using visual methods methods

Oct. 7, 1999 Dr. K. Palaniappan 4

Data Mining Tasks/AlgorithmsData Mining Tasks/Algorithms (handout)(handout)

Cluster analysisCluster analysis Linkage analysisLinkage analysis Time series analysisTime series analysis Categorization analysisCategorization analysis VisualizationVisualization Algorithms/TechnologiesAlgorithms/Technologies

Neural networksNeural networks Decision treesDecision trees Time series (?)Time series (?) Genetic algorithmsGenetic algorithms Hybrid approachesHybrid approaches Fuzzy logicFuzzy logic StatisticsStatistics

Oct. 7, 1999 Dr. K. Palaniappan 5

Data Modeling Data Modeling (elaboration)(elaboration)

Data abstractionData abstraction Grouping, binning, categorization, Grouping, binning, categorization,

histogramming of data useful for histogramming of data useful for summarizationsummarization

1-D vs 2-D vs higher dimensional data1-D vs 2-D vs higher dimensional data

Oct. 7, 1999 Dr. K. Palaniappan 6

Data Modeling Data Modeling (elaboration)(elaboration)

Descriptive dataDescriptive data State-based knowledgeState-based knowledge Set of attributes used to describe discrete Set of attributes used to describe discrete

objectsobjects Declarative information - organization Declarative information - organization

structures, credit reports, vendor profilesstructures, credit reports, vendor profiles Transactional dataTransactional data

Episodic info about time and place of eventsEpisodic info about time and place of events Links between object classes to represent traits Links between object classes to represent traits

or conditionsor conditions

Oct. 7, 1999 Dr. K. Palaniappan 7

Problem DefinitionProblem Definition Knowledge representation using Knowledge representation using

hierarchical frameworkshierarchical frameworks Objects--> Relationships-->Networks--> Objects--> Relationships-->Networks-->

Applications-->SystemsApplications-->Systems Procedural vs declarative knowledgeProcedural vs declarative knowledge

Episodic data tagged with temporal and spatial Episodic data tagged with temporal and spatial information (sequence, “knowing how”)information (sequence, “knowing how”)

Semantic data more commonly analyzed Semantic data more commonly analyzed (factual, “knowing that”)(factual, “knowing that”)

MetaknowledgeMetaknowledge

Oct. 7, 1999 Dr. K. Palaniappan 8

Data Preparation & AnalysisData Preparation & Analysis

Define data mining goalsDefine data mining goals Planning questions:Planning questions:

Ready access to all data sourcesReady access to all data sources Data formatData format Integration of data from multiple sources Integration of data from multiple sources

and data basesand data bases Visual or nonvisual analytical methodsVisual or nonvisual analytical methods Visualization for displayVisualization for display Important patternsImportant patterns

Oct. 7, 1999 Dr. K. Palaniappan 9

Sample DatasetsSample Datasets

http://www.kdnuggets.com/datasets.html http://www.kdnuggets.com/datasets.html (Fedstats, Statlog, UC Irvine, KDD)(Fedstats, Statlog, UC Irvine, KDD)

ftp://208.144.240.175/kddcupftp://208.144.240.175/kddcup

ftp://www.epsilon.com/kddcup (fund raising ftp://www.epsilon.com/kddcup (fund raising mailing response dataset)mailing response dataset)

Oct. 7, 1999 Dr. K. Palaniappan 10

Accessing and Preparing DataAccessing and Preparing Data

Capitalization, concatenation, Capitalization, concatenation, representation format, augmentation, representation format, augmentation, abstraction, unit conversion, exclusionabstraction, unit conversion, exclusion

Limiting scope - select appropriate Limiting scope - select appropriate dimensions to extractdimensions to extract

Structuring extractions - number of Structuring extractions - number of records and timerecords and time

Oct. 7, 1999 Dr. K. Palaniappan 11

Accessing and Preparing DataAccessing and Preparing Data

Extraction using data sampling vs report Extraction using data sampling vs report generation (examining the entire dataset)generation (examining the entire dataset)

Maintaining consistency and integrity - Maintaining consistency and integrity - keep track of processing history, data keep track of processing history, data keys, query generation codekeys, query generation code

Data sources and preprocessing - Data sources and preprocessing - databases (SAP, Oracle, Peoplesoft, databases (SAP, Oracle, Peoplesoft, Access, FoxPro, LotusNotes, etc), word Access, FoxPro, LotusNotes, etc), word processors, spreadsheets processors, spreadsheets

Oct. 7, 1999 Dr. K. Palaniappan 12

Accessing and Preparing DataAccessing and Preparing Data

Data integration Data integration Multisource Multisource Multiformat Multiformat Multiplatform Multiplatform Multisecurity Multisecurity MultimediaMultimedia MultiaccessMultiaccess

Converting dataConverting data Long and short data structures Long and short data structures

Oct. 7, 1999 Dr. K. Palaniappan 13

Accessing and Preparing DataAccessing and Preparing Data

Data cleanup Data cleanup Up to 80% of time in data mining process Up to 80% of time in data mining process Errors - data entry (mistyping, incomplete Errors - data entry (mistyping, incomplete

screens), missing data, incompatible screens), missing data, incompatible formats, tampering/improper coding formats, tampering/improper coding

Disambiguation Disambiguation

Oct. 7, 1999 Dr. K. Palaniappan 14

Visual Methods for Analyzing Visual Methods for Analyzing DataData

Discover overall trendsDiscover overall trends Discover smaller hidden patternsDiscover smaller hidden patterns Make unbiased observations/ descriptions Make unbiased observations/ descriptions

about data about data Cognitive limitationsCognitive limitations

Short term memory attentional limitation Short term memory attentional limitation (absorbing multiple pages of tabular or text-(absorbing multiple pages of tabular or text-based output)based output)

Long-term memory - reliance on associations Long-term memory - reliance on associations not being presentednot being presented

Oct. 7, 1999 Dr. K. Palaniappan 15

Cognitive StrengthsCognitive Strengths

Linkage analysis - e.g. telephone calling Linkage analysis - e.g. telephone calling patternspatterns

Scheme-based visualizationScheme-based visualization Positioning algorithms - reveal object Positioning algorithms - reveal object

clustering, hierarchical relationships, clustering, hierarchical relationships, organizational networks, geopositional or organizational networks, geopositional or landscape displays landscape displays

Oct. 7, 1999 Dr. K. Palaniappan 16

Cognitive StrengthsCognitive Strengths

Manipulating display characteristics of Manipulating display characteristics of objects or recordsobjects or records Source data -> Data object -> Object Source data -> Data object -> Object

attributes -> Visualizationattributes -> Visualization Attributes - color, shape, size, x-pos, y-Attributes - color, shape, size, x-pos, y-

pos, elevation, intensity, alignment, pos, elevation, intensity, alignment, label, image, orientation, linklabel, image, orientation, link

Coding attribute information up to 20 or Coding attribute information up to 20 or more dimensions can be displayedmore dimensions can be displayed

Oct. 7, 1999 Dr. K. Palaniappan 17

Analyzing Structural FeaturesAnalyzing Structural Features

Out-of-bounds values - e.g. landscape Out-of-bounds values - e.g. landscape display or scatter diagram of trauma display or scatter diagram of trauma patientspatients

Missing data - e.g. clustering of cellular Missing data - e.g. clustering of cellular communications datacommunications data

Anomalous data - e.g. unusual airline Anomalous data - e.g. unusual airline flightsflights

Oct. 7, 1999 Dr. K. Palaniappan 18

Analyzing Network StructuresAnalyzing Network Structures

Object - link - networkObject - link - network InterconnectivityInterconnectivity Articulation points - data objects that Articulation points - data objects that

connect two or more subnetworks, e.g. connect two or more subnetworks, e.g. detecting bottlenecksdetecting bottlenecks

Identification of subnetworks or discrete Identification of subnetworks or discrete networksnetworks

Missing connections - entities detached Missing connections - entities detached from main networkfrom main network

Oct. 7, 1999 Dr. K. Palaniappan 19

Analyzing Network StructuresAnalyzing Network Structures

Strong/Weak linkages - strength of Strong/Weak linkages - strength of relationships within the networkrelationships within the network

Fan-Out frequency - degree of connectivity, Fan-Out frequency - degree of connectivity, good indicator of unusual behaviorgood indicator of unusual behavior

Pathway analysis - connectivity of objects Pathway analysis - connectivity of objects across a series of linkagesacross a series of linkages

Commonality linkages - e.g. fraud detection, Commonality linkages - e.g. fraud detection, reducing marketing costs, minimizing reducing marketing costs, minimizing transportation and delivery coststransportation and delivery costs

Oct. 7, 1999 Dr. K. Palaniappan 20

Analyzing Network StructuresAnalyzing Network Structures

Emergent patterns of connectivityEmergent patterns of connectivity Groups, liaisons, attached isolates, Groups, liaisons, attached isolates,

detached isolatesdetached isolates

Oct. 7, 1999 Dr. K. Palaniappan 21

Analyzing Temporal PatternsAnalyzing Temporal Patterns

TrendTrend CycleCycle SeasonalSeasonal IrregularIrregular Absolute time cycle of eventsAbsolute time cycle of events Contiguous time cycle eventContiguous time cycle event Visualizing temporal patternsVisualizing temporal patterns Anacapa presentation methodsAnacapa presentation methods