Presentation for data mining

Prentice Hall*CIS 674Introduction to Data MiningSrinivasan [email protected] Hours: TTH 2-3:18PM DL317

Prentice Hall

Prentice Hall*Introduction OutlineDefine data miningData mining vs. databasesBasic data mining tasksData mining developmentData mining issuesGoal: Provide an overview of data mining.

Prentice Hall

Prentice Hall*IntroductionData is produced at a phenomenal rateOur ability to store has grownUsers expect more sophisticated informationHow?

UNCOVER HIDDEN INFORMATIONDATA MINING

Prentice Hall

Prentice Hall*Data Mining Objective: Fit data to a modelPotential Result: Higher-level meta information that may not be obvious when looking at raw dataSimilar termsExploratory data analysisData driven discoveryDeductive learning

Prentice Hall

Prentice Hall*Data Mining AlgorithmObjective: Fit Data to a ModelDescriptivePredictivePreferential Questions Which technique to choose?ARM/Classification/ClusteringAnswer: Depends on what you want to do with data?Search Strategy Technique to search the dataInterface? Query Language?Efficiency

Prentice Hall

Prentice Hall*Database Processing vs. Data Mining ProcessingQueryWell definedSQL

QueryPoorly definedNo precise query language

Output Precise Subset of database Output Fuzzy Not a subset of database

Prentice Hall

Prentice Hall*Query ExamplesDatabase

Data Mining Find all customers who have purchased milk Find all items which are frequently purchased with milk. (association rules) Find all credit applicants with last name of Smith. Identify customers who have purchased more than $10,000 in the last month. Find all credit applicants who are poor credit risks. (classification) Identify customers with similar buying habits. (Clustering)

Prentice Hall

Prentice Hall*Data Mining Models and Tasks

Prentice Hall

Prentice Hall*Basic Data Mining TasksClassification maps data into predefined groups or classesSupervised learningPattern recognitionPredictionRegression is used to map a data item to a real valued prediction variable.Clustering groups similar data together into clusters.Unsupervised learningSegmentationPartitioning

Prentice Hall

Prentice Hall*Basic Data Mining Tasks (contd)Summarization maps data into subsets with associated simple descriptions.CharacterizationGeneralizationLink Analysis uncovers relationships among data.Affinity AnalysisAssociation RulesSequential Analysis determines sequential patterns.

Prentice Hall

Prentice Hall*Ex: Time Series AnalysisExample: Stock MarketPredict future valuesDetermine similar patterns over timeClassify behavior

Prentice Hall

Prentice Hall*Data Mining vs. KDDKnowledge Discovery in Databases (KDD): process of finding useful information and patterns in data.Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process.

Prentice Hall

Knowledge Discovery ProcessData mining: the core of knowledge discovery process.

Data CleaningData IntegrationDatabasesPreprocessed DataTask-relevant DataData transformationsSelectionData MiningKnowledge Interpretation

Prentice Hall*KDD Process Ex: Web LogSelection: Select log data (dates and locations) to usePreprocessing: Remove identifying URLs Remove error logsTransformation: Sessionize (sort and group)Data Mining: Identify and count patternsConstruct data structureInterpretation/Evaluation:Identify and display frequently accessed sequences.Potential User Applications:Cache predictionPersonalization

Prentice Hall

Prentice Hall*Data Mining DevelopmentSimilarity MeasuresHierarchical ClusteringIR SystemsImprecise QueriesTextual DataWeb Search Engines

Bayes TheoremRegression AnalysisEM AlgorithmK-Means ClusteringTime Series AnalysisNeural NetworksDecision Tree AlgorithmsAlgorithm Design TechniquesAlgorithm AnalysisData StructuresRelational Data ModelSQLAssociation Rule AlgorithmsData WarehousingScalability Techniques

HIGH PERFORMANCEDATA MINING

Prentice Hall

Prentice Hall*KDD IssuesHuman InteractionOverfitting Outliers InterpretationVisualization Large DatasetsHigh Dimensionality

Prentice Hall

Prentice Hall*KDD Issues (contd)Multimedia DataMissing DataIrrelevant DataNoisy DataChanging DataIntegrationApplication

Prentice Hall

Prentice Hall*Social Implications of DMPrivacy ProfilingUnauthorized use

Prentice Hall

Prentice Hall*Data Mining MetricsUsefulnessReturn on Investment (ROI)AccuracySpace/Time

Prentice Hall

Prentice Hall*Database Perspective on Data MiningScalabilityReal World DataUpdatesEase of Use

Prentice Hall

Prentice Hall*Outline of Todays ClassStatistical BasicsPoint EstimationModels Based on SummarizationBayes TheoremHypothesis TestingRegression and CorrelationSimilarity Measures

Prentice Hall

Prentice Hall*Point EstimationPoint Estimate: estimate a population parameter.May be made by calculating the parameter for a sample.May be used to predict value for missing data.Ex: R contains 100 employees99 have salary informationMean salary of these is $50,000Use $50,000 as value of remaining employees salary. Is this a good idea?

Prentice Hall

Prentice Hall*Estimation ErrorBias: Difference between expected value and actual value.

Mean Squared Error (MSE): expected value of the squared difference between the estimate and the actual value:

Why square?Root Mean Square Error (RMSE)

Prentice Hall

Prentice Hall*Jackknife EstimateJackknife Estimate: estimate of parameter is obtained by omitting one value from the set of observed values.Treat the data like a populationTake samples from this populationUse these samples to estimate the parameterLet (hat) be an estimate on the entire pop.Let (j)(hat) be an estimator of the same form with observation j deletedAllows you to examine the impact of outliers!

Prentice Hall

Prentice Hall*Maximum Likelihood Estimate (MLE)Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model.Joint probability for observing the sample data by multiplying the individual probabilities. Likelihood function:

Maximize L.

Prentice Hall

Prentice Hall*MLE ExampleCoin toss five times: {H,H,H,H,T}Assuming a perfect coin with H and T equally likely, the likelihood of this sequence is:

However if the probability of a H is 0.8 then:

Prentice Hall

Prentice Hall*MLE Example (contd)General likelihood formula:

Estimate for p is then 4/5 = 0.8

Prentice Hall

Prentice Hall*Expectation-Maximization (EM)Solves estimation with incomplete data.Obtain initial estimates for parameters.Iteratively use estimates for missing data and continue until convergence.

Prentice Hall

Prentice Hall*EM Example

Prentice Hall

Prentice Hall*EM Algorithm

Prentice Hall

Prentice Hall*Bayes Theorem ExampleCredit authorizations (hypotheses): h1=authorize purchase, h2 = authorize after further identification, h3=do not authorize, h4= do not authorize but contact policeAssign twelve data values for all combinations of credit and income:

From training data: P(h1) = 60%; P(h2)=20%; P(h3)=10%; P(h4)=10%.

Prentice Hall

1

2

3

4

Excellent

x1

x2

x3

x4

Good

x5

x6

x7

x8

Bad

x9

x10

x11

x12

Prentice Hall*Bayes Example(contd)Training Data:

Prentice Hall

ID

Income

Credit

Class

xi

1

4

Excellent

h1

x4

2

3

Good

h1

x7

3

2

Excellent

h1

x2

4

3

Good

h1

x7

5

4

Good

h1

x8

6

2

Excellent

h1

x2

7

3

Bad

h2

x11

8

2

Bad

h2

x10

9

3

Bad

h3

x11

10

1

Bad

h4

x9

Prentice Hall*Other Statistical MeasuresChi-SquaredO observed valueE Expected value based on hypothesis.Jackknife Estimateestimate of parameter is obtained by omitting one value from the set of observed values.RegressionPredict future values based on past valuesLinear Regression assumes linear relationship exists.y = c0 + c1 x1 + + cn xnFind values to best fit the dataCorrelation

Prentice Hall

Prentice Hall*Similarity MeasuresDetermine similarity between two objects.Similarity characteristics:

Alternatively, distance measure measure how unlike or dissimilar objects are.

Prentice Hall

Prentice Hall*Similarity Measures

Prentice Hall

Prentice Hall*Distance MeasuresMeasure dissimilarity between objects

Prentice Hall

Prentice Hall*Information Retrieval Information Retrieval (IR): retrieving desired information from textual data.Library ScienceDigital LibrariesWeb Search EnginesTraditionally keyword basedSample query:Find all documents about data mining.

DM: Similarity measures; Mine text/Web data.

Prentice Hall

Prentice Hall*IR Query Result Measures and ClassificationIRClassification

Prentice Hall

Presentation for data mining

Documents

Transcript of Presentation for data mining