Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010...
-
Upload
jaxon-gale -
Category
Documents
-
view
300 -
download
15
Transcript of Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010...
![Page 1: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/1.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Chapter Two
Principles of data mining
![Page 2: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/2.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Chapter Overview
• The process of data mining• Approaches of data mining• Categories of data mining problems• Information patterns to be discovered• Overview of data mining solutions• Importance of evaluation• Undertaking a data mining task in Weka • Review of basic concepts in statistics and
probability
![Page 3: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/3.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Process
PreparingInput Data
MiningPatterns
Post-processingPatterns
InputData
OutputPatterns
A data mining stage
Flow of control from one stage to the next stage
Flow of control from one stage to the previous stage
Repetition of the tasks at one stage
![Page 4: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/4.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Process
• Preparation
Formatted Data set
Formatted Data set
Target Data set
Pre-Processed Data set
Original Data sets
Collected Data set
• Integrating data• Getting necessary
data details
• Selecting relevant features• Selecting relevant records
• Data cleaning• Deal with unknown data• Data transformation
• Formatting data into acceptable form by the mining tool
![Page 5: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/5.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Process
• Mining– Determining data mining
tasks – Assigning roles for data
for certain tasks– Selecting data mining
solution(s) to each task– Setting necessary
parameters for the solution
– Collecting result patterns
Formatted Data set
Formatted Data set
Solution3
(w1, w2, …, wm) Solution2
(t1, t2, …, tr)Solution1
(p1, p2, …, pn)
Patterns
Mining solutionsParam
eter settings
![Page 6: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/6.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Process
• Post-processing– Pattern evaluation – Pattern selection– Pattern interpretation
PatternsEvaluation
criteria
reject
ValidPatternsValid
PatternsSelection
criteria
SelectedPatterns
acceptPattern
Interpretation
Knowledge learnt
![Page 7: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/7.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Process• Roles of participants in data mining
– Participants include:• Data miners / data analysts: main participant of a DM project• Domain expert: main collaborators of DM project• Decision makers: clients of a DM project
– Risk of human bias in the discovery process– Important roles of domain expert
• Pattern interpretation (for usefulness)• Pattern evaluation (for significance)• Mining options (for suitable tasks, limited)• Advisory on data pre-processing (for suitable operations, limited)
– Balancing the strength of human and machine
![Page 8: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/8.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Approaches
• Hypothesis testing approach– Top-down lead by a hypothesis statement– Procedure:
1. Forming a hypothesis statement2. Collecting and selecting data of relevance3. Conducting data analysis and collecting patterns 4. Interpreting the patterns to accept/reject the hypothesis
• Discovery approach – Bottom-up without a hypothesis in mind– Procedure:
1. Collecting and preparing data of interest
2. Conducting data analysis and discovering possible patterns
3. Evaluating the importance and interestingness
![Page 9: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/9.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Approaches
• Discovery approach (cont’d)– Directed discovery (supervised learning):
• Certain aspects of the outcome, i.e. the goal, of the discovery have been specified. The discovery is to find those patterns satisfying the goal.e.g. patterns relating to the outcome of a class variable
– Undirected discovery (unsupervised learning): • There is no specification of the goal of the discovery.
The discovery is to find those patterns of some kind of significance.e.g. associative links among some attribute values
![Page 10: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/10.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining: Problems & Patterns
• Classification– Construct a classification model to determine the class
of a given record
Example Data Set
Model Construction
MethodClassification
Model
ClassificationModel
(a) Model Development Phase
class?
Input features classCi
Input features
(b) Model Use Phase
Unseen Data Record with undetermined class
Data Record with the determined class
![Page 11: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/11.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining: Problems & Patterns
• Various forms of classification models
Instance space Neural network Decision tree
List of ordered classification rulesFunction (linear regression)
Many more …
![Page 12: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/12.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining: Problems & Patterns
• Cluster detection– Measure similarity among data objects and group them
into clusters accordingly
Cluster Memberships of Data Points
Input data points
ClusteringMethod
![Page 13: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/13.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining: Problems & Patterns
• Forms of clustering resultsClusters of various shapes
Eclipse shaped clusters
Hierarchical clustering results
![Page 14: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/14.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining: Problems & Patterns
• Association rule mining– Discover significant relationships between data
objects
AssociationMining Method X Y
– Between values, e.g. Apple Coke
– Between categories of values, e.g. Food Magazine
– Between values of attributes, e.g. Married:yes OwnHouse:yes
– Over time period, e.g. year 1: Database year 2: Data Mining
• Various associations
![Page 15: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/15.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining: Problems & Patterns
• An exampleStudentID Gender Country Major Subject Age TotalUnits Degree Class
1 M UK Computing 22 360 1st Class2 F UK Computing 21 360 2nd Lower3 M FRANCE Psychology 24 345 2nd Lower4 M SPAIN Accounting 23 360 1st Class5 F UK Psychology 22 300 Pass6 F USA History 30 345 2nd Upper7 M UK Computing 35 360 1st Class8 F FRANCE Psychology 25 360 3rd Class9 F GERMANY History 23 360 2nd Upper10 M UK Accounting 22 360 1st Class11 M SPAIN History 20 345 2nd Upper12 F UK Law 45 300 Pass
StudentID Gender Country Major Subject Age TotalUnits Degree Class1 M UK Computing 22 360 1st Class2 F UK Computing 21 360 2nd Lower3 M FRANCE Psychology 24 345 2nd Lower4 M SPAIN Accounting 23 360 1st Class5 F UK Psychology 22 300 Pass6 F USA History 30 345 2nd Upper7 M UK Computing 35 360 1st Class8 F FRANCE Psychology 25 360 3rd Class9 F GERMANY History 23 360 2nd Upper10 M UK Accounting 22 360 1st Class11 M SPAIN History 20 345 2nd Upper12 F UK Law 45 300 Pass
Classification model? Clusters? Association rules?
![Page 16: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/16.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Solutions: An Overview
• Classification solutions– Decision tree e.g. ID3– k nearest neighbour (kNN) e.g. PEBLS– Rules e.g. Sequential Cover– Bayesian theorem e.g. Naïve Bayes– Artificial neural network
• Clustering Solutions– Partition-based methods e.g. K-means– Hierarchical methods e.g. agglomeration– Density-based methods e.g. DBScan– Model-based methods e.g. Expectation-
Maximisation– Graph-based methods e.g. Chameleon
![Page 17: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/17.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Solutions: An Overview
• Association rule solutions– Greedy methods e.g. Apriori– Graph-based methods e.g. FP-Growth– Methods for various associations
• Boolean associations• Generalised associations (multi-level associations)• Quantitative associations (multidimensional associations)• Sequential associations (sequential patterns)
Since one type of data mining problems can be transformed to another type of data mining problems, some solutions for one type can also be applied to another type.
![Page 18: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/18.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Evaluation of Patterns
• Importance of evaluating result patterns– Classification model must be accurate enough to be
creditable – Clusters must genuinely exist– Association rules must have enough strengths to be
believed– Data descriptions must be general enough to cover a
large part of the data set
How do we evaluate the discovered patterns ?
![Page 19: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/19.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Evaluation of Patterns
• Possible measures of interestingness– Objective measures based on data and pattern
• Conciseness of pattern, e.g. minimum description length • Coverage, e.g. coverage for classification rules• Reliability, e.g. accuracy of a classification model• Peculiarity, e.g. measures of difference from the norm• Diversity, e.g. tendency of clusters
– Subjective measures based on domain knowledge• Novelty• Surprisingness• Usefulness • Applicability
![Page 20: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/20.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Evaluation of Patterns
• Commonly used measures– Accuracy rate or error rate for classification models
• True positive• False positive• False negative (see section 6.5.1)
– Quality of clusters• Quality of a cluster• Overall quality of all clusters (see section 4.5.1)
– Strengths of associations• Support• Confidence• Lift (see section 8.1.2 and 8.6)
![Page 21: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/21.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Associate Tab page
Data Mining in Weka Explorer
• The roadmap
Preprocess Tab page
(1)
Cluster Tab page
(2)
Classify Tab page
Tree Visualiser window
(3)
![Page 22: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/22.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining in Weka Explorer• Preprocess
Open data set from different sources
Generate random data set
Save data set into a file
Display & edit data
Attribute display, selection & removal from the opened data set
Selected attribute summary
Selected attribute visualisation
Visualise all attributes
Filters for pre-processing
Feedback messages
Data summary
![Page 23: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/23.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining in Weka Explorer• Classify (as an example)
Method selection & parameter setting
Test option setting
Task list. Menu of options available with right click.
Result display window
![Page 24: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/24.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining in Weka Explorer• Classify (as an example)
Method List
Selecting a specific method
Selecting &Changing parameters
![Page 25: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/25.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining in Weka Explorer• Visualisation
An Example Decision Tree
Scatter plot of data object of different classes
![Page 26: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/26.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Where probability and statistics used?– Patterns found from data are probabilistic in nature– Used in various measures of evaluation, e.g. confidence
measure of association rules
– Used in data exploration stage for better understanding, e.g. maximum, minimum, mean, variance, skewness
– Used during the mining process to assist the discovery of patterns, e.g. information gain for decision tree induction
– Used as a part of patterns, e.g. naïve Bayes, Gaussian mixture model
– Used in comparison of patterns, e.g. classification model with significantly better accuracy
![Page 27: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/27.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Probability and conditional probability– Probability of event P(E) and its meanings when:
P(E) = 0, P(E) = 1 and 0 < P(E) < 1
– Probabilities of multiple events: P(E and F), P(E or F) = P(E) + P(F) – P(E and F)
– Mutually exclusive events: P(E and F) = 0 and P(E and F) = P(E) + P(F)
– Conditional probability of event E given event F: P(E|F) = P(E and F)/P(F)
– Independent events: P(E and F) = P(E)P(F), and P(E|F) = P(E)
![Page 28: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/28.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Probability & conditional probability (example)StudentID Gender Country Major Subject Age TotalUnits Degree Class
1 M UK Computing 22 360 1st Class2 F UK Computing 21 360 2nd Lower3 M FRANCE Psychology 24 345 2nd Lower4 M SPAIN Accounting 23 360 1st Class5 F UK Psychology 22 300 Pass6 F USA History 30 345 2nd Upper7 M UK Computing 35 360 1st Class8 F FRANCE Psychology 25 360 3rd Class9 F GERMANY History 23 360 2nd Upper10 M UK Accounting 22 360 1st Class11 M SPAIN History 20 345 2nd Upper12 F UK Law 45 300 Pass
StudentID Gender Country Major Subject Age TotalUnits Degree Class1 M UK Computing 22 360 1st Class2 F UK Computing 21 360 2nd Lower3 M FRANCE Psychology 24 345 2nd Lower4 M SPAIN Accounting 23 360 1st Class5 F UK Psychology 22 300 Pass6 F USA History 30 345 2nd Upper7 M UK Computing 35 360 1st Class8 F FRANCE Psychology 25 360 3rd Class9 F GERMANY History 23 360 2nd Upper10 M UK Accounting 22 360 1st Class11 M SPAIN History 20 345 2nd Upper12 F UK Law 45 300 Pass
2
1
12
6)( MGenderP
0 )( FGender and MGenderP
1 )( FGender or MGenderP
2
1)|( UKCountryFGenderP
![Page 29: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/29.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Probability distribution of random variables– Discrete random variable– Continuous random variable
P(X = x) P(a X < b)
68%
95%
![Page 30: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/30.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Basic Statistics
– Sample mean, median and mode
– Variance and standard deviation
– Skewness
n
xx
i
1
)( 22
n
xxs
ix
x
x
s
Medianx )(3
26age
23agemedian 22agemode
53.636sage 2 324.7sage
22913247
23263.
.
)(
ageskewness
![Page 31: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/31.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review• Confidence interval estimate
– Sample mean is only an estimate of the true mean for the data population.
– Central limit theorem: sample means follows a normal distribution that:
a. The mean is the true population mean X b. The standard deviation is
– Based on the central limit theorem and using the sample standard deviation to replace the true one, the following expression is used to estimate the interval for the true mean at confidence level of 1-
n/
1)(n
stx
n
stxP XX
![Page 32: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/32.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Confidence interval estimate (example)
95012
3247201226
12
3247201226 .)
..
..( P
The interval is estimated as [21.347, 30.653] at confidence level of 95%
For this data set, n = 12, age = 26 and sage = 7.324. At confidence level of 95%, i.e. 1 - = 0.95 and /2 = 0.025, n – 1 = 11, and therefore, t = 2.201. The interval estimate is:
![Page 33: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/33.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Hypothesis testing– As an introduction to statistical
inference and statistic significance.
– Procedure:a. Forming null and alternative
hypotheses
b. Deciding the level of significance p
c. Determining a test statistic and calculating its value
d. Comparing the calculated value against known value and deciding if the null hypothesis should be rejected
![Page 34: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/34.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
• Hypothesis testing (example)– Assuming age = 25
– Hypotheses:
Null:
Alternative:
– Calculating the statistic t as:
Probability & Statistics: A Brief Review
0.473ns
aget
age
123247
2526
/./
Less than t = 2.201 for p/2 = 0.025 and n – 1 = 11.
– Conclusion: null hypothesis is not rejected, i.e. the difference between the sample mean and the population mean is insignificant.
ageage
ageage
![Page 35: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/35.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Chapter Summary• The data mining process involves preparation of data, mining of
patterns and post-processing of the patterns.
• Top-down and bottom-up approaches are both useful. The discovery approach can be directed or undirected.
• Three main streams of data mining tasks and various forms of patterns and models are introduced.
• Specific solutions are required for specific types of problems
• The importance of evaluation of patterns must be appreciated.
• Normal procedure of conducting data mining in Weka is explained
• Some important basic concepts in probability and statistics are reviewed.
![Page 36: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Two Principles of data mining.](https://reader031.fdocuments.in/reader031/viewer/2022012304/551ba275550346a10a8b6209/html5/thumbnails/36.jpg)
Data Mining Techniques and Applications, 1st editionHongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
References
Read Chapter 2 of Data Mining Techniques and Applications
Useful further references
Han, J. and Kamber, M. (2006), Data Mining: Concepts and Techniques, 2nd Edition, Morgan Kaufmann Publishers, Chapter 1
Berry, M. J. A. and Linoff, G. (2004), Data Mining Techniques: For Marketing, Sales and Customer Relationship Management, 2nd ed. Wiley Computer Publishing, Chapters 1 – 2