Data warehousing and mining furc

Motivation: “Necessity is the Mother of Invention”

Data Explosion Problem

1. Automated data collection tools (e.g. web, sensor networks) and mature

database technology lead to tremendous amounts of data stored in databases,

data warehouses and other information repositories.

2. Currently enterprises are facing data explosion problem.

Electronic Information an Important Asset for Business Decisions

1. With the growth of electronic information, enterprises began to realizing that the

accumulated information can be an important asset in their business decisions.

2. There is a potential business intelligence hidden in the large volume of data.

3. This intelligence can be the secret weapon on which the success of a business

may depend.

Extracting Business Intelligence (Solution)

1. It is not a Simple Matter to discover Business

Intelligence from Mountain of Accumulated Data.

2. What is required are Techniques that allow the

enterprise to Extract the Most Valuable Information.

3. The Field of Data Mining provides such Techniques.

4. These techniques can Find Novel Patterns (unknown)

that may Assist an Enterprise in Understanding the

business better and in forecasting.

Data Mining vs SQL, EIS, and OLAP

• SQL. SQL is a query language, difficult for business people to use

• EIS = Executive Information Systems. EIS systems provide graphical interfaces that give executives a pre-programmed (and therefore limited) selection of reports, automatically generating the necessary SQL for each.

• OLAP allows views along multiple dimensions, and drill-drown, therefore giving access to a vast array of analyses. However, it requires manual navigation through scores of reports, requiring the user to notice interesting patterns themselves.

• Data Mining picks out interesting patterns. The user can then use visualization tools to investigate further.

An Example of OLAP Analysis and its Limits• What is driving sales of walking sticks ?

• Step 1: View some OLAP graphs: e.g. walking stick sales by city.

• Step 2: Noticing that Islamabad has high salesyou decide to investigate further.

• (Before OLAP, you would have to have written a very complex SQL query instead of just simply clicking to drill-down).

• It seems that old people are responsible for most walking stick sales. You confirm this by viewing a chart of age distributions by state.

• But imagine if you had to do this manual investigation for all of the 10,000 products in your range ! Here, OLAP gives way to Data Mining.

Walking Sticks Sales by City

50

10

400

Karachi

Lahore

Islamabad

Walking Sticks Sales in Islamabad by Age

10 30

360

Less than 20

20 to 60

Older than 60

Age Distribution by City

0

20

40

60

80

Karachi Lahore Islamabad

Younger than 20

20 to 60

Older than 60

Data Mining vs Expert Systems

• Expert Systems = Rule-Driven DeductionTop-down: From known rules (expertise) and data to decisions.

Expert System

Rules

DataDecisions

• Data Mining = Data-Driven InductionBottom-up: From data about past decisions to discovered rules (general rules induced from the data).

Data Mining

RulesData(including past decisions)

Difference b/w Machine Learning and Data Mining

Machine Learning techniques are designed to deal with a limited

amount of artificial intelligence data. Where the Data Mining

Techniques deal with large amount of databases data.

Data Mining (Knowledge Discovery in Databases)

Extraction of interesting (non-trivial, implicit, previously unknown

and potentially useful) information or patterns from data in large databases. What is not data mining?

(Deductive) query processing. Expert systems or small ML/statistical programs

What is not Data Mining? (Deductive) query processing. Expert systems or small ML/statistical programs

Data Mining (Example)

Random Guessing vs. Potential Knowledge

Suppose we have to Forecast the Probability of Rain in Islamabad city for any

particular day.

Without any Prior Knowledge the probability of rain would be 50% (pure

random guess).

If we had a lot of weather data, then we can extract potential rules using

Data Mining which can then forecast the chance of rain better than

random guessing.

Example: The Rule

if [Temperature = ‘hot’ and Humidity = ‘high’] then there is 66.6% chance

of rain. Temperature Humidity Windy Rainhot high false Nohot high true Yeshot high false Yesmild high false Nocool normal false Nocool normal true Yes

The Data Mining Process

• Step 0: Determine Business Objective

- e.g. Forecasting the probability of rain - Must have relevant prior knowledge and goals of application.

• Step 1: Prepare Data - Noisy and Missing values handling (Data Cleaning). - Data Transformation (Normalization/Discretization). - Attribute/Feature Selection.

• Step 2: Choosing the Function of Data Mining - Classification, Clustering, Association Rules

• Step 3: Choosing The Mining Algorithm - Selection of correct algorithm depending upon the quality of data. - Selection of correct algorithm depending upon the density of data.

• Step 4: Data Mining- Search patterns of interest:- A typical data mining algorithm can

mine millions of patterns.

• Step 5: Visualization/Knowledge Representation - Visualization/Representation of interesting patterns, etc

Data Mining: A KDD Process

Data mining: the core of knowledge discovery process.

Data Integration

Databases

Data Warehouse

Task-relevant Data

Data Mining

Pattern Evaluation

Data Mining: On What Kind of Data?

1. Relational databases2. Data warehouses3. Transactional databases4. Advanced DB and information repositories

Time-series data and temporal data Text databases Multimedia databases Data Stream (Sensor Networks Data) WWW

Data Mining Functionalities (1)

Data Preprocessing Handling Missing and Noisy Data (Data Cleaning). Techniques we will cover.

▪ Missing values Imputation using Mean, Median and Mod.

▪ Missing values Imputation using K-Nearest Neighbor.

▪ Missing values Imputation using Association Rules Mining.

▪ Missing values Imputation using Fault-Tolerant Patterns (Will be a research project).

▪ Data Binning for Noisy Data.

TID Refund Country Taxable Income Cheat

1 Yes USA 125K No

2 UK 100K No

3 No Australia 70K No

4 120K No

5 No NZL 95K Yes


Data Preprocessing Data Transformation (Discretization and Normalization). With the help of data transformation rules become more General and Compact. General and Compact rules increase the Accuracy of Classification.

Age

15

18

40

33

55

48

12

23

Child = (0 to 20)

Young = (21 to 47)

Old = (48 to 120)

Age

Child

Child

Young

Young

Old

Old

Child

Young

1. If attribute 1 = value1 & attribute 2 = value2 and Age = 08 then Buy_Computer = No.



1. If attribute 1 = value1 & attribute 2 = value2 and Age = Child then Buy_Computer = No.


Data Preprocessing Attribute Selection/Feature Selection

▪ Selection of those attributes which are more relevant to data mining task.

▪ Advantage1: Decrease the processing time of mining task.

▪ Advantage2: Generalize the rules. Example

▪ If our mining goal is to find that countries which has more Cheat on which Taxable Income.

▪ Then obviously the date attribute will not be an important factor in our mining task.

Date Refund Country Taxable Income Cheat

11/02/2002 Yes USA 125K No

13/02/2002

Yes UK 100K No

16/02/2002

No Australia 120K Yes

21/03/2002

No Australia 120K Yes

26/02/2002

No NZL 95K Yes


Data Preprocessing We will cover two Attribute/Feature

Selection Techniques▪ Principle Component Analysis

▪ Wrapper Based

▪ Filter Based


Association Rule Mining In Association Rule Mining Framework we have to find all the

rules in a transactional/relational dataset which contain a support (frequency) Greater than some minimum support (min_sup) threshold (provided by the user).

For example with min_sup = 50%.Transaction ID Items Bought

2000 Bread,Butter,Egg1000 Bread,Butter, Egg4000 Bread,Butter, Tea5000 Butter, Ice cream, Cake

Itemset Support{Butter} 4{Bread} 3{Egg} 2

{Bread,Butter} 3{Bread, Butter, Egg} 2


Association Rule Mining Topic we will cover

Frequent Itemset Mining Algorithms (Apriori, FP-Growth, Bit-vector ).

Fault-Tolerant/Approximate Frequent Itemset Mining. N-Most Interesting Frequent Itemset Mining. Closed and Maximal Frequent Itemset Mining. Incremental Frequent Itemset Mining Sequential Patterns. Projects

▪ Mining Fault-Tolerant Using Pattern-Growth.▪ Application of Fault-Tolerant Frequent Pattern is Missing

values Imputation (Course Project).


Classification and Prediction Finding models (functions) that describe and distinguish classes or concepts for future

prediction Example: Classify rainy/un-rainy cities based on Temperature, Humidify and Windy

Attributes. Must have known the previous business decisions (Supervised Learning).

City Temperature Humidity Windy RainLahore hot low false NoIslamabad hot high true YesIslamabad hot high false YesMultan mild low false NoKarachi cool normal false NoRawalpindi hot high true Yes

Rule

• If Temperature = Hot & Humidity = High then Rain = Yes.

City Temperature Humidity Windy RainMuree hot high false ?Sibi mild low true ?

Prediction of unknown record

Data Mining Functionalities (2) Cluster Analysis

Group data to form new classes based on un-labels class data. Business decisions are unknown (Also called unsupervised Learning). Example: Classify rainy/un-rainy cities based on Temperature, Humidify

and Windy Attributes.City Temperature Humidity Windy RainLahore hot low false ?Islamabad hot high true ?Islamabad hot high false ?Multan mild low false ?Karachi cool normal false ?Rawalpindi hot high true ?

3 clusters


Outlier Analysis Outlier: A data object that does not comply with the general

behavior of the data.

It can be considered as noise or exception but is quite useful in

fraud detection, rare events analysis2 Outliers

Are All the “Discovered” Patterns Interesting?

A data mining system/query may generate

thousands of patterns, not all of them are interesting.

Suggested approach: Query-based, Constraint

mining Interestingness Measures: A pattern is

interesting if it is easily understood by humans,

valid on new or test data with some degree of

certainty, potentially useful, novel, or validates some

hypothesis that a user seeks to confirm

Can We Find All and Only Interesting Patterns?

Find all the interesting patterns: Completeness Can a data mining system find all the interesting patterns? Remember most of the problems in Data Mining are NP-Complete. There is no global best solution for any single problem.

Search for only interesting patterns: Optimization Can a data mining system find only the interesting patterns? Approaches

▪ First general all the patterns and then filter out the uninteresting ones.

▪ Generate only the interesting patterns—Constraint based mining (Give threshold factors in mining)

Reading Assignment

Book Chapter Chapter 1 of “Jiawei Han and Micheline

Kamber” book “Data Mining: Concepts and Techniques”.

Data Mining ------- Where?

Some Nice Resources ACM Special Interest Group on Knowledge Discovery and Data

Mining (SIGKDD) http://www.acm.org/sigs/sigkdd/.

Knowledge Discovery Nuggets www.kdnuggests.com. IEEE Transactions on Knowledge and Data Engineering –

http://www.computer.org/tkde/.

IEEE Transactions on Pattern Analysis and Machine Intelligence – http://www.computer.org/tpami/.

Data Mining and Knowledge Discovery - Publisher: Springer Science+Business Media B.V., Formerly Kluwer Academic Publishers B.V. http://www.kluweronline.com/issn/1384-5810/. current and previous offerings of Data Mining course at Stanford, CMU, MIT and Helsinki.

Text and Reference Material

The course will be mainly based on research literature, following text may however be consulted:

Jiawei Han and Micheline Kamber. “Data Mining: Concepts and Techniques”.

1. David Hand, Heikki Mannila and Padhraic Smyth. “Principles of Data Mining”. Pub. Prentice Hall of India, 2004.

2. Sushmita Mitra and Tinku Acharya. “Data Mining: Multimedia, Soft Computing and Bioinformatics”. Pub. Wiley an Sons Inc. 2003.

3. Usama M. Fayyad et al. “Advances in Knowledge Discovery and Data Mining”, The MIT Press, 1996.

Data Mining

Data quality Missing values imputation using

Mean, Median and k-Nearest Neighbor approach

Distance Measure

Data Quality Data quality is a major concern in Data Mining and

Knowledge Discovery tasks. Why: At most all Data Mining algorithms induce

knowledge strictly from data. The quality of knowledge extracted highly depends

on the quality of data. There are two main problems in data quality:-

Missing data: The data not present. Noisy data: The data present but not correct.

Missing/Noisy data sources:- Hardware failure. Data transmission error. Data entry problem. Refusal of responds to answer certain questions.

Effect of Noisy Data on Results Accuracy

age income student buys_computer<=30 high no ?>40 medium yes ?31…40 medium yes ?

age income student buys_computer<=30 high yes yes<=30 high no yes>40 medium yes no>40 medium no no>40 low yes yes31…40 no yes31…40 medium yes yes

Data Mining

• If ‘age <= 30’ and income = ‘high’ then buys_computer = ‘yes’

• If ‘age > 40’ and income = ‘medium’ then buys_computer = ‘no’

Discover only those rules which contain support (frequency) greater >= 2

Due to the missing value in training dataset, the accuracy of prediction decreases and becomes “66.7%”

Training data

Testing data or actual data

Imputation of Missing Data (Basic)

Imputation is a term that denotes a procedure that replaces the missing values in a dataset by some plausible values i.e. by considering relationship among

correlated values among the attributes of the dataset.

Attribute 1 Attribute 2 Attribute 3 Attribute 420 cool high false

cool high true20 cool high true20 mild low false30 cool normal false10 mild high true

If we consider only {attribute#2}, then value “cool” appears in 4 records.

Probability of Imputing value (20) = 75%


Imputation of Missing Data (Basic)



For {attribute#4} the value “true” appears in 3 records





For {attribute#2, attribute#3} the value {“cool”, “high”} appears in only 2 records


Randomness of Missing Data Missing data randomness is divided into three classes.

1. Missing completely at random (MCAR):- It occurs when the probability of instance (case) having missing value for an attribute does not depend on either the known attribute values or missing data attribute.

2. Missing at random (MAR):- It occurs when the probability of instance (case) having missing value for an attribute depends on the known attribute values, but not on the missing data attribute.

3. Not missing at random (NMAR):- When the probability of an instance having a missing value for an attribute could depend on the value of that attribute.

Methods of Treating Missing Data

Ignoring and discarding data:- There are two main ways to discard data with missing values. Discard all those records which have missing data

also called as discard case analysis. Discarding only those attributes which have high

level of missing data. Imputation using Mean/median or Mod:- One of the most

frequently used method (Statistical technique). Replace (numeric continuous) type “attribute

missing values” using mean/median. (Median robust against noise).

Replace (discrete) type attribute missing values using MOD.

Methods of Treating Missing Data

Replace missing values using prediction/classification model:- Advantage:- it considers relationship among the known

attribute values and the missing values, so the imputation accuracy is very high.

Disadvantage:- If there is no correlation exist for some missing attribute values and know attribute values. The imputation can’t be performed.

(Alternative approach):- Use hybrid combination of Prediction/Classification model and Mean/MOD.▪ First try to impute missing value using

prediction/classification model, and then Median/MOD. We will study more about this topic in Association Rules

Mining.

Similarity and DissimilaritySimilarity

Numerical measure of how alike two data objects are.

Is higher when objects are more alike. Often falls in the range [0,1]

Dissimilarity Numerical measure of how different are two data

objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies

Proximity refers to a similarity or dissimilarity

Distance Measures

Remember K-Nearest Neighbor are determined on the bases of some kind of “distance” between points.

Two major classes of distance measure:

1. Euclidean : based on position of points in some k -dimensional space.

2. Noneuclidean : not related to position or space.

Scales of Measurement

Applying a distance measure largely depends on

the type of input data

Major scales of measurement:

1. Nominal Data (aka Nominal Scale Variables)

▪ Typically classification data, e.g. m/f ▪ no ordering, e.g. it makes no sense to state that M > F ▪ Binary variables are a special case of Nominal scale

variables.

2. Ordinal Data (aka Ordinal Scale)▪ ordered but differences between values are not important ▪ e.g., political parties on left to right spectrum given labels 0,

1, 2 ▪ e.g., Likert scales, rank on a scale of 1..5 your degree of

satisfaction ▪ e.g., restaurant ratings

Scales of Measurement

Applying a distance function largely depends on

the type of input data

Major scales of measurement:

3. Numeric type Data (aka interval scaled)▪ Ordered and equal intervals. Measured on a linear scale. ▪ Differences make sense▪ e.g., temperature (C,F), height, weight, age, date

Scales of Measurement• Only certain operations can be performed

on certain scales of measurement.

Nominal Scale

Ordinal Scale

Interval Scale

1. Equality2. Count

3. Rank(Cannot quantify difference)

4. Quantify the difference

Axioms of a Distance Measure d is a distance measure if it is a

function from pairs of points to reals such that:

1. d(x,x) = 0.2. d(x,y) = d(y,x).3. d(x,y) > 0.

Some Euclidean Distances L2 norm (also common or Euclidean distance):

The most common notion of “distance.”

L1 norm (also Manhattan distance)

distance if you had to travel along coordinates only.

)||...|||(|),( 22

22

2

11 pp jx

ix

jx

ix

jx

ixjid

||...||||),(2211 pp jxixjxixjxixjid

Examples L1 and L2 norms

x = (5,5)

y = (9,8)L2-norm:dist(x,y) = (42+32) = 5

L1-norm:dist(x,y) = 4+3 = 7

4

35

Another Euclidean Distance

L∞ norm : d(x,y) = the maximum of the differences between x and y in any dimension.

Data warehousing and mining furc

Engineering

Transcript of Data warehousing and mining furc