Data mining

Motivation : Data Flood

Data explosion problem

Automated data collection tools and mature database

technology lead to tremendous amounts of data stored in

databases, data warehouses and other information repositories.

We are drowning in data, but starving for knowledge!

Solution: Data warehousing and data mining

Data warehousing and on-line analytical processing

Extraction of interesting knowledge (rules, regularities,

patterns, constraints) from data in large databases

Data Mining(knowledge mining from data) is an area of research and practice that is focused on discovering novel patterns in data using algorithms and computer , it is good at finding the hidden patterns of a dataset by analyzing correlations among attribute values.

Today we have software that can search through massive data haystacks looking for lots of interesting and usable needles.

Data Mining Tasks

• Classification • Regression • Segmentation • Association

Analysis • Anomaly

detection • Sequence

Analysis • Time-series

Analysis • Text

categorization • Advanced insights

discovery • Others

Data Mining Problems

• What other products are purchased together with a digital camera? – Based on previous purchases (shopping cart) – E.g., If a digital camera is purchased, flash memory, battery, printer

are also purchased. Association Analysis

• Similar questions: – What products to recommend in on-line stores such as

Amazon.com, movie rental, wireless themes, etc. – What items should be displayed together in merchant. – What genes appear together in toxic mushrooms.

Data Mining Problems (cont.)

• Is this student going to go to a college? – Based on Gender, ParentIncome, ParentEncouragement, IQ, etc.

– E.g., if ParentEncouragement=Yes and IQ>100, College=Yes

Classification (prediction)

• Similar questions: – Is this a spam email? (spam filtering)

– How good/bad is your credit? (credit scoring)

– Recognition of hand-written letters (pen recognition)

– What is this gene like? (bioinformatics)

– Does this person behave like a terrorist?


• What is the age of a person? – Based on Hobby, MaritalStatus, NumberOfChildren, Income,

HouseOwnership, NumberOfCars, …

– E.g., If MaritalStatus=Yes, Age = 20+4*NumberOfChildren+0.0001*Income+…

Regression (prediction)

• Similar questions: – What’s the sales amount of ice cream next month? (sales prediction)

– What’s the stock price of A next week? (stock prediction)

– What’s the income of a customer? (marketing)

– What’s the life-time of a software bug? (bug tracking)


• Who are my Web visitor? – Identify similar groups based on demographics, visiting patterns – E.g., Daily news readers, email users, shoppers, short-stayers, etc Segmentation (clustering)

• Similar questions: – Identify groups of genes (bioinformatics) – Identify groups of locations of Cholera incidents in London (spatial

data mining) – Identify group of customers in merchants (Amazon, E-Bay, MSN,

WalMart, etc) (target marketing) – Identify groups of documents. (text categorization)


• Could this network packet be from a virus attack? – Predict likelihood of the network packet pattern Anomaly detection (outlier detection)

• Similar questions: – Are the hospital lab results normal (Adverse drug effect

detection) – Is this credit transaction fraudulent? (fraud detection) – Does this person behave unusual, maybe worth high-level

of security clearance?

Data mining and machine learning • Machine learning focuses on creating computer algorithms

that can use pre-existing inputs to refine and improve their own capabilities for dealing with future inputs.

• Machine learning is not exactly the same thing as data mining and vice versa. Not all data mining techniques rely on what researchers would consider machine learning.

• machine learning is used in areas like robotics that we don’t commonly think of when we are thinking of data mining as such.

• Data mining is an area that has taken much of its inspiration and techniques from machine learning (and some, also, from statistics), but is put to different ends.

Data mining as a step in the process of knowledge discovery.

• 1. Data cleaning (to remove noise and inconsistent data). • 2. Data integration (where multiple data sources may be

combined). • 3. Data selection (where data relevant to the analysis task

are retrieved from the database). • 4. Data transformation (where data are transformed or

consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance).

• 5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns)

• 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures)

• 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)

according to this view, data mining is only one step in the entire process . We agree that data mining is a step in the knowledge discovery process. However, in industry, in media, and in the database research milieu, the term data mining is becoming more popular than the longer term of knowledge discovery from data.

Database, data warehouse ,WorldWideWeb, or other information repository: This

is one or a set of databases, data warehouses, spreadsheets, or other kinds of information

repositories. Data cleaning and data integration techniques may be performed

on the data.

Database or data warehouse server: The database or data warehouse server is responsible

for fetching the relevant data, based on the user’s data mining request.

Knowledge base: This is the domain knowledge that is used to guide the search or

evaluate the interestingness of resulting patterns. Such knowledge can include concept

Hierarchies.

Data mining engine: This is essential to the data mining system and ideally consists of

a set of functional modules for tasks such as characterization, association and correlation

analysis, classification, prediction, cluster analysis, outlier analysis, and evolution

analysis.

Pattern evaluation module: This component typically employs interestingness measures

and interacts with the data mining modules so as to focus the search toward interesting patterns . It may use interestingness thresholds to filter

out discovered patterns.

User interface: This module communicates between users and the data mining system,

allowing the user to interact with the system by specifying a data mining query or

Task.

Data mining typically consists of four processes:

1) data preparation. 2) exploratory data analysis. 3) model development. 4) Interpretation of results.

Step 1 involves making sure that the data are organized in the right way , that missing data fields are filled in, that inaccurate data are located and repaired or deleted, and that data are "recoded" as necessary to make them amenable to the kind of analysis we have in mind. step2 getting to know the data using histograms and other visualization tools, and looking for preliminary hints that will guide our model choice. The exploration process also involves figuring out the right values for key parameters. Step 3 choosing and developing a model - is by far the most complex and most interesting of the activities of a data miner. It is here where you test out a selection of the most appropriate data mining techniques. Depending upon the structure of a dataset, there may be dozens of options, and choosing the most promising one has as much art in it as science. Step 4 the interpretation of results - focuses on making sense out of what the data mining algorithm has produced. This is the most important step from the perspective of the data user, because this is where an actionable conclusion is formed.

"association rules mining"

Confidence: how frequently a particular pair occurs among all the times when the first item is present.

Support: Support is the proportion of times that a particular pairing occurs across all shopping carts.

to evaluate a long list of these rules for a value called:

Lift : takes into account the support for a rule, but also gives more weight to rules where the LHS and/or the RHS occur less frequently. In other words, lift favors situations where LHS and RHS are not abundant but where the relatively few occurrences always happen together. The larger the value of lift, the more "interesting" the rule may be.

We can get started with association rules mining very easily using the R package known as "arules" using the following commands by using the Groceries data set, which is ready to be analyzed. So we are skipping right to Step 2 in our four step proces exploratory: > install.packages("arules") library("arules")

You can make the Groceries data set ready with this command: data(Groceries)

run the summary() function on Groceries so that we can see what is in there: > summary(Groceries)

Notes

Groceries is an item Matrix object in sparse format , has rectangular data structure with 9835 rows and 169 columns , is called "sparse" is that very few of these items exist in any given grocery basket. when an item appears in a basket, its cell contains a one, while if an item is not in a basket, its cell contains a zero. every cart has at least one item. output also shows us

which items occur in grocery baskets most frequently. any non-zero amount of whole milk is represented by

a one. Other data mining techniques could take advantage of knowing the exact amount of a product, but association rules does not need to know that amount .

the item "yogurt" appeared in 1372 out of 9835 rows or about 14% of cases. So we can set the support parameter to somewhere

around 10%-15% in order to get a manageable number of it. item that occurs only very rarely in the

grocery baskets is unlikely to be of much use to us in terms of creating meaningful Rules.

we want to focus our attention on items that occur with some meaningful frequency in the dataset. itemFrequencyPlot(Groceries,support=0.1)

Bar graph

The term "apriori" refers to the specific algorithm that R will use to scan the data set for appropriate rules. Apriori alrgorithm used at finding rules in transaction data. • Rules are in the form of "if LHS then RHS." ,each rule states that when

the thing or things on the left hand side of the equation occur(s) the thing on the right hand side occurs a certain percentage of the time.

• For example if Milk and Butter occur together in 10% of the grocery carts (that is "support"), and Milk (by itself, ignoring Butter) occurs in 25% of the carts, then the confidence of the Milk/Butter rule is 0.10/0.25 = 0.40. > apriori(Groceries,parameter=list(support=0.005,+ confidence=0.5))

Apriori

The "minlen" and "maxlen" parameters also have sensible defaults: these refer to the minimum and maximum.

Obviously you can’t generate a rule unless you have at least one item in an item set.

Now we will examine ways of making sense out of a large number of rules, but for now let’s agree that 15 is too many rules to examine. we will store the resulting rules in a data structure called ruleset: > ruleset <- apriori(Groceries,+ parameter=list(support=0.01,confidence=0.5))

The inspect() command

Notes

Rules 7 and 8 have the highest level of lift: the fruits and vegetables involved in these two rules have a relatively low frequency of occurrence, but their support and confidence are both relatively high.

Contrast these two rules with Rule 1, which also has high confidence , but which has low support. The reason for this is that milk is a frequently occurring item, so there is not much novelty to that rule. On the other hand, the combination of fruits, root vegetables, and other vegetables suggest a need to find out more about customers whose carts may contain only vegetarian or vegan items.

to better insights we can use a data visualization package to help explore this possibility.

The R package called arulesViz has methods of visualizing the rule sets generated by apriori() that can help us examine a larger set of rules. First, install and library the arulesViz package:

> install.packages("arulesViz") > library(arulesViz)

> ruleset <- apriori(Groceries,parameter=list(support=0.005,confidence=0.35)) generate 357 rules. > plot(ruleset)

Notes

the lift is shown by the darkness of a dot that appears on the plot. The darker the dot, the close the lift of that rule is to 4.0.

the support of rules ranges from somewhere below 1% all the way up above 7%, all of the rules with high lift seem to have support below 1%.On the other hand, there are rules with high lift and high confidence , which sounds quite positive.

focus on a smaller set of rules that only have the very highest levels of lift. goodrules <-

ruleset[quality(ruleset)$lift > 3.5] Note that the use of the square braces with our data structure ruleset allows us to index only those elements > inspect(goodrules)

Notes

it seems evidence that shoppers are purchasing particular combinations of items that go together in recipes. The first three rules really seem like soup! Rules four and five seem like a fruit platter with dip.

we might recommend that recipes could be published along with coupons and popular recipes, such as for

homemade soup, might want to have all of the ingredients group together in the store along with signs saying,

"Mmmm, homemade soup!"

R Functions Used in This Chapter

• apriori() - Uses the algorithm of the same name to analyze a transaction data set and generate rules.

• itemFrequencyPlot() - Shows the relative frequency of commonly occurring items in the spare occurrence matrix. • inspect() - Shows the contents of the data object generated by

apriori() that generates the association rules.

• install.packages() - Loads package from the CRAN respository.

• summary() - Provides an overview of the contents of a data structure.

REFRENCES

• Book :INTRODUCTION TO DATA SCIENCE

• Book : Data mining concepts and techniques

Second Edition

SLIDES :DR:BASSEL Alkteeb

THANK YOU

Data mining

Education

Transcript of Data mining