Association Mining

21
www.edureka.co/r-for-analytics Know The Science Behind Product Recommendation

Transcript of Association Mining

Page 1: Association Mining

www.edureka.co/r-for-analytics

Know The Science Behind Product

Recommendation

Page 2: Association Mining

www.edureka.co/r-for-analyticsSlide 2

Objectives

What is data mining

What is Business Analytics

Stages of Analytics / data mining

What is R

overview of Machine Learning

What is Association rule mining

Use-case

At the end of this session, you will be able to

Page 3: Association Mining

Slide 3 www.edureka.co/r-for-analytics

Business Analytics

Why Business Analytics is getting popular these days ?

Cost of storing data Cost of processing data

Page 4: Association Mining

Slide 4 www.edureka.co/r-for-analytics

Cross Industry standard Process for data mining ( CRISP – DM )

Stages of Analytics / Data Mining

Page 5: Association Mining

Slide 5 www.edureka.co/r-for-analytics

What is R

R is Programming Language

R is Environment for Statistical Analysis

R is Data Analysis Software

Page 6: Association Mining

Slide 6 www.edureka.co/r-for-analytics

R : Characteristics

Effective and fast data handling and storage facility

A bunch of operators for calculations on arrays, lists, vectors etc

A large integrated collection of tools for data analysis, and visualization

Facilities for data analysis using graphs and display either directly at the computer or paper

A well implemented and effective programming language called ‘S’ on top of which R is built

A complete range of packages to extend and enrich the functionality of R

Page 7: Association Mining

Slide 7 www.edureka.co/r-for-analytics

Who Uses R : Domains

Telecom

Pharmaceuticals

Financial Services

Life Sciences

Education, etc

Page 8: Association Mining

Slide 8

Common Machine Learning Algorithms

Types of Learning

Supervised Learning

Unsupervised Learning

Algorithms

Naïve Bayes Support Vector Machines Random Forests Decision Trees

Algorithms

K-means

Fuzzy Clustering

Hierarchical Clustering

Gaussian mixture models

Self-organizing maps

Page 9: Association Mining

Slide 9Slide 9Slide 9 www.edureka.co/r-for-analytics

Association Rule Mining

Wal-Mart customers who purchase Barbie dolls have a 60% likelihood of also purchasing one of three types of candy bars

Customers who purchase maintenance agreements are very likely to purchase large appliances

When a new hardware store opens, one of the most commonly sold items is toilet bowl cleaners

Page 10: Association Mining

Slide 10Slide 10Slide 10 www.edureka.co/r-for-analytics

What is Association Rule Mining?

In data mining, Association Rule Mining is a popular and well researched method for discovering interesting relations between variables in large databases.

It is intended to identify strong rules discovered in databases using different measures of interests.

The rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat.

Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements.

Page 11: Association Mining

Slide 11Slide 11Slide 11 www.edureka.co/r-for-analytics

How good is Association Rule?

Here we have 5 customers. Each customer is given a bucket and their purchases are as follows :

Customer Items Purchased

1 OJ, soda

2 Milk, OJ, window cleaner

3 OJ, detergent

4 OJ, detergent, soda

5 Window cleaner, soda

Here, customer 1 purchases OJ (orange juice), and soda.customer 2 purchases Milk, OJ and window cleanercustomer 3 purchases OJ and detergentcustomer 4 purchases OJ, detergent and sodacustomer 5 purchases window cleaner and soda.

Now lets form a matrix to analyze the above data and conclude inferences

Page 12: Association Mining

Slide 12Slide 12Slide 12 www.edureka.co/r-for-analytics

How good is Association Rule?

OJ Window

cleaner

Milk Soda Detergent

OJ 4 1 1 2 2

Window cleaner 1 2 1 1 0

Milk 1 1 1 0 0

Soda 2 1 0 3 1

Detergent 2 0 0 1 2

Simple patterns derived from the above observation :

OJ and soda are more likely purchased together than any other two items

Detergent is never purchased with milk or window cleaner

Milk is never purchased with soda or detergent

Co-occurence of Products

Page 13: Association Mining

Slide 13Slide 13Slide 13 www.edureka.co/r-for-analytics

Association Rule Mining

The following three terms are the important constraints on which the Association Rules are made

Support

The support Supp(x)=proportion of transactions in the data set which contain the interest.

Confidence

The confidence of a rule: Conf(x=>y)= Supp(X U Y)/Supp(X)

Lift

The lift of a rule: Lift(X=>Y)= Supp(X U Y) / (Supp(X) X Supp(Y))

Now lets calculate the Support, Confidence and Lift for our ‘Groceries’ data

Support Confidence

{Soda} => {OJ} 0.4 0.6667

{OJ} => {Soda} 0.4 0.5

Page 14: Association Mining

Slide 14Slide 14Slide 14 www.edureka.co/r-for-analytics

Association Rule Mining

The Groceries data set contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9835 transactions and the items are aggregated to 169 categories.

‘arules’ provides the infrastructure for representing, manipulating and analyzing transaction data and patterns.

Various visualization techniques for association rules and itemsets. This package extends package arules.

Page 15: Association Mining

Slide 15Slide 15Slide 15 www.edureka.co/r-for-analytics

Association Rule Mining

Syntax - apriori(data, parameter = NULL, appearance = NULL, control = NULL)

apriori() - The apriori function is present in the ‘arules’ package. It employs level-wise search for frequent item-sets.

Page 16: Association Mining

Slide 16Slide 16Slide 16 www.edureka.co/r-for-analytics

Association Rule Mining

Going through 1098 rules manually, is not an efficient option.

Let us make use of the ‘Viz’ in arulesViz and visualize the rules.

Page 17: Association Mining

Slide 17Slide 17Slide 17 www.edureka.co/r-for-analytics

Association Rule Mining

Now lets plot the data using the ‘Scatter Plot’ graph

A scatter plot is a mathematical diagram to display values for two variables for a set of data.

The data is displayed as a collection of points

Scatter plot is used when a variable exists below the control of the experimenter.

Conclusion:

It can be seen that rules with high lift have relatively low support.

Most interesting rules reside on support-confidence border.

Page 18: Association Mining

Slide 18Slide 18Slide 18 www.edureka.co/r-for-analytics

Association Rule Mining

Now after applying the Association Rules, the Support, Confidence and the Lift values for the Groceries data is as shown below:

Page 19: Association Mining

Slide 19Slide 19Slide 19 www.edureka.co/r-for-analytics

Association Rule Mining

Page 20: Association Mining

Slide 20Slide 20Slide 20 www.edureka.co/r-for-analytics

Conclusion:

The most interesting rules according to ‘lift’ can be seen at the top-center.

There are 3 rules containing “Butter” and 1 other item in the antecedent, in consequence to “whipped/sour cream”

Let us zoom into the plot to observe the significant inferences:

Association Rule Mining

Page 21: Association Mining