Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie...

40
Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration e-mail: [email protected] web: http://www.andrew.cmu.edu/user/alm3 Hong Kong University, 9 July 2003 Hong Kong University, 9 July 2003 © 2003 by Alan Montgomery, All rights reserved.
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    1

Transcript of Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie...

Page 1: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

Applications of Data Mining to Internet Marketing

Alan MontgomeryAssociate Professor

Carnegie Mellon UniversityGraduate School of Industrial Administration

e-mail: [email protected]: http://www.andrew.cmu.edu/user/alm3

Hong Kong University, 9 July 2003Hong Kong University, 9 July 2003

© 2003 by Alan Montgomery, All rights reserved.

Page 2: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

2

Outline

• Background on Data Mining and Internet Marketing

• Web Mining as a basis for Interactive Marketing

• Using text processing algorithms to classify content

Page 3: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

Interactive Marketing/CRM

The reason we are interested in web mining is that we can use it for

interactive marketing

Page 4: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

4

Internet Marketing

• Learning about customers– Knowledge acquisition– Customer differentiation

• Customization of the product and product experience– Creating what customers want– Remembering what customers want– Anticipating what customers want– Determining the price sensitivity of the customer

Page 5: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

5

Example: American Airlines

• Has the capability to build custom pages for each of the airline’s 2 million registered users

• Prior to the web, there was no cost-efficient way to tell millions of customers about a special fare available only this weekend, to a destination you personally will find attractive

Source

Page 6: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

6

Interactive Marketing Requires...

• Ability to identifyidentify end-users• Ability to differentiatedifferentiate customers based on

their value and their needs• Ability to interactinteract with your customers• Ability to customizecustomize your products and services

based on knowledge about your customersPeppers, Rogers, and Dorf (1999)

Information is key!Information is key!

Page 7: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

7

Data Mining

• Searches for patterns in the growing flood of data– World Wide Web– Transaction Data– Remotely sensed data– Genomic data

• Applies algorithms (Neural networks, genetic algorithms, bayes nets, …) that can automatically find patterns and potentially learn and improve from experience (machine learning)

Page 8: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

The Ultimate User Interface

Page 9: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

9

The Sophisticated Marketer

• Creating what customers want• Remembering what customers want• Anticipating what customers want• Determining the price sensitivity of the

customer

Page 10: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

10

The Naïve Consumer

Page 11: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

Example

Page 12: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

12

What is this user doing?

Page 13: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

Text Classification

Categorizing Web Viewership Using Statistical Models of Web Navigation

and Text Classification

Page 14: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

14

User DemographicsSex: MaleAge: 22Occupation: StudentIncome: < $30,000State: PennsylvaniaCountry: U.S.A.

{Business} {Business} {Business} {Sports}

{Sports} {???} {News} {News}

{???} {???}

Page 15: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

15

Information Available

Clickstream Data

• Panel of representative web users collected by Jupiter Media Metrix

• Sample of 30 randomly selected users who browsed during April 2002– 38k URLs viewings– 13k unique URLs visited– 1,550 domains

• Average user– Views 1300 URLs– Active for 9 hours/month

Classification Information

• Dmoz.org - Pages classified by human experts

• Page Content - Text classification algorithms from Comp. Sci./Inform. Retr.

Page 16: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

16

Dmoz.org

• Largest, most comprehensive human-edited directory of the web

• Constructed and maintained by volunteers (open-source), and original set donated by Netscape

• Used by Netscape, AOL, Google, Lycos, Hotbot, DirectHit, etc.

• Over 3m+ sites classified, 438k categories, 43k editors (Dec 2001)

Categories1. Arts2. Business3. Computers4. Games5. Health6. Home7. News8. Recreation9. Reference10. Science11. Shopping12. Society13. Sports14. Adult

Page 17: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

17

Problem

• Web is very large and dynamic and only a fraction of pages can be classified– 147m hosts (Jan 2002, Internet Domain Survey,

isc.org)– 1b (?) web pages+

• Only a fraction of the web pages in our panel are categorized– 1.3% of web pages are exactly categorized– 7.3% categorized within one level– 10% categorized within two levels– 74% of pages have no classification information

Page 18: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

Text Classification

Page 19: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

19

Background

• Informational Retrieval– Overview (Baeza-Yates and Ribeiro-Neto 2000,

Chakrabarti 2000)– Naïve Bayes (Joachims 1997)– Support Vector Machines (Vapnik 1995 and Joachims

1998)– Feature Selection (Mladenic and Grobelnik 1998,

Yang Pederson 1998)– Latent Semantic Indexing– Support Vector Machines– Language Models (MacKey and Peto 1994)

Page 20: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

20

Result: Document Vector

home 2game 8hit 4runs 6threw 2ejected 1baseball 5major 2league 2bat 2

Page 21: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

21

Classifying Document Vectors

home 2game 8hit 4runs 6threw 2ejected 1baseball 5major 2league 2bat 2

bush 58congress 92tax 48cynic 16politician 23forest 9major 3world 29summit 31federal 64

sale 87customer 28cart 24game 16microsoft 31buy 93order 75pants 21nike 8tax 19

game 97football 32hit 45goal 84umpire 23won 12league 58baseball 39soccer 21runs 26

{News Class} {Sports Class} {Shopping Class}

? ? ?

Test Document

Page 22: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

22

Classifying Document Vectors

home 2game 8hit 4runs 6threw 2ejected 1baseball 5major 2league 2bat 2

bush 58congress 92tax 48cynic 16politician 23forest 9major 3world 29summit 31federal 64

sale 87customer 28cart 24game 16microsoft 31buy 93order 75pants 21nike 8tax 19

game 97football 32hit 45goal 84umpire 23won 12league 58baseball 39soccer 21runs 26

{News Class} {Sports Class} {Shopping Class}

Test Document

Page 23: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

23

home 2game 8hit 4runs 6threw 2ejected 1baseball 5major 2league 2bat 2

bush 58congress 92tax 48cynic 16politician 23forest 9major 3world 29summit 31federal 64

sale 87customer 28cart 24game 16microsoft 31buy 93order 75pants 21nike 8tax 19

game 97football 32hit 45goal 84umpire 23won 12league 58baseball 39soccer 21runs 26

{News Class} {Sports Class}

Test Document

{Shopping Class}

P( {News} | Test Doc) = 0.02 P( {Sports} | Test Doc) = 0.91 P( {Shopping} | Test Doc) = 0.07

Classifying Document Vectors

Page 24: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

24

home 2game 8hit 4runs 6threw 2ejected 1baseball 5major 2league 2bat 2

game 97football 32hit 45goal 84umpire 23won 12league 58baseball 39soccer 21runs 26

{Sports Class}

Test Document

P( {Sports} | Test Doc) = 0.91

Classifying Document Vectors

Page 25: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

25

Classification Model

• A document is a vector of term frequency (TF) values, each category has its own term distribution

• Words in a document are generated by a multinomial model of the term distribution in a given class:

• Classification:

)}d|c(P{maxargCc

|V| : vocabulary size nic : # of times word i appears in class c

})c|w(P)c(P{maxarg|V|

i

ni

Cc

ci

1

)}p,...,p,p(p,n{M~d c|v|

ccc 21

Page 26: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

26

Results

• 25% correct classification• Compare with random guessing of 7%• More advanced techniques perform slightly

better:– Shrinkage of word term frequencies (McCallum et al

1998)– n-gram models– Support Vector Machines

Page 27: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

User Browsing Model

Page 28: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

28

User Browsing Model

• Web browsing is “sticky” or persistent: users tend to view a series of pages within the same category and then switch to another topic

• Example:

{News} {News} {News}

Page 29: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

29

A Markov Model of Browsing

1. News

2. News

2. Other

1. Other

Page 30: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

30

Markov Switching Model

artsbusinesscomputers games health home newsrecreationreferencescienceshopping society sports adultarts 83% 4% 5% 2% 1% 2% 6% 3% 2% 6% 2% 3% 4% 1%business 3% 73% 5% 3% 2% 3% 6% 2% 3% 3% 3% 2% 3% 2%computers 5% 11% 79% 3% 3% 7% 5% 3% 4% 4% 5% 5% 2% 2%games 1% 3% 2% 90% 1% 1% 1% 1% 0% 1% 1% 1% 1% 0%health 0% 0% 0% 0% 84% 1% 1% 0% 0% 1% 0% 1% 0% 0%home 0% 1% 1% 0% 1% 80% 1% 1% 0% 1% 1% 1% 0% 0%news 1% 1% 1% 0% 1% 0% 69% 0% 0% 1% 0% 1% 1% 0%recreation 1% 1% 1% 0% 1% 1% 1% 86% 1% 1% 1% 1% 1% 0%reference 0% 1% 1% 0% 1% 0% 1% 0% 85% 2% 0% 1% 1% 0%science 1% 0% 0% 0% 1% 1% 1% 0% 1% 75% 0% 1% 0% 0%shopping 1% 3% 2% 1% 1% 2% 1% 1% 0% 1% 86% 1% 1% 0%society 1% 1% 2% 0% 2% 1% 3% 1% 2% 2% 0% 82% 1% 1%sports 2% 1% 1% 0% 0% 0% 3% 1% 1% 0% 0% 1% 85% 0%adult 1% 1% 1% 0% 0% 0% 1% 0% 0% 0% 0% 1% 0% 93%

16% 10% 19% 11% 2% 3% 2% 6% 3% 2% 7% 6% 5% 7%

Pooled transition matrix, heterogeneity across users

Page 31: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

31

Implications

• Suppose we have the following sequence:

• Best guess, probability of news news is 69%• Using Bayes Rule can determine that there is a 97%

probability of news, unconditional=2%, conditional on last observation=69%

{News} ? {News}

Page 32: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

Results

Page 33: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

33

Methodology

Bayesian setup to combine information from:• Known categories based on exact matches• Text classification• Markov Model of User Browsing

– Introduce heterogeneity by assuming that conditional transition probability vectors drawn from Dirichlet distribution

• Similarity of other pages in the same domain– Assume that category of each page within a domain

follows a Dirichlet distribution, so if we are at a “news” site then pages more likely to be classified as “news”

Page 34: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

34

Findings

Random guessing

Text Classification

+ Domain Model

+ Browsing Model

7%

25%41%78%

Page 35: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

Findings about Text Classication

Page 36: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

36

Key Points of Text Processing

Can turn text and qualitative data into quantitative data

• Each technique (text classification, browsing model, or domain model) performs only fairly well (~25% classification)

• Combining these techniques together results in very good (~80%) classification rates

Page 37: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

37

Applications

• Newsgroups– Gather information from newsgroups and determine

whether consumers are responding positively or negatively

• E-mail– Scan e-mail text for similarities to known problems/topics

• Better Search engines– Instead of experts classifying pages we can mine the

information collected by ISPs and classify it automatically

• Adult filters– US Appeals Court struck down Children’s Internet

Protection Act on the grounds that technology was inadequate

Page 38: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

Conclusions

Page 39: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

39

Conclusions

• Interactive Marketing provides a foundation understanding how marketers may use data mining in e-business

• Clickstream data provides a powerful raw input that requires effort to turn it into useful knowledge– User profiling predicts ‘who you are’ from ‘where you go’– Path analysis predicts ‘what you want’ from ‘what you view’– Text processing can turn qualitative data into quantitative

data

What is your company doing with clickstream data?What is your company doing with clickstream data?

Page 40: Applications of Data Mining to Internet Marketing Alan Montgomery Associate Professor Carnegie Mellon University Graduate School of Industrial Administration.

40

Future Directions for Learning

• Across full mixed-media data• Across multiple internal databases, web,

newsfeeds• Active experimentation• Learn decisions rather than predictions• Cumulative, life-long learning• Programming languages with learning

embedded