Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

24
Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012

Transcript of Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

Page 1: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

Processing and Analyzing

Large log from Search Engine

Meng Dou13/9/2012

Page 2: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

2

Web-browsing data social network communications sensor data->Behavior dataGoogle and Facebook, for example, are Big Data companies.

•Big data processing•Extracting useful information that reflects user behavior from massive log•Instance data management•Data analysis

Behavior data (like web log) can be used for improving and supporting business processes.Data mining, process mining and so on 

Big data

Challenges Opportunities

Page 3: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

3

  

   

Distributed File

System(HDFS)

Distributed File

System(HDFS)

Key-value

Database(HBase ,Cassa

ndra, MongoDB)

Key-value

Database(HBase ,Cassa

ndra, MongoDB)

 

Unstructured Data

Cloud Storage

Big Data processing

BI/

Reporting

BI/

Reporting

Data

Mining

Data

Mining

Machine

Learning

Machine

Learning

Analytic applications

Cloud computing

(Map/Reduce Framework)

Cloud computing

(Map/Reduce Framework)

Big Data Access HiveHive NoSQLNoSQL

Raw data

Instance data

Distributed File

System(HDFS)

NoSQLNoSQL

Cloud computing

(Map/Reduce Framework)

Cloud computing

(Map/Reduce Framework)

CassandraCassandra

Web Data

Process

Mining

Process

Mining

Process

Mining

Process

Mining

Page 4: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

Case study: Search Engine Company

4

•News, Page, Image, Maps, Music, navigationDataset: 66 million clicks in one month, 2.2 million clicks per

day->generate behavior in 10 minutes

User Behavior:•Visiting path (Referer)•Searching result effectiveness •Abs Clicking Behavior•Source and Destination of User visiting•Robot Behavior Reorganization and Analysis•Visiting page layout•Behavior comparison and product improvement•User grouping and recommendation

首页

图片首页

新闻首页

时评首页

网页结果页

时评结果页

图片结果页 图片过渡页

新闻过渡页

新闻专题页

新闻结果页

网页搜索

页面切换

网页结果点击

图片搜索

新闻搜索

时评搜索

外部页

图像点击

页面切换

新闻点击

点击全文

Page 5: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

Data features 

5

• It contains massive information in a well recorded format

• Large scale with big growing potential

• Real-time analysis

Page 6: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

existing tools

6

Data extracting: XESame , Prom Import

Process Mining : ProM 1)Due to large data set, analysing has low speed and in most situations it got crash 2)Offline analysis-> real-time analysis

Cloud Storage/no rational DB

Instance data(XES)

Extracting data from cloud

Page 7: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

System Structure

7

Log processingLog processing

UnderstandableUnderstandable modelmodel

Extracting useful Extracting useful information that information that reflects user behavior reflects user behavior from massive logfrom massive log

Page 8: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

Convert raw log to instance data(event log) with Map/Reduce

8

Page 9: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

9

AEBDCFDAEFG

CaseID1+T1+ACaseID1+T3+ECaseID2+T3+BCaseID3+T2+DCaseID2+T1+CCaseID3+T3+FCaseID1+T2+DCaseID2+T4+ACaseID3+T1+ECaseID4+T1+FCaseID2+T2+G

CaseID1+T1+ACaseID1+T2+DCaseID1+T3+E

CaseID2+T1+CCaseID2+T2+GCaseID2+T3+BCaseID2+T4+A

CaseID3+T1+ECaseID3+T2+DCaseID3+T3+F

CaseID4+T1+F

A D E

C G B A

E D F

F

ADECGBAEDF

F

UKOC

If the events number in Xlog exceed 5000, output one Xlog, to avoid the exceed

heap size of computer

Map ReduceSort and Partition

XESName_0.xesXESName_1.xesXESName_2.xes

Page 10: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

10

fileSize logNum OnePCTime MapReduceTime MapNum ReduceNum

8.84 MB 36422 5 s, 921 ms 7s 3 15

65.8M 218177 30 s, 846 ms 25s 3 15

112 M 772241 48 s, 559 ms 30s 3 15

One day(371M) 2,200,000 2.5minutes 1.3minutes 40 15

One week 15,000,000 20 Minutes (Expected )

2.5minutes 280 15

One month 66,000,000 2 hours(Expected )

6 minutes 1200 15

CPU: Intel Xeon 2.40GHZ RAM:2GB14Nodes

Page 11: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

Process Discovery

11

Alpha minerHeuristic minerFuzzy minerSequence model

One instance/case is defined as one visitor’s one time visiting.•IP+UA•CookieIDActivity varies based on different requirements

Page 12: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

Behavior analysis

12

User behavior pattern

range activity Data selection

Interaction between channels

all ContentType  

Web Map vising path

all Referer/URL  

webpage layout news ContentType+PageType+Block

(Channel =news)AND(PageType=195)

  image  ContentType+PageType+Block

(Channel =image)AND(PageType=435)

Searching result

all    

Behavior grouping

all    

Registration      

Page 13: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

13

User behavior pattern

range activity Data selection

Interaction between channels

all ContentType  

Web Map vising path

all Referer/URL  

webpage layout news ContentType+PageType+Block

(Channel =news)AND(PageType=195)

  image  ContentType+PageType+Block

(Channel =image)AND(PageType=435)

Searching result

all    

Behavior grouping

all    

Registration      

Page 14: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

Behavior analysis

14

User behavior pattern

range activity Data selection

Interaction between channels

all ContentType  

Web Map vising path

all Referer/URL  

webpage layout news ContentType+PageType+Block

(Channel =news)AND(PageType=195)

  image  ContentType+PageType+Block

(Channel =image)AND(PageType=435)

Searching result

all    

Behavior grouping

all    

Registration      

Page 15: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

Active visitor’s visiting path

15

Page 16: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

Behavior analysis

16

User behavior pattern

range activity Data selection

Interaction between channels

all ContentType  

Web Map vising path

all Referer/URL  

webpage layout news ContentType+PageType+Block

(Channel =news)AND(PageType=195)

  image  ContentType+PageType+Block

(Channel =image)AND(PageType=435)

Searching result

all    

Behavior grouping

all    

Registration      

Page 17: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

Main page

17

Page 18: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

18

Page 19: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

Sequence model

19

Page 20: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

`

20

Page 21: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

XES statistics

21

Page 22: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

Conclusion

22

It is a nice project to get into data analysis field ,with the combination of web data analysis, process mining and cloud computing technology.

Future work:1 More algorithms and technologies should be applied to this data set.2 Behavior comparison and user recommendation still need to be accomplished.3 Can process mining analyze the behavior that does not have a certain pattern.

1 Log Sampling2 Detect the incorrectness from logs before applying log to analysis technologies.3 Extend function of “converting data from key-value database or cloud storage to event log” in Prom or XESame.

Page 23: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

feedback

23

1 What is the real questions?2 Why process mining?

Page 24: Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.

Thank you !

Meng Dou13/9/2012