Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

55
LOGS MINER : PORTAL FOR DATA MINING WEB ACCESS LOGS Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    1

Transcript of Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Page 1: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

LOGS MINER : PORTAL FOR DATA MINING WEB ACCESS LOGS

Presented byAndrew Wong

9th Annual IUG meeting at HKU Library 8 December 2009

Page 2: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Agenda• Definitions• Motivations• Architecture of Logs Miner• Logs Miner User Interface• Logs Miner reports• Benefits• Future development

2

Page 3: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Definitions

Web data mining-- “application of data mining methodologies,

techniques, and models to variety of data forms, structures, and usage patterns that comprise the World Wide Web”

(Markov, Z. & Larose, D. T. 2007)

3

Three scopes of Web data mining:Web content miningWeb structure miningWeb log mining

Page 4: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Definitions

Web log mining• Discover user access patterns from Web

usage logs• Is also called web usage mining• Three processing stages:

1. Pre-processing2. Pattern discovery3. Pattern analysis

4

Page 5: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Purposes for web logs mining• Identify and classify different group of

patrons• Understand search patterns by different

group of patrons• Adapt web-user interfaces to suit users

need• Statistical data for collection

management

5

Page 6: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Web logs

6

lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)“

lbnxyz.ust.hk - - [16/Nov/2009:12:03:27 +0800] "GET /catalog/?s=brandy&feed=rss HTTP/1.1" 304 - "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=10486796160015392754)"

lbz222.ust.hk - - [16/Nov/2009:12:03:30 +0800] "GET /stream/xml/stream.xml HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5“

lbz333.ust.hk - - [16/Nov/2009:12:03:33 +0800] "GET /catalog/?s=brandy HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5"

lbz444ust.hk - - [16/Nov/2009:12:03:35 +0800] "GET /stream/xml/stream.xml HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5"

• Web logs provide huge information on user action

Page 7: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Web logs

7

Fields Value

Remote host field lbz000.ust.hk

Date/Time field [16/Nov/2009:12:03:26 +0800]

HTTP request “GET /catalog/ HTTP/1.1“

Status code field 200

Transfer Volume (Bytes) Field

20283

User agent field "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)“

lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)“

Page 8: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Various types of web logCommon Log Format – usually used by Apache Web

server logs, Apache Tomcat Logse.g. Library web server, INNOPAC, SmartCAT, Institutional Repository

8

lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)“

Microsoft IIS Log Formate.g. ILLiad, Class Registration Form

2009-07-20 01:22:44 GET /ce/ - 66.249.71.201 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - 401 1891 0

Include:• Remote host field• Date field• Time field• HTTP request field• Status code field• Transfer Volume (Bytes)• Referrer field• User agent field

Page 9: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Various types of web logMicrosoft Streaming Servere.g. Streaming video

9

143.89.160.133 2009-09-02 10:21:20 - /arc-open/oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv 0 6 5 200 {3300AD50-2C39-46c0-AE0A-41B7139D4722} 11.0.5721.5251 en-US WMFSDK/11.0.5721.5251_WMPlayer/11.0.5721.5268 - wmplayer.exe 11.0.5721.5145 Windows_XP 5.1.0.2600 Pentium 3816 216613290 2830093 rtsp TCP - - - 2244972 2244972 398 398 0 0 0 0 0 0 1 1 100 143.89.105.168 lbms07.ust.hk 1 0 - 245 file://C:\wmhome\hkust\arc-open\oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv mms://stream.ust.hk/arc-open/oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv - - 0

Fields only for streaming server:• Video codec• Audio codec• Duration• Client’s player

Page 10: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Web Logfile analysis toolsTools used to analyze web access logs• AccessWatch v1.33• Analog 6.0• Pwebstats• RefStats 1.2• INNOPAC Millennium Web Report – Search

Statistics

Others:• AWStats• Sawmill Analytics• Webalizer

10

Page 11: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Motivations• Create a portal for storing,

analyzing all different web access logs.

• Interface for querying web access logs to generate dynamic statistical report

11

Page 12: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

AWStats as core• Ability to analyze different log formats

including Apache NCSA combined log files, IIS log files (W3C), streaming servers log files

• Feasible to analyze non-standardized log format

• Support works from command line and from a browser as CGI• Build a web interface to query the data

(Logs Miner)• Pre-process the raw log data, running large

scale query in cron job

12

Page 13: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

AWStats as core• Unlimited log file size

• Report number of unique visit and visit

• Provides Plug-in to expand the functionality

• Open source

13

Page 14: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Requirement for AWStats• Web logs files: raw data must be

contained web logs components such as client IP address, status code, HTTP Request field……

• Any OS platform which supporting PERL

14

Page 15: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

System configuration of Logs Miner:

• PC-level workstations• CentOS release 5.4• Apache web server 2.0• PERL v.5.8.8• AWStats 6.9

15

Page 16: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Logs Miner architecture

16

AWStats

AWStats

reports

Pattern discovery, pattern analysisPreprocessing

Raw logs: Library web server,INNOPAC,SmartCAT,Institutional repository,Digital archives …..

Access statistics

Logs Miner UI

Customized report

Page 17: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Logs Miner user interface• A portal for mining web access log data and

retrieve information about usages of multiple web applications.

• Built on top of AWStats, an open source logs analyzer.

• Currently set up to analyze more than 20 library servers and applications including Library Web Server, INNOPAC, Institutional

Repository, Digital Archives, SmartCAT, ILLiad, Streaming Video Server, etc.

17

Page 18: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Logs Miner user interface

18

URL: https://lbnx16.ust.hk/mining

Includes 20+ applicationsProvides three types of reportFiltered by URL or Host

Generates Yearly or monthly report

Query box which supporting regular

expression

Page 19: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Logs Miner user interface

19

URL: https://lbnx16.ust.hk/mining

Tips for construct query string

Page 20: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Three types of reports• AWStats reports• Access statistics

- filtered by URL / Host• Customized reports

20

Page 21: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

AWStats report

21

Page 22: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

AWStats report

22

Page 23: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

AWStats report

23

Report the number of - number of unique visitors- number of visits- These number are exclude the visit from the Robot

Page 24: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

AWStats report

24

Page 25: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

AWStats report

25

Created by plugins: geoip

Page 26: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

AWStats report

26

Work in progress

HKUST's iPhone Application for receiving Library information and searching on SmartCAT

Page 27: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Access statistics report

27

Query box which supporting regular expression

Page 28: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Access statistics report – filtered by URL

28

Page 29: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Access statistics report – filtered by Host

29

Page 30: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Example (1) – Usage of a database

30

Database title:

Cambridge Journals Online

URL: http://library.ust.hk/cgi/db/cambridge.pl?subscribedTo

Server name: library.ust.hk (Library web server)

Parameters /cgi/db/cambridge.pl?subscribedTo

Include pattern: cgi\/db\/cambridge\.pl.+

Page 31: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Example (1) – Usage of a database

31

Page 32: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Example (1) – Usage of a database

32

Page 33: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Example (2) – Usage of a document of HKUST Institutional Repository

33

Document Long, Jiafu 2005, Autoinhibition of X11/Mint scaffold proteins revealed by the closed ……

URL: http://repository.ust.hk/dspace/bitstream/1783.1/2496/1/nsmb958.pdf

Server name: repository.ust.hk (HKUST Institutional Repository)

Parameters /dspace/bitstream/2496/1/nsmb958.pdf

Include pattern:

\/1783\.1\/2496\/1\/nsmb958\.pdf

Page 34: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Example (2) – Usage of a document of HKUST Institutional Repository

34

Page 35: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Example (2) – Usage of a document of HKUST Institutional Repository

35

Page 36: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Example (3) – Access by particular group

36

Number of access on Library web page from Library public workstations

Library web page

URL: http://library.ust.hk/

Server name: library.ust.hk (Library web server)

Client’s name convention

OPAC workstation (lbb[nnn].ust.hk)IC workstation (lbc[nnn].ust.hk)Computer Lab (lba[nnn].ust.hk

Include pattern:

lb(a|b|c)[\d]+\.ust.hk\.hk

Page 37: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Example (3) – Access by particular group

37

Page 38: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Example (3) – Access by particular group

38

Page 39: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Example (4) – Exclude particular group

39

Number of access on Digital Archives from HKUST campus but exclude HKUST Library Staff

Digital university archives

URL: http://archives.ust.hk/

Server name: archives.ust.hk (Digital Archives)

Client’s name convention

Library staff workstation (lbz[nnn].ust.hk)

Page 40: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

40

Example (4) – Exclude particular group

Include pattern:

^.+\.ust\.hk$

Exclude pattern:

lbz.+\.ust.hk\.hk

Page 41: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

41

Example (4) – Exclude particular group

Page 42: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Example (5) – Number of virtual visits• A virtual visit is defined as a user’s request

on the library’s website in order to use one of the services provided by the library.

• One Key Performance Indicator – Virtual visits per capita

• Includes main web applications:- Library web server- Innopac- SmartCAT (Next generation Catalogs)- HKUST Institutional Repository- Digital Archives - HKUST ILLiad

42

Page 43: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Example (5) – Number of virtual visits

43

Report the number of • Visits

- a unique IP accesses a page, and requests other pages without an hour between any of the requests

Page 44: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Example (5) – Number of virtual visits

44

Request within an hour

Request within an hour

Request within an hour

Count as a visit

Page 45: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Example (5) – Number of virtual visits

45

Applications unique visit visit page visit/visitor pages/visit

Library web server 413,324 1,018,811 60,78,913 2.46 5.96

IR 94,596 133,458 632,256 1.41 4.73

Digital Archives 1497 3,511 90,489 2.34 25.77

E-Journal 21,833 42,768 376,473 1.95 8.8

E-theses 25,848 34,956 116,664 1.35 3.33

HKUST ILLiad 8,039 18,548 138,109 2.3 7.44

SmartCat 4,202 9,398 288,787 2.23 30.72

Streaming Videos 778 1,233 4,073 1.58 3.30

Total 570,117 1,262,683 7,725,764 2.21 6.11

Virtual Visit in 2009 1,262,683 2.21 6.11

Page 46: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Customized reports• Built-in customized reports to provide a

full picture of page visit figures of similar pages

From HKUST Library Web Server (http://library.ust.hk)

• Sitemap• Databases List• Course Guides• Database Guides• Subject Guides

46

Page 47: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Customized reports

47

SubSet:• Sitemap• Databases List• Course Guides• Database Guides• Subject Guides

Page 48: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Customized reports

48

HKUST library web sitemap

Page 49: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Customized reports

49

Page 50: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Customized reports

50

Add more customized reports template

• E-Journal list• Library Forms• ……

Page 51: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Benefits of Logs Miner• Central place for storing, processing and

analyzing Web Logs data• Combined usage data from different

server logs• Statistics report can be generated

dynamically. • Flexible querying interface enabling users

to construct their own statistical reports in real-time

51

Page 52: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Privacy issue• From web access logs, individual client’s

action can be tracked• Protected by firewall, file permission, user

authentication• Logs Miner User Interface can be only

accessed from library network

52

IMPORTANT: As data retrieved in your searches or reports may contain usage patterns of our users, please be careful not to re-distribute such information outside of the HKUST Library.

Page 53: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Future Development• Include more web applications

such as HKUST PowerSearch server (federated search to Library’s subscription resources)

• Create more customized report template such as E-journal list

53

Page 54: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

ReferenceHan, J., & Kamber, M. 2006. Data mining :Concepts and

techniques (2nd ed.). Amsterdam: Morgan Kaufmann.

Liu, H., & Keselj, V. 2007. Combined mining of web server logs and web contents for classifying user navigation patterns and predicting users' future requests. Data knowledge engineering, 61(2): 304.

Markov, Z., & Larose, D. T. 2007. Data mining the web :Uncovering patterns in web content, structure, and usage. Hoboken, N.J.: Wiley-Interscience.

54

Page 55: Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009.

Thank you!

Email address: [email protected]

55