Web Servers

31
Web Servers & Log Analysis Web Servers & Log Analysis What can we learn from looking at Web server logs? - What server resources were requested - When the files were requested - Who requested them (where IP address = who) - How they requested them (browser types & OS) Some assumptions - A request for a resource means the user did receive it - A resource is viewable & understandable to each user - Users are identified within a loose set of parameters How does knowing request patterns affect or help IA?

description

 

Transcript of Web Servers

Page 1: Web Servers

Web Servers & Log AnalysisWeb Servers & Log Analysis

• What can we learn from looking at Web server logs?- What server resources were requested- When the files were requested- Who requested them (where IP address = who)- How they requested them (browser types & OS)

• Some assumptions- A request for a resource means the user did receive it- A resource is viewable & understandable to each user- Users are identified within a loose set of parameters

• How does knowing request patterns affect or help IA?

Page 2: Web Servers

Types of Web Server LogsTypes of Web Server Logs

• Proxy-based- Web access servers to control access or cache

popular files

• Client-based- Local cache files- Browser History file(s)

• Network-based- Routers, firewalls & access points

• Server-based- Web servers to serve content

Page 3: Web Servers

Using Web ServersUsing Web Servers

• The Apache Software Foundation• Microsoft Internet Information Server (Service

s)• These applications “Serve”- Text - HTML, XML, plain text- Graphics - jpeg, gif, png- CGI, servlets, XMLHttpRequest & other logic- other MIME types such as movies & sound

• Most servers can log these files- Daily, weekly or monthly- Can not always log CGI or related logic

(specifically or “out of the box”)

Page 4: Web Servers

How Servers WorkHow Servers Work

• Hypertext Transfer Protocol - http

1. A file is requested from the browser

2. The request is transferred via the network

3. The server receives the request (& logs it)

4. The server provides the file (& logs it)

5. The browser displays the file

• Almost all Web servers work this way

Page 5: Web Servers

Types of Server LogsTypes of Server Logs

• Access Log- Logs information such as page served or time

served

• Referer Log- Logs name of the server and page that links to

current served page- Not always- Can be from any Web site

• Agent Log- Logs browser type and operating system

• Mozilla• Windows

Page 6: Web Servers

Log File FormatLog File Format

• Extended Log File Format - W3C Working Draft WD-logfile-960323

• key advantage:- computer storage cost decreases while paper cost

rises

• every server generates slightly different logs

Page 7: Web Servers

Extended Log File FormatsExtended Log File Formats

• WWW Consortium Standards• Will automatically record much of what is

programmatically done now.- faster- more accurate- standard baselines for comparison - graphics standards

Page 8: Web Servers

What is a log file?What is a log file?

• A delimited, text file with information about what the server is doing- IP Address or Domain name- Date/Time- Method used & Page Requested- Protocol, Response Code & Bytes Returned- Referring Page (sometimes)- UserAgent & Operating System

p0016c74ea.us.kpmg.com - - [01/Sep/2004:08:17:21 -0500] "GET /images/sanchez.jpg HTTP/1.1" 200 - "http://www.ischool.utexas.edu/research/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)"

Page 9: Web Servers

In search of Reliable DataIn search of Reliable Data

• Not as Foolproof as Paper- You can see when someone is reading a page- You can know the page is turned- You can know the book is checked out

• No State Information- The same person or another person could be

reading pages 1 then page 2- You really can’t tell how many users you have

• Server Hits not perfectly Representative- Counters inaccurate- Caching & Robots can influence + & -

• Floods/Bandwidth can Stop “intended” usage

Page 10: Web Servers

What is a “hit”?What is a “hit”?

• Technically, a hit is simply any file requested from the server- That is logged- That represents (usually) part of a request to “see”

a whole Web page

• Hits combine to represent a “page view”• Page views combine to represent an

“episode” or “session”- Episode is one activity or question a user perfoms

or requests on a Web site- Session is a series of episodes that embodies all

the interactions a user undertakes using a Web site (per time, based on averages around 30 min.)

Page 11: Web Servers

Making Servers More ReliableMaking Servers More Reliable

• Keep system setups simple- unique file and directory names- clear, consistent structure

• Configure CMS for logging/serving • Use an FTP server for file transfer- frees up logs and server!

• Judicious use of links• Wise MIME types- some hard/impossible to log

Page 12: Web Servers

Clever Web Server SetupClever Web Server Setup

• Redirect CGI to find referrer• Use a database- store web content- record usage data

• create state information with programming- NSAPI- ActiveX

• Have contact information• Have purpose statements

Page 13: Web Servers

Managing Log FilesManaging Log Files

• Backup• Store Results or Logs?• Beginning New Logs• Posting Results

Page 14: Web Servers

Log Analysis ToolsLog Analysis Tools

• Analog• Webalizer• Sawmill• WebTrends• AWStats• WWWStat• GetStats• Perl Scripts• Data Mining & Business Intelligence tools

Page 15: Web Servers

WebTrendsWebTrends

• A whole industry of analytics• Most popular commercial application

Page 16: Web Servers

Log Analysis Cumulative SampleLog Analysis Cumulative Sample

• Program started at Tue-03-Dec-2005 01:20 local time. • Analysed requests from Thu-28-Jul-2004 20:31 to Mon-

02-Dec-1996 23:59 (858.1 days). • Total successful requests: 4 282 156 (88 952) • Average successful requests per day: 4 990 (12 707) • Total successful requests for pages: 1 058 526 (17 492) • Total failed requests: 88 633 (1 649) • Total redirected requests: 14 457 (197) • Number of distinct files requested: 9 638 (2 268) • Number of distinct hosts served: 311 878 (11 284) • Number of new hosts served in last 7 days: 7 020 • Corrupt logfile lines: 262 • Unwanted logfile entries: 976 • Total data transferred: 23 953 Mbytes (510 619 kbytes) • Average data transferred per day: 28 582 kbytes (72 946

kbytes)

Page 17: Web Servers

How about the iSchool Web site?How about the iSchool Web site?

• Our server files are collected constantly- Daily - Weekly- Monthly- Even yearly

• What does a quick look tell us?- How well is the server working?

• Uptime, server errors, logging errors- How popular is our site?

• Number of hits, popular files- Who is visiting the site?

• Countries, types of companies- What searches led people here?

Page 18: Web Servers

UT & its Web server logsUT & its Web server logs

• UT Web log reports(Figures in parentheses refer to the 7 days to 28-Mar-2004 03:00).

Successful requests: 39,826,634 (39,596,364)

Average successful requests per day: 5,690,083 (5,656,623)

Successful requests for pages: 4,189,081 (4,154,717)

Average successful requests for pages per day: 598,499 (593,530)

Failed requests: 442,129 (439,467)

Redirected requests: 1,101,849 (1,093,606)

Distinct files requested: 479,022 (473,341)

Corrupt logfile lines: 427

Data transferred: 278.504 Gbytes (276.650 Gbytes)

Average data transferred per day: 39.790 Gbytes (39.521 Gbytes)

Page 19: Web Servers

Neat Analysis TricksNeat Analysis Tricks

• use a search engine to find references- “link:www.ischool.utexas.edu/~donturn”

• key to using unique names- use many engines

• update times different• blocking mechanisms are different

• use Web searches (or Yahoo, Bloglines…)- look for references- look for IP addresses of users

Page 20: Web Servers

Neat Tricks, cont.Neat Tricks, cont.

• Walking up the Links- follow URL’s upward

• Reverse Sort- look for relations

• Use your own robot to index- Test

Page 21: Web Servers

Web Surveys, an alternativeWeb Surveys, an alternative

• Surveys actually ask users what they did, what they sought & if it helped

• GVU, Nielsen and GNN- Qualitative questions

• phone• web forms

- Self-selected sample problems• random selection• oversample

Page 22: Web Servers

Analysis of a Very Large Search LogAnalysis of a Very Large Search Log

• What kinds of patterns can we find?

• Request = query and results page

• 280 GB – Six Weeks of Web Queries- Almost 1 Billion Search Requests, 850K valid, 575K queries- 285 Million User Sessions (cookie issues)- Large volume, less trendy- Why are unique queries important?

• Web Users:- Use Short Queries in short sessions - 63.7% one request- Mostly Look at the First Ten Results only- Seldom Modify Queries

• Traditional IR Isn’t Accurately Describing Web Search• Phrase Searching Could Be Augmented

• Silverstein, Henzinger, Marais, Moricz (1998)

Page 23: Web Servers

Analysis of a Very Large Search LogAnalysis of a Very Large Search Log

• 2.35 Average Terms Per Query- 0 = 20.6% (?)- 1 = 25.8%- 2 = 26.0% = 72.4%

• Operators Per Query- 0 = 79.6%

• Terms Predictable• First Set of Results Viewed Only = 85%• Some (Single Term Phrase) Query Correlation - Augmentation- Taxonomy Input- Robots vs. Humans

Page 24: Web Servers

Real Life Information RetrievalReal Life Information Retrieval

• 51K Queries from Excite (1997)• Search Terms = 2.21• Number of Terms

- 1 = 31% 2 = 31% 3 = 18% (80% Combined)

• Logic & Modifiers (by User)- Infrequent- AND, “+”, “-”

• Logic & Modifiers (by Query)- 6% of Users- Less Than 10% of Queries- Lots of Mistakes

• Uniqueness of Queries- 35% successive- 22% modified- 43% identical

Page 25: Web Servers

Real Life Information RetrievalReal Life Information Retrieval

• Queries per user 2.8• Sessions

- Flawed Analysis (User ID)- Some Revisits to Query (Result Page Revisits)

• Page Views- Accurate, but not by User

• Use of Relevance Feedback (more like this)- Not Used Much (~11%)

• Terms Used Typical & frequent• Mistakes

- Typos- Misspellings- Bad (Advanced) Query Formulation

• Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998)

Page 26: Web Servers

Downie & Web UsageDownie & Web Usage

• Server logs are like library usage• User-based analyses- who- where- what

• File-based analyses- amount

• Request analyses- conform (loosely) to Zipf’s Law

• Byte-based analyses

Page 27: Web Servers

Web use analysis & IA?Web use analysis & IA?

• Another tool to begin to understand how people use your Web provided resources

• With a small amount of setup, you can learn a large amount

• Server use can be integrated into site usage for users- Lists of popular pages & more interlinking pages- Adding search terms that found the page to related pages- Adjust metadata to reflect searches that find pages- Add pages to the site index or site map

• First-cut usability information- Pages 1 & 2 were accessed, but not 3 - Why?- Navigation usage, link ordering and design understanding- Knowing what browsers & OS helps tailor design and media

types

Page 28: Web Servers

BREAK!BREAK!

• No Presentation this week- Next week: Asset management, content

management & version control

• Break up media development work

• Examine current pages, style sheets & designs

• Set up next set of pair & individual deliverables

Page 29: Web Servers

Media Development workMedia Development work

• We need to find & create graphics for the new site

• Content about:- Austin- UT- iSchool- People at the iSchool- Students at work in the iSchool (classes, labs)

• Screen grab from videos• Search the Web for copyright free images• Take our own pictures

Page 30: Web Servers

Current Pages & DesignsCurrent Pages & Designs

• First version of main iSchool page template and CSS complete

• Secondary page template & CSS complete- Some secondary pages already built

• Index page template set• Site map page initially set- Big Map- Main pages map

Page 31: Web Servers

Next stepsNext steps

• In class- Test & evaluate current CSS and templates- Improvise secondary home page based on initial design- Examine new Alumni section- Examine new Course Listing page

• For homework- Complete secondary page migration to new design- Rotate design work

• Alumni• Site Map• Home page design ideas

- Picture/Media creation work