Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol 7 of 110 4/17/2007...

55
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html 1 of 110 4/17/2007 4:12 PM Table of Contents | All Slides | Link List | CSCI E-12 Hypertext Transfer Protocol April 17, 2007 Harvard University Division of Continuing Education Extension School Course Web Site: http://cscie12.dce.harvard.edu/ Copyright 1998-2007 David P. Heitmeyer Instructor email: [email protected] Course staff email: [email protected] Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html 2 of 110 4/17/2007 4:12 PM HyperText Transfer Protocol GET / HTTP is a stateless protocol. Cookies provide a mechanism to "maintain state". Cookie Central: The Unofficial Cookie FAQ http://www.cookiecentral.com/faq/ http://www.cookiecentral.com/ Maintaining State with Cookies HTTP State Management Mechanism http://www.ics.uci.edu/pub/ietf/http/rfc2109.txt Cookie Central: The Unofficial Cookie FAQ http://www.cookiecentral.com/faq/ http://www.cookiecentral.com/ Persistent Client State HTTP Cookies http://www.netscape.com/newsref/std/cookie_spec.html view plain print ? minerva% telnet www.npr.org 80 1. Trying 216.35.221.77... 2. Connected to www.npr.org. 3. Escape character is '^]'. 4. GET / HTTP/1.1 5. Host: www.npr.org 6. 7. HTTP/1.1 200 OK 8. Date: Tue, 10 Apr 2007 20:07:33 GMT 9. Server: Apache 10. Set-Cookie: Apache=140.247.197.240.289451144786054516; path=/ 11. Cache-Control: max-age=0 12. Expires: Tue, 10 Apr 2007 20:07:33 GMT 13. Transfer-Encoding : chunked 14. Content-Type: text/html 15. 16. 76c 17. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 18. "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 19. <html xmlns="http://www.w3.org/1999/xhtml"> 20. <head> 21. <title>NPR - National Public Radio - News, Arts, World, US.</title> 22. <!-- content removed --> 23. </html> 24.

Transcript of Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol 7 of 110 4/17/2007...

Page 1: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

1 of 110 4/17/2007 4:12 PM

Table of Contents | All Slides | Link List | CSCI E-12

Hypertext Transfer ProtocolApril 17, 2007

Harvard University Division of Continuing Education

Extension School

Course Web Site: http://cscie12.dce.harvard.edu/

Copyright 1998-2007 David P. Heitmeyer

Instructor email: [email protected] Course staff email: [email protected]

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

2 of 110 4/17/2007 4:12 PM

HyperText Transfer Protocol

GET /

HTTP is a stateless protocol. Cookies provide a mechanism to "maintain state".

Cookie Central: The Unofficial Cookie FAQ http://www.cookiecentral.com/faq/ http://www.cookiecentral.com/

Maintaining State with Cookies

HTTP State Management Mechanism http://www.ics.uci.edu/pub/ietf/http/rfc2109.txtCookie Central: The Unofficial Cookie FAQ http://www.cookiecentral.com/faq/ http://www.cookiecentral.com/Persistent Client State HTTP Cookies http://www.netscape.com/newsref/std/cookie_spec.html

view plain print ?

minerva% telnet www.npr.org 80 1.Trying 216.35.221.77... 2.Connected to www.npr.org. 3.Escape character is '^]'. 4.GET / HTTP/1.1 5.Host: www.npr.org 6. 7.HTTP/1.1 200 OK 8.Date: Tue, 10 Apr 2007 20:07:33 GMT 9.Server: Apache 10.Set-Cookie: Apache=140.247.197.240.289451144786054516; path=/ 11.Cache-Control: max-age=0 12.Expires: Tue, 10 Apr 2007 20:07:33 GMT 13.Transfer-Encoding: chunked 14.Content-Type: text/html 15. 16.76c 17.<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 18. "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 19.<html xmlns="http://www.w3.org/1999/xhtml"> 20.<head> 21.<title>NPR - National Public Radio - News, Arts, World, US.</title> 22.<!-- content removed --> 23.</html> 24.

Page 2: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

3 of 110 4/17/2007 4:12 PM

Cookie Example

Server returns cookie to HTTP client ("Set-Cookie" response header)HTTP client returns cookie to server ("Cookie" request header)

ESPN Cookies

Set-Cookie:SWID=C8F9AF31-F170-42BF-9471-50A95DA24C17;path=/;expires=Tue, 10-Apr-2027 03:20:59 GMT;domain=.go.com;

Set-Cookie:DE2=dXNhO21hO2NhbWJyaWRnZTt0MTs1OzQ7NDs1MDY7MDQyLjM4MDstMDcxLjEzNpath=/; expires=Tue, 17 Apr 2007 03:00:00 GMT; domain=.go.com

view plain print ?

minerva% lwp-request -USed http://www.espn.com/ 1.GET http://espn.go.com/ 2.User-Agent: lwp-request/2.07 3. 4.GET http://www.espn.com/ --> 301 Moved Permanently 5.GET http://espn.go.com/ --> 200 OK 6.Cache-Control: no-cache 7.Date: Tue, 10 Apr 2007 03:20:58 GMT 8.Pragma: no-cache 9.From: SPORTBARWEB08 10.Accept-Ranges: bytes 11.ETag: "802e571f7bc71:1762" 12.Server: Microsoft-IIS/5.0 13.Vary: Accept-Encoding 14.Content-Length: 122217 15.Content-Type: text/html; charset=iso-8859-1 16.Content-Type: text/html; charset=windows-1252 17.Last-Modified: Tue, 10 Apr 2007 03:19:21 GMT 18.Cache-Expires: Tue, 10 Apr 2007 03:24:22 GMT 19.Client-Date: Tue, 10 Apr 2007 03:21:02 GMT 20.Client-Peer: 198.105.193.43:80 21.Client-Response-Num: 1 22.P3P: CP="CAO DSP COR CURa ADMa DEVa TAIa PSAa PSDa IVAi IVDi CONi OUR SAMo OTRo BUS PHY O23.Refresh: 3600 24.Set-Cookie: SWID=C8F9AF31-F170-42BF-9471-50A95DA24C17; path=/; expires=Tue, 10-Apr-2027 025.Set-Cookie: DE2=dXNhO21hO2NhbWJyaWRnZTt0MTs1OzQ7NDs1MDY7MDQyLjM4MDstMDcxLjEzNTs4NDA7MjI7O26.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

4 of 110 4/17/2007 4:12 PM

Cookie Properties/Attributes

nameexpiresdomainpathsecure

HTTP State Management Mechanism, RFC 2965

RFC 2109, February 1997RFC 2965, October 2000

namecommentcomment URLdiscarddomainmax-agepathportsecureversion

Additional Cookie Notes

Client: 300 total cookies4 kb per cookie20 cookies per server or domain

Page 3: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

5 of 110 4/17/2007 4:12 PM

Cookie Example: Server Sets a Cookie

Form that will set a Cookie: http://cscie12.dce.harvard.edu/http/cookie.cgi

Set-Cookie HTTP Response Header:

Set-Cookie: YourName=David%20P.%20Heitmeyer; domain=cscie12.dce.harvard.edu; path=/http/;expires=Fri, 13-May-2005 18:05:04 GMT

view plain print ?

minerva% telnet 140.247.197.240 80 1. Trying 140.247.197.240... 2.Connected to 140.247.197.240. 3.Escape character is '^]'. 4.GET /http/cookie.cgi?name=David%20P.%20Heitmeyer HTTP/1.1 5.Host: cscie12.dce.harvard.edu 6.Connection: close 7. 8.HTTP/1.1 200 OK 9.Connection: close 10.Date: Wed, 13 Apr 2005 18:05:04 GMT 11.Server: Apache/2.0.49 (Fedora) 12.Content-Type: text/html; charset=ISO-8859-1 13.Client-Date: Wed, 13 Apr 2005 18:05:04 GMT 14.Client-Peer: 140.247.197.240:80 15.Client-Response-Num: 1 16.Client-Transfer-Encoding: chunked 17.Set-Cookie: YourName=David%20P.%20Heitmeyer; \ 18. domain=cscie12.dce.harvard.edu; \ 19. path=/http/; \ 20. expires=Fri, 13-May-2005 18:05:04 GMT 21. 22.<?xml version="1.0" encoding="iso-8859-1"?> 23.<!DOCTYPE html 24. PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 25. "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 26.<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US"><head><title>For27.</head><body> 28.<h1>Hello, David P. Heitmeyer</h1> 29.</body></html> 30.Connection closed by foreign host. 31.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

6 of 110 4/17/2007 4:12 PM

Cookie Example: Returning a Cookie

Form that will set a Cookie: http://cscie12.dce.harvard.edu/http/cookie.cgi

view plain print ?

minerva% telnet 140.247.197.240 80 1.Trying 140.247.197.240... 2.Connected to 140.247.197.240. 3.Escape character is '^]'. 4.GET /http/cookie.cgi HTTP/1.1 5.Cookie: YourName=David%20P.%20Heitmeyer 6.Host: cscie12.dce.harvard.edu 7.Connection: close 8. 9.HTTP/1.1 200 OK 10.Connection: close 11.Date: Wed, 13 Apr 2005 18:11:40 GMT 12.Server: Apache/2.0.49 (Fedora) 13.Content-Type: text/html; charset=ISO-8859-1 14.Client-Date: Wed, 13 Apr 2005 18:11:40 GMT 15.Client-Peer: 140.247.197.240:80 16.Client-Response-Num: 1 17.Client-Transfer-Encoding: chunked 18. 19.<?xml version="1.0" encoding="iso-8859-1"?> 20.<!DOCTYPE html 21. PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 22. "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 23.<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US"> 24.<head><title>Form</title></head><body> 25.<h1>Hello, David P. Heitmeyer</h1> 26.

Page 4: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

7 of 110 4/17/2007 4:12 PM

Your Cookies

Firefox Webdeveloper Toolbar has a "Cookies" section. screenshot

Mozilla Cookie Manager

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

8 of 110 4/17/2007 4:12 PM

Cookies and Session IDs

A UserID or SessionID (a long character/number string that is uniquely assigned) is often stored incookie. The SessionID is used as the key or identifier when storing information about the user orsession.

For example, a user logs in to a site. If the username and password match, the server sets a cookie("Set-Cookie") in the browser that contains a session id; the server also makes an entry in websitedatabase that maps the session id to the username. When the cookie is returned, the session id isread and the username is looked up in the database.

Page 5: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

9 of 110 4/17/2007 4:12 PM

Google Cookie Example

Using Google's "Preference" page and setting:

Search Language preference to: English, French, GermanSafeSearch Filtering: Strict FilteringNumber of Results: 50

The Cookie name is: PREF The Value is:ID=bb504f37cd318aa9:FF=1:LR=lang_en|lang_fr|lang_de:LD=en:NR=50:TM=1113416195:LM=111

This cookie contains a session id as well as the values of certain preferences in a colon-separateddata structure.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

10 of 110 4/17/2007 4:12 PM

Cookies and Ad Tracking

Page 6: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

11 of 110 4/17/2007 4:12 PM

Method: POST

Form that will set a Cookie: http://cscie12.dce.harvard.edu/http/cookie.cgi

view plain print ?

minerva% telnet 140.247.197.240 80 1.Trying 140.247.197.240... 2.Connected to 140.247.197.240. 3.Escape character is '^]'. 4.POST /http/cookie.cgi HTTP/1.1 5.Host: cscie12.dce.harvard.edu 6.Content-Length: 10 7.Content-Type: application/x-www-form-urlencoded 8. 9.name=David 10.HTTP/1.1 200 OK 11.Date: Wed, 13 Apr 2005 19:31:11 GMT 12.Server: Apache/2.0.49 (Fedora) 13.Set-Cookie: YourName=David; domain=cscie12.dce.harvard.edu; path=/http/; expires=Fri, 13-14.Content-Length: 319 15.Connection: close 16.Content-Type: text/html; charset=ISO-8859-1 17. 18.<?xml version="1.0" encoding="iso-8859-1"?> 19.<!DOCTYPE html 20. PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 21. "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 22.<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US"> 23.<head><title>Form</title> 24.</head><body> 25.<h1>Hello, David</h1> 26.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

12 of 110 4/17/2007 4:12 PM

WebDAV: an extension of HTTP

Web-based Distributed Authoring and Versioning

WebDAV Resources http://www.webdav.org/From the WebDAV Resources :

WebDAV stands for "Web-based Distributed Authoring and Versioning". It is a setof extensions to the HTTP protocol which allows users to collaboratively edit andmanage files on remote web servers.

Page 7: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

13 of 110 4/17/2007 4:12 PM

HTTP Resources

W3C HTTP http://www.w3.org/Protocols/HTTP Pocket Reference http://www.oreilly.com/catalog/httppr/ by Clinton Wong (O'Reilly).Illustrated Guide to HTTP http://www.manning.com/hethmon/ by Paul Hethmon (Manning Publications; ISBN 0138582262) see sample chapters and resources online.

Other Readings:

W3C Recommendations Reduce 'World Wide Wait' http://www.w3.org/Protocols/NL-PerfNote.htmlApache Week: HTTP version 1.1 http://www.apacheweek.com/features/http11WebTechniques: HTTP 1.1: What's in it for Me? http://www.webtechniques.com/archives/1997/08/webm/Cookie Central: The Unofficial Cookie FAQ http://www.cookiecentral.com/faq/ http://www.cookiecentral.com/

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

14 of 110 4/17/2007 4:12 PM

Apache HTTP Server

Apache Software FoundationApache HTTP Server Project

Apache 1.3Apache 2.x

Apache ModulesPHPPerlPythonmany, many others

Page 8: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

15 of 110 4/17/2007 4:12 PM

Apache: The Most Widely Used Web Server on the PublicInternet

Netcraft Web Server Survey

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

16 of 110 4/17/2007 4:12 PM

In this Unit: Configuring Apache with .htaccess files

Custom Error DocumentsRedirectRewriteDirectory IndexSetting HTTP Headers

ExpiresHeaders

Access ControlRequiring a Secure Connection (SSL)

Page 9: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

17 of 110 4/17/2007 4:12 PM

Apache Configuration Overview

Server Configuration (httpd.conf) Unless you are the server administrator, you generally will not have access to this account. Onthe DCE systems, you do not have read or write access to this file. Server configuration isread at server start or restart.Per Directory (.htaccess) Certain configuration directives for Apache can be placed within per-directory .htaccess files..htaccess file is read on a per request basis.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

18 of 110 4/17/2007 4:12 PM

.htaccess File Example

document root: /home/e12/htdocs filename: .htaccess location: /home/e12/htdocs/apache/.htaccess contents:

filename: status404.html location: /home/e12/htdocs/apache/status404.html

http://cscie12.dce.harvard.edu/apache/ZZZ.html

ErrorDocument 404 status404.html 1.

Page 10: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

19 of 110 4/17/2007 4:12 PM

Scope of .htaccess files

Directives within .htaccess files apply to the directory that contains the .htaccess file and all its descendants.

Directives within the file, /home/e12/htdocs/.htaccess would apply to all files within and "under" the public_html directory for the user cscie12.

Directives within the file, /home/e12/htdocs/assignments/.htaccess would apply to all files within and "under" the public_html/assignments directory for the usercscie12.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

20 of 110 4/17/2007 4:12 PM

Problems You Will Have with .htaccess files

Internal Server ErrorCan't "see" the fileIncorrect Permissions

Page 11: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

21 of 110 4/17/2007 4:12 PM

Problems You will encounter when using .htaccess files

500 Internal Server Error If you see begin seeing 500 Internal Server Error responses from the server after you havecreated or edited an .htaccessfile, the most likely cause of the problem is incorrect permissions and/or an error in the directivesyntax.

Permissions on the .htaccessfile are not set correctly. Just like HTML and image files, the server must be able to read the.htaccess file. The simplest way to allow that is to make your .htaccess file readable by "other".

Syntax Error. An error in the syntax of a directive the .htaccess file will result in a 500 Internal Server Error. In addition, correct usage of a directive that is not allowed in the.htaccess file will result in a 500status code. Whether or not a directive is allowed depends upon the server configuration file(httpd.conf; AllowOverride) and the directive itself.

view plain print ?

minerva% pwd 1./home/courses/j/h/jharvard/public_html 2.minerva% ls -l .htaccess 3.-rw------- 1 jharvard founder 349 Nov 27 00:03 .htaccess 4.minerva% chmod o+r .htaccess 5.minerva% ls -l ~/public_html/.htaccess 6.-rw----r-- 1 jharvard founder 349 Nov 27 00:03 .htaccess 7.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

22 of 110 4/17/2007 4:12 PM

Problems You will encounter when using .htaccess files

You can't "see" your .htaccess file.

HTTP The web server is typically configured to deny requests for .htaccess files. For example, the file corresponding to the URL, http://cscie12.dce.harvard.edu/.htaccess exists and is readable by the Web server, but if we try to follow the link, we get a 403 Forbidden response.UNIX The ls command will not list files or directories that begin with a '.' (dot). In order to see the .htaccess file when you do a directory listing, use the -a (all) option:SFTP Sometimes your SFTP program will hide the "dot" files unless explicitly told to show them.

Page 12: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

23 of 110 4/17/2007 4:12 PM

Apache Configuration Sections

Configuration directives can be limited by using "sections", such as

DirectoryLocationFilesVirtualHostDirectoryMatchLocationMatchFilesMatch

Within .htaccess

Note that only Files and FilesMatch can be used within .htaccess files.

Examples:

Examples:

<Files .htaccess> 1. Order allow,deny 2. Deny from all 3.</Files> 4.

# deny access to any tilde backup files 1.<Files *~> 2. Order allow,deny 3. Deny from all 4.</Files> 5.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

24 of 110 4/17/2007 4:12 PM

Configuring Apache with .htaccess files

Custom Error DocumentsRedirectRewriteDirectory IndexSetting HTTP Headers

ExpiresHeaders

Access Control

Page 13: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

25 of 110 4/17/2007 4:12 PM

Custom Error Documents

.htaccess

ErrorDocument directive http://www.apache.org/docs/2.0/mod/core.html#errordocumentCustom Error Responses http://www.apache.org/docs/2.0/custom-error.html

ErrorDocument 401 /apache/status401.html 1.ErrorDocument 403 /apache/status403.html 2.ErrorDocument 404 /apache/status404.html 3.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

26 of 110 4/17/2007 4:12 PM

HTTP Redirect

Fight Linkrot!

Do your part to Fight Linkrot! (Jakob Nielson's Alertbox, http://www.useit.com/alertbox/980614.html )

RedirectRewriteMeta http-equiv refresh

Redirecting Requests

HTTP Status Codes: 301 Moved permanently 302 Moved temporarily

Redirecting client requests can be very useful:

URL moves to a new location Do your part to Fight Linkrot! (Jakob Nielson's Alertbox,http://www.useit.com/alertbox/980614.html )

resource removedsite structure is reorganized

Provide "friendly" or additional URLs to access a resource

Page 14: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

27 of 110 4/17/2007 4:12 PM

Redirect

Redirect directive http://www.apache.org/docs/2.0/mod/mod_alias.html#redirect

.htaccess

Try it:

http://cscie12.dce.harvard.edu/apache/dce.htmlhttp://cscie12.dce.harvard.edu/apache/church_st

Redirect 302 /apache/dce.html http://www.dce.harvard.edu/ 1.Redirect 301 /apache/church_st http://map.harvard.edu/level3.cfm?mapname=camb_allston2.

view plain print ?

minerva% telnet cscie12.dce.harvard.edu 80 1.Trying 140.247.197.240... 2.Connected to cscie12.dce.harvard.edu. 3.Escape character is '^]'. 4.GET /apache/dce.html HTTP/1.1 5.Host: cscie12.dce.harvard.edu 6.Connection: close 7. 8.HTTP/1.1 302 Found 9.Date: Wed, 13 Apr 2005 20:03:10 GMT 10.Server: Apache/2.0.49 (Fedora) 11.Location: http://www.dce.harvard.edu/ 12.Content-Length: 302 13.Connection: close 14.Content-Type: text/html; charset=iso-8859-1 15.

view plain print ?

minerva% lwp-request -USed http://cscie12.dce.harvard.edu/apache/dce.html 1.GET http://www.dce.harvard.edu/ 2.User-Agent: lwp-request/2.06 3. 4.GET http://cscie12.dce.harvard.edu/apache/dce.html --> 302 Found 5.GET http://www.dce.harvard.edu/ --> 200 OK 6.Connection: Close 7.Date: Wed, 13 Apr 2005 20:01:26 GMT 8.Accept-Ranges: bytes 9.Server: Orion/2.0.6 10.Content-Length: 3619 11.Content-Type: text/html 12.Content-Type: text/html; charset=iso-8859-1 13.Last-Modified: Wed, 27 Oct 2004 18:45:00 GMT 14.Client-Date: Wed, 13 Apr 2005 20:01:49 GMT 15.Client-Peer: 140.247.198.100:80 16.Client-Response-Num: 1 17.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

28 of 110 4/17/2007 4:12 PM

Rewrite

mod_rewrite http://www.apache.org/docs/2.0/mod/mod_rewrite.htmlA Users Guide to URL Rewriting with the Apache Webserver http://www.engelschall.com/pw/apache/rewriteguide/

Rewrite uses regular expressions to match on a pattern and rewrite to a new location. For example,the Derek Bok Center site used to be a "user" account and had the "~bok_cen/" base. When movedto its own virtual host, all of the "~bok_cen" requests could be rewritten to the new site with a singlerewrite rule.

Old URL: http://www.fas.harvard.edu/~bok_cen/tf/resources.html(.*) matches on: /tf/resources.htmlNew URL: http://bokcenter.fas.harvard.edu/tf/resources.html

# rewrite for Bok Center 1.RewriteRule ^/~?bok_cen(.*) http://bokcenter.fas.harvard.edu$1 [R=301] 2.

Page 15: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

29 of 110 4/17/2007 4:12 PM

Examples of Rewrite Uses

Provide a standard mechanism to access course Web sites within HarvardCollege.

http://www.courses.harvard.edu/<4 digit catalog number>

For example, Chemistry 7 has a catalog number of 5118, so the URL for the course Web site can be reached through:

http://www.courses.harvard.edu/5118

The "real" location of the site is:

http://my.harvard.edu/icb/icb.do?course=fas-chem7

HASCS Site Restructure

Dozens of rewrite directives were put in place when the HASCS site was restructured so that linksto documents within the previous site would get redirected to the appropriate page in the new site.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

30 of 110 4/17/2007 4:12 PM

Rewrite: Can be conditional

Rewrite rules can conditional (match against almost any environment variable).

Here we match on host and user-agent to deliver an error page explaining that their browser is not supported.

RewriteEngine On 1.RewriteCond %{HTTP_USER_AGENT} ^Lynx 2.RewriteRule ^(index.html)?$ text/ [R=302] 3.

# rewrite rule to catch IE Mac browsers since 1.# the PIN Service does not support them as of 10/16/2006 2.RewriteCond %{HTTP_HOST} "^login.icommons.harvard.edu$" 3.RewriteCond %{HTTP_USER_AGENT} "MSIE 5.*\; Mac_PowerPC" 4.RewriteRule ^/pinproxy.* /pin_error_ie_mac.html [R,L] 5.

Page 16: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

31 of 110 4/17/2007 4:12 PM

An aside: Text-only sites and "link"

Meta-information can be used to describe alternate content.

W3C Web Content Accessibility Guidelines: alternate pages http://www.w3.org/TR/WAI-WEBCONTENT-TECHS/#alt-pages

In ~cscie12/public_html/index2.html

Lynx view of index2.html provides the text-only version as a

link:

view plain print ?

<link title="Text-only version" 1. rel="alternate" 2. href="http://cscie12.dce.harvard.edu/text/index.html" 3. media="aural, braille, tty"/> 4.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

32 of 110 4/17/2007 4:12 PM

Meta Refresh

Note: redirection may also be achieved on some browsers by using the http-equiv attribute of the <meta> element. More information and examples are provided athttp://www.fas.harvard.edu/~web/tutorial/meta/refresh/ . The recommended method is to do it at the server level.

view plain print ?

<!-- in head --> 1.<!-- will redirect in 10 seconds --> 2.<meta http-equiv="Refresh" content="10; URL=http://www.harvard.edu/"/> 3.

Page 17: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

33 of 110 4/17/2007 4:12 PM

Directory Index and Listings

Note: Remember the difference between a directory having rwx-----x and rwx---r-x permissions?

DirectoryIndex http://www.apache.org/docs/2.0/mod/mod_dir.html Would you prefer main.html or overview.html to be the default files returned when a directoryis requested?mod_autoindex http://www.apache.org/docs/2.0/mod/mod_autoindex.html Provides for automatic indexing of a directory.

DirectoryIndex

DirectoryIndex index.html main.html overview.html slide1.html 1.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

34 of 110 4/17/2007 4:12 PM

More Control over Directory Listings

mod_autoindex

Basic

Custom

The details:

Page 18: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

35 of 110 4/17/2007 4:12 PM

view plain print ?

minerva% pwd /home10/c/s/cscie12/public_html/autoindex/ex2 1.minerva% ls -la 2.total 28 3.drwxr-xr-x 2 cscie12 courses 8192 Nov 27 13:28 . 4.drwxr-xr-x 6 cscie12 courses 8192 Nov 27 13:11 .. 5.-rw-r--r-- 1 cscie12 courses 207 Nov 27 13:12 .htaccess 6.-rw-r--r-- 1 cscie12 courses 147 Nov 27 13:09 HEADER.html 7.-rw-r--r-- 1 cscie12 courses 66 Nov 27 13:09 README.html 8.-rw-r--r-- 1 cscie12 courses 4168 Nov 27 12:58 client-server.gif 9.-rw-r--r-- 1 cscie12 courses 906 Nov 27 12:58 slide1.html 10.-rw-r--r-- 1 cscie12 courses 743 Nov 27 12:58 slide2.html 11.-rw-r--r-- 1 cscie12 courses 1208 Nov 27 12:58 slide3.html 12.minerva% cat .htaccess 13.IndexOptions FancyIndexing 14.IndexOptions IconsAreLinks IconHeight=22 IconWidth=20 \ 15. NameWidth=* ScanHTMLTitles SuppressLastModified \ 16. SuppressSize SuppressColumnSorting \ 17. SuppressHTMLPreamble 18.IndexIgnore *.gif .. 19.minerva% 20.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

36 of 110 4/17/2007 4:12 PM

Setting HTTP Headers

ExpiresHeaders

Page 19: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

37 of 110 4/17/2007 4:12 PM

Expires

Module mod_expires http://www.apache.org/docs/2.0/mod/mod_expires.html

.htaccess

Or, expire based upon modification time of document:

From the Apache mod_expires documentation:

This module controls the setting of the Expires HTTP header in server responses. The expirationdate can set to be relative to either the time the source file was last modified, or to the time of theclient access.

The Expires HTTP header is an instruction to the client about the document's validity andpersistence. If cached, the document may be fetched from the cache rather than from the sourceuntil this time has passed. After that, the cache copy is considered "expired" and invalid, and anew copy must be obtained from the source.

ExpiresActive On 1. 2.ExpiresByType text/html A3600 3.# HTML expires in 1 hour 4. 5.ExpiresByType image/gif A2592000 6.# GIF expires in 30 days 7. 8.ExpiresByType image/jpeg A2592000 9.# JPEG expires in 30 days 10. 11.ExpiresByType image/png A2592000 12.# PNG expires in 30 days 13. 14.# types not specified 15.ExpiresDefault "now plus 1 day" 16.# expires in 1 day 17.

ExpiresActive On 1.ExpiresByType text/html M86400 2.# HTML expires 1 day after it was last modified 3.ExpiresDefault M86400 4.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

38 of 110 4/17/2007 4:12 PM

Do not cache

If you do not want your page cached, set these HTTP response headers:

In .htaccess in Apache, this would translate to:

view plain print ?

Cache-control: no-cache 1.Pragma: no-cache 2.Expires: <set to now> 3.

ExpiresDefault "now" 1.Header set Pragma "no-cache" 2.

Page 20: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

39 of 110 4/17/2007 4:12 PM

Headers

mod_headers http://www.apache.org/docs/2.0/mod/mod_headers.html

The optional headers module allows for the customization of HTTP response headers. Headers canbe merged, replaced or removed. The server will always add a "Server" and "Date" header to the HTTP response.

view plain print ?

Header set Author "David P. Heitmeyer" 1.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

40 of 110 4/17/2007 4:12 PM

Usertrack with Cookies

mod_usertrack

.htaccess file:

view plain print ?

minerva% lwp-request -USed http://cscie12.dce.harvard.edu/apache/usertrack/ 1.GET http://cscie12.dce.harvard.edu/apache/usertrack/ 2.User-Agent: lwp-request/1.38 3. 4.GET http://cscie12.dce.harvard.edu/apache/usertrack/ --> 200 OK 5.Connection: close 6.Date: Wed, 13 Apr 2005 20:32:54 GMT 7.Accept-Ranges: bytes 8.Server: Apache/2.0.49 (Fedora) 9.Content-Length: 59 10.Content-Type: text/html; charset=UTF-8 11.Client-Date: Wed, 13 Apr 2005 20:32:54 GMT 12.Client-Peer: 140.247.197.240:80 13.Client-Response-Num: 1 14.Set-Cookie2: MyCookie=140.247.197.240.1113424374035983; \ 15. path=/; max-age=2858400; \ 16. domain=.dce.harvard.edu; version=1 17.

CookieTracking on 1.CookieStyle RFC2965 2.CookieName MyCookie 3.CookieExpires "1 month 3 days 2 hours" 4.CookieDomain .dce.harvard.edu 5.

Page 21: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

41 of 110 4/17/2007 4:12 PM

WWW Access Control

You can implement access control on all or part of your Web site so that:

users must provide a username and password (Basic Authentication);users' computers must be within a particular domain

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

42 of 110 4/17/2007 4:12 PM

Basic Authentication: Warning

Basic Authentication alone does not provide the security and privacy to adequately protecttruly confidential or personal information.

Basic Authentication is analogous to simply "closing a door" to parts of your Web site. It will preventthe casual or polite users from "opening the door", but will not prevent someone mildly determinedto walking in.

Two issues that contribute to the lack of security and privacy are:

the content is transmitted over the network in plaintextthe usernames and passwords (submitted with each HTTP request) is transmitted over thenetwork in plaintext

Page 22: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

43 of 110 4/17/2007 4:12 PM

HTTP: Authenticate

view plain print ?

minerva% telnet 140.247.197.240 80 1.Trying 140.247.197.240... 2.Connected to 140.247.197.240. 3.Escape character is '^]'. 4.HEAD /apache/access/example1/ HTTP/1.1 5.Host: cscie12.dce.harvard.edu 6. 7.HTTP/1.1 401 Authorization Required 8.Connection: close 9.Date: Wed, 13 Apr 2005 20:44:39 GMT 10.Server: Apache/2.0.49 (Fedora) 11.WWW-Authenticate: Basic realm="Basic Authentication Tutorial 1" 12.Content-Length: 492 13.Content-Type: text/html; charset=iso-8859-1 14.Client-Date: Wed, 13 Apr 2005 20:44:39 GMT 15.Client-Peer: 140.247.197.240:80 16.Client-Response-Num: 1 17.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

44 of 110 4/17/2007 4:12 PM

HTTP: Authentication/Authorization

The username:password is sent MIME BASE 64 encoded (not encrypted).

view plain print ?

minerva% telnet 140.247.197.240 80 1.Trying 140.247.197.240... 2.Connected to 140.247.197.240. 3.Escape character is '^]'. 4.HEAD /apache/access/example1/ HTTP/1.1 5.Host: cscie12.dce.harvard.edu 6.Authorization: Basic Z3Vlc3Q6Z3Vlc3Q= 7. 8.HTTP/1.1 200 OK 9.Connection: close 10.Date: Wed, 13 Apr 2005 20:47:53 GMT 11.Accept-Ranges: bytes 12.Server: Apache/2.0.49 (Fedora) 13.Content-Length: 124 14.Content-Type: text/html; charset=UTF-8 15.Client-Date: Wed, 13 Apr 2005 20:47:53 GMT 16.Client-Peer: 140.247.197.240:80 17.Client-Response-Num: 1 18. 19. 20.

Page 23: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

45 of 110 4/17/2007 4:12 PM

Access Control Documentation

Apache

Apache FAQ has a section on user authentication.Using User Authentication from Apache WeekRelevant Apache Module and Directive Documentation

mod_access modulemod_auth modulerequire directivesatisfy directive

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

46 of 110 4/17/2007 4:12 PM

Implementing Access Control

To implement access control, you must create a file name '.htaccess' that contains with the properconfiguration instructions. You may also need to create a ".htpasswd" file using the utility"htpasswd" and a ".htgroup" file.

htpasswd program.htaccess filehtpasswd filehtgroup file

Page 24: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

47 of 110 4/17/2007 4:12 PM

htpasswd file

.htpasswd This file contains usernames and encrypted passwords (username:enc_passwd). It is created andmanaged with the utility, "htpasswd", which can be run from the command line.

This file should notlie within your public_html. It should reside at the root level of your home directory (for example,/home/courses/j/h/jharvard/.htpasswd

This file needs to be readable by the Web Server.

Sample content:

view plain print ?

minerva% which htpasswd 1./usr/bin/htpasswd 2.minerva% htpasswd 3.Usage: htpasswd [-c] passwordfile username 4.The -c flag creates a new file. 5. 6.

view plain print ?

minerva% more ~e12/.htpasswd.demo 1.guest:79WeSn3vYGsKQ 2.guest2:wGcgIYLtHNIpM 3.guest3:j9VzpSX/C8Kr2 4.guest4:CjHmW1PWNFwXM 5.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

48 of 110 4/17/2007 4:12 PM

htgroup file

.htgroup This file contains group definitions (group_name:member1 member2 ...).

This file should notlie within your public_html. It should reside at the root level of your home directory (for example,/home/courses/j/h/jharvard/.htgroup

This file needs to be readable by the Web Server.

Page 25: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

49 of 110 4/17/2007 4:12 PM

Access Control Examples

For the examples given, the user "cscie12" is used. You should substitute your username andhome directory appropriately.

The following .htpasswd.demo and .htgroup.demo files are used:

/home/e12/.htpasswd.demo The .htpasswd.demo was generated by using the utility "htpasswd"

Password for "guest" (and all other entries) is "guest". Entries for guest2, guest3, and guest4 arecreated without the "-c" flag, since the .htpasswd.demo file already exists.

Contents of file:

.htgroup.demo Contents of file:

view plain print ?

minerva% htpasswd 1.Usage: htpasswd [-c] passwordfile username 2.The -c flag creates a new file. 3.minerva% htpasswd -c /home/e12/.htpasswd.demo guest 4.Adding password for guest 5.New password: ***** 6.Re-type password: ***** 7.

guest:79WeSn3vYGsKQ 1.guest2:PR4APgA.4CKO. 2.guest3:5DbCMPbSDstj2 3.guest4:htPnr8jT4bI5E 4.

view plain print ?

VIP: guest guest4 1. 2.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

50 of 110 4/17/2007 4:12 PM

Access Control Example 1

Any valid user in .htpasswd.demo is allowed access

The"AuthName" is the description that is displayed by the browser in the Basic Authentication dialogbox.

Contents of sample .htaccess file:

Demonstration of Example 1 You may login as any of the following users (username:password): guest:guest guest2:guest guest3:guest guest4:guest

AuthName "Basic Authentication Tutorial 1" 1.AuthType Basic 2.AuthUserFile /home/e12/.htpasswd.demo 3.require valid-user 4.

view plain print ?

minerva% lwp-request -USed -C"guest:iforgot" http://cscie12.dce.harvard.edu/apache/access1.GET http://cscie12.dce.harvard.edu/apache/access/example1/ 2.Authorization: Basic Z3Vlc3Q6aWZvcmdvdA== 3.User-Agent: lwp-request/2.06 4. 5.GET http://cscie12.dce.harvard.edu/apache/access/example1/ --> 401 Authorization Required6.GET http://cscie12.dce.harvard.edu/apache/access/example1/ --> 401 Authorization Required7.Connection: close 8.Date: Wed, 13 Apr 2005 20:53:42 GMT 9.Server: Apache/2.0.49 (Fedora) 10.WWW-Authenticate: Basic realm="Basic Authentication Tutorial 1" 11.Content-Length: 492 12.Content-Type: text/html; charset=iso-8859-1 13.Client-Date: Wed, 13 Apr 2005 20:53:42 GMT 14.Client-Peer: 140.247.197.240:80 15.Client-Response-Num: 1 16.Client-Warning: Credentials for 'guest' failed before 17.Title: 401 Authorization Required 18.X-Pad: avoid browser bug 19. 20.minerva% lwp-request -USed -C"guest2:guest2" http://cscie12.dce.harvard.edu/apache/access21.GET http://cscie12.dce.harvard.edu/apache/access/example1/ 22.Authorization: Basic Z3Vlc3QyOmd1ZXN0Mg== 23.User-Agent: lwp-request/2.06 24. 25.GET http://cscie12.dce.harvard.edu/apache/access/example1/ --> 401 Authorization Required26.GET http://cscie12.dce.harvard.edu/apache/access/example1/ --> 200 OK 27.Connection: close 28.Date: Wed, 13 Apr 2005 20:59:05 GMT 29.Accept-Ranges: bytes 30.Server: Apache/2.0.49 (Fedora) 31.Content-Length: 124 32.Content-Type: text/html; charset=UTF-8 33.Client-Date: Wed, 13 Apr 2005 20:59:05 GMT 34.Client-Peer: 140.247.197.240:80 35.Client-Response-Num: 1 36.

Page 26: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

51 of 110 4/17/2007 4:12 PM

Access Control Example 2

Only certain users in .htpasswd.demo are allowed access

Contents of sample .htaccess file:

Demonstration of Example 2 Only guest2 and guest3 are authorized: guest2:guest guest3:guest

Unauthorized: guest:guest guest4:guest

AuthName "Basic Authentication Tutorial 2" 1.AuthType Basic 2.AuthUserFile /home/e12/.htpasswd.demo 3.require user guest2 guest3 4.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

52 of 110 4/17/2007 4:12 PM

Access Control Example 3

Only members of a particular group are allowed access

Contents of .htaccess file:

Contents of .htgroup.demo file:

Demonstration of Example 3 Only members of the group "VIP" (as defined by /home/e12/.htgroup.demo) are authorized (guestand guest4): guest:guest guest4:guest

Unauthorized: guest2:guest guest3:guest

AuthName "Basic Authentication Tutorial 3" 1.AuthType Basic 2.AuthUserFile /home/e12/.htpasswd.demo 3.AuthGroupFile /home/e12/.htgroup.demo 4.require group VIP 5.

view plain print ?

VIP: guest guest4 1. 2.

Page 27: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

53 of 110 4/17/2007 4:12 PM

Access Control Example 4

Only certain computers are allowed access

Contents of sample .htaccess file:

Demonstration of Example 4 Computers that are on the Harvard network (computers with hostnames ending in .harvard.edu orwith IP addreses beginning with 128.103 or 140.247) will have access, others will be denied.

order deny,allow 1.deny from all 2.allow from 140.247 3.allow from 128.103 4.allow from .harvard.edu 5.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

54 of 110 4/17/2007 4:12 PM

Access Control Example 5

Only certain computers are denied access

Contents of sample .htaccess file:

Demonstration of Example 5 Connections from within the domain 'fas.harvard.edu' will be denied.

order allow,deny 1.allow from all 2.deny from .fas.harvard.edu 3.

Page 28: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

55 of 110 4/17/2007 4:12 PM

Access Control Example 6

Certain computers are allowed in; others must provide a username and password

Contents of sample .htaccess file:

Demonstration of Example 6 Connection from within ".yale.edu" will be allowed; others must provide a valid username andpassword.

order deny,allow 1.deny from all 2.allow from .yale.edu 3.AuthType Basic 4.AuthUserFile /home/e12/.htpasswd.demo 5.AuthName "Basic Authentication Tutorial 6" 6.require valid-user 7.satisfy any 8.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

56 of 110 4/17/2007 4:12 PM

Access Control Example 7

Only certain computers are allowed in and users must provide a valid username andpassword.

Contents of sample .htaccess file:

Demonstration of Example 7 and satisfy all

order deny,allow 1.deny from all 2.allow from .harvard.edu 3.AuthType Basic 4.AuthUserFile /home/e12/.htpasswd.demo 5.AuthName "Basic Authentication Tutorial 7" 6.require valid-user 7.satisfy all 8.

Page 29: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

57 of 110 4/17/2007 4:12 PM

Requiring SSL (https://)

SSL (Secure Socket Layer) is a protocol that encrypts data between the client and the server. httpsis HTTP over SSL. More details in our last lecture on Security and Privacy.

Contents of sample .htaccess file:

Allowed: https://www.people.fas.harvard.edu/~heitmey/secure/index.htmlForbidden: http://www.people.fas.harvard.edu/~heitmey/secure/index.html

SSLRequireSSL 1.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

58 of 110 4/17/2007 4:12 PM

Details about enabling .htaccess and allowed directives

Context: can these directives be in .htaccess files?AllowOverride: is the server configured to allow this group of directives to be overriden in thislocation?Is the required module loaded?

Page 30: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

59 of 110 4/17/2007 4:12 PM

Legal Directives I: Context

Certain Apache directives are legal within .htaccess files. Some are not. See the Apache Documentation for details. Specifically, look at the Context line that is given for thedirective in question.

Apache Core Features http://www.apache.org/docs/2.0/mod/core.htmlApache Module List http://www.apache.org/docs/2.0/mod/standard Apache Directives http://www.apache.org/docs/2.0/mod/directives.html

The following is an excerpt from the Apache HTTP Server Version 1.3 documentation

ErrorDocument directive

Syntax: ErrorDocument error-code document Context: server config, virtual host, directory, .htaccess Status: core Override: FileInfo Compatibility: The directory and .htaccess contexts are only available in Apache 1.1 and later.

Also, the "a" indicator on the Apache Quick Reference Card indicates that the directive is valid within an .htaccess file.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

60 of 110 4/17/2007 4:12 PM

Legal Directives II: AllowOverride

Users are allowed to override certain aspects of the main server configuration. The main server configuration file (httpd.conf) contains an AllowOverride directive that determines which directives within .htaccess files Apache will process. The Override line that is given for eachdirective in the Apache documentationindicates which configuration directive must be active in order to use that directive with an .htaccessfile.

For the FAS system, the main server configuration file has the following directive in place for users'public_html directories:

The following is an excerpt from the Apache HTTP Server Version 1.3 documentation

ErrorDocument directive

Syntax: ErrorDocument error-code document Context: server config, virtual host, directory, .htaccess Status: core Override: FileInfo Compatibility: The directory and .htaccess contexts are only available in Apache 1.1 and later.

AllowOverride FileInfo AuthConfig Limit Indexes Options 1. 2.

Page 31: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

61 of 110 4/17/2007 4:12 PM

Legal Directives III: Apache Modules

Apache is distributed with several modules. These modules may or may not be active within the Apache server with which you are working. The Core features will always be available.

For example, if the Rewrite Module (mod_rewrite) has not been activated, none of the Rewritedirectives will be available to use.

Refer to the Status and Modulelines in the documentation for each directive and to the documentation for the specific Apacheinstallation you are using.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

62 of 110 4/17/2007 4:12 PM

Apache Modules

On the Apache (Apache/2.0.51) minerva.dce.harvard.edu web server, the following Apachemodules are active:

LoadModule access_module modules/mod_access.soLoadModule auth_module modules/mod_auth.soLoadModule auth_anon_module modules/mod_auth_anon.soLoadModule auth_dbm_module modules/mod_auth_dbm.soLoadModule auth_digest_module modules/mod_auth_digest.soLoadModule ldap_module modules/mod_ldap.soLoadModule auth_ldap_module modules/mod_auth_ldap.soLoadModule include_module modules/mod_include.soLoadModule log_config_module modules/mod_log_config.soLoadModule env_module modules/mod_env.soLoadModule mime_magic_module modules/mod_mime_magic.soLoadModule cern_meta_module modules/mod_cern_meta.soLoadModule expires_module modules/mod_expires.soLoadModule deflate_module modules/mod_deflate.soLoadModule headers_module modules/mod_headers.soLoadModule usertrack_module modules/mod_usertrack.soLoadModule setenvif_module modules/mod_setenvif.soLoadModule mime_module modules/mod_mime.soLoadModule dav_module modules/mod_dav.soLoadModule status_module modules/mod_status.soLoadModule autoindex_module modules/mod_autoindex.soLoadModule asis_module modules/mod_asis.soLoadModule info_module modules/mod_info.soLoadModule dav_fs_module modules/mod_dav_fs.soLoadModule vhost_alias_module modules/mod_vhost_alias.soLoadModule negotiation_module modules/mod_negotiation.soLoadModule dir_module modules/mod_dir.soLoadModule imap_module modules/mod_imap.soLoadModule actions_module modules/mod_actions.soLoadModule speling_module modules/mod_speling.soLoadModule userdir_module modules/mod_userdir.soLoadModule alias_module modules/mod_alias.soLoadModule rewrite_module modules/mod_rewrite.soLoadModule proxy_module modules/mod_proxy.soLoadModule proxy_ftp_module modules/mod_proxy_ftp.soLoadModule proxy_http_module modules/mod_proxy_http.soLoadModule proxy_connect_module modules/mod_proxy_connect.soLoadModule cache_module modules/mod_cache.soLoadModule suexec_module modules/mod_suexec.soLoadModule disk_cache_module modules/mod_disk_cache.soLoadModule file_cache_module modules/mod_file_cache.soLoadModule mem_cache_module modules/mod_mem_cache.soLoadModule cgi_module modules/mod_cgi.soLoadModule dav_svn_module /usr/lib/httpd/modules/mod_dav_svn.soLoadModule authz_svn_module /usr/lib/httpd/modules/mod_authz_svn.so

Page 32: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

63 of 110 4/17/2007 4:12 PM

Webmaster Tools

Site Icons ('favicon.ico')Web Robots

Link CheckingSearch Robots

Other Webmaster ToolsHTML/CSS ValidationAccessibility ComplianceWeb Site MirroringConverting HTML to other formatsHTTP Server Performance

Log Analysis

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

64 of 110 4/17/2007 4:12 PM

Site Icons

favicon.ico at root of web siteor "link" element in "head" element of XHTML/HTML document

MSIE uses http://www.somesite.com/favicon.ico for icons in the bookmark list.

Firefox uses favicon.ico or link element, rel="icon", in the location bar, bookmark list and tab display.

The code in the 'head' of the XHTML would look something like:

view plain print ?

<link rel="icon" href="images/mozilla-16.png" type="image/png"/> 1.<link rel="shortcut icon" href="images/mozilla.ico" type="image/x-icon"/> 2.

Page 33: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

65 of 110 4/17/2007 4:12 PM

SEO: Search Engine Optimization

Make your site ready for search engines

well-formed (and hopefully valid) HTML/XHTML.use mark-up language for headings and liststitles that stand on their own"meta" keywords and description

An example using O'Reilly OnLamp.com

In "head" element of page:

view plain print ?

<meta name="keywords" content="ONLamp.com,O'Reilly Network,oreillynet, 1.oreillynet.com,O'Reilly,OREILLY,o'reilly network,o'reilly, 2.onlamp.com,lamp,lampp,linux,apache,mysql,perl,python, 3.php,linux,bsd,web development,server development reference, 4.technical information,open source" /> 5. 6.<meta name="description" content="Welcome to ONLamp.com, 7.the high performance web development site from the O'Reilly Network 8.offering comprehensive Lamp developer information and resources. 9.O'Reilly Network's ONLamp site features original articles, 10.news and commentary." /> 11.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

66 of 110 4/17/2007 4:12 PM

Firefox as a Web Development Tool

Web Developer Extension

Firefox Extension - Live HTTP Headers

Firefox Extension - Firebug

Page 34: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

67 of 110 4/17/2007 4:12 PM

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

68 of 110 4/17/2007 4:12 PM

xurl and churl

xurl. A simple Perl script that extract the links for a single page. Adapted from The Perl Cookbook.minerva% xurl URL

churl. A simple Perl script that will check the links for a single page. Adapted from The Perl Cookbook.minerva% churl URL

Page 35: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

69 of 110 4/17/2007 4:12 PM

xurl

view plain print ?

minerva% xurl http://www.extension.harvard.edu/ 1.http://dceweb.harvard.edu/prod/sowxcrq.taf?school=EXT 2.http://dceweb.harvard.edu/prod/sswckce.taf?wgrp=EXT 3.http://my.extension.harvard.edu/ 4.http://www.extension.harvard.edu/2006-07/courses/ 5.http://www.extension.harvard.edu/2006-07/courses/DistanceEd/courses/ 6.http://www.extension.harvard.edu/2006-07/courses/citations.jsp 7.http://www.extension.harvard.edu/2006-07/forms/ 8.http://www.extension.harvard.edu/2006-07/help/directory.jsp;jsessionid=HGELDPGDCCIK 9.http://www.extension.harvard.edu/2006-07/images/go2.jpg 10.http://www.extension.harvard.edu/2006-07/images/home.jpg 11.http://www.extension.harvard.edu/2006-07/images/profiles/default-5.jpg 12.http://www.extension.harvard.edu/2006-07/images/snaps.jpg 13.http://www.extension.harvard.edu/2006-07/images/veri.gif 14.http://www.extension.harvard.edu/2006-07/news/;jsessionid=HGELDPGDCCIK 15.http://www.extension.harvard.edu/2006-07/news/chaisson.jsp;jsessionid=HGELDPGDCCIK 16.http://www.extension.harvard.edu/2006-07/news/creatures.jsp;jsessionid=HGELDPGDCCIK 17.http://www.extension.harvard.edu/2006-07/news/earthday.jsp;jsessionid=HGELDPGDCCIK 18.http://www.extension.harvard.edu/2006-07/news/gittleman.jsp;jsessionid=HGELDPGDCCIK 19.http://www.extension.harvard.edu/2006-07/news/retirement.jsp;jsessionid=HGELDPGDCCIK 20.http://www.extension.harvard.edu/2006-07/news/volunteers.jsp;jsessionid=HGELDPGDCCIK 21.http://www.extension.harvard.edu/2006-07/overview/ 22.http://www.extension.harvard.edu/2006-07/overview/tradition.jsp 23.http://www.extension.harvard.edu/2006-07/overview/video/ 24.http://www.extension.harvard.edu/2006-07/overview/welcome.jsp 25.http://www.extension.harvard.edu/2006-07/programs/ 26.http://www.extension.harvard.edu/2006-07/programs/default.jsp#cert 27.http://www.extension.harvard.edu/2006-07/programs/info.jsp 28.http://www.extension.harvard.edu/2006-07/register/ 29.http://www.extension.harvard.edu/2006-07/register/financial/ 30.http://www.extension.harvard.edu/2006-07/register/financial/finaid.jsp 31.http://www.extension.harvard.edu/2006-07/register/guidelines/calendar/ 32.http://www.extension.harvard.edu/2006-07/register/guidelines/international.jsp 33.http://www.extension.harvard.edu/2006-07/register/policies/transcripts.jsp 34.http://www.extension.harvard.edu/2006-07/stylesheets/home-print.css 35.http://www.extension.harvard.edu/2006-07/stylesheets/home-screen.css 36.http://www.extension.harvard.edu/DistanceEd/ 37.http://www.extension.harvard.edu/chooser;jsessionid=HGELDPGDCCIK 38.http://www.google-analytics.com/urchin.js 39.https://dceweb.harvard.edu/prod/gowlogn3.taf 40.javascript:popUp('/2006-07/snapshots/') 41.javascript:popUp2('/2006-07/profiles/default.jsp?n=5') 42.mailto:[email protected] 43.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

70 of 110 4/17/2007 4:12 PM

churl

view plain print ?

minerva$ churl http://www.extension.harvard.edu/ 1.http://www.extension.harvard.edu/: 2.200 OK http://dceweb.harvard.edu/prod/sowxcrq.taf?school=EXT 3.200 OK http://dceweb.harvard.edu/prod/sswckce.taf?wgrp=EXT 4.200 OK http://my.extension.harvard.edu/ 5.200 OK http://www.extension.harvard.edu/2006-07/courses/ 6.200 OK http://www.extension.harvard.edu/2006-07/courses/DistanceEd/courses/ 7.200 OK http://www.extension.harvard.edu/2006-07/courses/citations.jsp 8.200 OK http://www.extension.harvard.edu/2006-07/forms/ 9.200 OK http://www.extension.harvard.edu/2006-07/help/directory.jsp;jsessionid=KPLBEAHDCC10.200 OK http://www.extension.harvard.edu/2006-07/images/go2.jpg 11.200 OK http://www.extension.harvard.edu/2006-07/images/home.jpg 12.200 OK http://www.extension.harvard.edu/2006-07/images/profiles/default-5.jpg 13.200 OK http://www.extension.harvard.edu/2006-07/images/snaps.jpg 14.200 OK http://www.extension.harvard.edu/2006-07/images/veri.gif 15.200 OK http://www.extension.harvard.edu/2006-07/news/;jsessionid=KPLBEAHDCCIK 16.200 OK http://www.extension.harvard.edu/2006-07/news/chaisson.jsp;jsessionid=KPLBEAHDCCI17.200 OK http://www.extension.harvard.edu/2006-07/news/creatures.jsp;jsessionid=KPLBEAHDCC18.200 OK http://www.extension.harvard.edu/2006-07/news/earthday.jsp;jsessionid=KPLBEAHDCCI19.200 OK http://www.extension.harvard.edu/2006-07/news/gittleman.jsp;jsessionid=KPLBEAHDCC20.200 OK http://www.extension.harvard.edu/2006-07/news/retirement.jsp;jsessionid=KPLBEAHDC21.200 OK http://www.extension.harvard.edu/2006-07/news/volunteers.jsp;jsessionid=KPLBEAHDC22.200 OK http://www.extension.harvard.edu/2006-07/overview/ 23.200 OK http://www.extension.harvard.edu/2006-07/overview/tradition.jsp 24.200 OK http://www.extension.harvard.edu/2006-07/overview/video/ 25.200 OK http://www.extension.harvard.edu/2006-07/overview/welcome.jsp 26.200 OK http://www.extension.harvard.edu/2006-07/programs/ 27.200 OK http://www.extension.harvard.edu/2006-07/programs/default.jsp#cert 28.200 OK http://www.extension.harvard.edu/2006-07/programs/info.jsp 29.200 OK http://www.extension.harvard.edu/2006-07/register/ 30.200 OK http://www.extension.harvard.edu/2006-07/register/financial/ 31.200 OK http://www.extension.harvard.edu/2006-07/register/financial/finaid.jsp 32.200 OK http://www.extension.harvard.edu/2006-07/register/guidelines/calendar/ 33.200 OK http://www.extension.harvard.edu/2006-07/register/guidelines/international.jsp 34.200 OK http://www.extension.harvard.edu/2006-07/register/policies/transcripts.jsp 35.200 OK http://www.extension.harvard.edu/2006-07/stylesheets/home-print.css 36.200 OK http://www.extension.harvard.edu/2006-07/stylesheets/home-screen.css 37.200 OK http://www.extension.harvard.edu/DistanceEd/ 38.200 OK http://www.extension.harvard.edu/chooser;jsessionid=KPLBEAHDCCIK 39.200 OK http://www.google-analytics.com/urchin.js 40.SKIP https://dceweb.harvard.edu/prod/gowlogn3.taf 41.SKIP javascript:popUp('/2006-07/snapshots/') 42.SKIP javascript:popUp2('/2006-07/profiles/default.jsp?n=5') 43.SKIP mailto:[email protected] 44. 45.

Page 36: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

71 of 110 4/17/2007 4:12 PM

Page Weight

Page weight of http://www.harvard.edu/

Firefox Web Developer Tool Bar

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

72 of 110 4/17/2007 4:12 PM

timefetch

view plain print ?

[dheitmey@minerva dheitmey]$ timefetch -rv http://www.harvard.edu/ 1. 0.007 1.6kb: http://www.harvard.edu/images/global/shield1.gif 2. 0.007 7.4kb: http://www.harvard.edu/images/global/shield2.gif 3. 0.007 11.9kb: http://www.harvard.edu/images/global/banner.gif 4. 0.006 2.6kb: http://www.harvard.edu/images/global/shield3.gif 5. 0.007 0.2kb: http://www.harvard.edu/images/global/home2.gif 6. 0.006 0.1kb: http://www.harvard.edu/images/global/nav_bullet.gif 7. 0.006 0.7kb: http://www.harvard.edu/images/global/admissions.gif 8. 0.006 0.4kb: http://www.harvard.edu/images/global/employment.gif 9. 0.006 0.3kb: http://www.harvard.edu/images/global/libraries.gif 10. 0.006 0.3kb: http://www.harvard.edu/images/global/museums.gif 11. 0.006 0.2kb: http://www.harvard.edu/images/global/arts.gif 12. 0.006 0.4kb: http://www.harvard.edu/images/global/president.gif 13. 0.006 0.1kb: http://www.harvard.edu/images/global/nav2_bullet.gif 14. 0.006 0.4kb: http://www.harvard.edu/images/global/administration.gif 15. 0.006 0.3kb: http://www.harvard.edu/images/global/schools.gif 16. 0.006 0.6kb: http://www.harvard.edu/images/global/neighbors.gif 17. 0.006 0.3kb: http://www.harvard.edu/images/global/athletics.gif 18. 0.006 0.3kb: http://www.harvard.edu/images/global/alumni.gif 19. 0.006 0.3kb: http://www.harvard.edu/images/global/search.gif 20. 0.006 0.7kb: http://www.harvard.edu/includes/inst_image/titles/062.gif 21. 0.008 28.7kb: http://www.harvard.edu/includes/inst_image/images/062.jpg 22. 0.006 0.8kb: http://www.harvard.edu/images/home/schools_header.gif 23. 0.006 0.0kb: http://www.harvard.edu/images/home/spacer.gif 24. 0.006 0.4kb: http://www.harvard.edu/images/home/sch_bus.gif 25. 0.006 0.4kb: http://www.harvard.edu/images/home/sch_eng.gif 26. 0.006 0.3kb: http://www.harvard.edu/images/home/sch_under.gif 27. 0.006 0.4kb: http://www.harvard.edu/images/home/sch_gov.gif 28. 0.006 0.5kb: http://www.harvard.edu/images/home/sch_dce.gif 29. 0.006 0.5kb: http://www.harvard.edu/images/home/sch_grad.gif 30. 0.006 0.3kb: http://www.harvard.edu/images/home/sch_dental.gif 31. 0.006 0.2kb: http://www.harvard.edu/images/home/sch_law.gif 32. 0.006 0.3kb: http://www.harvard.edu/images/home/sch_des.gif 33. 0.006 0.3kb: http://www.harvard.edu/images/home/sch_med.gif 34. 0.006 0.3kb: http://www.harvard.edu/images/home/sch_div.gif 35. 0.006 0.4kb: http://www.harvard.edu/images/home/sch_health.gif 36. 0.006 0.4kb: http://www.harvard.edu/images/home/sch_ed.gif 37. 0.006 0.5kb: http://www.harvard.edu/images/home/sch_rad.gif 38. 0.006 0.1kb: http://www.harvard.edu/images/home/event.gif 39. 0.006 0.1kb: http://www.harvard.edu/images/home/research.gif 40. 0.006 0.1kb: http://www.harvard.edu/images/home/video.gif 41. 0.008 36.1kb: http://www.harvard.edu/images/home/news/070416a.jpg 42. 0.008 46.9kb: http://www.harvard.edu/images/home/news/070415.jpg 43. 0.006 0.4kb: http://www.harvard.edu/images/home/news/othernews.gif 44. 0.007 19.1kb: http://www.harvard.edu/images/home/news/drew_t.jpg 45. 0.006 0.4kb: http://www.harvard.edu/images/global/about.gif 46. 0.006 0.1kb: http://www.harvard.edu/images/global/footer_bull.gif 47. 0.006 0.4kb: http://www.harvard.edu/images/global/directories.gif 48. 0.006 0.4kb: http://www.harvard.edu/images/global/contact.gif 49. 0.006 0.3kb: http://www.harvard.edu/images/global/infotech.gif 50. 0.006 0.3kb: http://www.harvard.edu/images/global/news.gif 51. 0.006 0.4kb: http://www.harvard.edu/images/global/siteguide.gif 52.---------------------------------------------------- 53. 0.014 21.5kb: http://www.harvard.edu/ 54. 0.445 190.8kb: http://www.harvard.edu/ (incl.: img) 55.

Page 37: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

73 of 110 4/17/2007 4:12 PM

view plain print ?

minerva% timefetch -h 1.Usage: /usr/local/bin/timefetch [dhFjrvXz] [-f host] [-a attempts] [-b broken_images] 2. [-s size] [-T text] [-t timeout] http://url [http://url ... ] 3./usr/local/bin/timefetch -h for help message 4. 5. -a Number of attempts for the initial page fetch. 6. -b Minimum number of broken images to trigger alarm. 7. -d Debug: view all kinds of marginally useful output. 8. -f Force host: before doing recursive downloads, munge each URL 9. and replace the host in the URL with some other host. 10. -h Help: print this help message. 11. -j Java: download java applets as well. 12. -F No frames: If the page is a frameset, do *not* fetch the frames. 13. Default is to fetch them. 14. -r Recursive: download all images and calculate cumulative time. 15. -s Minimum size for the entire document (in kilobytes). 16. -t Timeout value for HTTP requests. 17. -T HTML text to scan for (such as "</html>"). Not case sensitive. 18. -v Verbose: print out URLs as they are downloaded. 19. -X Don't exit on errors, just try to continue. 20. -z Exit immediately on errors in fetching the main page. 21. 22. NOTE: This program always downloads embedded frames and prints 23. a cumulative total for frames and framesets, even if you did not 24. specify a recursive download. 25.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

74 of 110 4/17/2007 4:12 PM

timefetch examples

timefetch is in need of updating (does not parse CSS and will not get images referenced in CSS), but can still be a useful tool

timefetch will show the actual download time (often not useful) and the total kilobytes downloaded (often useful). Warning: timefetch will not execute JavaScript, nor does it fetch images included byCSS).

Compare to:

view plain print ?

[dheitmey@minerva dheitmey]$ timefetch -rv http://cscie12.dce.harvard.edu/ 1. 0.089 17.9kb: http://images.amazon.com/images/P/059610197X.01._AA240_SCLZZZZZZZ_.jpg 2. 1.054 11.2kb: http://ec1.images-amazon.com/images/P/0596009879.01._AA240_SCLZZZZZZZ_V373.---------------------------------------------------- 4. 0.788 17.6kb: http://isites.harvard.edu/icb/icb.do?keyword=k12622 5. 2.033 46.6kb: http://cscie12.dce.harvard.edu/ (incl.: img) 6.

Page 38: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

75 of 110 4/17/2007 4:12 PM

Web Robots

Robots, Spider, Crawlers

As they "spider" a site, the robots can perform various actions, such as:

Gathering content for search engines or a website mirrorValidating, checking, or processing content

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

76 of 110 4/17/2007 4:12 PM

Spidering Behavior: an example with Lynx

After lynx is done, here's a look at the files we have:

the "lnkNNNNNNNN.dat" files contain the text dump of the pages lynx retrievedthe "traverse.dat" files contain the list of link that lynx retrivedthe "reject.dat" files contain a list of URLs that lynx did not fetch (due to the fact that they are outside the "realm" as specified on the command line).

view plain print ?

minerva% lynx -traversal \ 1.> -crawl \ 2.> -realm cscie12.dce.harvard.edu \ 3.> http://cscie12.dce.harvard.edu/lecture_notes/2006-07/20070410/toc.html 4.

view plain print ?

minerva% ls -l 1.total 540 2.-rw------- 1 dheitmey teaching 3559 Apr 17 15:34 lnk00000000.dat 3.-rw------- 1 dheitmey teaching 59557 Apr 17 15:34 lnk00000001.dat 4.-rw------- 1 dheitmey teaching 13851 Apr 17 15:34 lnk00000002.dat 5.-rw------- 1 dheitmey teaching 757 Apr 17 15:34 lnk00000003.dat 6.-rw------- 1 dheitmey teaching 3562 Apr 17 15:34 lnk00000004.dat 7..... truncated ... 8.-rw------- 1 dheitmey teaching 838 Apr 17 15:34 lnk00000101.dat 9.-rw------- 1 dheitmey teaching 2659 Apr 17 15:34 lnk00000102.dat 10.-rw------- 1 dheitmey teaching 28336 Apr 17 15:34 reject.dat 11.-rw------- 1 dheitmey teaching 20385 Apr 17 15:34 traverse2.dat 12.-rw------- 1 dheitmey teaching 7709 Apr 17 15:34 traverse.dat 13.

Page 39: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

77 of 110 4/17/2007 4:12 PM

Robots Exclusion Standard (RES)

The Web Robots Pageshttp://www.robotstxt.org/wc/robots.htmlA Standard for Robot Exclusionhttp://www.robotstxt.org/wc/norobots.htmlThe Web Robots FAQ

RES provides two mechanisms to instruct robots that visit your site:

robots.txt file1.robots meta tag2.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

78 of 110 4/17/2007 4:12 PM

robots.txt

Two directives:

User-AgentDisallow

Note: robots.txt must lie at the root level of the server.

Examples of robots.txt files:

http://www.fas.harvard.edu/robots.txthttp://www.foxnews.com/robots.txtfind them at a couple of your favorite web sites

Why won't the following robots.txt files do anything useful? (they aren't at the root level of server)

http://www.people.fas.harvard.edu/~jharvard/robots.txthttp://www.fas.harvard.edu/computing/robots.txt

Page 40: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

79 of 110 4/17/2007 4:12 PM

robots.txt Examples

http://www.npr.org/robots.txt

http://www.foxnews.com/robots.txt

Disallow all robots from certain areas: 1.User-agent: * 2.Disallow: /cgi-bin 3.Disallow: /ramfiles/ 4.Disallow: /*.smil 5.Disallow: /*.asx 6.Disallow: /*.ram 7.Disallow: /*.rmm 8.Disallow: /*.js 9.Disallow: /*.au 10.Disallow: /stations/force/force_localization.php? 11.Disallow: /rundowns/segment.php? 12.

User-agent: * 1.Disallow: / 2. 3.User-agent: fusionbot 4.User-agent: Googlebot 5.Disallow: /printer_friendly_story 6. 7.User-agent: Mediapartners-Google* 8.Disallow: /printer_friendly_story 9. 10.User-agent: Teoma 11.Disallow: /printer_friendly_story 12. 13.User-agent: yahoo-newscrawler 14.Disallow: /printer_friendly_story 15. 16.User-agent: Yahoo! Slurp 17.Disallow: /printer_friendly_story 18. 19.User-agent: newslookup-bot 20.Disallow: /printer_friendly_story 21. 22.User-agent: gsa-crawler 23.Disallow: /printer_friendly_story 24.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

80 of 110 4/17/2007 4:12 PM

Robots meta element

name="robots"content

index or noindexfollow or nofollow

The Robots meta element can be used on a per document basis.

OK to index page; OK to follow links on page

OK to index page; Don'tfollow links on page

Don't index page; OK to follow links on page

Don't index page; Don't follow links on page

view plain print ?

<meta name="robots" content="index,follow"/> 1.

view plain print ?

<meta name="robots" content="index,nofollow"/> 1.

view plain print ?

<meta name="robots" content="noindex,follow"/> 1.

view plain print ?

<meta name="robots" content="noindex,nofollow"/> 1.

Page 41: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

81 of 110 4/17/2007 4:12 PM

Link Checking Robots

Check the links on a single page; or on an entire site.If following links, will do a get request, otherwise it should do a head request.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

82 of 110 4/17/2007 4:12 PM

Examples of Link Checking Robots

churlchecklinkwebbotcheckbotwebchecklynx

Page 42: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

83 of 110 4/17/2007 4:12 PM

checklink

W3C Link Checkerhttp://validator.w3.org/docs/checklink

Use Onlinehttp://validator.w3.org/checklinkUse command line:

Perl, Free

view plain print ?

minerva% checklink URL 1.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

84 of 110 4/17/2007 4:12 PM

checklink

view plain print ?

minerva% checklink --help 1.W3C checklink version 3.6.2.26 (c) 1999-2004 W3C 2.Usage: checklink <options> <uris> 3.Options: 4. -s/--summary Result summary only. 5. -b/--broken Show only the broken links, not the redirects. 6. -e/--directory Hide directory redirects, for example 7. http://www.w3.org/TR -> http://www.w3.org/TR/ 8. -r/--recursive Check the documents linked from the first one. 9. -D/--depth n Check the documents linked from the first one 10. to depth n (implies --recursive). 11. -l/--location uri Scope of the documents checked in recursive mode. 12. By default, for example for 13. http://www.w3.org/TR/html4/Overview.html 14. it would be http://www.w3.org/TR/html4/ 15. -n/--noacclanguage Do not send an Accept-Language header. 16. -L/--languages Languages accepted (default: *). 17. -q/--quiet No output if no errors are found. Implies -s. 18. -v/--verbose Verbose mode. 19. -i/--indicator Show progress while parsing. 20. -u/--user username Specify a username for authentication. 21. -p/--password password Specify a password. 22. --hide-same-realm Hide 401's that are in the same realm as the 23. document checked. 24. -t/--timeout value Timeout for HTTP requests. 25. -d/--domain domain Regular expression describing the domain to 26. which the authentication information will be 27. sent. 28. --masquerade "base1 base2" Masquerade base URI base1 as base2. See manual 29. page for more information. 30. -y/--proxy proxy Specify an HTTP proxy server. 31. -h/--html HTML output. 32. -?/--help Show this message. 33. -V/--version Output version information. 34. 35.See "perldoc Net::FTP" for information about various environment variables 36.affecting FTP connections and "perldoc Net::NNTP" for setting a default 37.NNTP server for news: URIs. 38. 39.The W3C_CHECKLINK_CFG environment variable can be used to set the 40.configuration file to use. See details in the full manual page, it can 41.be displayed with: 42. perldoc /usr/local/bin/checklink 43. 44.More documentation at: http://www.w3.org/2000/07/checklink 45.Please send bug reports and comments to the www-validator mailing list: 46. [email protected] (with 'checklink' in the subject) 47. Archives are at: http://lists.w3.org/Archives/Public/www-validator/ 48.

Page 43: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

85 of 110 4/17/2007 4:12 PM

checklink

view plain print ?

minerva% checklink -r -D 0 -s -b \ 1.> http://cscie12.dce.harvard.edu/lecture_notes/20070131/ 2.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

86 of 110 4/17/2007 4:12 PM

webbot

webbot is part of the W3C Libwww package. http://www.w3.org/Robot/

view plain print ?

minerva% webbot 1. 2.W3C OpenSource Software 3.----------------------- 4. 5. Webbot version 5.4.0 6. using the W3C libwww library version 5.4.0. 7. 8. See "http://www.w3.org/Robot/User/CommandLine" for help 9. See "http://www.w3.org/Robot/User/" for user information 10. See "http://www.w3.org/Robot/" for general information 11. 12. Please send feedback to the <[email protected]> mailing list, 13. see "http://www.w3.org/Library/#Forums" for details 14.

Page 44: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

87 of 110 4/17/2007 4:12 PM

webbot example

view plain print ?

minerva% webbot -img \ 1.> -depth 99 \ 2.> -prefix http://cscie12.dce.harvard.edu/lecture_notes/20060131/ \ 3.> -include http://cscie12.dce.harvard.edu/lecture_notes/20060131/ 4.> -404 404.log 5.> -l clf.log 6.> -referer referer.log 7.> -reject reject.log 8.> http://cscie12.dce.harvard.edu/lecture_notes/20060131/ 9....content removed... 10.Robot....... Received element 0, attribute 5 with anchor 0x8073700 11.Robot....... Found `http://cscie12.dce.harvard.edu/' - 12............. Already checked 13.Robot....... Received element 0, attribute 5 with anchor 0x8073688 14.Robot....... Found `http://cscie12.dce.harvard.edu/lecture_notes/20060131/slide1.html' - 15............. Already checked 16.Robot....... done with http://cscie12.dce.harvard.edu/lecture_notes/20060131/slide0.html 17. 2 outstanding requests 18.Robot....... done with http://cscie12.dce.harvard.edu/lecture_notes/20060131/images/verit19. 1 outstanding request 20.Robot....... done with http://cscie12.dce.harvard.edu/lecture_notes/20060131/images/KUSea21. Everything is finished... 22. 23.Accessed 62 documents in 2.61 seconds (23.79 requests pr sec) 24. Did a GET on 53 document(s) and downloaded 182K bytes of document bodies (71396.825. Did a HEAD on 9 document(s) with a total of 49K bytes 26. 27.Raw Log files: 28. Logged 62 entries in general log file `clf.log' 29. Logged 61 entries in referer log file `referer.log' 30. Logged 51 entries in rejected log file `reject.log' 31. Logged 0 entries in not found log file `404.log' 32.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

88 of 110 4/17/2007 4:12 PM

checkbot

Checkbot http://degraaff.org/checkbot/

minerva% checkbot

Checkbot Example Output

Page 45: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

89 of 110 4/17/2007 4:12 PM

Checkbot Example

view plain print ?

minerva% checkbot --help 1.Checkbot 1.75 command line options: 2. 3. --cookies Accept cookies from the server 4. --debug Debugging mode: No pauses, stop after 25 links. 5. --mailto address Mail brief synopsis to address when done. 6. --noproxy domains Do not proxy requests to given domains. 7. --verbose Verbose mode: display many messages about progress. 8. --url url Start URL 9. --match match Check pages only if URL matches `match' 10. If no match is given, the start URL is used as a match 11. --exclude exclude Exclude pages if the URL matches 'exclude' 12. --filter regexp Run regexp on each URL found 13. --ignore ignore Ignore URLs matching 'ignore' 14. --suppress file Use contents of 'file' to suppress errors in output 15. --file file Write results to file, default is checkbot.html 16. --note note Include Note (e.g. URL to report) along with Mail message. 17. --proxy URL URL of proxy server for HTTP and FTP requests. 18. --internal-only Only check internal links, skip checking external links. 19. --sleep seconds Sleep this many seconds between requests (default 0) 20. --style url Reference the style sheet at this URL. 21. --timeout seconds Timeout for http requests in seconds (default 120) 22. --interval seconds Maximum time interval between updates (default 10800) 23. --dontwarn codes Do not write warnings for these HTTP response codes 24. --enable-virtual Use only virtual names, not IP numbers for servers 25. --language Specify 2-letter language code for language negotiation 26. 27.Options --match, --exclude, and --ignore can take a perl regular expression 28.as their argument 29. 30.Use 'perldoc checkbot' for more verbose documentation. 31.Checkbot WWW page : http://degraaff.org/checkbot/ 32.Mail bugs and problems: [email protected] 33.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

90 of 110 4/17/2007 4:12 PM

checkbot

Results are given in an HTML page: checkbot.html

view plain print ?

minerva% checkbot 1.> --url http://cscie12.dce.harvard.edu/lecture_notes/20060131/ 2.> --verbose 3.

Page 46: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

91 of 110 4/17/2007 4:12 PM

For the Programmer: Writing Your Own

The Perl modules, LWP and WWW::Robot make writing robots almost trivial.

Examples in Perl Cookbook, published by O'ReillyPerl and LWP by Sean Burke, published by O'Reilly

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

92 of 110 4/17/2007 4:12 PM

Other Webmaster Tools

Checking HTML PagesWeb Site MirroringDocument Version ControlMonitor HTTP Server Performance

Page 47: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

93 of 110 4/17/2007 4:12 PM

Checking HTML Pages

HTML tidy, http://tidy.sourceforge.net/W3C HTML Validation, http://validator.w3.org/W3C CSS Validation, http://jigsaw.w3.org/css-validator/WebXact (WAI and Section 508 Compliance), http://webxact.watchfire.com/Watchfire.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

94 of 110 4/17/2007 4:12 PM

Web Site Mirroring

GNU wget http://www.gnu.org/software/wget/wget.htmlw3mir http://langfeldt.net/w3mir/

Page 48: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

95 of 110 4/17/2007 4:12 PM

GNU wget

GNU wget http://www.gnu.org/software/wget/wget.html

view plain print ?

minerva% wget --help 1.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

96 of 110 4/17/2007 4:12 PM

w3mir

w3mir http://langfeldt.net/w3mir/

w3miris a all purpose HTTP copying and mirroring tool. The main focus of w3mir is to create and maintaina browseable copy of one, or several, remote WWW site(s). Used to the max w3mir can retrieve thecontents of several related sites and leave the mirror browseable via a local web server, or from a filesystem, such as directly from a CDROM.

Page 49: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

97 of 110 4/17/2007 4:12 PM

HTTP Server Stress Test

ApacheBench (ab) is part of the Apache HTTP Server Distribution http://www.apache.org/httpd.html

minerva% ab -h

Apache JMeter http://jakarta.apache.org/jmeter/index.html

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

98 of 110 4/17/2007 4:12 PM

Apache Bench (ab)

view plain print ?

minerva% /usr/bin/ab -n 10000 -c 10 http://cscie12.dce.harvard.edu/lecture_notes/ 1.This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0 2.Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ 3.Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/ 4. 5.Benchmarking cscie12.dce.harvard.edu (be patient) 6.Completed 1000 requests 7.Completed 2000 requests 8.Completed 3000 requests 9.Completed 4000 requests 10.Completed 5000 requests 11.Completed 6000 requests 12.Completed 7000 requests 13.Completed 8000 requests 14.Completed 9000 requests 15.Finished 10000 requests 16. 17. 18.Server Software: Apache/2.0.51 19.Server Hostname: cscie12.dce.harvard.edu 20.Server Port: 80 21. 22.Document Path: /lecture_notes/ 23.Document Length: 1040 bytes 24. 25.Concurrency Level: 10 26.Time taken for tests: 23.653070 seconds 27.Complete requests: 10000 28.Failed requests: 0 29.Write errors: 0 30.Total transferred: 12097254 bytes 31.HTML transferred: 10406240 bytes 32.Requests per second: 422.78 [#/sec] (mean) 33.Time per request: 23.653 [ms] (mean) 34.Time per request: 2.365 [ms] (mean, across all concurrent requests) 35.Transfer rate: 499.43 [Kbytes/sec] received 36. 37.Connection Times (ms) 38. min mean[+/-sd] median max 39.Connect: 1 8 4.2 9 39 40.Processing: 6 14 4.3 14 44 41.Waiting: 0 8 4.2 8 36 42.Total: 18 23 4.9 21 47 43. 44.Percentage of the requests served within a certain time (ms) 45. 50% 21 46. 66% 21 47. 75% 22 48. 80% 23 49. 90% 31 50. 95% 38 51. 98% 39 52. 99% 40 53. 100% 47 (longest request) 54.

Page 50: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

99 of 110 4/17/2007 4:12 PM

Apache Bench (ab)

view plain print ?

minerva% /usr/sbin/ab -n 1000 -c 10 http://cscie12.dce.harvard.edu/tools/webcube.cgi 1.This is ApacheBench, Version 1.3d <$Revision: 1.67 $> apache-1.3 2.Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ 3.Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/ 4. 5.Benchmarking cscie12.dce.harvard.edu (be patient) 6.Completed 100 requests 7.Completed 200 requests 8.Completed 300 requests 9.Completed 400 requests 10.Completed 500 requests 11.Completed 600 requests 12.Completed 700 requests 13.Completed 800 requests 14.Completed 900 requests 15.Finished 1000 requests 16. 17. 18.Server Software: Apache/2.0.49 19.Server Hostname: cscie12.dce.harvard.edu 20.Server Port: 80 21. 22.Document Path: /tools/webcube.cgi 23.Document Length: 58163 bytes 24. 25.Concurrency Level: 10 26.Time taken for tests: 98.986587 seconds 27.Complete requests: 1000 28.Failed requests: 0 29.Write errors: 0 30.Total transferred: 58323402 bytes 31.HTML transferred: 58171098 bytes 32.Requests per second: 10.10 [#/sec] (mean) 33.Time per request: 989.866 [ms] (mean) 34.Time per request: 98.987 [ms] (mean, across all concurrent requests) 35.Transfer rate: 575.39 [Kbytes/sec] received 36. 37.Connection Times (ms) 38. min mean[+/-sd] median max 39.Connect: 0 0 0.1 0 1 40.Processing: 191 981 966.2 719 9244 41.Waiting: 188 810 750.6 586 6321 42.Total: 191 981 966.2 719 9244 43. 44.Percentage of the requests served within a certain time (ms) 45. 50% 719 46. 66% 1065 47. 75% 1278 48. 80% 1468 49. 90% 1956 50. 95% 2452 51. 98% 4006 52. 99% 5499 53. 100% 9244 (longest request) 54.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

100 of 110 4/17/2007 4:12 PM

HTTP Server Monitoring

Nagios http://www.nagios.org/Cricket http://cricket.sourceforge.net/ Monitoring Apache with Cricket

Page 51: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

101 of 110 4/17/2007 4:12 PM

Web Logs

Common Tools

AnalogAnalog + Report MagicAWStatsWebTrends

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

102 of 110 4/17/2007 4:12 PM

HTTP Server Logs

access logerror logreferer log (no longer common)user-agent log (no longer common)

error log:

[Thu Dec 2 11:26:08 1999] [notice] Apache/1.3.9 (Unix) configured -- resuming normal operat[Thu Dec 2 11:27:19 1999] [notice] caught SIGTERM, shutting down[Mon Dec 6 19:15:04 1999] [notice] Apache/1.3.9 (Unix) configured -- resuming normal operat[Mon Dec 6 19:27:33 1999] [notice] caught SIGTERM, shutting down

access log:

is03.fas.harvard.edu - - [02/Dec/1999:11:26:42 -0500] "GET /server-status HTTP/1.0" 200 1544140.247.30.103 - - [02/Dec/1999:11:26:48 -0500] "GET / HTTP/1.0" 200 1622is03.fas.harvard.edu - - [02/Dec/1999:11:26:56 -0500] "GET /server-info HTTP/1.0" 200 45662140.247.30.104 - - [06/Dec/1999:19:16:58 -0500] "GET / HTTP/1.0" 200 1622140.247.27.63 - - [06/Dec/1999:19:17:08 -0500] "GET / HTTP/1.1" 200 1622140.247.27.63 - - [06/Dec/1999:19:17:09 -0500] "GET /apache_pb.gif HTTP/1.1" 200 2326is04.fas.harvard.edu - - [06/Dec/1999:19:18:32 -0500] "GET /server-status HTTP/1.0" 200 1546

Page 52: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

103 of 110 4/17/2007 4:12 PM

Data in Access Logs

http://www.apache.org/docs/mod/mod_log_config.html

Typical Data:

TimeIP address / HostnameUsername (if under Authentication)RequestUser-AgentReferrer URLResponse StatusBytes returned

Possible Data:

The contents of a specified environment variableFilenameThe request protocolThe contents of specified HTTP request headersThe contents of specified HTTP response headersRemote logname (from identd, if supplied)The request methodThe canonical Port of the server serving the request.The process ID of the child that serviced the request.The query stringFirst line of requestThe time taken to serve the request.The URL path requested.The canonical ServerName of the server serving the request.The server name according to the UseCanonicalName setting.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

104 of 110 4/17/2007 4:12 PM

Log Formats

Common Log Format (CLF)host ident auth_user date request status bytes

User-Agent Logdate user-agent

Referer Logdate referrer-url request-url

Combined Log Formathost ident authuser date request status bytes referrer user-agent

Custom Log Formats in Apache

Page 53: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

105 of 110 4/17/2007 4:12 PM

Combined Log Format

damhab.103.60.61 - - [14/Sep/1999:11:37:46 -0400] "GET /~cscie12/syllabus/workshops.html HTTdamhab.103.60.61 - - [14/Sep/1999:11:37:46 -0400] "GET /~cscie12/discussion/ HTTP/1.0" 404 2vicfux.115.63.27 - - [14/Sep/1999:11:38:28 -0400] "GET /~cscie12/ HTTP/1.0" 200 7281 "http:/vicfux.115.63.27 - - [14/Sep/1999:11:38:56 -0400] "GET /~cscie12/images/dce2.gif HTTP/1.0" 2vicfux.115.63.27 - - [14/Sep/1999:11:38:58 -0400] "GET /~cscie12/images/dot.gif HTTP/1.0" 20vicfux.115.63.27 - - [14/Sep/1999:11:38:58 -0400] "GET /~cscie12/images/syllabus.gif HTTP/1.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

106 of 110 4/17/2007 4:12 PM

Web Server Logs: Two perspectives

Server AdministratorContent Provider

Page 54: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

107 of 110 4/17/2007 4:12 PM

Web Server Logs: What we would like to know and what we can know

What is the busiest time?1.How long do they stay?2.How long did it take to fulfill a request?3.How many requests were there for a specific resource?4.Where are the users coming from?5.What browsers are people using?6.What pages have they been to?7.How many were looking versus buying?8.What requests resulted in errors (status 404, etc)?9.Where do the users go when they leave the site?10.Do they come back?11.

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

108 of 110 4/17/2007 4:12 PM

Complicating Issues

HTTP is a stateless protocolLocal CacheProxy CacheProxy ServersShared Computers

Page 55: Hypertext Transfer Protocol - Harvard University · Hypertext Transfer Protocol  7 of 110 4/17/2007 4:12 PM Your Cookies Firefox Webdeveloper ...

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

109 of 110 4/17/2007 4:12 PM

Log Rotation

approximately 200 to 250 bytes per line (request)For example, 1,000,000 requests per day (12 requests per second)log grows at 2.8 kb per second238 Mb for 1,000,000 requestscompressed (gzip'ed) logs are 7 to 10% of original size!

Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html

110 of 110 4/17/2007 4:12 PM

Tools for Log Analysis

Analog http://www.analog.cx/ Stephen TurnerUNIX, Windows, MacOS, othersFree!!

Report Magic

WebTrends Log Analyzer http://www.webtrends.com/

Table of Contents | All Slides | Link List | CSCI E-12