Web Engineering Basic Technologies: Protocols and Web · PDF fileWeb Engineering Basic...
Transcript of Web Engineering Basic Technologies: Protocols and Web · PDF fileWeb Engineering Basic...
Basic Web Technologies
• HTTP and HTML
• Web Servers
• Proxy Servers
• Content Delivery Networks
Where we will be later in the course ...
Where we will be later in the course .......
• Supporting a range of client devices
World-Wide Web
• Series of Protocols• URL/URI unique identification of resources
• URI examples• http://www.inf.ethz.ch/education• mailto: [email protected]• ftp://ftp.inf.ethz.ch/ed/report.txt• tel:+41-44-6321234• ....
• URL is a URI that provides information about how to locate a resource
• HTTP Hypertext Transfer Protocol• HTML Hypertext Mark-Up Language
• Web Browsers• Internet Explorer, Mozilla Firefox …..
HTML<html>
<head>
<title>Michael's Personal Home Page</title>
</head>
<body bgcolor="#FFFFFF" text="#000000">
<h1> Michael </h1>
<img src="michael.bmp" align="right"/>
<h2>Work</h2>
Michael works at <a href="http://www.ethz.ch"> ETH Zurich </a>
<h2>Personal</h2>
<address> CNB E106 <br/> Zurich <br/> Switzerland </address>
</body>
</html>
HTML …
• Based on Hypertext Style of Navigation
• Simple and Easy to Publish on Web
• Structure, Content or Presentation?• wide use of table elements for formatting layout• address elements describe the content
• Problems of• link maintenance• document interpretation
• Flexible• unknown tags ignored by browsers => easy to extend with customised tags
HTML ......
• Document meta data can be included in header
<head>
<title>Michael's Personal Home Page</title>
<meta name="keywords" content="web, databases, java">
<meta name= "authors" content="michael">
<meta http-equiv="expires" content="25 Mar 2006">
<meta http-equiv="Refresh" content="15">
</head>
• Keywords used by search engines
HTML5 : The Next Generation of HTML
• New standard for HTML, XHTML and HTML DOM
• Work in progress, most browsers now have some support
• Cooperation between W3C and Web Hypertext Application Technology Working Group (WHATWG)
• One goal was to have a clearer separation of content andpresentation• HTML5 - content• CSS3 - layout as well as look and feel
• Second goal to make it easier to process documents and their content
• Third goal to reduce the need for plug-ins
Key Features of HTML5
• Tags to support a stronger document model to make it easier to identify logical elements of documents• section, article, aside, details, header, footer …
• Support for other media types• video, audio …
• Take over some of the things normally handled by JavaScript such as form field validation• form field input types such as email, url, dates, numbers ….• far richer set of event attributes
• Support for client-side storage• replacing cookies
HTTP 1.0
• hypertext transfer protocol
• one object transferred per connection
HTTP request
GET /www/globis.html HTTP/1.0
Accept: www/source
Accept: text/html
Accept: image/gif
User-Agent: Lynx/2.4 libwww/2.14
HTTP result
HTTP/1.0 200 OKDate: Thursday, 23-April-98 09:00:05 GMTServer: NCSA/1.4.2MIME-version: 1.0Content-type: text/htmlContent-length: 3500
<html>……</html>
note blank linebetween headerand body ofmessage
HTML Form
• GET /cgi-bin/globis.pl?user=moira&pass=fred
HTML Form
<html>
...
<form action="/cgi-bin/globis.pl" method="GET">
Name: <input type="text" name="user" size=10>
<br/>
Password: <input type="password" name="pass" size=6>
<br/>
<input type="submit" value="ENTER">
<input type="reset" value="CLEAR">
</form>
</html>
Introducing Dynamic Content
• need to introduce some mechanism to execute programs on theserver side and dynamically generate HTML documents
CGI Programming
• Common Gateway Interface
• Executes Programs on Server Side
CGI Result
CGI Programs
• Can be written in any language
• Desirable Features• ease of text manipulation
• ability to access environment variables
• ability to interface with other services
• Commonly Used Languages• Perl, C/C++, Tcl, Java
Accessing Form Data
#!/usr/local/bin/perl
print "Content-type: text/html", "\n\n";
$query_string=$ENV {'QUERY_STRING'};
…..
($field_name, $param) = split (/=/, $query_string);
…..
if ($user eq "moira" ) {
print "Location: /globis.html", "\n\n";
} else ……
Unix Environment Variables
SERVER_NAME
REMOTE_HOST
REMOTE_ADDR
REMOTE_USER
REQUEST_METHOD
QUERY_STRING
.......
GET and POST
• Two methods for sending Form Data
• GET• appends form data to url
GET /cgi-bin/globis.pl?name=globis HTTP/1.0
• POST• form data read from standard input
POST /cgi-bin/globis.pl HTTP/1.0....user =moirapass=fred
Server Side Includes
• Directives included in HTML Documents• execute programs
• output data such as environment variables
Server Side Includes ...
<html>
<head><title>Globis</title></head>
<body>
<h1>Welcome to <!--#echo var=“SERVER_NAME”--></h1>
......
<address>Moira(<!--#echo var=“DATE_LOCAL”-->)</address>
</body>
</html>
Server Side Includes ......
Configure Server to say
• documents which should be parsed• AddType text/x-server-parsed-html .shtml
• AddType text/x-server-parsed-html .html
• directives supported• Includes - display environment variables etc.
• Exec - execute External Programs
Where to Cache?
• Caching can occur at many different levels and locations in web architectures
• Four fundamental ways for implementing a caching mechanism• browser caching
• proxy caching
• reverse proxy caching/server accelerators
• content delivery networks (CDN)
• We will go on to look at each of these in turn
Browser Caching
• Every browser contains cache of HTML docs & multimedia files
• Browser cache is a directory in user’s hard disk
• Advantages• simple
• universal
• Disadvantages• applies only to static resources
• can be by-passed by content provider who can add suitable HTTP headers to response or directives to HTML page forcing browser not to cache
Proxy Caching
• A proxy cache lies between a community of users (e.g. D-INFK, ETHZ) and the public internet
• Works on same basic principles as browser cache, but on much larger scale (may be hundreds or thousands of users)
• Proxy caches sometimes implemented together with firewalls which control flow of requests/responses between intranet and internet
• Client requests have to somehow be routed to proxy server• can be done through browser’s proxy setting
• interception proxies have requests redirected to them by underlying network
proxy server: cache miss
• http://some.host/path/doc.html
• http_proxy=http://www_proxy.my.domain
proxy server: cache hit
HTTP/1.0
• GET URL
• HEAD URLHTTP/1.1 200 OK
Date: Wed, 10 May 2000 09:33:08 GMT
Server: Apache/1.3.12 (Win32)
Last-Modified: Mon, 01 May 2000 13:37:40 GMT
Content-type: text/html
Content-length: 907
• GET URLIf-Modified-Since: Sunday, 05 Mar 2000 13:00:00 GMT
HEAD similar to GET
but only asks forresponse header
rather than content
Browser and proxy caches
• All caches have a set of rules used to determine what can be cachedand when to use cached resources• some rules set in protocols• some set by cache software (e.g. browser)• some set by cache administrator
• Many of these rules based on information in the HTTPrequest/response header• added by server/browser• explicitly generated by content provider• may be based on type of request or type of content• example: HTTP header containing Cache-Control: no-cache
General caching rules
• If response’s header says not to keep it, it won’t be cached
• If no validator (e.g. a Last-Modified header) is present on a response,it will be considered uncacheable
• If request is authenticated or secure, it won’t be cached
• A cached object is considered fresh (i.e. able to be sent to clientwithout checking with origin server) if• it has an expiry time or other age-controlling header set and is still fresh• if object already seen and browser cache set to check once a session• if proxy cache has seen object recently and long time since modified
• If a representation is stale, the server will be asked to validate it
HTTP header information for caching
• Example
HTTP validators and validation
• Validation used by servers and caches to communicate when an object has changed
• Most common validator is Last-Modified time• if cache has object with last-modified time t, generate If-Modified-Since t
request to server to check if object still current
• HTTP 1.1 introduced ETags as another kind of validator• every time object changes, server generates a unique identifier ETag which is
included in HTTP response header of object request• server controls how ETags generated
• Most modern web servers generate both ETags and Last-Modified validators automatically for static content
HTTP cache-control
• max-age=[seconds]
max amount of time page considered fresh; relative to time of request
• s-maxage=con [seds]
similar to max-age, except only applies to shared caches (e.g. proxies)
• public
marks authenticated responses as cacheable; normally if HTTP authentication required, responses uncacheable
• no-cache
forces cache to submit request to original server every time for validation before releasing cached copy
• no-store
instructs cache not to keep a copy under any circumstances
• must-revalidate
instructs cache to obey any freshness information given about an object; counteracts some conditions in which cache may serve stale representations
• proxy-revalidate
similar to must-revalidate, but only applies to proxy caches
What doesn’t work
• HTML metatags example<meta http-equiv="expires" content="Thu, 26 May 2005 10:50:00
GMT"><meta http-equiv="pragma" content="no-cache">
• easy to use, but are not very effective• HTML not usually read by proxy servers• few browsers honour such specifications
• Pragma HTTP headers• can include in HTTP header
Pragma: no-cache• HTTP specification does not specify how these should be handled and
many browsers ignore it
Problems of proxy servers
• Connections to servers still required
• Still a high server load
• Servers lose control over their documents• authorisation
• billing
• access statistics
Prefetching caches
• Request objects from the server without an explicit request
• Based on• access patterns• object analysis (HTML documents, ...)• explicit subscriptions
• Reduces latency
• If level of prefetching too high then may pay severe penalties in terms of• increased network traffic• server load
Proxy servers
• advantages• reduce latency, network bandwidth and server load
• opportunity to analyse an organisation's usage patterns
• transparent to clients and servers
• disadvantages• additional resources
• single point of failure
• chance that users get stale data from the cache
Reverse proxy caching
• Reverse proxy caches are also intermediaries, but instead of being deployed by network administrators are deployed by the webmasters themselves (i.e. server side)
• Improve web site’s• performance• reliability• scalability
• Typically some form of load balancer used to make one or more gateway caches look like the origin server to clients
• Sometimes known as “gateway caches”, “surrogate caches”, or “server accelerators”
Content Delivery Networks (CDNs)
• A content delivery network distributes gateway caches throughout the Internet (or part of it) and sells caching to interested web sites• Akamai (http://www.akamia.com)
• Original idea:• when a client requests a page to the origin server, the server returns a page
with rewritten links that point to a node of the CDN so that further client requests are handled by the CDN
• CDN serves requests using multiple cache nodes, selecting theoptimal copy of the page given the geographical location of the user and the real-time network traffic conditions
Content Delivery Networks (CDNs) ...
• CDNS now perform dynamic request routing using the Internet's Domain Name System (DNS)
• DNS is a distributed directory mapping fully qualified domain names(FQDN) to IP addresses
• To determine an FQDN's address, a DNS client sends a request to its local DNS server which then queries a set of authoritative servers
• When local DNS receives a response, it sends address to DNS client and saves it in cache
• Each DNS record has a time-to-live (TTL) field that tells DNS server how long it may cache result
• Normally the association of FQDN to IP address is static
Content Delivery Networks (CDNs) ......
• CDNs use modified DNS servers for CDN server selection
• Results of a DNS query to one of these servers may vary depending on source of request and network condition
• To enable fast reaction to dynamic resource changes, the answer returned by the CDN's DNS server has a small TTL
• This approach is largely transparent to client and works for any web content
• Issues• usually assumed client close to their local DNS servers• single request from a local DNS server can represent differing number of web clients
(hidden load factor)