Application layer protocols The World Wide Web. Architectural Overview Static Web Documents Dynamic...

Application layer protocols

The World Wide Web

The World Wide Web• Architectural Overview• Static Web Documents• Dynamic Web Documents• HTTP – The HyperText Transfer Protocol• Performance Ehnancements• The Wireless Web

WWW - Architectural Overview

• The World Wide Web (WWW) is a system/framework of interlinked hypertext documents spread out over millions of machines all over the Internet

• The Web was conceived in 1989 at CERN, the European center for nuclear research

• The system was designed with a goal "to link and access information of various kinds as a web of nodes in which the user can browse at will"


• The Web consists of a vast, collection of documents (Web pages - or just “pages”), each of it containing links to other pages anywhere in the world

• Users can follow a link by clicking on it, which then takes them to the desired page

• The idea of having such links (one page pointing to another), now called hypertext, was invented in MIT in 1945, long before the Internet was invented


• Pages are viewed with a program called a browser, that fetches the page requested, interprets the text and formatting commands on it, and displays it, properly formatted, on the screen

• A page starts with a title, contains some information, and ends with the e-mail address of the page's maintainer

• Strings of text that are links to other pages are called hyperlinks

• Hyperlinks are typically highlighted, by underlining, displaying them in a special color, or both

Architectural Overview

(a) A Web page (b) The page reached by clicking on Department of Animal Psychology.


• When a user is interested in some topic/issue on a current page, he/she can learn more about it by clicking on its (underlined) name

• The browser then fetches the page to which the name is linked and displays it

• The new page can be on the same machine as the first one or somewhere else - the user cannot tell


• The basic model of how the Web works is as follows:• A browser is displaying a Web page on the client

machine and when the user clicks on a line of text that is linked to a page on the abcd.com server, the browser follows the hyperlink by sending a message to the abcd.com server asking it for the page

• Next the page arrives and is displayed• If the user clicks on a hyperlink to a page on the

xyz.com server , the browser then sends a request to that machine for the page, and so on …

Architectural Overview

The parts of the Web model.

WWW – the client side

• Browser - a program that can display a Web page and detect a mouse click to items on the displayed page

• When an item is selected, the browser follows the hyperlink and fetches the page selected

• A hyperlink needs a way to refer to any other page on the Web and for this purpose is using URLs (Uniform Resource Locators) – like:

• http://en.wikipedia.org/wiki/World_Wide_Web


• A URL has three parts: • - the name of the protocol (http),• - the DNS name of the machine where the

page is located (en.wikipedia.org), and • - the path of the resource to be fetched or the

program to be run (/wiki/World_Wide_Web)• When a user clicks on a hyperlink, the browser

carries out a series of steps in order to fetch the page pointed to

WWW – the client side• The browser determines the URL (what was selected)• The browser asks DNS for the IP address of

en.wikipedia.org• DNS replies with 91.198.174.225• The browser makes a TCP connection to port 80 on

91.198.174.225• It then sends over a request asking for file

/wiki/World_Wide_Web• The en.wikipedia.org server sends the file

/wiki/World_Wide_Web• The TCP connection is released• The browser displays all the text in

/wiki/World_Wide_Web• The browser fetches and displays all images in this file


• To allow a browser to understand a Web page, it is written in a standardized language called HTML, which describes Web pages

• A browser is basically an HTML interpreter • Most browsers have buttons and features to

make it easier to navigate the Web (back, forward, start page, bookmark, etc)

• Web pages can also contain icons, line drawings, maps, and photographs, each of which can (optionally) be linked to another page

WWW – the client side• Not all pages contain HTML – there may be also

document in PDF, an icon in GIF, a video in MPEG, etc.

• Browsers have a problem when encounter a page they can’t interpret

• When a server returns a page, it also returns some additional information about the page – what is the MIME (Multipurpose Internet Mail Extensions) type of the page

• MIME is an Internet standard that extends the format of email to support


• Pages of type text/html are just displayed directly, as are pages in a few other built-in types

• If the MIME type is not one of the built-in ones, the browser consults its table of MIME types to tell it how to display the page

• Two options exist:– plug-ins – helper applications

WWW – the client side• Plug-in - a code module that the browser fetches

from a special directory on the disk and installs as an extension to itself – it is a dynamic code module designed to extend the capabilities of the browser by integrating a third party application program into the browser

• Plug-ins run inside the browser and have access to the current page, so they can modify its appearance

• After the plug-in has done its job (after user moves to another Web page), it is removed from the browser's memory

The Client Side

(a) A browser plug-in. (b) A helper application.

The Client Side

• Each browser has a set of procedures that all plug-ins must implement so the browser can call the plug-in

• The browser makes a set of its own procedures available to the plug-in, to provide services to plug-ins

• Before a plug-in can be used, it must be installed - the user goes to the plug-in's Web site and downloads an installation file

The Client Side

• The other option to extend a browser is to use a helper application - a complete program, running as a separate process

• The helper offers no interface to the browser and makes no use of browser services

• It just accepts the name of a scratch file where the content file has been stored, opens the file, and displays the contents

• Helpers are large programs that exist independently of the browser - Adobe's Acrobat Reader for displaying PDF files or Microsoft Word. Some programs (such as Acrobat) have a plug-in that invokes the helper itself

The Client Side• Many helper applications use the MIME type

application - many subtypes have been defined, like: application/pdf for PDF files and application/msword for Word files

• Browsers can be configured to handle a virtually unlimited number of document types with no changes to the browser

• Modern Web servers are often configured with hundreds of type/subtype combinations and new ones are often added every time a new program is installed

The Client Side• Browsers can also open local files, rather than

fetching them from remote Web servers• Local files do not have MIME types, so the browser

needs some way to determine which plug-in or helper to use for types other than its built-in types (as text/html and image/jpeg)

• For local files, helpers can be associated with a file extension as well as with a MIME type

• With the standard configuration, opening foo.pdf will open it in Acrobat and opening bar.doc will open it in Word

The Client Side• The helper application runs outside of the browser

window• An advantage of helper applications over plug-ins is

multitasking between a helper application and the browser window - if the browser is closed down, the helper application lives on

• Helper applications cannot display the contents of a file in the context of a Web page. If the file being read is a graphic, the helper application displays only the image, not the image embedded in the Web page

The Client Side• The ability to extend the browser with a large

number of new types is convenient but can also lead to trouble

• When a browser fetches a file with extension .exe - an executable program, hence - no helper

• To run the program? Security? • A Web page with pictures (of movie stars) can

be linked to a virus

The Client Side

• A single click on a picture then causes a program to be fetched and run on the user's machine (potentially hostile behavior)

• To prevent unwanted guests like this, browsers like Internet Explorer can be configured to be selective about running unknown programs automatically, but not all users understand how to manage the configuration

The server side

• When the user types in a URL or clicks on a line of hypertext, the browser parses the URL and interprets the part between http:// and the next slash as a DNS name to look up

• The browser establishes a TCP connection (with the IP address of the server) to port 80 on the server

• Then it sends a command containing the rest of the URL (the name of a file on that server)

• The server returns the file for the browser to display

The server side

• The server has a main loop, in which:1. accepts a TCP connection from a client (a

browser)2. gets the name of the file requested3. gets the file (from disk)4. returns the file to the client5. releases the TCP connection

• A problem - every request requires a disk access to get the file

The server side• A high-end SCSI disk has an average access

time of around 4 msec, which limits the server to at most 250 requests/sec, less if large files have to be read often - too low for big Web site

• Solution - maintain a cache in memory of the n most recently used files

• Before going to disk to get a file, the server checks the cache - If there, file can be served directly from memory (no disk access)

The server side• Effective caching requires a large amount of

main memory and some extra processing time to check the cache and manage its contents

• Still - the savings in time with caching are nearly always worth the overhead and expense

• The server can be multithreaded - design, where the server consists of a front-end module that accepts all incoming requests and k processing modules

The server side

• The k + 1 threads all belong to the same process so the processing modules all have access to the cache within the process' address space

• When a request comes in, the front end accepts it and builds a short record describing it, it then hands the record to one of the processing modules

• In another possible design, the front end is eliminated and each processing module tries to acquire its own requests, but a locking protocol is then required to prevent conflicts

The Server Side

A multithreaded Web server with a front end and processing modules

The Server Side

• The processing module first checks the cache (if the file is there) if YES - it updates the record to include a pointer to the file in the record, if NO the processing module starts a disk operation to read it into the cache

• When the file comes in from the disk, it is put in the cache and also sent back to the client – so modules can be actively working on other requests, while some wait for a disk operation (better performance having multiple disks)

The Server Side• Modern Web servers do more than just

accept file names and return files• The front end passes each incoming request

to the first available module, which then carries out some of the following steps:

1. Resolve the name of the Web page requested (some names are empty file names and need to be expanded to some default file name)

2. Authenticate the client (verifying the client's identity – a page may not be available to public)

The Server Side

3. Perform access control on the client (if there are restrictions in regard to identity and location)

4. Perform access control on the Web page (if there are access restrictions for this page)

5. Check the cache (is page there? – if not – step 6)6. Fetch the requested page from disk7. Determine the MIME type to include in the response8. Take care of miscellaneous odds and ends (ex -

building a user profile or gathering certain statistics)

The Server Side9. Return the reply (the result) to the client10. Make an entry in the server log for admin purposes

With too many requests come in each second, the CPU will not be able to handle the processing load, no matter how many disks are used in parallel

Solution - to add more nodes (computers), possibly with replicated disks to avoid having the disks become the next bottleneck

The server side – farm model

Server farm model - a front end still accepts incoming requests but sprays them over multiple CPUs rather than multiple threads to reduce the load on each computerThe individual machines may themselves be multithreaded and pipelined Problem with server farms - there is no shared cache - each processing node has its own memory—unless an expensive shared-memory multiprocessor is used

The Server Side

A server farm.

Server farm

• A server farm may have a front end keep track of where it sends each request and send subsequent requests for the same page to the same node

• So each node is a specialist in certain pages so that cache space is not wasted by having every file in every cache

• Another problem with server farms - client's TCP connection terminates at the front end, so the reply must go through the front end

Server farm

• Possible solution of front-end load - TCP handoff

• The TCP end point is passed to the processing node so it can reply directly to the client, shown as (3) in the next slide on the right

• This handoff is done in a way that is transparent to the client

The Server Side

(a) Normal request-reply message sequence.(b) Sequence when TCP handoff is used.

URLs—Uniform Resource Locators

• Web pages may contain pointers to other Web pages – need mechanisms for naming and locating pages

• Issues:1. What is the page called?2. Where is the page located?3. How can the page be accessed?

• If every page were somehow assigned a unique name, there would not be any ambiguity in identifying pages, still the problem would not be solved – how to find the address, what language?

URLs—Uniform Resource Locators• Each page is assigned a URL (Uniform Resource

Locator) that effectively serves as the page's worldwide name

• URLs have three parts: 1. the protocol (also known as the scheme – http), 2. the DNS name of the machine on which the page is

located (val.acad.bg) and 3. a local name uniquely indicating the specific page

(usually just a file name on the machine where it resides - path relative to the default Web directory )


• http://val.acad.bg/cworks/KMK_slides_current/11. Mobile phones-2013.ppt

• Many sites have built-in shortcuts for file names - often a null file name defaults to the organization's main home page

• When the file named is a directory, this implies a file named index.html. Finally, ~user/ might be mapped onto user's WWW directory, and then onto the file index.html in that directory

http://val.acad.bg/cworks/KMK_slides_current/11

http://val.acad.bg/cworks/KMK_slides_current/11


• See how hypertext works - to make a piece of text clickable, the page writer must provide two items of information: the clickable text to be displayed and the URL of the page to go to if the text is selected

• When the text is selected, the browser 1) looks up the host name using DNS, 2) gets the host's IP address, 3) establishes a TCP connection to the host - over that connection, it sends the file name using the specified protocol

• This URL scheme is open-ended - browsers use multiple protocols to get at different kinds of resources

URLs – Uniform Resource Locaters

Some common URLsGopher protocol - a TCP/IP application layer protocol designed for

distributing, searching, and retrieving documents over the Internet - effective predecessor of the World Wide Web (no images, only text)

URLs – Uniform Resource Locaters

• URLs have been designed to not only allow users to navigate the Web, but to deal with FTP, news, Gopher, e-mail, and telnet as well, making all the specialized user interface programs for those other services unnecessary and thus integrating nearly all Internet access into a single program, the Web browser

• A URL points to one specific host. For pages that are heavily referenced, it is desirable to have multiple copies far apart, to reduce the network traffic

Statelessness and Cookies

• The Web is basically stateless - no concept of a login session

• The browser sends a request to a server and gets back a file. Then the server forgets that it has ever seen that particular client

• At first, when the Web was just used for retrieving publicly available documents, this model was perfectly adequate. But later, problems appeared

Statelessness and Cookies• Some Web sites require clients to register (and

possibly pay money) to use them – so - how servers can distinguish between requests from registered users and everyone else?

• If a user wanders around an electronic store, tossing items into her shopping cart from time to time, how does the server keep track of the contents of the cart?

• Users can set up a detailed initial page for a portal like Yahoo with only the information they want (e.g., their stocks and their favorite sports teams), but how can the server display the correct page if it does not know who the user is?


• observing their IP addresses? No …- many users work on shared computers, the IP address merely identifies the computer, not the user

• many ISPs use NAT, so all outgoing packets from all users bear the same IP address. From the server's point of view, all the ISP's thousands of customers use the same IP address

Cookies

• Netscape devised a much-criticized technique called cookies

• The name derives from programmer slang - a program calls a procedure and gets something back that it may need to present later to get some work done

• Cookie is a small piece of data sent from a website and stored in a user's web browser while a user is browsing a website

• Cookies were later formalized in RFC 2109

Cookies

• When a client requests a Web page, the server can supply additional information along with the requested page - this may include a cookie, which is a small (at most 4 KB) file (or string)

• Browsers store offered cookies in a cookie directory on the client's hard disk unless the user has disabled cookies

• Cookies are not executable programs

Cookies

• In principle, a cookie could contain a virus, but since cookies are treated as data, there is no official way for the virus to actually run and do damage

• It is always possible for some hacker to exploit a browser bug to cause activation

• A cookie may contain up to five fields (see next slide) - the domain tells where the cookie came from

Cookies

• Browsers are supposed to check that servers are not lying about their domain

• Each domain may store no more than 20 cookies per client

• The Path is a path in the server's directory structure that identifies which parts of the server's file tree may use the cookie. It is often /, which means the whole tree.


Some examples of cookies.

Cookies

• The Content field takes the form name = value - name and value can be anything the server wants cookie's content is stored in this field.

• The Expires field specifies when the cookie expires - if absent, the browser discards the cookie when it exits (nonpersistent cookie). With a time and date the cookie is kept until it expires (times are given in Greenwich Mean Time). To remove a cookie from a client's hard disk, a server just sends it again, but with an expiration time in the past.

• The Secure field can be set to indicate that the browser may only return the cookie to a secure server. This feature is used for e-commerce, banking, and other secure applications

Use of cookies

• Just before a browser sends a request for a page to some Web site, it checks its cookie directory to see if any cookies there were placed by the domain the request is going to

• If there are cookies – they are included in the request message

• The server gets them and interprets them

Use of cookies

• Having customer’s ID, a browser can build an appropriate Web page to display (ex – poker hand, horse race, or page with stock prices, or football results, etc), or number of items selected in a cart (for shopping), in such case the cookie is returned on every new page request and when the customer decides to bye selected stuff, the cookie contains the full list of purchases

Use of cookies

• Cookies can also be used for the server's benefit - to keep track of number unique visitors and number of pages visited

• hackers have exploited numerous bugs in the browsers to capture cookies not intended for them like credit card numbers (sometimes put in cookies)

• Cookies could be used also to secretly collect information about users' Web browsing habits

Use of cookies

• Some users configure their browsers to reject all cookies - problems with legitimate Web sites that use cookies

• Solution - users sometimes install cookie-eating software - special programs that inspect each incoming cookie upon arrival and accept or discard it depending on choices the user has given it (e.g., about which Web sites can be trusted)

• So the user has better control over which cookies are accepted and which are rejected. Browsers like Mozilla have elaborate user-controls over cookies built in

Static Web Documents

• Basis of the Web - transferring Web pages from server to client

• In the simplest form, Web pages are static – i.e. files on some server waiting to be retrieved

• Web pages are written in a language called HTML (HyperText Markup Language) - it allows users to produce Web pages that include text, graphics, and pointers to other Web pages

The HyperText Markup Language

• Markup languages thus contain explicit commands for formatting

• In HTML, <b> means start boldface mode, and </b> means leave boldface mode

• Advantage of a markup language - writing a browser for it is straightforward: the browser simply has to understand the markup commands

A Web page• A Web page consists of a head and a body, each enclosed by

<html> and </html> tags (most browsers do not complain if these tags are missing)

• The head is bracketed by the <head> and </head> tags and the body is bracketed by the <body> and </body> tags

• The strings inside the tags are called directives• Most HTML tags have this format, that is they use,

<something> to mark the beginning of something and </something> to mark its end

• Most browsers have a menu item VIEW SOURCE or something like that. Selecting this item displays the current page's HTML source, instead of its formatted output

HTML – HyperText Markup Language

(a) The HTML for a sample Web page. (b) The formatted page.

(b)

HTML

A selection of common HTML tags. some can have additional parameters.

HTML

• HTML allows images to be included in-line on a Web page

• The <img> tag specifies that an image is to be displayed at the current position in the page - can have several parameters and the src parameter gives the URL of the image

• An HTML table consists of one or more rows, each consisting of one or more cells

Forms

(a) An HTML table. (b) A possible rendition

of this table.

Forms in HTML

• Options for customers to fill registration cards, to type in search words

• Forms – contain boxes or buttons allowing users to fill in info or make choices and then send the info to the page’s owner

• Forms use the <input> tag for this purpose - it has a variety of parameters for determining the size, nature, and usage of the box displayed

Forms

(a) The HTML for an order form.

(b) The formatted page.

(b)

Forms

• Three kinds of input boxes are used in the form above

• - type “text” – user must type in a string • - type “radio ” – when a choice is made among

2 or more alternatives (see radio value=“…” (<select> tags as drop-down menu could also be provided)

• - type “checkbox” – can be on or off – only 1 of the set must be chosen

Forms

• When the user clicks the submit button, the browser packages the collected information into a single long line and sends it back to the server for processing\

• The & is used to separate fields and + is used to represent space

• For the example above, the line might look like the slide bellow (broken into three lines here because the page is not wide enough)

Forms

A possible response from the browser to the server with information filled in by the user.

The string would be sent back to the server as one line, not three, if a checkbox is not selected, it is omitted from the string. It is up to the server to make sense of this string

XML and XSL

• HTML, with or without forms, does not provide any structure to Web pages

• There is a need for structured Web pages and separating the content from the formatting

• Two new languages have been developed to allow Web pages to be structured for automated processing - XML (eXtensible Markup Language) and XSL (eXtensible Style Language)

XML

• In the example bellow, an XML document defines a structure “book_list” – a list of books, each of which has 3 fields (title, author and year of publ)

• Structures may have repeated fields, optional fields and alternative fields

• Fields are permitted to be subdivided further• XML defines ONLY the book list with 3 books,

it does not say how to display the Web page

XML and XSL

A simple Web page in XML.

XSL

• To provide the formatting information, the XSL definition is needed - in a second file, book_list.xsl

• This file is a style sheet that tells how to display the page

• A sample XSL file for formatting is given bellow - after some necessary declarations, including the URL of the XSL standard, the file contains tags starting with <html> and <body> - they define the start of the Web page

XML and XSL

A style sheet in XSL.

Dynamic Web Documents

• In recent years, more and more content has become dynamic, that is, generated on demand, rather than stored on disk

• When a user fills in a form and clicks on the submit button, a message is sent to the server indicating that it contains the contents of a form, along with the fields the user filled in

• the message is given to a program or script to process


• processing involves using the user-supplied information to look up a record in a database on the server's disk and generate a custom HTML page to send back to the client

• The HTML page might display a form containing the list of items in the cart and the user's last-known shipping address along with a request to verify the information and to specify the method of payment


Steps in processing the information from an HTML form.

Dynamic Web Documents• Forms and interactive Web pages are managed by a

system called the CGI (Common Gateway Interface)• CGI allow Web servers to talk to back-end programs

and scripts that can accept input (e.g., from forms) and generate HTML pages in response

• These are scripts written in the Perl scripting language (Perl scripts are easier and faster to write than programs) and are in a directory called cgi-bin, which is visible in the URL

• Sometimes another scripting language, Python, is used instead of Perl


• A script can be invoked with some parameters (for ex. – filled and submitted reg. form)

• The script then parses the parameters, makes an entry in the database for the new customer, and sends back an HTML page providing a registration number and a telephone number for the help desk

• Another option is to use PHP (PHP: Hypertext Preprocessor) – embedded scripting language


• The server has to understand PHP (just as a browser has to understand XML to interpret Web pages written in XML) - typically, servers expect Web pages containing PHP to have file extension php rather than html or htm

• In the next slide, the php script generates a Web page telling what it knows about the browser invoking it - this information is in the variable HTTP_USER_AGENT

Dynamic Web DocumentsA sample HTML page with embedded PHP.


• PHP is good at handling forms and is simpler than using a CGI script

• On the slide, in a) is a normal HTML page with a form in it - displays two text boxes, one with a request for a name and one with a request for an age – after the two boxes are filled the server parses the string sent back, putting the name in the name variable and the age in the age variable

• It then processes the action.php file, in (b) as a reply by executing the PHP commands

• Finally, the HTML file is sent back - c)


(a) A Web page containing a form. (b) A PHP script for handling the output of the form. (c) Output from the PHP script when the inputs are "Barbara" and 24 respectively.


• PHP is open source code and freely available - designed specifically to work well with Apache, which is also open

• There are two other techniques:• - JSP (JavaServer Pages), which is similar to PHP, except

that the dynamic part is written in the Java programming language

• - ASP (Active Server Pages) - Microsoft's version of PHP and JavaServer Pages, uses Microsoft's proprietary scripting language, Visual Basic Script, for generating the dynamic content

Application layer protocols The World Wide Web. Architectural Overview Static Web Documents Dynamic...

Documents

Transcript of Application layer protocols The World Wide Web. Architectural Overview Static Web Documents Dynamic...