se3316a_2013_02_intro_html

61
7/18/2019 se3316a_2013_02_intro_html http://slidepdf.com/reader/full/se3316a201302introhtml 1/61 SE 3316 © Jagath Samarabandu 1 SE 3316a - WEB TECHNOLOGIES Chapter 1 HTML and Web Pages  [email protected] TEB 351 519-661-2111 x80058 Dr. Jagath Samarabandu Slides based on those from Møller et. al.

description

intro_html

Transcript of se3316a_2013_02_intro_html

Page 1: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 1/61

SE 3316 © Jagath Samarabandu 1

SE 3316a - WEB TECHNOLOGIES

Chapter 1

HTML and Web Pages

 [email protected]

TEB 351

519-661-2111 x80058

Dr. Jagath Samarabandu

Slides based on those from Møller et. al.

Page 2: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 2/61

SE 3316 © Jagath Samarabandu 2

Objectives The history of HTML

URLs and related schemes Survivor's guides to HTML and CSS

Limitations of HTML Unicode

The World Wide Web Consortium (W3C)

Page 3: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 3/61

SE 3316 © Jagath Samarabandu 3

Hypertext Collections of document connected by

hyperlinks Paul Otlet, philosophical treatise (1934)

Vannevar Bush, hypothetical Memex system(1945)

Ted Nelson introduced hypertext (1968)

Hypermedia generalizes hypertext beyondtext

Page 4: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 4/61

SE 3316 © Jagath Samarabandu 4

Markup Languages

Notation for adding formal structure to text

Charles Goldfarb, the INLINE system (1970) Standard Generalized Markup Language,

SGML (1986) DTD, element, attribute, tag, entity:

Page 5: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 5/61

SE 3316 © Jagath Samarabandu 5

SGML Example

<!DOCTYPE greeting [<!ELEMENT greeting (#PCDATA)><!ATTLIST greeting style (big|small) "small"><!ENTITY hi "Hello">

]><greeting style="big"> &hi; world! </greeting>

Page 6: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 6/61

SE 3316 © Jagath Samarabandu 6

The Origins of the WWW

WWW was invented by Tim Berners-Lee at

CERN (1989) Hypertext across the Internet (replacing FTP)

Three constituents: HTML + URL + HTTP HTML is an SGML language for hypertext

URL is a notation for locating files on servers

HTTP is a high-level protocol for file transfers

Page 7: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 7/61

SE 3316 © Jagath Samarabandu 7

The Design of HTML

Simple, purist design principles

HTML describes the logical structure of adocument

Browsers are free to interpret tags differently HTML is a lightweight file format

Size of file containing just ”Hello World!”:− MS Word: 11,274 bytes

− PDF: 4915 bytes

− HTML: 28 bytes

Page 8: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 8/61

SE 3316 © Jagath Samarabandu 8

The History of HTML

1992: HTML 1.0, Tim Berners-Lee original proposal

1994: HTML 2.0, standard with best features 1996: Competing Netscape and Explorer features

1996: HTML 3.2, the Browser Wars end

1997: HTML 4.0, stylesheets are introduced 1999: HTML 4.01, we have a winner!

2000: XHTML 1.0, an XML version of HTML 4.01

2001: XHTML 1.1, modularization 2002: XHTML 2.0, simplified (abandoned in 2009)

2008: XHTML5/HTML5 initiative (current draft Aug. 6, 2013)

Page 9: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 9/61

SE 3316 © Jagath Samarabandu 9

Uniform Resource Locator - URL

Consists of a scheme (http), a server

(example.com ), and a path (/ex/one) A Web resource is located by a URL

http://example.com/ex/one/

Relative URLsgml/dtd.html

Fragment identifierhttp://example.com/ex/page.html#sec1

Subset of the generic URI

Page 10: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 10/61

SE 3316 © Jagath Samarabandu 10

Uniform Resource Identifier - URI

scheme:scheme-specific-part

Conventions about use of /, #, and ?− “/” for separating hierarchical structure

“?” for separating a query from query-ableresource− “#” for separating URI from “fragment identifier”

http is most common scheme. Other schemes: mailto, file, ftp

Official registry of schemes at IANA

Page 11: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 11/61

SE 3316 © Jagath Samarabandu 11

Uniform Resource Name - URN

A pointer to a resource without referring to a

particular location.− E.g. urn:isbn:0-471-94128-X

Not widely used at time of introduction

Some applications in semantic web withinResource Description Framework (RDF)

Page 12: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 12/61

SE 3316 © Jagath Samarabandu 12

Internationalized Resource Identifier

Syntax of URI is restricted to US-ASCII

IRI is being developed to allow internationalcharacters in a URI

Uses a mapping that is rather complex E.g. http://www.blåbærgrød.dk/blåbærgrød.html

Maps to: http://www.xn--blbrgrd-fxak7p.dk/bl%E5b%E6rgr%F8d.html

Page 13: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 13/61

Data Flow Models in WWW

Sep 10, 2012 SE 3316 © Jagath Samarabandu 13

Server

Server

Server

Browser

Browser

Browser

Static pages

Server side

processing

Server sideprocessing Javascript

engine

Client sideprocessing

Asynchronouscommunication

Page 14: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 14/61

SE 3316 © Jagath Samarabandu 14

Survivor’s Guide to HTML

Overall structure of an HTML document

<html><head><title>The Title of the Document</title></head><body bgcolor="white">...

</body></html>

Page 15: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 15/61

SE 3316 © Jagath Samarabandu 15

Simple Formatting

<html><head><title>Good Advice</title>

</head><body><h1>Good Advice for Everyday Life</h1><h2>For UNIX programmers</h2>

<b>Never</b> type:<p><tt>rm -rf /*</tt><p>on your computer.<h2>For Nuclear Scientists</h2><b>Never</b> press the

<i>Big <font color="red">Red</font> Button</i>.</body>

</html>

Page 16: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 16/61

SE 3316 © Jagath Samarabandu 16

More Formatting<html>

<head><title>Things To Do</title>

</head><body><ol>

<li>Feed the cat.<li>Try out the shell command:<pre>foreach x ( `ls` )

cat $x | tr "aeiouy" "x" > $xend</pre>

<li>Buy ticket for Timbuktu.

</ol></body>

</html>

Page 17: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 17/61

SE 3316 © Jagath Samarabandu 17

Hyperlinks: Source Document

<html><head>

<title>Source Document</title></head><body>

<a href="target.html#danger">Look here</a>.</body>

</html>

Page 18: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 18/61

SE 3316 © Jagath Samarabandu 18

Hyperlinks: Target Document

<html><head>

<title>Target Document</title></head><body>

...<a name="danger"></a>

<h2>Chapter 17: Dangerous Shell Commands</h2> Never execute a shell command thatinadvertently changes all vowels tothe character 'x'.

</body></html>

Page 19: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 19/61

SE 3316 © Jagath Samarabandu 19

Tables<table border="1">

<tr><td>PostScript</td><td align="right">11,274 bytes</td>

</tr><tr><td>PDF</td><td align="right">4,915 bytes</td>

</tr><tr><td>MS Word</td><td align="right">19,456 bytes</td>

</tr>

<tr><td>HTML</td><td align="right">28 bytes</td>

</tr></table>

Page 20: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 20/61

SE 3316 © Jagath Samarabandu 20

Tables for Alignment

<table width="100%"><tr><td align="left"><a href="index.html"><img src="home.gif" border="0"></a><a href="info.html"><img src="info.gif" border="0"></a>

</td><td align="right"><a href="links.html"><img src="left.gif" border="0"></a><a href="survivor.html"><img src="right.gif” border="0"></a>

</td></tr>

</table><h1>Using Tables</h1>

Page 21: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 21/61

SE 3316 © Jagath Samarabandu 21

Fill-Out Forms<form method="get"action="http://www.google.com/search"><input type="text" name="q">

<input type="submit" name="btnG" value="GoogleSearch"></form>

Tip:

Use http://www.brics.dk/ixwt/echo as the action

URL will echo the GET and PUT variables

Page 22: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 22/61

SE 3316 © Jagath Samarabandu 22

GUI Elements<input name="foo" type="text" size="20"><hr><input name="bar" type="radio" value="s">Small<input name="bar" type="radio" value="m">Medium <input name="bar" type="radio" value="l">Large

<hr><input name="baz" type="checkbox" value="c">Cheese<input name="baz" type="checkbox"value="p">Pepperoni<input name="baz" type="checkbox"value="a">Anchovies

<hr><select name="bar"><option value="s">Small<option value="m">Medium <option value="l">Large</select><hr><select name="baz" multiple><option value="c">Cheese<option value="p">Pepperoni<option value="a">Anchovies

</select>

Page 23: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 23/61

SE 3316 © Jagath Samarabandu 23

GUI Elements ... contd.

<hr><textarea name="foo" rows="5" cols="20">

 Write something here...

</textarea><hr><input name="foo" type="password" value="tomato"><hr><input name="foo" type="file">

<hr><input name="foo" type="hidden" value="you can't seethis"><hr><input name="qux" type="image" src="Denmark.gif"><hr>

<input type="submit" value="Submit this form"><hr><input type="reset" value="Reset this form">

Page 24: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 24/61

SE 3316 © Jagath Samarabandu 24

Logical Versus Physical

Logical structure the page starts with a

header

the entries are written ina list

numbers are

emphasized

Physical layout headers are centered,

huge, and grey

lists have square bullets

emphasis is rendered inbold-style italics

Phone Numbers

John Doe, (202) 555-1414  Jane Dow, (202) 555-9132  Jack Doe, (212) 555-1742 

Page 25: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 25/61

SE 3316 © Jagath Samarabandu 25

Cascading Style Sheets – CSS

Cascading Stylesheets separate structure fromlayout

The essential concepts are selectors and properties

Properties describe physical layout of specific HTML

tags CSS 2.1 has 98 properties that can be set

A stylesheet associates properties with selectedtags

Avoids duplication

Page 26: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 26/61

SE 3316 © Jagath Samarabandu 26

Some CSS Properties and Values

font-style normal, italics, oblique

font-size 12pt, larger, 150%, 1.5em

text-align

line-height normal, 1.2em, 120%

display

color red, yellow, rgb(212,120,20)

left, right, center, justify

block, inline, list-item, none

Page 27: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 27/61

SE 3316 © Jagath Samarabandu 27

Applying a Stylesheet

h1 { color: #888; font: 50px/50px "Impact"; text-align: center; }ul { list-style-type: square; }em { font-style: italic; font-weight: bold; }

<html><head><title>Phone Numbers</title>

<link href="style.css" rel="stylesheet" type="text/css"></head><body><h1>Phone Numbers</h1><ul>

<li>John Doe, <em>(202) 555-1414</em><li>Jane Dow, <em>(202) 555-9132</em><li>Jack Doe, <em>(212) 555-1742</em>

</ul></body>

</html>

Page 28: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 28/61

SE 3316 © Jagath Samarabandu 28

Structure of a Stylesheet

A selector is a list of tag names

For each selector, some properties areassigned values:− b {color: red; font-size: 12pt}

− i {color: green}

Longer selectors give context sensitivity:

− table b {color: blue; font-size: 12pt}− form b {color: yellow; font-size: 12pt}

− i {color: green}

Page 29: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 29/61

SE 3316 © Jagath Samarabandu 29

CSS Selectors

The most specific selector is chosen to apply

E F matches any F element that is a descendant ofE element

E > F matches any F element that is a child element

of E E.foo matches any E element whose class attribute

contains foo

E#foo matches any E element with ID equal to foo Other selectors include *, + and pseudo-classes

Page 30: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 30/61

SE 3316 © Jagath Samarabandu 30

Specificity in Action

<head><style type="text/css"> b {color: red;}

 b b {color: blue;} b.foo {color: green;} b b.foo {color: yellow;} b.bar {color: maroon;}</style>

<title>CSS Test</title></head>

<body><b class=foo>Hey!</b><b>Wow!<b>Amazing!</b><b class=foo>Impressive!</b><b class=bar>k00l!</b><i>Fantastic!</i>

</b></body>

Hey!Hey!Hey!Hey! Wow!Wow!Wow!Wow! Amazing!Amazing!Amazing!Amazing! Impressive!Impressive!Impressive!Impressive! K00l!K00l!K00l!K00l!   Fantastic!Fantastic!Fantastic!Fantastic!Green Red Blue ellow aroon Red

Page 31: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 31/61

SE 3316 © Jagath Samarabandu 31

HTML Validity

HTML has a formal syntax specification

800 lines of DTD notation A validator gives syntax errors for invalid

documents

Valid documents may contain this logo:

Most HTML documents on the Web are

invalid: Lets check a few. Shall we!

Page 32: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 32/61

SE 3316 © Jagath Samarabandu 32

Validation Errors

<html><body>

<table><b>123</i></table></body>

</html>

Check errors on following HTML markup

Page 33: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 33/61

SE 3316 © Jagath Samarabandu 33

Reasons for Invalidity

Ignorance of the HTML standard

Lack of testing ”This page is optimized for the XYZ browser”

”This page is best viewed in 1024x768” Automatic tools generate invalid HTML output

Forgiving browsers try to interpret invalidinput

Page 34: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 34/61

SE 3316 © Jagath Samarabandu 34

How Does This Look?

<h2>Lousy HTML</h1><li><a>This is not very</b> good.

<li><i>In fact, it is quite bad</em></ul>But the browser does <a naem="goof">something.

Page 35: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 35/61

SE 3316 © Jagath Samarabandu 35

Problems with Invalidity

There are several different browsers

Each browsers has many differentimplementations

Each implementation must interpret invalidHTML

There are many arbitrary choices to make

As a result, the HTML standard has beenundermined

HTML renders differently for most clients

Page 36: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 36/61

SE 3316 © Jagath Samarabandu 36

A Standard for Invalid HTML

The HTML Tidy tool tries to save the situation

Invalid HTML is transformed to (almost) validHTML

Still many arbitrary choices, but now weagree

<h2>Lousy HTML</h1><li><a>This is not very</b> good.

<li><i>In fact, it is quite bad</em></ul>But the browser does <a naem="goof">something.

Page 37: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 37/61

SE 3316 © Jagath Samarabandu 37

HTML Recipes<h1>Rhubarb Cobbler</h1><h2>Wed, 4 Jun 95</h2>This recipe is suggested by Jane Dow.Rhubarb Cobbler made with bananas as the main sweetener.

It was delicious.

<table><tr><td> 2 1/2 cups <td> diced rhubarb<tr><td> 2 tablespoons <td> sugar<tr><td> 2 <td> fairly ripe bananas

<tr><td> 1/4 teaspoon <td> cinnamon<tr><td> dash of <td> nutmeg</table>

<i>Combine all and use as cobbler, pie, or crisp.</i><p>This recipe has 170 calories, 28% from fat,58% from carbohydrates, and 14% from protein.<p>Related recipes: <a href="#GardenQuiche">Garden Quiche</a>is also yummy.

Page 38: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 38/61

SE 3316 © Jagath Samarabandu 38

Limitations of HTML

HTML is designed for hypertext, not for

recipes Structure and presentation is intertwined

HTML validation is less than recipe validation

HTML standards have been undermined

We need a special Recipe Markup Language!

Page 39: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 39/61

SE 3316 © Jagath Samarabandu 39

Bytes vs. Characters

HTML files are represented as text files

A text file is logically a sequence ofcharacters but physically a sequence of bytes

Several mappings exist:− ASCII

− EBCDIC

− Unicode Unicode aims to cover all characters in all

past or present written languages

Page 40: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 40/61

SE 3316 © Jagath Samarabandu 40

Unicode

Unicode Characters−

Mapping characters to numbers− Numbers are called “code points”

Glyphs− Pictorial representation of each code point− Handled by fonts

Encoding/Decoding− Convention for mapping code points to a serial bit

stream (“code units”) and vice versa− UTF-8 (8 bit code units), UTF-16 (16 bits)

Page 41: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 41/61

SE 3316 © Jagath Samarabandu 41

Unicode Characters

A character is a symbol that appears in a text

letters of the alphabet− pictograms (like ©)

− accents

Unicode characters are abstract entities:− LATIN CAPITAL LETTER A (A)

− LATIN CAPITAL LETTER A WITH RING ABOVE (Å)

− HIRAGANA LETTER SA (さ)− SINHALA LETTER SA (ස)

− RUNIC LETTER THURISAZ THURS THORN ( )

සමරබ

Page 42: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 42/61

SE 3316 © Jagath Samarabandu 42

Unicode Glyphs

A glyph is a graphical presentation

A typical example is: Å This may represent several characters:

− LATIN CAPITAL LETTER A WITH RING ABOVE

− ANGSTROM SIGN Or even a sequence of characters:

LATIN CAPITAL LETTER A− COMBINING RING ABOVE

Some characters even result in several glyphs

Page 43: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 43/61

SE 3316 © Jagath Samarabandu 43

Unicode Code Points

A code point is a unique number assigned to everyUnicode character

Code points are between 0 and 1,114,112

Current standard (6.1) defines 110,116 code points

The character HIRAGANA LETTER SA is assignedthe code point 12,373

Written as U+3055 (4+ digit hex of code point)

Code point 0 through 127 coincide with ASCII

Some code point are never assigned

Page 44: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 44/61

SE 3316 © Jagath Samarabandu 44

Unicode Character Encoding

A character encoding interprets a sequenceof bytes as a sequence of code points

The bytes are first parsed into code units

Code units have a fixed length

One or more code units may be required todenote a code point

Examples are UTF-8, UTF-16, UTF-32

UTF

Page 45: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 45/61

SE 3316 © Jagath Samarabandu 45

UTF-8

A code unit is a single byte

A code point is from 1 to 4 code units Code units between 0 and 127 directly

represent the corresponding code points

  110XXXXX indicates that 2 code units are used

  1110XXXX indicates that 3 code units are used

  11110XXX indicates that 4 code units are used The remaining code units looks like 10XXXXXX

UTF 8 E l

Page 46: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 46/61

SE 3316 © Jagath Samarabandu 46

UTF-8 Example

11100011 10000001 10010101

  11100011 10000001 10010101 11000001010101

12,373 HIRAGANA LETTER SA

UTF-8 files start with the special sequence“EF BB BF” (11101111 10111011 10111111)

What does this stand for?

UTF 16

Page 47: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 47/61

SE 3316 © Jagath Samarabandu 47

UTF-16

A code unit consists of 2 bytes

Code points below 65,536 are in a single code unit Higher code points are represented as:

−   110110XXXXXXXXXX 110111XXXXXXXXXX

− (after subtracting 65,536) This makes sense because Unicode assign no code

points between the numbers:

− 1101100000000000 (55,296)

− and

− 1101111111111111 (57,343)

B t O d

Page 48: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 48/61

SE 3316 © Jagath Samarabandu 48

Byte Order

When reading several bytes at once, we mustconsider the byte order of the architecture

UTF-16 starts text with the special code point:−   1111111011111111 (65,279, 0xFEFF)

− called zero-width non-breaking space The dual code point

− 1111111111111110 (65,534)− is never assigned

UTF-16LE and UTF-16BE may avoid this

UTF 16 E l

Page 49: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 49/61

SE 3316 © Jagath Samarabandu 49

UTF-16 Example

11111110 11111111 00110000 01010101

  11111110 11111111 00110000 01010101 00110000 01010101

12,373 HIRAGANA LETTER SA

U i d i J

Page 50: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 50/61

SE 3316 © Jagath Samarabandu 50

Unicode in Java

Java represents characters as UTF-16 codeunits – not as UTF-16 code points!

A pragmatic choice to use only 16 bits

The length function on strings may be wrong

(see String.codePointCount) Some strings may represent illegal data

ISO 8859 1

Page 51: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 51/61

SE 3316 © Jagath Samarabandu 51

ISO-8859-1

Another popular character encoding

Only 256 code points Single byte code units

Coincides with ASCII on code points 0-127 Cannot represent general Unicode

In all, there are hundreds of differentencodings...

Character Encodings in HTML

Page 52: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 52/61

SE 3316 © Jagath Samarabandu 52

Character Encodings in HTML

The document may declare its own encoding:<meta http-equiv="Content-Type"

content="text/html; charset=UTF-8">

Rest of the document is assumed to be

encoded in UTF-8 Unicode characters may be represented

using numeric character references as:&#12373;

How to Input Unicode

Page 53: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 53/61

SE 3316 © Jagath Samarabandu 53

How to Input Unicode

Needs software support for

− localization (L10n),− multilingualization (m17n),

− internationalization (i18n)

Input methods− Allows entering characters not found on the input

device

− E.g. Chinese characters using a QWERTYkeyboard

Developing i18n Ready Software

Page 54: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 54/61

SE 3316 © Jagath Samarabandu 54

Developing i18n Ready Software

Old days of using ascii strings are gone!

Represent all strings using Unicode Ensure that character data is decoded when

reading and encoded when writing

Use Unicode aware functions for manipulatingstrings

World Wide Web Consortium (W3C)

Page 55: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 55/61

SE 3316 © Jagath Samarabandu 55

World Wide Web Consortium (W3C)

Develops HTML, CSS, and most Webtechnology

Founded in 1994

Has 380 companies and organizations as

members Is directed by Tim Berners-Lee

Located at MIT (US), Inria (France), Keiko(Japan)

W3C Players

Page 56: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 56/61

SE 3316 © Jagath Samarabandu 56

W3C Players

Members ($50,000 per year)

Team Advisory board

Technical Architecture Group Working Groups

W3C Documents

Page 57: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 57/61

SE 3316 © Jagath Samarabandu 57

W3C Documents

Working Drafts

Candidate Recommendations Proposed Recommendations

Recommendations Working Group Notes

Member Submissions

Staff Comments

Team Submissions

W3C Principles

Page 58: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 58/61

SE 3316 © Jagath Samarabandu 58

W3C Principles

Consensus among members

Limited intellectual property rights Free Web access to technical reports (unlike

ISO)

WHATWG

Page 59: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 59/61

SE 3316 © Jagath Samarabandu 59

WHATWG

Web Hypertext Application Technology WorkingGroup

Founded by individuals from Apple, Mozilla,Opera

Focuses on HTML5 standard. Differs slightly from W3C HTML5

Aims to have better involvement with browservendors- From WHATWG FAQ 

Summary

Page 60: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 60/61

SE 3316 © Jagath Samarabandu 60

Summary

History and structure of HTML, CSS,URL/URI/URN/IRI and Unicode

Survivor’s guides to these technologies

Limitations of HTML for general data (the

recipe collection example)

Essential Online Resources

Page 61: se3316a_2013_02_intro_html

7/18/2019 se3316a_2013_02_intro_html

http://slidepdf.com/reader/full/se3316a201302introhtml 61/61

SE 3316 © Jagath Samarabandu 61

Essential Online Resources

http://www.w3.org/TR/html5/ 

http://www.w3.org/Addressing/  http://www.w3.org/Style/CSS/ 

http://validator.w3.org/  http://www.w3.org/ 

http://www.whatwg.org/