Post on 27-Jun-2020
UFCEKG‐20‐2 Data, Schemas & ApplicationsData, Schemas & Applications
Lecture 3Data Representation, XML & RSS
N. H. N. D. de Silva(Slides adapted from Prakash Chatterjee, UWE)(Slides adapted from Prakash Chatterjee, UWE)
Last week:Last week:
o introduction to the webo introduction to the webo Uri schemas & encodingo http protocolo media typeso media typeso request / response cycleo get, post, put and deleteo introduction to mashupso introduction to mashupso simple mashup example with forms
Feb 2013 2N. H. N. D. de Silva
WWW : definitionWWW : definition
The World Wide Web (abbreviated as WWW or W3, commonly ( , yknown as the Web), is a system of interlinked hypertextdocuments accessed via the Internet. With a web browser, one can view web pages that may contain text images videos andcan view web pages that may contain text, images, videos, and other multimedia, and navigate between them via hyperlinks.
Wikipedia : World Wide Web
Concept originally proposed by Sir Tim Berners‐Lee (1989) based on earlier hypertext systems Berners‐Lee and Belgian computeron earlier hypertext systems. Berners‐Lee and Belgian computer scientist Robert Cailliau proposed in 1990 to use hypertext "to link and access information of various kinds as a web of nodes in h h h b ll" d h bl l d dwhich the user can browse at will", and they publicly introduced
the project in December of the same year.
Feb 2013 3N. H. N. D. de Silva
Problem : How to encode data for communicationProblem : How to encode data for communication
Competing constraintsCompeting constraintso Data must be serialised into a character stream o Communicate the meaning of the data as well as the datao Error‐freeo Minimal sizeo Handle Multi Lingual texto Handle Multi‐Lingual text
Bank of America Market Data Mirrors
Feb 2013 4N. H. N. D. de Silva
Solutions
o Card file based o csvo xls ‐ Excel file formato XMLo SQL export o JSON JavaScript Object Notationo JSON ‐ JavaScript Object Notation
The Medabar in Asmara, Eritrea Google Map
Feb 2013 5N. H. N. D. de Silva
Card-basedCard based
Exampleso ATCO‐CIF for timetables o IGES for Computer‐Aided Design
Characteristicso Based on old 80‐column punched cardso Multiple record typeso Multiple record typeso Fixed field widthso No formal language to define the format
Feb 2013 6N. H. N. D. de Silva
CSV
ExamplesAl eston (Bristol) eather datao Alveston (Bristol) weather data
o World Health Organization(WHO) ‐ generated estimates of TB mortality, prevalence, incidence (including incidence of HIV+TB) and case detection rate.
o 1000 Songs Google Spreadsheeto 1000 Songs ‐ Google Spreadsheet
CharacteristicsD t l t d b t h t t bo Data values separated by a common separator character ‐ space, comma or tab
o Column position is significanto Lines separated by newlines ‐ coding depends on OS ‐ linefeed (x0A) Unix or
carriage return (x0D) line feed Windows carriage return on old Macscarriage‐return (x0D), line feed ‐Windows, carriage‐return on old Macs o Separator must not occur in data values, or some other convention needed ‐
Quotes around value, an escape charactero Column headings may be the first lineo Column headings may be the first lineo Only tables ‐ all lines the sameo All columns required ‐ problem for space‐separated data
Feb 2013 7N. H. N. D. de Silva
d d
D t ith ti l d t d t d d t d
Tagged record structures
Data with optional data and repeated data need more complex structures. Many have been developed for specific domainsdomains
o MARC library catalogue recordso EDIFACT for commercial Electronic Data interchange (EDI)o EDIF LISP ‐based nested data
EXIF d t b dd d i JPEG io EXIF data embedded in a JPEG image
Feb 2013 8N. H. N. D. de Silva
XMLXML
A generic data format based on tagged elements in a tree g ggstructure.
Developed from GML via SGMLDeveloped from GML, via SGML.
GML, a document mark‐up language developed by Charles p g g p yGoldfarb at IBM in 1969.
E lExampleso Alveston WDL config fileo UWE news RSS feedo UWE news RSS feed
Tree with Buddhist prayer flagsFeb 2013 9N. H. N. D. de Silva
XML domain vocabularies
XML domain vocabularies
XML domain vocabularies
XML defines only the rules for a well‐formed document. The allowable tags, their structuring and order in a document range of allowable values and the meaningdocument, range of allowable values and the meaning of those tags depends on the XML application ‐ called a vocabulary.vocabulary.There are now hundreds of XML vocabularies designed for every sort of datao XHTML ‐ the version of HTML which conforms to XMLo SVG ‐ graphicso TransExchange for timetableso TransExchange for timetableso RSS and Atom for news
Feb 2013 10N. H. N. D. de Silva
XML processing vocabularies
There are also vocabularies for languages for processing XMLg g p g
o XSLT ‐ for transforming XML documentsf f do XSL‐FO ‐ for transforming to PDF documents
o XML Schema ‐ for defining XML vocabularieso XProc ‐ for defining XML Pipelineso XProc ‐ for defining XML Pipelines
Feb 2013 11N. H. N. D. de Silva
Problem: News dissemination
I want to disseminate news about my project/company, and allow interested people to read it. e.g. the university
d h b f l ffwants to spread the news about successful staff
Solution 1 : HTML pageSolution 1 : HTML pagePublish a page of news on the website in HTML
P blProblemso how do visitors know when its changed?o news from different universities cannot be easilyo news from different universities cannot be easily
combined – (why?)
Feb 2013 12N. H. N. D. de Silva
Solution : emailEncourage interested users to subscribe to your company newslettercompany newsletter.
Problemso Subscription is a barriero Clutters up email boxeso can look like spamo can look like spamo List management and emailing overhead
Feb 2013 13N. H. N. D. de Silva
l i d fUWE makes up its own set of additional tags
Solution : Create XML document for newsUWE makes up its own set of additional tags
<newsItem date=‘2007-10-2’> <newsTitle>UWE best in West</newsTitle><newsTitle>UWE best in West</newsTitle><newsBody>UWE wins tiddlewinks again</newsBody><contact>press@uwe.ac.uk</Contact>
</newsItem>
Problemso someone has to design this languageo has to be translated to HTML to display
A d h t d t d lti l t fo A reader has to understand multiple new tags from different sources
o needs to be distinguished from standard HTMLo needs to be distinguished from standard HTML
Feb 2013 14N. H. N. D. de Silva
Aside: NamespacesProblemHow to distinguish in a document XML tags from different vocabularies ?vocabularies ?
Solutiono define a (global) unique URI for the vocabularyo use an arbitrary prefix ‐ news: for all tags in the same
b l i i hi dvocabulary ‐ unique within a document o link the prefix to the vocabulary in the document
<h1>UWE news</h1><p> <news:item xmlns="http://www.uwe.ac.uk/news" date="2007-10-2“>
<news:Title>UWE best in West</news:Title><news:Body>UWE wins tiddlewinks again</news:Body><news:Contact>press@uwe.ac.uk</news:Contact>
</news:item></p>
Feb 2013 15N. H. N. D. de Silva
Solution : RSSo Standardize on one (or several !) standard tagso Tags are machine‐readable to identify news items in a list
of web siteso RSS 2.0
o Really Simple Syndicationo Really Simple Syndicationo Rich Site Summary
o Atom ‐ a more recent formato Atom a more recent format o Differences ‐ dates (RFC 822 v RFC 3339 timestamps),
multi‐lingual content
Characteristicso Structure: rss / channel / item Treeo Structure: rss / channel / item Treeo Items in reverse chronological ordero Few mandatory tagsy go Namespaces allow additional vocabularies to be added
Feb 2013 16N. H. N. D. de Silva
Example RSS ‐ UWE newsp<?xml version="1.0" encoding="iso-8859-1"?><rss version="2.0"><channel> <title>UWE News</title><link>http://www.uwe.ac.uk</link><description>Latest UWE press releases</description>i<image><url>http://info.uwe.ac.uk/common/assets/2004Design/logoNoBorder.gif</url><title>University of the West of England</title><link>http://www.uwe.ac.uk</link>/i</image>
<pubDate>Fri, 13 Oct 2008 15:15:44 GMT</pubDate><item><title>New research looks to transport users for solutions</title>li k htt //i f k/ / / ti l ?it 1363 /li k<link>http://info.uwe.ac.uk/news/uwenews/article.asp?item=1363</link>
<description>'Ideas in Transit' is a new initiative which will look totransport users' experiences and creativity as a source of innovationto tackle the UK's transport problems....
/d i ti</description></item>
Feb 2013 17N. H. N. D. de Silva
Example RSS ‐ BBC Finance News? l i "1 0" di " SO 8859 1" ?<?xml version="1.0" encoding="ISO-8859-1" ?><?xml-stylesheet title="XSL_formatting" type="text/xsl“ href="/shared/bsp/xsl/rss/nolsol.xsl"?> <rss version="2.0" xmlns:media="http://search.yahoo.com/mrss"><channel>
<title>BBC News | Business | UK Edition</title><link>http://news.bbc.co.uk/go/rss/-/1/hi/business/default.stm</link><description>Visit BBC News for up-to-the-minute news, breaking news, video, audio and
feature stories. BBC News provides trusted World and UK news as well as local andregional perspectives Also entertainment, business, science, technology and healthregional perspectives. Also entertainment, business, science, technology and healthnews.
</description><language>en-gb</language><lastBuildDate>Mon, 13 Oct 2008 14:28:30 GMT</lastBuildDate> < i ht>C i ht (C) B iti h B d ti C ti<copyright>Copyright: (C) British Broadcasting Corporation, see
http://news.bbc.co.uk/1/hi/help/rss/4498287.stm for terms and conditions of reuse</copyright><docs>http://www.bbc.co.uk/syndication/</docs><ttl>15</ttl> <image>
<title>BBC News</title> <url>http://news.bbc.co.uk/nol/shared/img/bbc_news_120x60.gif</url><link>http://news.bbc.co.uk/go/rss/-/1/hi/business/default.stm</link>
</image></image><item>
<title>UK banks receive £37bn bail-out</title> <description>The UK government says it is to inject a total of up to £37bn into Royal
….. </item></item>
Feb 2013 18N. H. N. D. de Silva
RSS aggregationProblemHow to keep track of multiple feeds
Solutionhttp://www.youtube.com/watch?v=0klgLsSxGsU&feature=player_embedded#t=0s
o Application needed which is stateful – remembers what ppitems you have read
o Integrates multiple feeds into one ‘magazine’o Polls RSS providers on a regular basis
Feed integrators Bloglines Google Reader reduce the loadFeed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader integrated into MyUWEg y
RSS Aggregation with BloglinesFeb 2013 19N. H. N. D. de Silva
RSS as a tree structure
o UWE newso BBC Finance newso Earthquakes
Feb 2013 20N. H. N. D. de Silva
XML Characteristics
o strings enclosed in tags which provide a humanly readable name for the element ‐ so‐called self‐describing
o elements may be nested to create hierarchical data structuresstructures
o element tags may be repeated o element names can be relative to their parento element names can be relative to their parent o element structure can be formally defined
Feb 2013 21N. H. N. D. de Silva
Aside: Self describing
o Element names provide a clue about the meaning of
Aside: Self ‐describing
o Element names provide a clue about the meaning of the data, but not enough
o names are ambiguouso names may be misleadingo what units?
hat acc rac ?o what accuracy?o what origin? ‐ leads to need for meta‐data
o who createdo who createdo wheno what license to useo why
Feb 2013 22N. H. N. D. de Silva
XML terminologyXML documents are tree‐structures, with each node bounded by an open and a closing tagy p g go Element: the opening tag, attributes, the body of the
element and the closing tag. Elements are not elemental!h l b k fo tag name: the name in angle brackets ‐must conform to
rules, may have a prefixo Attribute: a name="value" pair attached to an elemento Attribute: a name= value pair attached to an element.
Names follow the same rules as tag names. o Parent: all elements except the root have one parentp po Child: an element nested in another parent elemento Root: every document has a single root element with no
parento Mixed Content: an element may contain a mixture of
text and other elementstext and other elements
Feb 2013 23N. H. N. D. de Silva
Basic XML rules o A single root elemento Tags must be properly nestedo An element must be closed:o An element must be closed:
o Open and closing tag <p>... </p> o Empty element <br /> or <hr size="3"/>
Other formatting rules o XML names are case sensitive, no spaces, restricted character seto XML names are case sensitive, no spaces, restricted character seto Attribute values must be single or double‐quotedo Special characters coded as references 
 (a line feed) > >
S h t h i l i i th t t f to Some characters have special meaning e.g. < is the start of a tag‐within XML data, & is the first character of an entity reference. In XML data these have to be encoded as < and & or ; p;enclosed in <[CDATA[ ....]]>
o Preferably use standard formats for representing values e.g. 2008 10 14 f d2008‐10‐14 for a date
Feb 2013 24N. H. N. D. de Silva