IN350 Class 2: Document Properties and Markup Languages August 30, 2001 Judith A. Molka-Danielsen...

25
IN350 Class 2: Document Properties and Markup Languages August 30, 2001 Judith A. Molka-Danielsen Reference: Parts of Chapter 6 handout, Chapter 1: XML A Primer
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    1

Transcript of IN350 Class 2: Document Properties and Markup Languages August 30, 2001 Judith A. Molka-Danielsen...

IN350 Class 2: Document Properties and Markup Languages

August 30, 2001Judith A. Molka-Danielsen

Reference: Parts of Chapter 6 handout, Chapter 1: XML A Primer

Overview Review Properties of Documents Introduce the concept of Markup Languages. Begin to talk about XML

Visit the Lab (room 076) and match student groups with accounts on Oracle.

Create a table and populate it. (In the BIG picture, this course IN350 is about

theories and issues for document processing and management. Some of the practical exercises will be to create indicies and do searches, and have nothing to do with XML.)

Classes of document processing

Text Processing: Initially computers were used to do tedious repetitive calculations (billing transactions) on information.

Often the calculations required preprocessing or typesetting of text.

Other issues include information storage (and compression algorithms to optimally store) and storage methods (indexing) and approaches to information retrieval.

Finally there was the preparation and processing of text for presentation purposes.

Classes of document processing

Document Processing: In the 1980s technologies like the PC, ethernet, laser printers, and graphical user interfaces with bit map displays, and text processing that was object based, allowed for indivduals to process documents. A text processing system called Scribe (by Brian Reid at CMU), represented a new kind of processing.

In text processors like IBM's Script, the user marked up text in terms of syntax characteristics, such as "12 point bold courier".

But Scribe formatted in terms of structural characteristics like, "heading". This was a transition to document processing.

Classes of document processing

Hypertext Processing: In the 1990s we saw the development of internetworks, and ubiquitous interfaces (windows).

Tim Berners-Lee at the National Radiation Lab at CERN created HTML and URL (Uniform Resource Locator) protocols so that a simple standardized form of markup, based on Scribe, could be used to describe documents and naming scheme would allow for the universal identification of documents.

So documents could be and viewed in graphical format and large collections linked across multiple internets. This is hypertext processing.

Properties of DocumentsSyntax - can express structure, presentation style, semantics, and external actions. It can be implicit in the contents of a document or expressed in a language.

Structure - a structural element like a section can have can have a Formating Style associated with it that tells how the elements relate to each other within the document.

Presentation Style - is how the document is displayed or printed. It can be embedded in the documents such as in TeX, and use macros LaTeX. Or can be defined separately as CSS for HTML documents. Presentation style can be determined by the author (in applications or languages) or the reader (Web browser).

Semantics - the meaning within a language, can be associated with use.

Characteristics continued...

Metadata - information about the organization of the data. Data about the data. Such as, author, publication date, subject codes, etc.

XML (Chapter 1: Structured Label Information)

There is a difference between Data and Documents. Documents are formated.

WYSIWYG word processors have problems They make documents that are for one output

medium (printer,online) Proprietary codes are for both style & format But it is hard to convert old document collections

(merge latex and word) Formats like ”headline” only mean BIG font size,

no meaning within the document People use too many options within a document

(30 fonts on a page.

Text and formatsFile formats - Word processing formats that are binary formats

include Word and WordPerfect. text - ASCII (American Code for Information Interchange)

by ANSI X3.6. Alternativly there is 16 bit Unicode (ISO 10616).

raster graphics - TIFF Tag Information File Format

GIF - Graphic Interchange Format

JPEG - Joint Photographic Experts Group

An example of a vector graphics standard is CGM Computer Graphics Metafile

printing - PostScript, PDF, EPS, PCL, LCDS, XML Printing Formats, ISO-IEC 10180 Standard Page Description Language, ISO-IEC 8624 Open Document Architecture (ODA)

Text and formatsFile formats continued - multimedia

MPEG (motion picture expert group) AVI (audio video interleaved)

email email header - RFC822

SMTP - Simple Mail Transport Protocol, RFC823

POP - Post Office Protocol

IMAP - Intelligent Mail Access Protocol (more advanced than POP)

MIME - Multimedia Internet Mail Extension (attachments)

Text and formats

File formats continued -

For document interchange between applications there is RTF (rich text format).

Compression formats include ARJ, ZIP, and uuencode/uudecode.

What is Markup?

•Markup is everything in a document that is not content. Typesetters used procedural markup to lay out instructions of how a document should look. (16 pt bold Helvetica)

•Word Processing software like Microsoft Word uses Procedural markup. They have a specific set of markup codes. The codes apply to a single physical way of presenting information, such as on a printed page. It doesn't define the appearance on other media like CD-ROM or Internet.

•Descriptive markup, or generic markup, describes the structure of the document rather than the appearance. Content is separate from style. You can publish on all media using the same structure instruction set.

SGMLSGML (Standard Generalized Markup Language, ISO 8879, 1986), specifies a standard method for describing the structure of the document. Structural elements are for example: title, chapter, paragraph. It is an extensible Meta Language. It can supports an infinite variety of document structures like: information bulletins, technical manuals, parts catalogs, design specifications, reports, letters, memos. The Document Type Definition (DTD) describes the structure of the document. (like a database schema in a database). The DTD provides a framework of elements (chapters, headers). The DTD specifies rules for the relationship between elements, ie. a chapter header must come after the start of a chapter. A document intance is a document whose contents is tagged in conformance with a DTD. A DTD can be applied throughout the whole organization.

SGML continued

SGML uses tagging to identify the contents position within a DTD structure. So we insert tags around the content. You can nest elements. A parser program verifies that a document follows the rules of a DTD. The parser checks if the document is structurally correct.

Documents can be ported to different formats for different output medium (printer, screen, CD Rom, speaker, TV)

Style is usally handled separately by style sheets, like Cascading Style Sheets (CSS).

HTML

HTML (first version in 1992, Dec. 1999 version 4.01) a tagging language that could be used on the World Wide Web for text formatting and linking documents. It adopts the syntax of SGML and is an application of SGML described by a particular DTD. HTML is not an extensible language. Authors cannot add their own tags. HTML supports style sheets written in CSS language (color, font, layout for web pages.) and Frameset to partition the browser window.

XHTML is modular approach to allow the support of markup tags in smaller client devices like cell phones, TVs, cars, kiosks, etc.

Chapter 1: positive comments on HTML

HTML uses tags to separate content (text) from format (structure, appearance).

It lets amateurs control markup (good and bad)

HTML tags were used for appearance formatting, but little attentiaol was used toward content structuring.

Chapter 1: negative comments on HTML

HTML did not offer enough custom control over the WYSIWYG environment.

Things looked different in different browsers (reader interpreted, not author interpreted).

Navigating through hypertext requires user memory.

Designing hypertext (document collections) for easy searching is hard to do. Spiders, crawlers, robots, AltaVista index all try to index the web.

Chapter 1: comments on CSS

Cascading Style Sheets helped HTML by freeing tags like <font> and <b> from carrying format information. Puts them in the style sheet.

It lets tags like <header> carry structure information.

CSS is a styling tool that can work with other markup languages like XML.

Chapter 1: comments on CSS

Formating• Structure• Appearance

Content•Information•Data

The Document

Structure – HTML does this a little bit.Appearance – or presentation, before HTML did this

with tags like <b> but now all structurecontrol should be taken out of HTMLdocuments and put in CSS or XSL files.

Chapter 1: a migration to XML was needed

Binary files (in native formats) compress tightly for efficient transmission, but they are complex and proprietary. (XML files are larger, with markup there is more to store and transfer.(-))To change documents between applications is hard. Must save data in text formats & move. Conversions were not always good. (XML writers define write formats, standards for loading, saving, open transfer)Lock-in let MS sell new versions of word that could read old format, save in new format, and then old versions could not read the files in new format. XML will handle document description and data description. Will not lose structure and labels in move.

XMLXML (XML 1.0, 1998, Extensible Markup Language) is also a meta language in that it describes other languages. There is not pre-defined list of elements.

Elements are specified using a DTD or Schema. Also style sheets can be used to specify the output format of each element (XSL).

XML is based on SGML but it is a subset and is considered easier to program. XML is also supported to be viewed in most current versions of browsers.

XML related standardsXPath Specifications for the data model and grammar for navigating an XML document. XSL eXtensible Stylesheet Language includes a language for transforming XML documents (XSLT) and a formatting vocabulary (XSLFO).

XSLT eXtensible Stylesheet Language Transformation defines a transformation language to convert XML documents into other formats.

XLL extensible linking language allow logic to be placed on linking.

XML related standards & groups

OAGIS The Open Application Group's (www.openapplications.org) Integration Specification for interoperability between ERP packagesOASIS-ebXML

Organization for the Advancement of Structured In- formation Standards (OASIS) Electronic Business XML (www.ebxml.org).

FinXML Financial Markup Language (www.finxml.com) supports a universal standard for data interchange within the capital market. FpML Financial Products Markup Language (www.fpml.org) enables e-commerce activities in the financial derivatives field. OFX Open Financial Exchange (www.ofx.net) for the electronic exchange of financial data.

Other languages

MathML - tags for presenting formulas

SMIL - language for scheduling multimedia (Synchronized Multimedia Integration Language). It uses XML markup to identify and manage the presentation of files containing text, images, sound and video in multi-media presentations.

RDF - resource description format, format to contain metadata inform for XML.

HyTime - an SGML architecture that specifies the generic hypermedia structure of documents. Allows for the design of metaDTDs, for complex multimedia presentations, such as providing music with other media presentation.

See for more information on markup languages http://www.w3.org/

XML Technologies for Oracle

XML parser. Used to parse, construct, and validate XML documents.

XPath engine. A utility that searches in-memory XML documents, using the declarative syntax of XPath, another element of the XML standard.

XSLT processor. Supports XSLT in the Oracle database, allowing you to transform XML documents into different formats.

XML SQL utility. A utility that helps produce XML documents from SQL and lets you easily insert XML-based data into Oracle database tables.

XSQL pages. A technology that lets you assemble XML data declaratively and then publish that data with XSLT.