XML-talk

16

Click here to load reader

Transcript of XML-talk

Page 1: XML-talk

Introduction to XML

Sasha Schwarzman, 202-777-7518 [email protected]

AGU

14 February 2003

Page 2: XML-talk

Page 2 of 16

1. XML by example

1.1. Credit card statement (paper)

Cardmember Statement ACCOUNT NUMBER: 4444888822221111 AVAILABLE CREDIT: 5,000 CLOSING DATE: 11/25/02 PAYMENT DUE DATE: 12/15/02 CARDMEMBER STATEMENT SUMMARY TRANS DATE

POST DATE

REFERENCE NUMBER

DESCRIPTION OF TRANSACTION

CREDITS CHARGES

1023 1025 2416QZP Townhouse Store #3306 DC 10.65 1027 1027 2422KQ12 Wazuri DC 55.00 1103 1103 7422120F Payment – Thank you 1,000.00

Page 3: XML-talk

Page 3 of 16

1.2. Credit card statement (XML)

Page 4: XML-talk

Page 4 of 16

1.3. XML building blocks XML deal with documents A document is a basic unit of XML information, composed of elements and other

markup in an orderly package <Description> Payment -- Thank you </Description> Start tag Character data End tag Markup Element Markup

An element is an identifiable, named component of a document can have content (but doesn’t have to): data, other elements

can be a pointer to information (cross-reference, link)

must have one start and one end tag

elements can nest but cannot overlap

An attribute provides additional information about an element <Transaction Category=”Groceries”> found inside start tag

Page 5: XML-talk

Page 5 of 16

may be required or implied

an element may have multiple attributes

1.4. Credit card statement DTD

• What DTD can (structure, sequence, in-document linking, selected occurrence

indicators) and cannot provide for (datatyping, flexible occurrence indicators)

1.5. Document types and their instances • Invoice

• Sales catalog

• Dictionary

• Journal article

Page 6: XML-talk

Page 6 of 16

1.6. Validating parser

1.6.1. What parser does • Is document well-formed? (for stand-alone docs)

• Does a DTD conform to XML specs?

• Does a document instance conform to the DTD?

1.6.2. What parser does not do • Check semantics (“gobbledygook” might be meaningless but valid as far as a

validating parser is concerned)

• Check what a DTD cannot enforce (datatyping, flexible occurrence indicators)

1.7. Credit card statement in XML environment

Page 7: XML-talk

Page 7 of 16

2. Components of an XML system • Document instance

• DTD/Schema

• Validating parser

• Processing system

2.1.1. Document Two kinds Well-formed

Valid (has a model)

Usually created Manually – using XML Editor (Epic, XMetaL)

Programmatically from a database, another XML document, or by conversion from another format (LaTeX, MSWord)

2.1.2. DTD The modeling mechanism specified by the XML standard models one type of information

is a set of rules describing how documents of that type can be marked up

2.2. Processing system XML DOES NOT DO ANYTHING! Your software CAN! Start/stop behavior Run a script, load a database, create a “form letter” and fill-in contents

Link

Format (start bold, end bold)

Process Extract selected elements (e.g., metadata)

Rearrange/resequence content

Rename, add content

Page 8: XML-talk

Page 8 of 16

Count how many

3. XML origins

3.1. What is markup? Information added to a document that enhances its meaning in certain ways, in

that it identifies the parts and how they relate to each other.

3.2. Pre-electronic (traditional) markup Set this header in 12-point Helvetica Medium italic on a 14-point text body, justified on a 22-Pica slug with indents of 1 en on left and none on the right.

3.3. Markup language A set of symbols that can be placed in the text of a document to demarcate and

label the parts of that document

3.4. Specific markup languages Tells formatter what action to take: "carriage return", "center the following lines",

"go to the next page", etc.

3.4.1. RTF, Script, etc.

Script example .sp (skip one line) .bf roman 12 (change font size) .bd .ce Chapter 1. Introduction

(center "Chapter 1. Introduction" and print it in bold)

3.4.2. WYSIWYG Word Processors, DTP, and professional typesetting systems • WordPerfect, MSWord, WordStar, MacWrite

• Quark, Ventura

• XYVision, Penta, Miles 33

Proprietary, not interchangeable, structure and presentation inextricably intertwined. Retrieval, cross-referencing difficult.

Page 9: XML-talk

Page 9 of 16

3.5. Generic markup languages Uses descriptive tags rather than formatting codes. Indicates logical structure of

the documents. Separates formatting from structure/content.

3.5.1. Macro-based languages • LaTeX for TeX

• Syspub for Waterloo Script

• ms for nroff

LaTeX example \to{Mr. Smith} stands for 3 commands \noindent \settabs 6 \columns \+TO:&Mr. Smith\cr

3.6. SGML • 1960s. GCA’s “GenCode” (Graphics Communications Association)

• 1969. IBM’s GML. Generalized Markup Language (Charles Goldfarb, Edward Mosher, and Raymond Lorie)

• 1978. ANSI working group formed to provide a format for text interchange to develop a standard text-description language based on GML headed by Charles Goldfarb

• 1983 SGML developed. DoD and IRS adopt SGML. DoD develops CALS (Computer-Aided Acquisition and Logistic Support) as an SGML application. (CALS tables still in use.) AAP develops DTDs for books and journals. SGML spreads in Europe and North America

• 1986. ISO ratifies SGML as a standard (ISO 8879:1986)

3.7. HTML • Early 1990s. Tim Berners-Lee and Anders Berglund of European particle physics

lab CERN develop HyperText Markup Language (Berglund designed a publishing system to test SGML in the 1980s)

• HTML is an application of SGML for hypertext documents

Page 10: XML-talk

Page 10 of 16

• Both a step forward (Web, wide adoption, public interest in markup) and a step back (generic coding principles compromised: one (!) doc type used for all purposes, many tags purely presentational)

HTML example

Page 11: XML-talk

Page 11 of 16

HTML tags format

4. XML 1998. W3C group under Jon Bosak: simplified version of SGML: 80% of SGML power with 20% of its complexity

4.1. What XML can do XML can be used to tag… Content (what type of information is this?) City, state, zip

Part number

Debit, credit, payment

Question, answer

Page 12: XML-talk

Page 12 of 16

Genus, species

Indications, counter-indications

Structure (what part of document is this?) Paragraph, sub-section, section, chapter, list

Table, figure, formula, video

Author block, signature block, address block

Pointers (Location, navigation, linking, and other relationships) Hypertext links

Cross-references

Indexing terms

Metadata (information about data) Bibliographic/cataloging information (author, title, publication date)

Index terms and keywords (search terms)

Revision, version, edition

Status, tracking information

Data sources

Editor’s and reviewer’s comments

Abstracts, highlights, “teasers”, “blurbs”

Rendering/Processing (if you MUST) – how text should behave, display, or print normally handled through a stylesheet but… position of graphic on the page (floating, centered)

line break in titles

tables

author’s whimsy (“I want this word bold just because”)

Page 13: XML-talk

Page 13 of 16

4.2. XML is… A subset of SGML. A meta-language that describes the concepts and rules to

build domain-specific markup languages A family of technologies/standards (W3C Recommendations): XSLT, XSL,

Xpointer, XPath, XQuery, Xlink, DOM, SAX, etc. XML can be used: for document modeling

for data interchange

4.3. XML applications (domain-specific markup languages)

Device/media-oriented: • XHTML - Web

• WML – wireless markup language

• VoxML – spoken word markup language

Discipline-oriented: • MathML – mathematical markup language

• CML – chemical markup language

Industry-oriented: • Airlines/aircraft

• Semiconductors

Process-oriented: • SVG - Scalable Vector Graphics

4.4. XML is not… • a programming language. Does not replace C++, Java, Perl, etc.

• a user interface

• a presentation format

• a text formatting or processing system

• a standard set of document types

Page 14: XML-talk

Page 14 of 16

• a standard or recommended set of tags

• UNICODE

• a database

• user-unfriendly

5. XML in a publishing environment

Page 15: XML-talk

Page 15 of 16

5.1. Uncontrolled inputs, controlled outputs

Hand held computer

Cell phone

Telephone

A&IServices

XMLdocument

TOCsIndicesSearch

Interfaces

XML DB

WordPerfect

MSWord

LaTeX

HTML

PostScript

XMLConverter

Com

posi

tion

Engi

ne

Low-resPDF

High-resPDF

XML Article

HTML

XSLTstylesheet

CrossRefMDDB

Page 16: XML-talk

Page 16 of 16

5.2. Integrated environment with controlled inputs and outputs Example: technical manual (aircraft, automobile, etc.)

Conceptual configuration of a database-centered XML-aware system (adapted from The SGML Implementation Guide by B. Travis and D. Waldt)

Authoring Editing Reviewing

Copy-editing

Converting

Imaging

ComposingPublishing

Abstractingand Indexing

Searching Archiving

Revising

Tracking

Referencingand linking

Translating

Assigning

Master Database - Text Objects - Graphics - Works in Progress

6. XML advantages • Encode (markup) data only once. Create single information repository

• Separates content/structure from presentation/formatting

• Software/hardware independent

• Interoperability: common language for a community to agree on data content; machine-to-machine communication.

• Portability

• Preservation

• Non-proprietary/open industry standard

• Reuse/re-purposing (many outputs)

• Enables semantically complex searching and retrieval

• Cuts down on the number of required converters (saves software development costs)