XML and Localization

Post on 03-Jul-2015

820 views 1 download

description

An overview of XML and how it is used in the localization world

Transcript of XML and Localization

XML and LOCALIZATION

An overview by @Fantpmas from @YamagataEurope

What is XML? And why do you people love acronyms so much?

XML stands for eXtensible Markup Language

You can write your own language/dialect

A language to store data in a human readable format

XML is designed to carry data not display data like HTML XML doesn't do anything on its own, nada, zilch!

A sample XML document (Don't worry it's all plain text)

The root element

3 child elements

An XML element in detail

Start tag End tag

Attribute

Element content

Attribute value

XML elements can be empty

is the same as

Self-closing element

There are rules to follow When all rules are abided by, the XML is well-formed

XML well-formedness rules (not exhaustive) • There must be a root element • Elements must follow naming rules • All elements must be closed • Element names are case sensitive • Elements must be properly nested • Attributes must be quoted • Attributes can only appear once in same start tag • Some characters cannot be used as such • Entities must be declared

There must be a root element

Elements must follow naming rules

Names can only start with • A letter (in any language, including accented letters) • A colon • An underscore

筆者 筆者

Elements must follow naming rules

Names cannot contain • White spaces • Most punctuation characters except colon, underscore,

hyphen, dot, middle dot • Symbol characters

筆 者 筆 者

All elements must be closed

Element names are case sensitive

Elements must be properly nested

Attribute values must be quoted

Single or double quotes

Attention to those darn quotes

If double quotes are used you cannot use double quotes inside the attribute value . The same applies for single quotes.

Attributes must be unique in tags

Some characters cannot be used

• < and & need to escaped into entities: and • Most control characters

(characters to indicate carriage return, tab or backspace)

A word about entities

Entities are used to represent characters or a sequence of characters that needs to be repeated throughout a document Syntax:

Ampersand Semicolon

Predefined XML entities

5 predefined character entities, only 2 are obligatory

&lt; < less than

&gt; > greater than

&amp; & ampersand

&apos; ' apostrophe

&quot; " quotation mark

Entities must be declared

Except for predefined entities all entities must be declared in the Document Type Definition

Entity

DTD Entity declaration

Other constructs

• XML declaration

• Stylesheet declaration

• Document Type declaration

• Comments

• CDATA

Document Type Definition A DTD defines the structure of an XML document

How to declare DTDs

DTDs can be internal

DTD

How to declare DTDs

DTDs can be external

XML Schema

XML Schema (*.xsd) is an XML based alternative to DTD

DTDs in the localization world

Don't be scared, but XML really is everywhere • TMX • TBX • XLIFF • TTX • SRX • QT Linguist TS • DITA • ...

Encoding

All XML parsers must support at least UTF-8 and UTF-16. Default encoding is UTF-8. Always a good idea to specify the encoding

Byte Order Mark

A character to indicate the byte order of an XML document In UTF-8 it's optional and not even recommended In UTF-16 it's used to indicate endianness: little-endian or big-endian If you see these at the start of a file, something's wrong:

Complimentary technologies What? There's more of this geek stuff!?

Extensible Stylesheet Language Transformation (XSLT)

It's XML to transform another XML document!

XSL Transformations

XML

(X)HTML

XML

TXT

How to apply an XSLT

Declare the stylesheet in the XML file itself

Use an application like XMLSpy or xmlstarlet

XSLT localization examples

• Convert a TTX to a two-column HTML or CSV • Convert a TMX to a TBX • Convert a TMX to a TXT (for spell-check in MS Word) • Convert multilingual XML to TMX/TBX • Generate HTML preview for XML in SDL Trados Studio • Prepare XML files for translation

XPath

It's a query language to select nodes from an XML document It's used in XSLT

Will select all elements that have an attribute called

and whose value is

And also in SDL Trados Studio file types

Is XML good for localization? Yes, but not always

XML is great for localization

• Unicode supported by default

• Metadata gives more information about content

• Separates content from formatting (to some extent)

• Human readable

• Easily transformable using XSLT

• Excellent for single-sourcing

But bad XML is bad

• Translatable content in attributes

• No metadata to distinguish between content e.g. mixed languages, translatable vs not translatable

• CDATA is just plain cheating

• Bad implementations of standards (XLIFF)

And also

• Multilingual XML can be challenging (XSLT can help)

東京

• Big files and one-liners can cause processing problems

(pretty-printing can help)

Tools, tools, tools

• Altova XMLSpy: all-round XML editor

• Altova DiffDog: compare XML files

• xmlstarlet: command line XML toolkit

• EditPad Pro for all encoding/BOM matters

"Specification is only theory. In practice, there is only the parser."

@Tnkrd