Download - From Code to XLIFF Bridging the Chasm Dr. Stephen Flinter Connect Global Solutions LRC Conference – 19 November 2003.

Transcript

From Code to XLIFFBridging the Chasm

Dr. Stephen FlinterConnect Global Solutions

LRC Conference – 19 November 2003

Agenda

• The XLIFF Transformation Problem

• Current approaches• Grammar based approach – XPG• XPG & XML• Summary

The Problem

• The XLIFF Transformation Problem

• Current approaches• Grammar based approach – XPG• XPG & XML• Summary

The Problem

• XLIFF has made the representation of resources translation/localisation friendly

• Non-trivial to convert existing files to XLIFF

• Adding new file formats can be painful

XLIFF Transformation

Definition: XLIFF Transformation is the process by which native file formats are transformed into XLIFF, and from XLIFF back to its native format (after translation).

File formats include: Java, .properties, XML, HTML, custom.

Architecture

Original Material

Extract

Non-LocalizationData (Skeleton)

Localization Data(Translation Units)

Merge

Translated Material

.com Business Model

• Parody of the .com business model that has been floating around the web:

– Get lots of users– ???– Profit

XLIFF Transformation Model• The XLIFF transformation

model could be described in similar terms:

– Native file format– ???– XLIFF

Architecture

Original Material

Extract

Non-LocalizationData (Skeleton)

Localization Data(Translation Units)

Merge

Translated Material

Magichappens here

A little magichappens here

Current Approaches

• The XLIFF Transformation Problem

• Current approaches• Grammar based approach – XPG• XPG & XML• Summary

Current Approaches to XLIFF• Use XLIFF as native format• Use commercial tools• Use regular expressions &

scripts

XLIFF as Native Format

• Use XLIFF from software development onwards

• No transformation required• Preferred approach in the long

term

Disadvantages

• Requires significant changes to the software development process

• How to handle legacy resources?– Back to the original problem

Commercial Tools

• Tool support for XLIFF is improving all the time.

• Advantages of support and expertise of tool developer.

Disadvantages

• However, many tools still only read XLIFF, and won’t generate XLIFF from native formats

• Won’t necessarily support all formats required

• Can be difficult to identify in-line tags

Scripts and Regular Expressions• Use a scripting language (e.g.

perl, python, WordBasic)• Encode rules to extract

translatable resources using regular expressions

Examples

String Regular Expression

“Translatable text” /”([^”]*)”/

id1 = Translatable text

/.* = (.*)/

Advantages

• Superficially simple to develop• Plenty of powerful RE languages

(especially perl) available• Full control and ownership of

how the formats are managed

Disadvantages

• Error prone – difficult to cover all situations

• To remove all errors, often have to add many parsing rules

• Has to be redone for every new file type

• RE’s have to change for inline tags

Other Examples

print(“First string”);

print(“Second” + “ string”);

print(“Third \”string\””);

print(“Fourth {0} string”);

Summary

This approach is doomed to failure because of the disconnect between the grammar of the language, and the regular expressions used to identify strings.

Grammar Based Approach

• The XLIFF Transformation Problem

• Current approaches• Grammar based approach – XPG• XPG & XML• Summary

A New Approach

• With this approach, we look at the language grammar (EBNF)

• Identify grammar productions that can hold translatable text

• Generate a parser that accepts instances of the grammar and emits XLIFF

Grammar-based Architecture

Original Material

Extract

Non-LocalizationData (Skeleton)

Localization Data(Translation Units)

Merge

Translated Material

Grammar-based Architecture

Original Material

Extract

Non-LocalizationData (Skeleton)

Localization Data(Translation Units)

Merge

Translated Material

Original GrammarXLIFFParser

Generator

Architecture

• New component: XLIFF parser generator (XPG)

• Accepts a JavaCC grammar• Allows one or more productions

to be marked as translatable• Generate the “extract” and

“merge” programs

JavaCC

• JavaCC: Java Compiler Compiler• Modelled after lex & yacc• Works on EBNF-type grammars

rendered as JavaCC .jj files• JavaCC grammar available for

most modern programming languages.

Big Win

Direct, one-to-one correspondence between the grammar and the mechanism for identifying strings.

Advantages

• Consistent high quality– Guaranteed to work in every case – for

all instances of the grammar.

• Painless– No scripting/regular expressions required– Extractor and merger generated

automatically

• Fast– Just need to identify the strings in the

grammar

Example

• Extract from Java BNF<literal> ::= <integer literal> |

<floating-point literal> |

<boolean literal> |

<character literal> |

<string literal> |

<null literal>

<string literal> ::= " <string characters>?"

<string characters> ::= <string character> |

<string characters> <string character>

<string character> ::= <input character> except " and \ |

<escape character>

JavaCC Extract

void Literal() :

{}

{

<INTEGER_LITERAL> |

<FLOATING_POINT_LITERAL> |

<CHARACTER_LITERAL> |

<STRING_LITERAL> |

BooleanLiteral() |

NullLiteral()

}

<STRING_LITERAL>

< STRING_LITERAL:

"\""

( (~["\"","\\","\n","\r"])

| ("\\"

( ["n","t","b","r","f","\\","'","\""]

| ["0"-"7"] ( ["0"-"7"] )?

| ["0"-"3"] ["0"-"7"] ["0"-"7"]

)

)

)*

"\""

>

Identifying <STRING_LITERAL>• We identify the

<STRING_LITERAL> as a language item that may contain strings

• XPG then generates a new grammar, which compiles to the extractor.

• The extractor then generates XLIFF.

Modified JavaCC Grammar

void Literal() :

{}

{

<INTEGER_LITERAL> |

<FLOATING_POINT_LITERAL> |

<CHARACTER_LITERAL> |

StringLiteral() |

BooleanLiteral() |

NullLiteral()

}

StringLiteral()

void StringLiteral() :

{ Token t; }

{ t = <STRING_LITERAL>

{

String s = t.image.substring(1, t.image.length() - 1);

pw.println("<trans-unit id=\"" + id++ + "\">");

pw.println("<source>" + s + "</source>");

pw.println("</trans-unit>");

}}

Other XPG Tasks

• Create XLIFF surrounding tags• Create skeleton file• Embed code for handling inline

tags

Inline Tags

• Example:– “Click on the {0} button to start the {1}

job”

• The {0} and {1} constitute inline tags• Not part of grammar itself• Can vary from application to application• We must be able to extract these based

on regular expressions:– {[0-9]+}

XPG and Inline Tags

• Embeds code to read a set of regular expressions from a file.

• When the extractor identifies a string:– Executes RE on string– Moves matches to XLIFF inline tag

Final Architecture

Original Material

Extract

Non-LocalizationData (Skeleton)

Localization Data(Translation Units)

Merge

Translated Material

Original GrammarXLIFFParser

Generator

Inline tagsRegular

Expressions

XPG & XML

• The XLIFF Transformation Problem

• Current approaches• Grammar based approach – XPG• XPG & XML• Summary

XPG and XML Applications• A similar approach can be

applied to XML Schemas• Uses XSTL & DOM rather than

JavaCC• Can identify XML tags and

attributes that may contain text

Summary

• XPG is an approach to XLIFF transformation that corresponds to the grammar of the language being transformed.

• This ensures consistent, error free and rapid XLIFF transformation.

• The XPG approach is suitable for computer languages and markup