From Code to XLIFF Bridging the Chasm Dr. Stephen Flinter Connect Global Solutions LRC Conference...

42
From Code to XLIFF Bridging the Chasm Dr. Stephen Flinter Connect Global Solutions LRC Conference – 19 November 2003

Transcript of From Code to XLIFF Bridging the Chasm Dr. Stephen Flinter Connect Global Solutions LRC Conference...

From Code to XLIFFBridging the Chasm

Dr. Stephen FlinterConnect Global Solutions

LRC Conference – 19 November 2003

Agenda

• The XLIFF Transformation Problem

• Current approaches• Grammar based approach – XPG• XPG & XML• Summary

The Problem

• The XLIFF Transformation Problem

• Current approaches• Grammar based approach – XPG• XPG & XML• Summary

The Problem

• XLIFF has made the representation of resources translation/localisation friendly

• Non-trivial to convert existing files to XLIFF

• Adding new file formats can be painful

XLIFF Transformation

Definition: XLIFF Transformation is the process by which native file formats are transformed into XLIFF, and from XLIFF back to its native format (after translation).

File formats include: Java, .properties, XML, HTML, custom.

Architecture

Original Material

Extract

Non-LocalizationData (Skeleton)

Localization Data(Translation Units)

Merge

Translated Material

.com Business Model

• Parody of the .com business model that has been floating around the web:

– Get lots of users– ???– Profit

XLIFF Transformation Model• The XLIFF transformation

model could be described in similar terms:

– Native file format– ???– XLIFF

Architecture

Original Material

Extract

Non-LocalizationData (Skeleton)

Localization Data(Translation Units)

Merge

Translated Material

Magichappens here

A little magichappens here

Current Approaches

• The XLIFF Transformation Problem

• Current approaches• Grammar based approach – XPG• XPG & XML• Summary

Current Approaches to XLIFF• Use XLIFF as native format• Use commercial tools• Use regular expressions &

scripts

XLIFF as Native Format

• Use XLIFF from software development onwards

• No transformation required• Preferred approach in the long

term

Disadvantages

• Requires significant changes to the software development process

• How to handle legacy resources?– Back to the original problem

Commercial Tools

• Tool support for XLIFF is improving all the time.

• Advantages of support and expertise of tool developer.

Disadvantages

• However, many tools still only read XLIFF, and won’t generate XLIFF from native formats

• Won’t necessarily support all formats required

• Can be difficult to identify in-line tags

Scripts and Regular Expressions• Use a scripting language (e.g.

perl, python, WordBasic)• Encode rules to extract

translatable resources using regular expressions

Examples

String Regular Expression

“Translatable text” /”([^”]*)”/

id1 = Translatable text

/.* = (.*)/

Advantages

• Superficially simple to develop• Plenty of powerful RE languages

(especially perl) available• Full control and ownership of

how the formats are managed

Disadvantages

• Error prone – difficult to cover all situations

• To remove all errors, often have to add many parsing rules

• Has to be redone for every new file type

• RE’s have to change for inline tags

Other Examples

print(“First string”);

print(“Second” + “ string”);

print(“Third \”string\””);

print(“Fourth {0} string”);

Summary

This approach is doomed to failure because of the disconnect between the grammar of the language, and the regular expressions used to identify strings.

Grammar Based Approach

• The XLIFF Transformation Problem

• Current approaches• Grammar based approach – XPG• XPG & XML• Summary

A New Approach

• With this approach, we look at the language grammar (EBNF)

• Identify grammar productions that can hold translatable text

• Generate a parser that accepts instances of the grammar and emits XLIFF

Grammar-based Architecture

Original Material

Extract

Non-LocalizationData (Skeleton)

Localization Data(Translation Units)

Merge

Translated Material

Grammar-based Architecture

Original Material

Extract

Non-LocalizationData (Skeleton)

Localization Data(Translation Units)

Merge

Translated Material

Original GrammarXLIFFParser

Generator

Architecture

• New component: XLIFF parser generator (XPG)

• Accepts a JavaCC grammar• Allows one or more productions

to be marked as translatable• Generate the “extract” and

“merge” programs

JavaCC

• JavaCC: Java Compiler Compiler• Modelled after lex & yacc• Works on EBNF-type grammars

rendered as JavaCC .jj files• JavaCC grammar available for

most modern programming languages.

Big Win

Direct, one-to-one correspondence between the grammar and the mechanism for identifying strings.

Advantages

• Consistent high quality– Guaranteed to work in every case – for

all instances of the grammar.

• Painless– No scripting/regular expressions required– Extractor and merger generated

automatically

• Fast– Just need to identify the strings in the

grammar

Example

• Extract from Java BNF<literal> ::= <integer literal> |

<floating-point literal> |

<boolean literal> |

<character literal> |

<string literal> |

<null literal>

<string literal> ::= " <string characters>?"

<string characters> ::= <string character> |

<string characters> <string character>

<string character> ::= <input character> except " and \ |

<escape character>

JavaCC Extract

void Literal() :

{}

{

<INTEGER_LITERAL> |

<FLOATING_POINT_LITERAL> |

<CHARACTER_LITERAL> |

<STRING_LITERAL> |

BooleanLiteral() |

NullLiteral()

}

<STRING_LITERAL>

< STRING_LITERAL:

"\""

( (~["\"","\\","\n","\r"])

| ("\\"

( ["n","t","b","r","f","\\","'","\""]

| ["0"-"7"] ( ["0"-"7"] )?

| ["0"-"3"] ["0"-"7"] ["0"-"7"]

)

)

)*

"\""

>

Identifying <STRING_LITERAL>• We identify the

<STRING_LITERAL> as a language item that may contain strings

• XPG then generates a new grammar, which compiles to the extractor.

• The extractor then generates XLIFF.

Modified JavaCC Grammar

void Literal() :

{}

{

<INTEGER_LITERAL> |

<FLOATING_POINT_LITERAL> |

<CHARACTER_LITERAL> |

StringLiteral() |

BooleanLiteral() |

NullLiteral()

}

StringLiteral()

void StringLiteral() :

{ Token t; }

{ t = <STRING_LITERAL>

{

String s = t.image.substring(1, t.image.length() - 1);

pw.println("<trans-unit id=\"" + id++ + "\">");

pw.println("<source>" + s + "</source>");

pw.println("</trans-unit>");

}}

Other XPG Tasks

• Create XLIFF surrounding tags• Create skeleton file• Embed code for handling inline

tags

Inline Tags

• Example:– “Click on the {0} button to start the {1}

job”

• The {0} and {1} constitute inline tags• Not part of grammar itself• Can vary from application to application• We must be able to extract these based

on regular expressions:– {[0-9]+}

XPG and Inline Tags

• Embeds code to read a set of regular expressions from a file.

• When the extractor identifies a string:– Executes RE on string– Moves matches to XLIFF inline tag

Final Architecture

Original Material

Extract

Non-LocalizationData (Skeleton)

Localization Data(Translation Units)

Merge

Translated Material

Original GrammarXLIFFParser

Generator

Inline tagsRegular

Expressions

XPG & XML

• The XLIFF Transformation Problem

• Current approaches• Grammar based approach – XPG• XPG & XML• Summary

XPG and XML Applications• A similar approach can be

applied to XML Schemas• Uses XSTL & DOM rather than

JavaCC• Can identify XML tags and

attributes that may contain text

Summary

• XPG is an approach to XLIFF transformation that corresponds to the grammar of the language being transformed.

• This ensures consistent, error free and rapid XLIFF transformation.

• The XPG approach is suitable for computer languages and markup