26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU...

41
26th Internationalization and Unicode Conference San Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader

Transcript of 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU...

Page 1: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference San Jose, September 2004

Getting Started with ICU

Vladimir WeinsteinEric Mader

Page 2: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

2 San Jose, September 2004

Agenda

Getting & setting up ICU4C

Using conversion engine

Using break iterator engine

Getting & setting up ICU4J

Using collation engine

Using message formats

Page 3: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

3 San Jose, September 2004

Getting ICU4C

http://oss.software.ibm.com/icu/

Get the latest release

Get the binary package

Source download for modifying build options

CVS for bleeding edge:

:pserver:[email protected]:/usr/cvs/icu

Page 4: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

4 San Jose, September 2004

Setting up ICU4C

Unpack binaries

If you need to build from source

– Windows:

• MSVC .Net 2003 Project, • CygWin + MSVC 6, • just CygWin

– Unix: runConfigureICU

• make install• make check

Page 5: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

5 San Jose, September 2004

Testing ICU4C Windows - run: cintltst, intltest, iotest

Unix - make check (again)

See it for yourself:

#include <stdio.h>#include "unicode/utypes.h"#include "unicode/ures.h"

main() { UErrorCode status = U_ZERO_ERROR; UResourceBundle *res = ures_open(NULL, "", &status); if(U_SUCCESS(status)) { printf("everything is OK\n"); } else { printf("error %s opening resource\n", u_errorName(status)); } ures_close(res);}

Page 6: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

6 San Jose, September 2004

Conversion Engine - Opening

ICU4C uses open/use/close paradigm

Open a converter:

UErrorCode status = U_ZERO_ERROR;UConverter *cnv = ucnv_open(encoding, &status);if(U_FAILURE(status)) {

/* process the error situation, die gracefully */}

Almost all APIs use UErrorCode for status

Check the error code!

Page 7: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

7 San Jose, September 2004

What Converters are Available

ucnv_countAvailable() – get the number of available converters

ucnv_getAvailable – get the name of a particular converter

Lot of frameworks allow this examination

Page 8: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

8 San Jose, September 2004

Converting Text Chunk by Chunk

char buffer[DEFAULT_BUFFER_SIZE];char *bufP = buffer;len = ucnv_fromUChars(cnv, bufP, DEFAULT_BUFFER_SIZE,

source, sourceLen, &status);if(U_FAILURE(status)) {

if(status == U_BUFFER_OVERFLOW_ERROR) {status = U_ZERO_ERROR;bufP = (UChar *)malloc((len + 1) * sizeof(char));len = ucnv_fromUChars(cnv, bufP, DEFAULT_BUFFER_SIZE,

source, sourceLen, &status);

} else {/* other error, die gracefully */

}}/* do interesting stuff with the converted text */

Page 9: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

9 San Jose, September 2004

Converting Text Character by Character

Works only from code page to Unicode

UChar32 result;char *source = start;char *sourceLimit = start + len;while(source < sourceLimit) {

result = ucnv_getNextUChar(cnv, &source, sourceLimit, &status);

if(U_FAILURE(status)) {/* die gracefully */

}/* do interesting stuff with the converted text */

}

Page 10: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

10 San Jose, September 2004

Converting Text Piece by Piecewhile((!feof(f)) && ((count=fread(inBuf, 1, BUFFER_SIZE , f)) > 0) ) { source = inBuf; sourceLimit = inBuf + count; do { target = uBuf; targetLimit = uBuf + uBufSize; ucnv_toUnicode(conv, &target, targetLimit, &source, sourceLimit, NULL, feof(f)?TRUE:FALSE, /* pass 'flush' when eof */ /* is true (when no more data will come) */ &status); if(status == U_BUFFER_OVERFLOW_ERROR) { // simply ran out of space – we'll reset the // target ptr the next time through the loop. status = U_ZERO_ERROR; } else { // Check other errors here and act appropriately } text.append(uBuf, target-uBuf); count += target-uBuf; } while (source < sourceLimit); // while simply out of space }

Page 11: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

11 San Jose, September 2004

Clean up!

Whatever is opened, needs to be closed

Converters use ucnv_close

Sample uses conversion to convert code page data from a file

Page 12: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

12 San Jose, September 2004

Break Iteration - Introduction

Four types of boundaries:

– Character, word, line, sentence

Points to a boundary between two characters

Index of character following the boundary

Use current() to get the boundary

Use first() to set iterator to start of text

Use last() to set iterator to end of text

Page 13: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

13 San Jose, September 2004

Break Iteration - Navigation

Use next() to move to next boundary

Use previous() to move to previous boundary

Returns DONE if can’t move boundary

Page 14: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

14 San Jose, September 2004

Break Itaration – Checking a position

Use isBoundary() to see if position is boundary

Use preceeding() to find boundary at or before

Use following() to find boundary at or after

Page 15: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

15 San Jose, September 2004

Break Iteration - Opening

Use the factory methods:Locale locale = …; // locale to use for break iteratorsUErrorCode status = U_ZERO_ERROR;

BreakIterator *characterIterator = BreakIterator::createCharacterInstance(locale, status);

BreakIterator *wordIterator = BreakIterator::createWordInstance(locale, status);

BreakIterator *lineIterator = BreakIterator::createLineInstance(locale, status);

BreakIterator *sentenceIterator = BreakIterator::createSentenceInstance(locale, status);

Don’t forget to check the status!

Page 16: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

16 San Jose, September 2004

Set the text

We need to tell the iterator what text to use:UnicodeString text;

readFile(file, text);wordIterator->setText(text);

Reuse iterators by calling setText() again.

Page 17: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

17 San Jose, September 2004

Break Iteration - Counting words in a file:int32_t countWords(BreakIterator *wordIterator, UnicodeString &text){ U_ERROR_CODE status = U_ZERO_ERROR; UnicodeString word; UnicodeSet letters(UnicodeString("[:letter:]"), status);

int32_t wordCount = 0; int32_t start = wordIterator->first();

for(int32_t end = wordIterator->next(); end != BreakIterator::DONE; start = end, end = wordIterator->next()) { text->extractBetween(start, end, word); if(letters.containsSome(word)) { wordCount += 1; } }

return wordCount;}

Page 18: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

18 San Jose, September 2004

Break Iteration – Breaking lines

int32_t previousBreak(BreakIterator *breakIterator, UnicodeString &text, int32_t location){ int32_t len = text.length();

while(location < len) { UChar c = text[location];

if(!u_isWhitespace(c) && !u_iscntrl(c)) { break; }

location += 1; }

return breakIterator->previous(location + 1);}

Page 19: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

19 San Jose, September 2004

Break Iteration – Cleaning up

Use delete to delete the iteratorsdelete characterIterator;delete wordIterator;delete lineIterator;delete sentenceIterator;

Page 20: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

20 San Jose, September 2004

Useful Links

Homepage: http://oss.software.ibm.com/icu/

API documents: http://oss.software.ibm.com/icu/apiref/index.html

User guide: http://oss.software.ibm.com/icu/userguide/

Page 21: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

21 San Jose, September 2004

Getting ICU4J

Easiest – pick a .jar file off download section on http://oss.software.ibm.com/icu4j

Use the latest version if possible

For sources, download the source .jar

For bleeding edge, use the latest CVS

:pserver:[email protected]:/usr/cvs/icu4j

Page 22: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

22 San Jose, September 2004

Setting up ICU4J

Check that you have the appropriate JDK version

Try the test code (ICU4J 3.0 or later):import com.ibm.icu.util.ULocale;import com.ibm.icu.util.UResourceBundle;public class TestICU {

public static void main(String[] args) { UResourceBundle resourceBundle =

UResourceBundle.getBundleInstance(null, ULocale.getDefault());

}}

Add ICU’s jar to classpath on command line

Run the test suite

Page 23: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

23 San Jose, September 2004

Building ICU4J

Need ant in addition to JDK

Use ant to build

We also like Eclipse

Page 24: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

24 San Jose, September 2004

Collation Engine

More on collation in a couple of hours!

Used for comparing strings

Instantiation:ULocale locale = new ULocale("fr");Collator coll = Collator.getInstance(locale);// do useful things with the collator

Lives in com.ibm.icu.text.Collator

Page 25: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

25 San Jose, September 2004

String Comparison

Works fast

You get the result as soon as it is ready

Use when you don’t need to compare same strings many times

int compare(String source, String target);

Page 26: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

26 San Jose, September 2004

Sort Keys

Used when multiple comparisons are required

Indexes in data bases

ICU4J has two classes

Compare only sort keys generated by the same type of a collator

Page 27: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

27 San Jose, September 2004

CollationKey class

JDK compatible

Saves the original string

Compare keys with compareTo method

Get the bytes with toByteArray method

We used CollationKey as a key for a TreeMap structure

Page 28: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

28 San Jose, September 2004

RawCollationKey class

Does not store the original string

Get it by using getRawCollationKey method

Mutable class, can be reused

Simple and lightweight

Page 29: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

29 San Jose, September 2004

Message Format - Introduction

Assembles a user message from parts

Some parts fixed, some supplied at runtime

Order different for different languages:

– English: My Aunt’s pen is on the table.

– French: The pen of my Aunt is on the table.

Pattern string defines how to assemble parts:

– English: {0}''s {2} is {1}.

– French: {2} of {0} is {1}.

Get pattern string from resource bundle

Page 30: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

30 San Jose, September 2004

Message Format - Example

String person = …; // e.g. “My Aunt”String place = …; // e.g. “on the table”String thing = …; // e.g. “pen”

String pattern = resourceBundle.getString(“personPlaceThing”);MessageFormat msgFmt = new MessageFormat(pattern);Object arguments[] = {person, place, thing);String message = msgFmt.format(arguments);

System.out.println(message);

Page 31: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

31 San Jose, September 2004

Message Format – Different data types

String pattern = “On {0, date} at {0, time} there was {1}.”;MessageFormat fmt = new MessageFormat(pattern);Object args[] = {new Date(System.currentTimeMillis()), // 0 “a power failure” // 1 };

System.out.println(fmt.format(args));

On Jul 17, 2004 at 2:15:08 PM there was a power failure.

We can also format other data types, like dates

We do this by adding a format type:

This will output:

Page 32: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

32 San Jose, September 2004

Message Format – Format styles

String pattern = “On {0, date, full} at {0, time, full} there was {1}.”;MessageFormat fmt = new MessageFormat(pattern);Object args[] = {new Date(System.currentTimeMillis()), // 0 “a power failure” // 1 };

System.out.println(fmt.format(args));

On Saturday, July 17, 2004 at 2:15:08 PM PDT there was a power failure.

This will output:

Add a format style:

Page 33: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

33 San Jose, September 2004

Message Format – Format style details

Format Type Format Style Sample Output

number

(none) 123,456.789

integer 123,457

currency $123,456.79

percent 12%

date

(none) Jul 17, 2004

short 7/17/04

medium Jul 17, 2004

long July 17, 2004

full Saturday, July 17, 2004

time

(none) 2:15:08 PM

short 2:15 PM

medium 2:14:08 PM

long 2:15:08 PM PDT

full 2:15:08 PM PDT

Page 34: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

34 San Jose, September 2004

Message Format – No format type

Data Type Sample OutputNumber 123,456.789

Date 7/17/04 2:15 PM

String on the table

others output of toString() method

If no format type, data formatted like this:

Page 35: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

35 San Jose, September 2004

Message Format – Counting files

Pattern to display number of files:There are {1, number, integer} files in {0}.

String pattern = resourceBundle.getString(“fileCount”);MessageFormat fmt = new MessageFormat(fileCountPattern);String directoryName = … ;Int fileCount = … ;Object args[] = {directoryName, new Integer(fileCount)};

System.out.println(fmt.format(args));

There are 1,234 files in myDirectory.

Code to use the pattern:

This will output messages like:

Page 36: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

36 San Jose, September 2004

Message Format – Problems counting files

If there’s only one file, we get:There are 1 files in myDirectory.

Could fix by testing for special case of one file

But, some languages need other special cases:

– Dual forms

– Different form for no files

– Etc.

Page 37: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

37 San Jose, September 2004

Message Format – Choice format

Choice format handles all of this

Use special format element:There {1, choice, 0#are no files| 1#is one file| 1<are {1, number, integer} files} in {0}.

Using this pattern with the same code we get:

There are no files in thisDirectory.There is one file in thatDirectory.There are 1,234 files in myDirectory.

Page 38: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

38 San Jose, September 2004

Message Format – Choice format patterns

Selects a string based on number

If string is a format element, process it

Splits real line into two or more ranges

Range specifiers separated by vertical bar (“|”)

Lower limit, separator, string

Separator indicates type of lower limit:Separator Lower Limit

# inclusive

≤ inclusive

< exclusive

Page 39: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

39 San Jose, September 2004

Message Format – Choice pattern details

Here’s our pattern again:There {1, choice, 0#are no files| 1#is one file| 1<are {1, number, integer} files} in {0}.

First range is [0..1)

– Really [-∞..1)

Second range is [1..1]

Third range is (1..∞]

Page 40: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

40 San Jose, September 2004

Message Format – Other details

Format style can be a pattern string

– Format type number: use DecimalFormat pattern

– Format type date, time: use SimpleDateFormat pattern

Quoting in patterns

– Enclose special characters in single quotes

– Use two consecutive single quotes to represent one

The '{' character, the '#' character and the '' character.

Page 41: 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU Vladimir Weinstein Eric Mader.

26th Internationalization and Unicode Conference

Getting Started with ICU

41 San Jose, September 2004

Useful Links

Homepage: http://oss.software.ibm.com/icu4j/

API documents: http://oss.software.ibm.com/icu4j/doc/

User guide: http://oss.software.ibm.com/icu/userguide/