Unicode Support in ICU for Java Doug Felt [email protected] Globalization Center of Competency,...

23
Unicode Support in ICU for Java Doug Felt [email protected] Globalization Center of Competency, San Jose, CA

Transcript of Unicode Support in ICU for Java Doug Felt [email protected] Globalization Center of Competency,...

Page 1: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

Unicode Support in ICU for Java

Doug Felt

[email protected]

Globalization Center of Competency, San Jose, CA

Page 2: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

2

Overview

• What is ICU4J?

• ICU and the JDK, a brief history

• Benefits and tradeoffs of ICU4J

• Features of ICU4J

• Performance of ICU4J

• Using ICU4J

• Conclusion and References

Page 3: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

3

What is ICU4J?

• Internationalization Library– Sister project of ICU (C/C++)

– Open-source, non-viral license

– Sponsored by IBM

• Unicode Standard compliant, up-to-date

• 100% Pure Java

• Enhances and extends JDK functionality

• Over five years of continuous development

Page 4: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

4

ICU and Java, a History

• Started with Java 1.1 internationalization– Much code contributed by IBM/Taligent

– IBM provided support, bug fixes, enhancements

• Became open-source project in 2000– ICU4C code started with port from Java

• Continued contributions to Java since then– TextLayout, OpenType layout, Normalization

Page 5: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

5

Collaboration with Java Teams

• We continue to work with Java internationalization, graphics2D teams

• We participate in Java expert groups (e.g. JSR 204, Supplementary Support)

• Differences– perspectives (conformance, features versus size)

– processes (open source versus corporate/JSR)

– timetable (twice a year versus every two years)

Page 6: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

6

Benefits

• Fully implements current standards– Unicode collation, normalization, break iteration– Updated more frequently than Java

• Full CLDR data • Improved performance• Open source, open license, customizable• Compatible with ICU C/C++ libraries and data• Runs on JDK 1.4

– Get supplementary support without moving to 1.5

Page 7: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

7

Tradeoffs

• Not built-in, unlike Java i18n support

• Some API differences– But generally a superset of the Java API

– Some differences unavoidable due to class restrictions

– Rule syntax differs to varying degrees

• Data differences– ICU4J uses its own CLDR data, not the JVM’s data

• Size– Can trim ICU4J, but it will always be larger than 0K

Page 8: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

8

Features of ICU4J

• Collation

• Normalization

• Break Iteration

• UnicodeSet and Transforms

• Character Properties

• Locale data

• Other– Calendars, Formatters, IDNA, StringPrep, IMEs

Page 9: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

9

Collation

• Full UCA (Unicode Collation Algorithm)– Java does not implement UCA collation

• Locale data– Over 60 tailorings for locale-specific collation

– Variants: Pinyin, stroke, traditional, etc.

• Performance – sorting: 2 to 20 times faster

– sort key generation: 1.5 to 4 times faster

– sort key length: 2/3 to 1/4 the length of Java sort keys

Page 10: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

10

Normalization

• Java does not provide normalization APIs– Java uses ICU’s implementation internally

– Useful for searching, string equivalence, simplifying processing of text

• Full implementation of Unicode standard– NFC, NFD, NFKC, NFKD

– Also provides FCD ‘quick check’ for optimization

Page 11: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

11

Break Iteration

• Fully conforms to Unicode specifications– supplementary characters, Hangul

• Tags– e.g., “what kind of word was this”

• Title case iteration

• Rule-based, dictionary-based for Thai

Page 12: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

12

Unicode Set and Transforms

• UnicodeSet – collections of characters based on properties– logical set operations, flexible– “[[:mark:]&[\u0600-\u067f]]”

• Transliterator– general transformations, with chaining and editing– converts between scripts, e.g. Greek/Latin,

Devanagari/Gujarati– rule-based, rules for common conversions supplied\

• UScriptRun

Page 13: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

13

Character Properties

• All Unicode character properties– over 80, Java provides access to about 10

• All defined code points

• Current with latest Unicode release – ICU4J 3.0 uses Unicode 4.0.1 data

• Fast access to character data

Page 14: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

14

Locale Data

• Standard data, included with ICU4J– CLDR (Common Locale Data Repository)– Ensures same data is available everywhere– Can share resource data with ICU4C applications

• More locales, more kinds of data– ~230 locales, compared to ~130 for Java– Can modularize to include only the data you need

• RFC3066bis support (language_script_region)– e.g., zh_Hans, zh_Hant– keywords (orthogonal variants)

Page 15: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

15

Performance of ICU4J

• Instantiation times are comparable– Common instantiate and reuse model

– ICU4J and Java both use caches to limit impact

• Collation performance faster– faster sorting, smaller sort keys

• Performance is difficult to measure– JVM makes a difference

– ICU4J performs well in spot tests

– Use a scenario that matters to you to test

Page 16: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

16

Property Data Timings

JVM ICU4J Java (J-I)/I

Sun 1.4.1 89 ns/op 101 ns/op 13%

Sun 1.5.0b2 117 ns/op 102 ns/op -13%

IBM 1.4.1 50 ns/op 66 ns/op 32%

1.13MHz PIII, Win2K

Nanoseconds/operation for character property access (getType,toLowerCase, getDirectionality) on three JVMs.

Page 17: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

17

Sizes of ICU4J

• Full jar file: 2,700K• Modular builds for common subsets

– normalizer: 420K– collator: 1,400K– calendar: 1,300K– break iterator: 1,300K– basic properties: 500K– full properties: 1,200K– formatting: 2,200K– transforms: 1,500K

Page 18: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

18

Using ICU4J

• Jar file, just add to class path– Or roll into your distribution, it’s Open Source!

– Modular builds help you to trim ICU4J’s code

– Data can be trimmed to further reduce size

• Parallel APIs– APIs on parallel classes are generally a superset

– Change import (one line change) or change class name

– Some differences unavoidable (our supplementary support for Java 1.4 can’t add API to String)

Page 19: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

19

Code Examples (1)

import com.ibm.icu.text.BreakIterator;

BreakIterator b = BreakIterator.getWordInstance();

b.setText(text);

for (int pos = b.first();

pos != BreakIterator.DONE;

pos = b.next()) {

doSomething(pos);

}

Page 20: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

20

Code Examples (2)

import com.ibm.icu.lang.UCharacter;

int cp, pos = 0;

while (pos < text.length()) {

cp = UCharacter.codePointAt(text, pos);

if (UCharacter.getType(cp) ==

UCharacter.SURROGATE) return true;

pos += UCharacter.charCount(cp);

}

Page 21: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

21

Code Examples (3)

import com.ibm.icu.util.ULocale;

import com.ibm.icu.text.Collator;

import java.util.Arrays;

ULocale ulocale = new ULocale(“es_ES@collation=traditional”);

Collator col = Collator.getInstance(ulocale);

String[] list = ...

Arrays.sort(list, col);

Page 22: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

22

Conclusion

• ICU4J is not for you if– you have tight size constraints

– you require the Java runtime behavior

• ICU4J is for you if– you need full compliance with current standards

– you need current or additional locale and property data

– you need customizability

– you need features missing from Java (normalization)

– you need additional performance

Page 23: Unicode Support in ICU for Java Doug Felt dougfelt@us.ibm.com Globalization Center of Competency, San Jose, CA.

23

References

• ICU4J– http://oss.software.ibm.com/icu4j/

• Java– http://java.sun.com/

– http://www.ibm.com/java/

• Unicode, CLDR– http://www.unicode.org/

– http://www.unicode.org/cldr/