26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU...
-
Upload
rhoda-phillips -
Category
Documents
-
view
213 -
download
0
Transcript of 26th Internationalization and Unicode ConferenceSan Jose, September 2004 Getting Started with ICU...
26th Internationalization and Unicode Conference San Jose, September 2004
Getting Started with ICU
Vladimir WeinsteinEric Mader
26th Internationalization and Unicode Conference
Getting Started with ICU
2 San Jose, September 2004
Agenda
Getting & setting up ICU4C
Using conversion engine
Using break iterator engine
Getting & setting up ICU4J
Using collation engine
Using message formats
26th Internationalization and Unicode Conference
Getting Started with ICU
3 San Jose, September 2004
Getting ICU4C
http://oss.software.ibm.com/icu/
Get the latest release
Get the binary package
Source download for modifying build options
CVS for bleeding edge:
:pserver:[email protected]:/usr/cvs/icu
26th Internationalization and Unicode Conference
Getting Started with ICU
4 San Jose, September 2004
Setting up ICU4C
Unpack binaries
If you need to build from source
– Windows:
• MSVC .Net 2003 Project, • CygWin + MSVC 6, • just CygWin
– Unix: runConfigureICU
• make install• make check
26th Internationalization and Unicode Conference
Getting Started with ICU
5 San Jose, September 2004
Testing ICU4C Windows - run: cintltst, intltest, iotest
Unix - make check (again)
See it for yourself:
#include <stdio.h>#include "unicode/utypes.h"#include "unicode/ures.h"
main() { UErrorCode status = U_ZERO_ERROR; UResourceBundle *res = ures_open(NULL, "", &status); if(U_SUCCESS(status)) { printf("everything is OK\n"); } else { printf("error %s opening resource\n", u_errorName(status)); } ures_close(res);}
26th Internationalization and Unicode Conference
Getting Started with ICU
6 San Jose, September 2004
Conversion Engine - Opening
ICU4C uses open/use/close paradigm
Open a converter:
UErrorCode status = U_ZERO_ERROR;UConverter *cnv = ucnv_open(encoding, &status);if(U_FAILURE(status)) {
/* process the error situation, die gracefully */}
Almost all APIs use UErrorCode for status
Check the error code!
26th Internationalization and Unicode Conference
Getting Started with ICU
7 San Jose, September 2004
What Converters are Available
ucnv_countAvailable() – get the number of available converters
ucnv_getAvailable – get the name of a particular converter
Lot of frameworks allow this examination
26th Internationalization and Unicode Conference
Getting Started with ICU
8 San Jose, September 2004
Converting Text Chunk by Chunk
char buffer[DEFAULT_BUFFER_SIZE];char *bufP = buffer;len = ucnv_fromUChars(cnv, bufP, DEFAULT_BUFFER_SIZE,
source, sourceLen, &status);if(U_FAILURE(status)) {
if(status == U_BUFFER_OVERFLOW_ERROR) {status = U_ZERO_ERROR;bufP = (UChar *)malloc((len + 1) * sizeof(char));len = ucnv_fromUChars(cnv, bufP, DEFAULT_BUFFER_SIZE,
source, sourceLen, &status);
} else {/* other error, die gracefully */
}}/* do interesting stuff with the converted text */
26th Internationalization and Unicode Conference
Getting Started with ICU
9 San Jose, September 2004
Converting Text Character by Character
Works only from code page to Unicode
UChar32 result;char *source = start;char *sourceLimit = start + len;while(source < sourceLimit) {
result = ucnv_getNextUChar(cnv, &source, sourceLimit, &status);
if(U_FAILURE(status)) {/* die gracefully */
}/* do interesting stuff with the converted text */
}
26th Internationalization and Unicode Conference
Getting Started with ICU
10 San Jose, September 2004
Converting Text Piece by Piecewhile((!feof(f)) && ((count=fread(inBuf, 1, BUFFER_SIZE , f)) > 0) ) { source = inBuf; sourceLimit = inBuf + count; do { target = uBuf; targetLimit = uBuf + uBufSize; ucnv_toUnicode(conv, &target, targetLimit, &source, sourceLimit, NULL, feof(f)?TRUE:FALSE, /* pass 'flush' when eof */ /* is true (when no more data will come) */ &status); if(status == U_BUFFER_OVERFLOW_ERROR) { // simply ran out of space – we'll reset the // target ptr the next time through the loop. status = U_ZERO_ERROR; } else { // Check other errors here and act appropriately } text.append(uBuf, target-uBuf); count += target-uBuf; } while (source < sourceLimit); // while simply out of space }
26th Internationalization and Unicode Conference
Getting Started with ICU
11 San Jose, September 2004
Clean up!
Whatever is opened, needs to be closed
Converters use ucnv_close
Sample uses conversion to convert code page data from a file
26th Internationalization and Unicode Conference
Getting Started with ICU
12 San Jose, September 2004
Break Iteration - Introduction
Four types of boundaries:
– Character, word, line, sentence
Points to a boundary between two characters
Index of character following the boundary
Use current() to get the boundary
Use first() to set iterator to start of text
Use last() to set iterator to end of text
26th Internationalization and Unicode Conference
Getting Started with ICU
13 San Jose, September 2004
Break Iteration - Navigation
Use next() to move to next boundary
Use previous() to move to previous boundary
Returns DONE if can’t move boundary
26th Internationalization and Unicode Conference
Getting Started with ICU
14 San Jose, September 2004
Break Itaration – Checking a position
Use isBoundary() to see if position is boundary
Use preceeding() to find boundary at or before
Use following() to find boundary at or after
26th Internationalization and Unicode Conference
Getting Started with ICU
15 San Jose, September 2004
Break Iteration - Opening
Use the factory methods:Locale locale = …; // locale to use for break iteratorsUErrorCode status = U_ZERO_ERROR;
BreakIterator *characterIterator = BreakIterator::createCharacterInstance(locale, status);
BreakIterator *wordIterator = BreakIterator::createWordInstance(locale, status);
BreakIterator *lineIterator = BreakIterator::createLineInstance(locale, status);
BreakIterator *sentenceIterator = BreakIterator::createSentenceInstance(locale, status);
Don’t forget to check the status!
26th Internationalization and Unicode Conference
Getting Started with ICU
16 San Jose, September 2004
Set the text
We need to tell the iterator what text to use:UnicodeString text;
readFile(file, text);wordIterator->setText(text);
Reuse iterators by calling setText() again.
26th Internationalization and Unicode Conference
Getting Started with ICU
17 San Jose, September 2004
Break Iteration - Counting words in a file:int32_t countWords(BreakIterator *wordIterator, UnicodeString &text){ U_ERROR_CODE status = U_ZERO_ERROR; UnicodeString word; UnicodeSet letters(UnicodeString("[:letter:]"), status);
int32_t wordCount = 0; int32_t start = wordIterator->first();
for(int32_t end = wordIterator->next(); end != BreakIterator::DONE; start = end, end = wordIterator->next()) { text->extractBetween(start, end, word); if(letters.containsSome(word)) { wordCount += 1; } }
return wordCount;}
26th Internationalization and Unicode Conference
Getting Started with ICU
18 San Jose, September 2004
Break Iteration – Breaking lines
int32_t previousBreak(BreakIterator *breakIterator, UnicodeString &text, int32_t location){ int32_t len = text.length();
while(location < len) { UChar c = text[location];
if(!u_isWhitespace(c) && !u_iscntrl(c)) { break; }
location += 1; }
return breakIterator->previous(location + 1);}
26th Internationalization and Unicode Conference
Getting Started with ICU
19 San Jose, September 2004
Break Iteration – Cleaning up
Use delete to delete the iteratorsdelete characterIterator;delete wordIterator;delete lineIterator;delete sentenceIterator;
26th Internationalization and Unicode Conference
Getting Started with ICU
20 San Jose, September 2004
Useful Links
Homepage: http://oss.software.ibm.com/icu/
API documents: http://oss.software.ibm.com/icu/apiref/index.html
User guide: http://oss.software.ibm.com/icu/userguide/
26th Internationalization and Unicode Conference
Getting Started with ICU
21 San Jose, September 2004
Getting ICU4J
Easiest – pick a .jar file off download section on http://oss.software.ibm.com/icu4j
Use the latest version if possible
For sources, download the source .jar
For bleeding edge, use the latest CVS
:pserver:[email protected]:/usr/cvs/icu4j
26th Internationalization and Unicode Conference
Getting Started with ICU
22 San Jose, September 2004
Setting up ICU4J
Check that you have the appropriate JDK version
Try the test code (ICU4J 3.0 or later):import com.ibm.icu.util.ULocale;import com.ibm.icu.util.UResourceBundle;public class TestICU {
public static void main(String[] args) { UResourceBundle resourceBundle =
UResourceBundle.getBundleInstance(null, ULocale.getDefault());
}}
Add ICU’s jar to classpath on command line
Run the test suite
26th Internationalization and Unicode Conference
Getting Started with ICU
23 San Jose, September 2004
Building ICU4J
Need ant in addition to JDK
Use ant to build
We also like Eclipse
26th Internationalization and Unicode Conference
Getting Started with ICU
24 San Jose, September 2004
Collation Engine
More on collation in a couple of hours!
Used for comparing strings
Instantiation:ULocale locale = new ULocale("fr");Collator coll = Collator.getInstance(locale);// do useful things with the collator
Lives in com.ibm.icu.text.Collator
26th Internationalization and Unicode Conference
Getting Started with ICU
25 San Jose, September 2004
String Comparison
Works fast
You get the result as soon as it is ready
Use when you don’t need to compare same strings many times
int compare(String source, String target);
26th Internationalization and Unicode Conference
Getting Started with ICU
26 San Jose, September 2004
Sort Keys
Used when multiple comparisons are required
Indexes in data bases
ICU4J has two classes
Compare only sort keys generated by the same type of a collator
26th Internationalization and Unicode Conference
Getting Started with ICU
27 San Jose, September 2004
CollationKey class
JDK compatible
Saves the original string
Compare keys with compareTo method
Get the bytes with toByteArray method
We used CollationKey as a key for a TreeMap structure
26th Internationalization and Unicode Conference
Getting Started with ICU
28 San Jose, September 2004
RawCollationKey class
Does not store the original string
Get it by using getRawCollationKey method
Mutable class, can be reused
Simple and lightweight
26th Internationalization and Unicode Conference
Getting Started with ICU
29 San Jose, September 2004
Message Format - Introduction
Assembles a user message from parts
Some parts fixed, some supplied at runtime
Order different for different languages:
– English: My Aunt’s pen is on the table.
– French: The pen of my Aunt is on the table.
Pattern string defines how to assemble parts:
– English: {0}''s {2} is {1}.
– French: {2} of {0} is {1}.
Get pattern string from resource bundle
26th Internationalization and Unicode Conference
Getting Started with ICU
30 San Jose, September 2004
Message Format - Example
String person = …; // e.g. “My Aunt”String place = …; // e.g. “on the table”String thing = …; // e.g. “pen”
String pattern = resourceBundle.getString(“personPlaceThing”);MessageFormat msgFmt = new MessageFormat(pattern);Object arguments[] = {person, place, thing);String message = msgFmt.format(arguments);
System.out.println(message);
26th Internationalization and Unicode Conference
Getting Started with ICU
31 San Jose, September 2004
Message Format – Different data types
String pattern = “On {0, date} at {0, time} there was {1}.”;MessageFormat fmt = new MessageFormat(pattern);Object args[] = {new Date(System.currentTimeMillis()), // 0 “a power failure” // 1 };
System.out.println(fmt.format(args));
On Jul 17, 2004 at 2:15:08 PM there was a power failure.
We can also format other data types, like dates
We do this by adding a format type:
This will output:
26th Internationalization and Unicode Conference
Getting Started with ICU
32 San Jose, September 2004
Message Format – Format styles
String pattern = “On {0, date, full} at {0, time, full} there was {1}.”;MessageFormat fmt = new MessageFormat(pattern);Object args[] = {new Date(System.currentTimeMillis()), // 0 “a power failure” // 1 };
System.out.println(fmt.format(args));
On Saturday, July 17, 2004 at 2:15:08 PM PDT there was a power failure.
This will output:
Add a format style:
26th Internationalization and Unicode Conference
Getting Started with ICU
33 San Jose, September 2004
Message Format – Format style details
Format Type Format Style Sample Output
number
(none) 123,456.789
integer 123,457
currency $123,456.79
percent 12%
date
(none) Jul 17, 2004
short 7/17/04
medium Jul 17, 2004
long July 17, 2004
full Saturday, July 17, 2004
time
(none) 2:15:08 PM
short 2:15 PM
medium 2:14:08 PM
long 2:15:08 PM PDT
full 2:15:08 PM PDT
26th Internationalization and Unicode Conference
Getting Started with ICU
34 San Jose, September 2004
Message Format – No format type
Data Type Sample OutputNumber 123,456.789
Date 7/17/04 2:15 PM
String on the table
others output of toString() method
If no format type, data formatted like this:
26th Internationalization and Unicode Conference
Getting Started with ICU
35 San Jose, September 2004
Message Format – Counting files
Pattern to display number of files:There are {1, number, integer} files in {0}.
String pattern = resourceBundle.getString(“fileCount”);MessageFormat fmt = new MessageFormat(fileCountPattern);String directoryName = … ;Int fileCount = … ;Object args[] = {directoryName, new Integer(fileCount)};
System.out.println(fmt.format(args));
There are 1,234 files in myDirectory.
Code to use the pattern:
This will output messages like:
26th Internationalization and Unicode Conference
Getting Started with ICU
36 San Jose, September 2004
Message Format – Problems counting files
If there’s only one file, we get:There are 1 files in myDirectory.
Could fix by testing for special case of one file
But, some languages need other special cases:
– Dual forms
– Different form for no files
– Etc.
26th Internationalization and Unicode Conference
Getting Started with ICU
37 San Jose, September 2004
Message Format – Choice format
Choice format handles all of this
Use special format element:There {1, choice, 0#are no files| 1#is one file| 1<are {1, number, integer} files} in {0}.
Using this pattern with the same code we get:
There are no files in thisDirectory.There is one file in thatDirectory.There are 1,234 files in myDirectory.
26th Internationalization and Unicode Conference
Getting Started with ICU
38 San Jose, September 2004
Message Format – Choice format patterns
Selects a string based on number
If string is a format element, process it
Splits real line into two or more ranges
Range specifiers separated by vertical bar (“|”)
Lower limit, separator, string
Separator indicates type of lower limit:Separator Lower Limit
# inclusive
≤ inclusive
< exclusive
26th Internationalization and Unicode Conference
Getting Started with ICU
39 San Jose, September 2004
Message Format – Choice pattern details
Here’s our pattern again:There {1, choice, 0#are no files| 1#is one file| 1<are {1, number, integer} files} in {0}.
First range is [0..1)
– Really [-∞..1)
Second range is [1..1]
Third range is (1..∞]
26th Internationalization and Unicode Conference
Getting Started with ICU
40 San Jose, September 2004
Message Format – Other details
Format style can be a pattern string
– Format type number: use DecimalFormat pattern
– Format type date, time: use SimpleDateFormat pattern
Quoting in patterns
– Enclose special characters in single quotes
– Use two consecutive single quotes to represent one
The '{' character, the '#' character and the '' character.
26th Internationalization and Unicode Conference
Getting Started with ICU
41 San Jose, September 2004
Useful Links
Homepage: http://oss.software.ibm.com/icu4j/
API documents: http://oss.software.ibm.com/icu4j/doc/
User guide: http://oss.software.ibm.com/icu/userguide/