Java Course 7: Text processing, Charsets & Encodings
Click here to load reader
-
Upload
anton-keks -
Category
Technology
-
view
2.199 -
download
1
description
Transcript of Java Course 7: Text processing, Charsets & Encodings
Text processing,Text processing,Charsets & EncodingsCharsets & Encodings
Java course - IAG0040Java course - IAG0040
Anton KeksAnton Keks 20112011
Lecture 7Lecture 7Slide Slide 22
Java course – IAG0040Anton Keks
String processingString processing
● The following classes provide String processing: String, StringBuilder/Buffer, StringTokenizer
● All primitives can be converted to/from Strings using their wrapper classes (e.g. Integer, Float, etc)
● java.util.regex provides regular expressions● java.text package provides classes and interfaces for
parsing and formatting text, dates, numbers, and messages in a manner independent of natural languages
Lecture 7Lecture 7Slide Slide 33
Java course – IAG0040Anton Keks
LocalesLocales
● Java also supports locales, just like most OSs● A java.util.Locale object represents a specific
geographical, political, or cultural region. – There is a default locale, which is used by some
String operations (e.g. toUpperCase) and formatters in java.text package.
– Locale is initialized with: ISO 2-letter language code (lower case), ISO 2-letter country code (upper case), and a variant. Latter two are optional
● e.g. “de”, “et_EE”, “en_GB”
Lecture 7Lecture 7Slide Slide 44
Java course – IAG0040Anton Keks
LocalizationLocalization
● ResourceBundle classes can be used for localization of your programs
– ResourceBundles contain locale-specific objects, e.g. Strings
– ListResourceBundle and PropertyResourceBundle are simple implementations
– ResourceBundle.getBundle(...) returns a locale-specific bundle
Lecture 7Lecture 7Slide Slide 55
Java course – IAG0040Anton Keks
Natural language comparisonNatural language comparison
● String.compareTo() does lexicographical comparison, ie compares character codes
● Collators are used for locale-sensitive comparison/sorting, according to the rules of the specific language/locale
– java.text.Collator implements Comparator<String>
– Use Collator.getInstance(...) for obtaining one
– RuleBasedCollator is the common implementation, allows specification of own rules
Lecture 7Lecture 7Slide Slide 66
Java course – IAG0040Anton Keks
StringBuffer vs StringStringBuffer vs String
● A StringBuilder (and StringBuffer) is a mutable String
● Always use it, when doing complex String processing, especially when doing a lot of concatenations in a loop
● Java uses StringBuilder internally in place of the '+' operator
– String s = a + b + 25; is the same as
– String s = new StringBuilder().append(a).append(b).append(25).toString();
– There are many different append() methods for all primitive types as well as any objects. For an arbitrary object, toString() is called.
● StringBuffer, StringBuilder, and String implement CharSequence
● StringBuilder has the same methods as StringBuffer, but a bit faster, because it is not thread safe (not internally synchronized)
Lecture 7Lecture 7Slide Slide 77
Java course – IAG0040Anton Keks
Formatting and ParsingFormatting and Parsing
● Locale-specific formatting and parsing is provided by java.text.
● java.text.Format is an abstract base class for
– DateFormat (SimpleDateFormat) – date and time. Calendar is used for manipulation of date and time.
– NumberFormat (ChoiceFormat, DecimalFormat) – numbers, currencies, percentages, etc
– MessageFormat – for complex concatenated messages
– all of them provide various format and parse methods
– all of them can be initialized for the default or specified locale using provided static methods
– all of them can be created directly, specifying the custom format
Lecture 7Lecture 7Slide Slide 88
Java course – IAG0040Anton Keks
Regular expressionsRegular expressions
● Regular expressions are expressions, allowing easy searching and matching of textual data, they are built into many languages, like Perl and PHP, and widely used in Unix command-line
● Regular expression classes are in the java.util.regex package.
● In Java, represented as Strings, but must be 'compiled' by Pattern.compile() before use.
● However, many String methods provide convenient 'shortcuts', like split(), matches(), replaceFirst(), replaceAll(), etc
● Pattern is an immutable compiled representation, which can be used for creation of mutable Matcher objects.
● Use Patterns directly in case you intend to reuse the regexp
Lecture 7Lecture 7Slide Slide 99
Java course – IAG0040Anton Keks
Regular Expressions (cont)Regular Expressions (cont)
● Read javadoc of the Pattern class!
– . (a dot) matches any character
– [] can be used for matching any specified character
– \s, \S, \d, \w, etc save you typing sometimes (note: double escaping is needed within String literals, e.g. “\\s”
– ?, +, * match the number of occurrences of the preceding character: 0 or 1, 1 or more, any number respectively
– () - matches groups (they can be accessed individually)
– | means 'or', e.g. (dog|cat) matches both “dog” and “cat”
– ^ and $ match beginning and end of a line, respectively
– \b matches word boundary
Lecture 7Lecture 7Slide Slide 1010
Java course – IAG0040Anton Keks
ScanningScanning
● java.util.Scanner can be used for parsing Strings, InputStreams, Readers, or Files
● It uses either built-in or custom regular expressions for parsing input data, it is sensitive to either the default or specified Locale
● Default delimiter is whitespace (“\\s”), custom delimeter may be set using the useDelimiter() method
● It implements Iterator<String>, therefore has hasNext() and next() methods, various type-specific methods, e.g. hasNextInt(), nextInt(), etc, as well as finding and skipping facilities
● Can be used for parsing the standard input:
– Scanner s = new Scanner(System.in);int n = s.nextInt();
Lecture 7Lecture 7Slide Slide 1111
Java course – IAG0040Anton Keks
Charsets and encodingsCharsets and encodings
● In the 21st century, there is no excuse for any programmer not to know charsets and encodings well
● Charsets map glyphs (symbols) to numeric codes
● Charsets are represented by character encodings (actual bits and bytes that are stored in files)
● Fonts must support charsets in order to display texts in respective encodings properly
● Example:
– Glyph (symbol): A
– Numeric code: 65 (ASCII charset)
– Encoding: 0x41 == 1000001 b (ASCII 7-bit encoding)
Lecture 7Lecture 7Slide Slide 1212
Java course – IAG0040Anton Keks
ASCIIASCII
● American Standard Code for Information Interchange
● Created in 1963, ANSI in 1967, ISO-646 in 1972
● Allowed for text exchange between computers
● Only 7 bits are defined, nowadays called US-ASCII
● 0-31 – control chars
● 33-126 – printable
● Was designed forEnglish language
Lecture 7Lecture 7Slide Slide 1313
Java course – IAG0040Anton Keks
ASCII extensionsASCII extensions● ASCII is enough for only Latin, English, Hawaiian and Swahili
● For most other languages a number of 8-bit ASCII extensions were developed, incompatible with each other
● ISO-8859 was an attempt to standardize them by defining the upper 128 characters in 8-bit wide bytes
– All of them have the first 7-bit the same as ASCII
– ISO-8859-1 (Latin-1) – Western European
– ISO-8859-4 – Northern, ISO-8859-13 – Baltic, WIN-1257 – MS Baltic (modified ISO)
– ISO-8859-5, KOI8-R – Cyrillic, WIN-1251 – MS Cyrillic (different from ISO)
– Many of them are still used today in legacy systems or formats
Lecture 7Lecture 7Slide Slide 1414
Java course – IAG0040Anton Keks
Unicode (UCS, ISO-10646)Unicode (UCS, ISO-10646)
● Unicode solves the problem of incompatible charsets● Unicode defines standardized numeric codes (code
points) for most glyphs used in the world– Code points are abstract – they don't define representation
– First 256 code points correspond to ISO-8859-1
– 16 bit BMP (Basic Multilingual Plane) – most modern languages (including Chinese, Japanese, etc)
– More planes for other scripts (mathematical symbols, musical notation, ancient alphabets, etc)
● Apart from UCS, Unicode defines formatting and combining rules as well (e.g. for bidirectional text)
Lecture 7Lecture 7Slide Slide 1515
Java course – IAG0040Anton Keks
Unicode encodingsUnicode encodings
● Define representation of code points in bits and bytes● Fixed-width UCS-2 (2 bytes) and UCS-4 (4 bytes)● UTF (Unicode Transformation Format)
– All of them can encode any Unicode code points
– UTF-8 – variable size from 1 to 6 bytes (usually no longer than 3 bytes, compatible with ASCII), the most popular and compact
– UTF-16 – 2 or 4 bytes, 2 bytes for BMP code points, 4 bytes for other planes
– UTF-32 – constant size, 4 bytes per character, 'raw' unicode
– UTF-7 – 7-bit safe encoding (less popular nowadays)
Lecture 7Lecture 7Slide Slide 1616
Java course – IAG0040Anton Keks
Charsets and JavaCharsets and Java● char and String are UTF-16
– Beware that length(), indexOf(), etc operate on chars (surrogates), not Unicode glyphs, therefore can return 'logically wrong' values in case of 4-byte characters – this was a performance decision
● Encoding conversions are built-in
– Encoded text is binary data for Java, therefore stored in bytes
– There always exists the default encoding (the one OS uses)
– Charset class is provided for encoding/decoding, enumeration, etc
– s.toBytes(...) - encodes a String
– new String(...) - decodes raw bytes to a String
– System.out and System.in automatically convert to/from the default encoding