Oracle Unicode

®

Unicode Support in Oracle9i Database

®

Topics

• Customer New Requirement

• Existing Unicode Support in 8i

• New Unicode Features in 9i

• Character Semantics Support in 9i

• Unicode reliable data type as NCHAR in 9i

• VARCHAR2 vs. NVARCHAR2 for Unicode

• UTF-8 or UTF-16 for NCHAR

• Unicode Access Interface

• Unicode Migration and Compatibility

• Conclusion

®

Requirements

• Consistent character length semantics– Column definition close to visual length

– SQL standard recommends varchar2(10) to hold 10 characters

• Reliable Unicode data type– To be independent of database character set

– To enable gradually migration to Unicode

– To enable third vendor/component for Unicode

• Easy programming for other environments– Java, XML, Window NT

• High information density for performance– Storage efficiency for Asian or European data

®

Existing Unicode Support in 8i

• UTF-8 as database character set– UTF8 for ASCII based platform

– UTFE for EBCDIC based platform

• UTF-8 as client character set

• UTF-16 for OCI bind/define buffer

• UCHAR/UVARCHAR as UTF-16 in ProC

• UTF-16 for ODBC and OLEDB

• UTF-16 for JDBC

• Unicode binary sort

®

New Unicode features in Oracle9i

• Character semantics support for text column

• Reliable Unicode datatype as NCHAR

• UTF-16 support for Oracle Call Interface(OCI)

• Complete Unicode support for ODBC/OLEDB /JDBC

• Unicode and ISO14651 based multilingual sort

• Unicode enabled Oracle utilities such as SQL*Loader

• Unicode based locale builder for locale customization

®

Length Semantics Support in Oracle9i

• A new semantics– CHAR ( size [BYTE | CHAR] )

– VARCHAR2 ( size [BYTE | CHAR] )

• It meets Ansi SQL standard– the size is defined in character in the standard,

but most vender implemented in byte

• It fulfills Customer’s requirement– Portable database schema

– Character set independent

– Same data size across server, client, and third middle tier

– Easy migration to Unicode support

®

Character Semantics Support in 9i - Cont.

• Character semantics column– Explicit with quantifier: varchar2(30 char)

– Implicit with NLS_LENGTH_SEMANTICS setting to ‘char’ for varchar2(30)

• Same semantics support for PL/SQL variable

• Character length constraint checking

• SQL functions for different flavor of length semantics in Unicode

– like/like2/like4/likec

– lengthb/length/length2/length4/lengthc

– substrb/substr/substr2/substr4/substrc

– compose/decompose

®

Character Semantics Support in 9i - Cont.

• UTF-16 semantics – UTF8 encodes surrogate by a pair of three bytes

– It has the same semantics as UTF-16 and has the match between varchar2(10 char) and wchar(10)

– It has the same binary sorting order as UTF-16

• UTF-32 semantics– AL32UTF8 follows UTF-8 standard by encoding

surrogate in 4 bytes

– It has the same semantics as UTF-32 in coding point and the same binary order

• Conversion between UTF8 and AL32UTF8– AL32UTF8 can be used at client for the UTF-8

compliance

®

Reliable Unicode Data Type Support

• NCHAR, NVARCHAR2, NCLOB - 8i NCHAR: any fixed width character set

- Defined in SQL standard

• Unicode Character set encoding - UTF-8, UTF-16

- Independent on DB character set

• Character Length Semantics Only - Avoid migration issues in the future

• Support Unicode in non-Unicode database

®

Inter-operability With Other Data Types• Explicit Conversion Functions

- TO_NCHAR()

- TO_CHAR()

- ROWIDTONCHAR()

- CHARTOROWID()

- TO_CLOB()

- TO_NCLOB()

- TO_NUMBER()

- TO_DATE()

- TO_TIMESTAMP()

- TO_TIMESTAMP_TZ()

- TO_YMINTERVAL()

…...

®

Inter-operability - cont.

• Implicit Conversion

- Between NCHAR and CHAR types

- Between NCHAR and NUMBER, DATE, ROWID, RAW,

CLOBs etc.

• Conversion Direction: - Insert/select into/update/assignment operations:

convert to target

- Comparison, concatenation: SQL CHAR to SQL NCHAR avoid any data loss

- SQL function: convert to first string parameter

• Makes migration to SQL NCHAR much easier

®

Data Loss Exception Handling

• NLS Parameter:

- NLS_NCHAR_CONV_EXCP

- Dynamically changed in each session

- Effective for both explicit and implicit conversions

• Smoothness of operation vs. accuracy of operation

®

SQL Unicode String Processing

• Same level of support as CHAR

- Can use NCHAR same way as CHAR.

• SQL functions support for NCHAR

- SUBSTR, LENGTH, INSTR, LIKE, CONCAT,

LPAD/RPAD, LTRIM, RTRIM, NLS_SORT,

NLS_UPPER, NLS_LOWER etc.

- UNISTR, ASCIISTR

• Mixed type arguments

- CONCAT(nchar,char) - result type is based on first string parameter

• Easy programming

®

Unicode Database vs. Unicode Data Type

• Codepoint semantics for UTF8 will make Oracle database a virtual UTF-16 database

– There is no need to use NVARCHAR2 unless it is for the storage compression for Asian data

– The migration effort is minimum as there is no need to convert VARCHAR2 into NVARCHAR2

– It is recommended to use one-step migration for a new system

• NCHAR/NVARCHAR2– It allows incremental migration to Unicode

– It is always a Unicode column

– It can use UTF-16 encoding natively

®

NCHAR Choice between UTF8 and AL16UTF16

• UTF-8 - ASCII compatible

- Internet friendly: HTML, XML etc.

- More space efficient for western languages

• UTF-16 - More space efficient for Asian languages

- Faster in string processing

- Supported by JAVA, WINDOWS etc.

®

Programming Interfaces

• OCI Unicode Support - Support UTF-16 bind/define buffers

- Unicode meta data, SQL_TEXT, error

messages through mode parameter

- Unicode interface support independent on

server or client character set

- Character length semantics

• PL/SQL

• Pro*C/C++: Unicode support through UCHAR, UVARCHAR

• JDBC

• ODBC/OLEDB

®

Migration, Conversion and Compatibility

• Old NCHAR to 9i NCHAR migration

• Migration to Unicode Columns

ALTER TABLE tname MODIFY col (NCHAR(n))

• Convert whole database to Unicode database

®

Migration, Conversion and Compatibility

• Character length semantics - Database schema

ALTER TABLE tname MODIFY col (CHAR(n CHAR))

- Modify application to be in sync with database length semantics

Example: PL/SQL migration

1. Set NLS_LENGTH_SEMANTICS to CHAR

2. Apply %ROWTYPE, or explicit CHAR quantifier

3. Change substrb, lengthb and instrb to substr, length and

instr

®

Summary

• A flexible and complete Unicode support– Character semantics on UTF-8 or Unicode data

type

– All major access interfaces support Unicode

• High performance by high information density– UTF-8 for Western scripts

– UTF-16 for Asian scripts

• Easy programming– Same length semantics between database and

other components

• Easy migration– One step migration or gradual migration

Oracle Unicode

Documents

Transcript of Oracle Unicode