db2 unicode-dbcs

DB2 UDB for z/OS & OS/390Character Conversion & Unicode FundamentalsChris CroneSenior Software Engineer / [email protected]

This is a basic survival lesson in character conversion and Unicode. What is a CCSID? What does DB2 do with it? Why do I care? What does Unicode do for me? If you care about data integrity and run on more than one operating system in a global business, this is the basic information you need to survive.

Agenda

Character Conversion FundamentalsUnicode FundamentalsUnicode Support in DB2 UDB for z/OS & OS/390 V7Things to Look Out ForExample ScenariosSummary

Chris Crone 04:07 PM 04/16/02 1-2

This is the agenda for a basic survival lesson in character conversion and Unicode. What is a CCSID, what does DB2 do with it, why do I care, and what does Unicode do for me. In a global economy, this is the basic information you need to survive.While these are the fundamentals, they are not simple. There are many misconceptions. Just because you get some of the information back correctly does not guarantee that you are always getting the data without loss.

Terminology

For the purposes of this presentationUNICODE

UTF-8UTF-16

ASCII ASCII is a generic term that refers to all ASCII CCSIDs that DB2 currently supports

EBCDICEBCDIC is a generic term that refers to all EBCDIC CCSIDs that DB2 currently supports

CCSIDCoded Character Set IdentifierUsed By DB2 to tag string data

When I use the term UNICODE, I mean UTF-16, and UTF-8 encodings. See www.unicode.org for more.When I use the term ASCII, I mean any generic ASCII CCSID (like 850, 819, 437) Single Byte Character Set (SBCS), Mixed , or Double Byte Character Set (DBCS)When I use the term EBCDIC I mean any generic EBCDIC CCSID (like 500, 37) SBCS, Mixed, or DBCS CCSIDCCSIDs are used by DB2 to tag string data. A CCSID precisely defines the encoding of the data.

Character Conversion Fundamentals

Chris Crone 04:07 PM 04/16/02 3-4

When we store data in some EBCDIC CCSID and display it on a PC in some ASCII CCSID, it must be translated. There are many translations, and we're just getting started.

CCSID 437 CCSID 1252What is Character Conversion

This slide depicts two common PC codepages. Note the differences in them.

Codepage 1252 defines the things like the Euro, and the full Latin-1 character set with all the accented characters.Codepage 437 defines a partial Latin-1 accented character list

Because the codepages do not contain the exact same set of characters, data cannot necessarily be converted from one to the other without the potential loss of data.Note also that characters are represented in different areas

Look at ae ligature '19'x and '29'x in CCSID 437 and '6e'x and '6c'x for CCSID 1252

Conversion Methods

Native DB2 SYSIBM.SYSSTRINGS (V2.3)

ICONVUses LE base services (V6) - Non-Strategic

Requires OS/390 V2R9

OS/390 V2 R8/R9/R10 & z/OS support for Unicode (V7)Conversion Services

Requires OS/390 V2R8 and above + APAR OW44581

code and program directoryhttp://www6.software.ibm.com/dl/os390/unicodespt-p

documentation http://publibfp.boulder.ibm.com/pubs/pdfs/os390/cunpde00.pdf http://publibfp.boulder.ibm.com/pubs/pdfs/os390/cunuge00.pdfInformation APAR II13048 and II03049

Chris Crone 04:07 PM 04/16/02 5-6

DB2 uses three methods for conversionSYSSTRINGS - This is the conversion services that were introduced in DB2 V2R3 and the ones most people are familiar withICONV - Introduced in DB2 V6 and requiring OS/390 V2R9 and above. This was our first attempt at leveraging OS/390 infrastructure to perform conversion. It is non-strategic because the OS/390 V2 R8/R9/R10 support for Unicode provides more functionality with better performanceOS/390 V2 R8/R9/R10 support for Unicode - Starting in V7, DB2 will be leveraging this service for most future character conversion support.

Native DB2

Based on SYSIBM.SYSSTRINGSHigh PerformanceAdded with DB2 V2R3Support for

Single byte, Mixed, Double byteASCII EBCDIC

Uses a combination of 256 Byte conversion tables and special two stage look up tables

The native conversion services that DB2 ships are based on support that was added in DB2 V2R3

They are high performance and rely on a cached copy of SYSIBM.SYSSTRINGS rows to perform conversion.A large number of ASCII and EBCDIC, Single byte, Mixed, and Double byte conversions are supported.These conversions use a combination of conversion tables contained in the TRANSTAB field of SYSSTRINGS and two stage conversion tables (for mixed and Double byte conversions) specified by the TRANSPROC field of SYSSTRINGS.

Central repository for OS/390 systemHigh performance

Uses HW instructions available in z900 GA2See appendix for complete list of HW instructions

Uses page fixed tables in a data space

Conversion image built by off-line utilityCUNMIUTL - see sample in hlq.SCUNJCL (CUNJIUTL)

Administered via OS/390 Console SET UNIDISPLAY UNI

OS/390 V2 R8/R9/R10 & z/OS support for Unicode

Chris Crone 04:07 PM 04/16/02 7-8

With the introduction of the OS/390 support for Unicode system service, OS/390 now has a central repository for conversion that can be used by applications, middleware, and subsystems.This service is designed to be high performance and utilizes new HW instructions and page fixed conversion tables to perform the conversions.This service uses a conversion image that is built by the off-line utility CUNMIUTL. A customer specifies which conversions will be supported by the conversion image. The conversion image is managed via the OS/390 console, not DB2.

The SET UNI command specifies the image to be loadedThe DISPLAY UNI command displays information about the currently loaded conversion image

Conversion Services Example//CUNMIUTL EXEC PGM=CUNMIUTL //SYSPRINT DD SYSOUT=* //TABIN DD DISP=SHR,DSN=hlq.SCUNTBL //SYSIMG DD DSN=hlq.IMAGES(CUNIMG00),DISP=SHR //SYSIN DD * /******************************************** * INPUT STATEMENTS FOR THE IMAGE GENERATOR * ********************************************/ CONVERSION 00850,01047,ER; /*ASCII -> EBCDIC */ CONVERSION 01047,00850,ER; /*EBCDIC -> ASCII */ CONVERSION 00037,1200,ER; /*EBCDIC 037 -> UCS-2 */ CONVERSION 1200,00037,ER; /*UCS-2 -> EBCDIC 037*/ CONVERSION 00500,1200,ER; /*Latin-1 EBC -> UCS-2 */ CONVERSION 1200,00500,ER; /*UCS-2 -> Latin-1 EBC*/ CONVERSION 01047,1200,ER; /*EBCDIC 1047 -> UCS-2 */ CONVERSION 1200,01047,ER; /*UCS-2 -> EBCDIC 1047*/ CONVERSION 01208,1200,ER; /*UnicodeCCSID-> UCS-2 */ CONVERSION 1200,01208,ER; /*UCS-2 -> UnicodeCCSI*/ CONVERSION 01383,1200,ER; /*Simp Chines -> UCS-2 */ CONVERSION 1200,01383,ER; /*UCS-2 -> Simp Chines*/ CONVERSION 00932,1200,ER; /*Jpn MCCSID -> UCS-2 */ CONVERSION 1200,00932,ER; /*UCS-2 -> Jpn MCCSID */ CONVERSION 00939,1200,ER; /*Jpn-ExtEng -> UCS-2 */ CONVERSION 1200,00939,ER; /*UCS-2 -> Jpn-ExtEng */ CONVERSION 00300,1200,ER; /*Jpn GCCSID -> UCS-2 */ CONVERSION 1200,00300,ER; /*UCS-2 -> Jpn GCCSID */ CONVERSION 00500,00850,ER; /*Latin-1 EBC -> ASCII */ CONVERSION 00850,00500,ER; /*ASCII -> Latin-1 EBC*/

/*

Here's an example of the CUNJIUTL utilityNote the specification of ER (enforced subset, round trip) after each CCSID pair. The ER specification is required by DB2.

Conversion Services ConfigurationWhich Conversions should be configured

CCSID 367 (7-Bit ASCII) <-> ASCII & EBCDIC System CCSID(s)CCSID 1208 (UTF-8) <-> ASCII & EBCDIC System CCSID(s)CCSID 1200 (UTF-16) <-> ASCII & EBCDIC System CCSID(s)Client CCSID(s) <-> Unicode CCSIDs (367, 1208, 1200)Additional ASCII or EBCDIC Conversions

Starting with V7, most new code conversion support will be via conversion services. Native DB2 conversions will continue to be supported and used, but in most cases, not enhanced

OtherConversions needed to LOAD/UNLOAD DataConversions needed to support application encoding bind option, DECLARE VARIABLE, or CCSID overrides

Chris Crone 04:08 PM 04/16/02 9-10

Which Conversions should be configured All DB2 conversions involving Unicode are supported via the OS/390 conversion services. Any conversion involving Unicode must be configured.ASCII and EBCDIC conversions not supported by Native DB2 Conversion methods Other conversions, that are not supported via SYSIBM.SYSSTRINGS are supported via the conversion services

Conversion Services Example

14.34.14 d uni,all14.34.15 CUN3000I 14.34.14 UNI DISPLAY 097 ENVIRONMENT: CREATED 12/11/2000 AT 09.13.53 MODIFIED 12/11/2000 AT 09.13.53 IMAGE CREATED 12/06/2000 AT 17.10.01 SERVICE: CUNMCNV CUNMCASE STORAGE: ACTIVE 50 PAGES LIMIT 524287 PAGES CASECONV: NONE CONVERSION: 00500-00367-ER 00500-01208-ER 00500-01200(13488)-ER 00367-00500-ER 00367-01208-ER 00367-01200(13488)-ER 01208-00500-ER 01208-00367-ER 01208-01200-ER 01200(13488)-00500-ER 01200(13488)-00367-ER 01200-01208-ER

Here's an example of output from a display UNI command.Note when the image was created is displayedAlso, the number of pages used by the image is also displayed. This is important because these pages are page fixed so they are taking up dedicated memory space on the machineFinally, a list of conversions that are supported in this image are displayed.

Note for CCSID 1200 conversions the base CCSID that the conversion was created from is also displayed in parenthesis.

Round Trip - VS - Enforced Subset

Round Trip (RT) ConversionsDesigned to preserve codepoints that are not representable in both codepages

Enforced Subset (ES) ConversionsCodepoints that are not representable are converted to SUB character

DB2 Uses a combination of RT and ES conversionsTrend is toward ES conversionsContinue to use RT conversions in some cases for compatibility reasons

Chris Crone 04:08 PM 04/16/02 11-12

DB2 conversions are either Round Trip, or Enforced Subset.Round Trip conversions attempt to avoid loss of data by mapping unrepresented codepoints to unused (or unlikely to be used) codepoints

Data loss can be avoided or delayedCan cause strange conversions

Enforced Subset conversions map any unrepresented codepoints to the sub-character

Data loss occurs immediately

Conversions can cause the length of a string to change Expanding Conversions

When data converted from one CCSID to another expandsFor Example

Å - 'C5'x in CCSID 819 -> 'C385'x in CCSID 1208Contracting Conversions

When data converted from one CCSID to another contractsFor Example

Å - '00C5'x in CCSID 1200 -> 'C5'x in CCSID 819

Expanding and Contracting Conversions

There are some cases where a conversion causes the length of the data to change.

Expanding conversions cause the length of the data to growContracting conversions cause the length of the data to shrink

What are CCSIDs used for?DB2 uses CCSIDs to describe data stored in the DB2 subsystem

DB2 supports specification of CCSIDs at a subsystem level

With V7, DB2 supports 3 encoding schemesASCIIEBCDICUNICODE

Data is comparable only within a single encoding scheme

Chris Crone 04:08 PM 04/16/02 13-14

So now that we know all about CCSIDs, what are they used for.DB2 uses CCSIDs just like we use data type and length. They are part of the metadata that describes the data being stored in DB2.In V7 DB2 supports specification of three sets of CCSIDs. These three sets of CCSIDs represent the three encoding schemes (ASCII, EBCDIC, and Unicode), that DB2 supports.DB2 supports the specification of these CCSIDs at the subsystem level. Once these values have been specified, they should not be changed.

InstallationSpecification of CCSIDs is performed at installation via install Panel DSNTIPF

DSNTIPF INSTALL DB2 -APPLICATION PROGRAMMING DEFAULTS PANEL 1===>_Enter data below:1 LANGUAGE DEFAULT ===> IBMCOB ASM,C,CPP,COBOL,COB2,IBMCOB,FORTRAN,PLI2 DECIMAL POINT IS ===> . .or ,3 MINIMUM DIVIDE SCALE ===> NO NO or YES for a minimum of 3 digits to right of decimal after division4 STRING DELIMITER ===> DEFAULT DEFAULT,"or '(COBOL or COB2 only)5 SQL STRING DELIMITER ===> DEFAULT DEFAULT,"or '6 DIST SQL STR DELIMTR ===> ' 'or "7 MIXED DATA ===> NO NO or YES for mixed DBCS data8 EBCDIC CCSID ===> 0 CCSID of your SBCS or MIXED DATA9 ASCII CCSID ===> 0 CCSID of SBCS or mixed data.10 Unicode CCSID ===> 1208 CCSID of Unicode UTF-8 data.11 DEF ENCODING SCHEME ===> EBCDIC EBCDIC, ASCII, or UNICODE12 LOCALE LC_CTYPE ===>13 APPLICATION ENCODING ===> EBCDIC EBCDIC, ASCII, UNICODE ccsid (1-65533)14 DECIMAL ARITHMETIC ===> DEC15 DEC15,DEC31,15,3115 USE FOR DYNAMICRULES ===> YES YES or NO16 DESCRIBE FOR STATIC ===> NO Allow DESCRIBE for STATIC SQL.NO or YES.

Install panel DSNTIPF is used to specify CCSID informationOptions 8,9, and 10 are where the CCSIDs for the three encoding schemes are specified.Notice that ASCII and EBCDIC CCSIDs are initialized to 0 and the Unicode CCSID is initialized to 1208

The ASCII and EBCDIC CCSIDs are not pre-filled, these values needs to be set by the customer.

The EBCDIC should be set to the CCSID that the customer's 3270 emulators, CICS, and IMS transactions use. The ASCII value should be set to the CCSID that is most commonly used by workstations in the customer shop (1252 for example).

The Unicode value is pre-filled with 1208 cannot be changed. This value specifies the mixed CCSID for Unicode tables.

Other things to note on this pageOption 11 - This specifies the default encoding scheme for Objects created in the DB2 subsystem.Option 13 - This option specifies the default application encoding. Changing this valuse should be done with great care.

Installation (continued)Information from DSNTIPF ends up in Job DSNTIJUZ

DSNHDECM ASCCSID=1088, AMCCSID=949, AGCCSID=951, SCCSID=833, MCCSID=933, GCCSID=834, USCCSID=367, UMCCSID=1208, UGCCSID=1200, ENSCHEME=EBCDIC, APPENSCH=EBCDIC, MIXED=YES END

DSNHDECM ASCCSID=819, AMCCSID=65534, AGCCSID=65534, SCCSID=37, MCCSID=65534, GCCSID=65534, USCCSID=367, UMCCSID=1208, UGCCSID=1200, ENSCHEME=EBCDIC, APPENSCH=EBCDIC, MIXED=NO END

Mixed System Non-Mixed System

Chris Crone 04:08 PM 04/16/02 15-16

The information from panel DSNTIPF flows to Job DSNTIJUZ.In the case on the Left, we have a Mixed = Yes system that is set up to support Korea. The ASCII and EBCDIC system CCSIDs that actually would have been specified on panel DSNTIPF, to result in this specification, would have been 949 and 833.

For mixed systems, and for the Unicode CCSID, the Mixed CCSID is specified on install panel DSNTIPF and DB2 will pick the corresponding Single byte and Graphic (Double Byte) CCSIDs.

In the case on the Right, we have a Mixed = No system that is set up to support US English.

Note that the user specified 819 and 37 for the ASCII and EBCDIC Single byte CCSIDs and that DB2 used the value 65534 for the ASCII and EBCDIC Mixed and Graphic (Double byte) CCSIDs. 65534 is a reserved value that means no CCSID.

Also note that the Default Encoding and Default Application Encoding also flow to this job.Note there is a bug in DSNTIJUZ and DSNHDECM - These ship with CCSID 500 as default.

CCSIDs are stored in the following placesSYSIBM.SYSDATABASESYSIBM.SYSTABLESPACESYSIBM.SYSVTREESYSIBM.SYSPLAN (V7)SYSIBM.SYSPACKAGE (V7)Plans and Packages (SCT02 and SPT01)Directory (DSNDB01) (V5)DECP

In ENCODING_SCHEME column of - Stored as 'A', 'E', 'U', or blank (default)SYSIBM.SYSDATATYPESSYSIBM.SYSDATABASESYSIBM.SYSPARMSSYSIBM.SYSTABLESPACESYSIBM.SYSTABLES

Where Is Encoding Information stored?

Once we've specified CCSIDs for our system, what does DB2 do with them?DB2 stores CCSIDs in the Catalog, the directory, in bound statements, the directory, and of course the DECP

The value stored in these areas depends on what release of DB2 was used to create the object, and the value in the DECP at the time the object was created.

If a value is 0, it is assumed that the object is EBCDIC.In general, DB2 does not support changing of a CCSID once it is specified in a DECP

The exceptions areChanging from 0 to a valid valueChanging from a CCSID that does not support the EURO symbol to a CCSID that supports the EURO symbol (37 -> 1140 for instance).

Note that this sort of change requires special, disruptive, changes and should be undertaken only after the documentation has been read and the process is thoroughly understood.

Encoding enformation is also stored in some catalog tables

DSNTIPF Mixed Data OptionMixed = No systems have support for

SBCS Data - Pure single byte dataMixed Data

Unicode UTF-8 MBCS ( 1-4 bytes/char) data. No support for ASCII/EBCDIC mixed data

Graphic dataUnicode UTF-16 (2 or 4 bytes/char) data. No support for ASCII/EBCDIC DBCS data

Mixed = Yes systems have support forSBCS Data - Pure single byte dataMixed Data - Single & double byte data in a single stringGraphic Data - Pure Double byte data

Chris Crone 04:08 PM 04/16/02 17-18

Mixed = Yes systems are used in the Far East, primarily China, Japan, and Korea. They offer support for SBCS, Mixed, and DBCS ASCII and EBCDIC data. Mixed = No systems are used elsewhere in the world and only have support for SBCS data.Unicode data is always considered mixed regardless of the Mixed = Yes/Mixed = No setting of the systemCreation of columns with data types of For Mixed data and Graphic are allowed, for EBCDIC tables, on Mixed = No systems prior to DB2 V7. As of DB2 V7, Mixed = No systems only allow specification of these types of columns in Unicode tables.

Mixed DataSBCS Data

SBCS can be compared to mixed without conversion to mixed because it is a subset of the mixed repertoire. This is true for ASCII, EBCDIC, and Unicode

Mixed DataCapable of representing SBCS and MBCS data

EBCDIC SO, SI ('0E'x, '0F'x) delineate DBCS dataAB<A> -> 'C1C20E42C10F'x

ASCII uses first byte code point, if the first byte is within a certain range, say 'A0'x - 'AF'x, then it is the first byte of a DBCS character.

For example 'A055'x would be a DBCS characterSome CCSIDs have several first byte code point ranges.

UTF-8 data uses the high order bit to indicate MBCS data

For example 'EFBC91'x is a three byte UTF-8 character

Graphic DataASCII or EBCDIC - DBCS characters no shift or first byte code points neededUnicode - DBCS characters. Surrogates take two DBCS characters

Mixed Data

Mixed = yes systems use a CCSID triplet. That is to say, there is an SBCS CCSID, a Mixed CCSID, and a DBCS CCSID. On these systems, the SBCS CCSID is a subset of the Mixed CCSID. Because of this, SBCS and Mixed data can be compared without converting.SBCS columns are created by specifying the "FOR SBCS DATA" clause on create

like CREATE TABLE T1 (C1 CHAR(10) FOR SBCS DATA);

Mixed columns are the default, on MIXED = Yes systems, and can be explicitly specified by using the "FOR MIXED DATA" clause on CREATE.DBCS data is stored in Graphic columns

When Does Conversion Occur?

LocalGenerally, conversion does not occur for local applicationsWhen dealing with ASCII/Unicode tablesWhen specified by application

CCSID Override in SQLDA (V2.3 & above)Declare Variable (V7)Application Encoding Bind Option (V7)Current Application Encoding Special Register (V7)

RemoteAutomatically when needed

DRDA Receiver Makes Right

Chris Crone 04:08 PM 04/16/02 19-20

There are no hard and fast rules as to when a conversion occurs. The short answer is that conversion occurs when necessary.Some of the cases when we do conversion are listed here.

Unicode Fundamentals

Why Unicode?Unicode is a single character set that encodes all of the worlds scripts (sort of).The Unicode standard provides a cross platform, cross vendor method of encoding data that enbles lossless representation and manipulationBefore Unicode

Many StandardsANSIJIS

TISI

Provided by various vendorsIBM

ASCII (pSeries, xSereis) and EBCDIC (zSeries and iSeries)

HP Microsoft

Chris Crone 04:08 PM 04/16/02 21-22

Sort of, because new characters are being added all the time and so at any given time, an implementation of the standard is somewhat behind.Sort of because there is one standard that contains several implementations of the standardPrior to Unciode, many different standards and vendor implelmentations existed Unicode attempts to standardize the representation and manipulation of data across vendors and platforms


Four forms of UnicodeUTF-8

Unicode Transformation Format in 8 bitsUCS-2

Universal Character Set coded in 2 octetsUTF-16

Unicode Transformation Format in 16 bitsUTF-32

Unicode Transformation Format in 32 bits Introduced with Unicode Technical Report # 19 to replace UCS-4

There are currently 4 forms of Unicode that are being promoted by the Unicode standards organization.

UTF-8, UTF-16, and UTF-32. UTF-16 is the preferred format (according to UTR#19)UCS-2 is the precursor to UTF-16.

UTF-8 (CCSID 1208)

ASCII Safe UNICODE (maps to 7-Bit ASCII)

Bytes '00'x - '7F'x = 7-Bit ASCIIBytes '00'x - '7F'x represented by single byte charsChars above '80'x are encoded by 2-6 byte chars

Most characters take 2-3 bytesMost Japanese, Chinese, and Korean characters take 3 bytesMost Extended Latin characters take 2 bytes

Surrogates take 4 bytes

Chris Crone 04:08 PM 04/16/02 23-24

UTF-8 is represented by CCSID 1208. This is a growing CCSID. This means that as characters are added to the Unicode standard, they will be added to this CCSID.UTF-8 is also commonly called ASCII safe UnicodeThe first 127 characters are the same as CCSID 367 which is a 7-bit ASCII CCSIDOther characters are represented as MBCS, 1-4 byte, charactersOne nice feature of UTF-8 is that since it is an 8-bit encoding, it does not have any big endian/little endian issues.

UCS-2 (CCSID 13488, 17584)

Basic Multilingual Plane - BMP(0)Pure Double Byte Characters

64K characters in Repertoire'0000'x - '00FF'x Represent 8 bit ASCII

'00'x appended to 8 Bit ASCII characters'00FF'x - 'FFFF'x Represent additional characters

Greek -> '0370'x - '03FF'xCyrillic -> '0400'x - '04FF'...

UCS-2 is represented by CCSIDs 13488 and 17584. 13488 corresponds to Unicode Version 2, and 17584 corresponds to Unicode Version 3.When people say Unicode, without qualifying the encoding format, this is usually what they mean.Other characters are allocated in blocks (there's a block for Greek chars, a block for Cyrillic chars....)

UTF-16 (CCSID 1200)

UCS-2 with Surrogate SupportUses two two-byte characters to represent additional characters

~1 Million characters in repertoireBMP1-BMP16 (additional 16 planes).

Supplementary Multilingual Plane (SMP) - Plane 1U+10000..U+1FFFF

Supplementary Ideographic Plane (SIP) - Plane 2U+20000..U+2FFFF

Supplementary Special Purpose Plane (SSP) - Plane 14U+E0000..U+EFFFF

BMP15 and BMP16 are reserved for private use

Chris Crone 04:08 PM 04/16/02 25-26

UTF-16 is represented by CCSID 1200, which is a growing CCSID alsoUTF-16 is a superset of UCS-2 and uses reserved sections of BMP0 to map an additional 16 planesVersion 3.1 of the Unicode standard defines the first characters in the surrogate area

UTF-32

Each Character is 4 bytes Range is restricted to values '00000000'x - '0010FFFF'xRepresents the same repertoire as UTF-16

UCS-4 Implemented by SUN Solaris and HP/UX as base Unicode data type

XPG/4 standard requires fixed width character format z/Series, p/Series looking at UTF-32 implementations to support surrogate characters in C/C++ applications

For completeness, I'm mentioning UTF-32 and UCS-4. DB2 is not implementing any support for these implementations of the Unicode standard, although other vendors have.

EndianessBig Endian

pSeries, zSeries, iSeries, Sun, HPLease significant byte is leftmost

For a 4 byte word - Byte order 0,1,2,3

Little EndianIntel based machines including xSeriesMost significant byte is leftmost

For a 4 byte word - Byte order 3,2,1,0

UTF-8 - not affected by endianess issuesUTF-16 and UTF-32 are effected by endianess issues

Big Endian'A' = x'0041' for UTF-16 or x'00000041' for UTF-32

Little Endian'A' = x'4100' for UTF-16 or x'41000000' for UTF-32

Note: A BYTE is always ordered as leftmost most significant bit to rightmost least significant bit. Bit order within a byte is always 7,6,5,4,3,2,1,0

Chris Crone 04:08 PM 04/16/02 27-28

DB2 and DRDA manipulate and store data in Big Endian format.Little Endian clients convert data to Big Endian before putting on the wire.

Character Examples

A, a, 9, Å (The character A with Ring accent), U+9860, U+200D0

ASCII'41'x, '61'x, '39'x, 'C5'x, 'CDDB'x (ccsid 939), N/A

UTF-8'41'x, '61'x, '39'x, 'C385'x, 'E9A1A0'x, 'F0A08390'x

Note: 'C5'x becomes double byte in UTF-8

UTF-16 (Big Endian format)

'0041'x, '0061'x, '0039'x, '00C5'x, '9860'x, 'D840DCD0'x

UTF-32 (Big Endian format) '00000041'x, '00000061'x, '00000039'x, '000000C5'x, '00009860'x, '000200D0'x

Note: UCS-2/UTF-16 and UCS-4/UTF-32 are using a technique called Zero Extension

Now that we know all about Unicode, here are some examples of what this stuff looks like.Note that A-Ring takes two bytes to represent in UTF-8, and that other characters can take three or 4 bytes to represent

DB2 for z/OS & OS/390 V7 Enhancements for Unicode

Chris Crone 04:08 PM 04/16/02 29-30

Requirement

Enable Unicode on DB2 UDB for OS/390 and z/OSSupport Vendors implementing Unicode applicationsSupport needs of Multinational CompaniesSupport data from more than one country/language in one DB2 subsystem

For V7 our challenge was to enable Unicode data storage on DB2, without regressing function or performance for our ASCII and EBCDIC customers.We wanted to meet the needs of ERP and CRM vendors, as well as address the needs of customer written applications that have a need to store multinational data

Solution

Allow UNICODE to be specified as the Encoding Scheme (ES) at the

System Level UNICODE CCSIDs (Install)Similar to ASCII/EBCDIC System CCSIDs

Databasecreate database mydb ccsid unicode

Table Spacecreate tablespace myts in mydb ccsid unicode

Tablecreate table t1 (c1 char(10)) ccsid unicode

Othercreate procedure mysp (in in_parm char(10) ccsid unicode) ...

Chris Crone 04:08 PM 04/16/02 31-32

The Unicode support we have added with DB2 V7 is similar to the support we have for ASCII, that was added in V5.This support allows ASCII, EBCDIC, and Unicode objects to coexist in a single DB2 subsystem.specification of the encoding scheme is made at the system level as well as the object level

Storage

Storage of Unicode DataChar/VarChar/CLOB FOR SBCS DATA

(7-bit) ASCII this is a subset of UTF-8 CCSID 367 Char/VarChar/CLOB FOR MIXED DATA

UTF-8 CCSID 1208Graphic/VarGraphic/DBCLOB

UTF-16 CCSID 1200

Data stored in Unicode tables in DB2 will be stored in one of the following CCSIDs: 367, 1208, 1200.

Parsing

Parsing will be in EBCDICConversion to EBCDIC system CCSID from

ASCIIEBCDICUNICODE

Need to ensure that literal values are convertible to System EBCDIC CCSID.

If substitution occurs in statement text being converted to EBCDIC - SQLCODE +335 issuedUse Host Variables or Parameter markers where conversion to system EBCDIC CCSID is an issue

Chris Crone 04:08 PM 04/16/02 33-34

Parsing for DB2 V7 is in EBCDIC.This means that all statements sent to DB2 will be converted to the system EBCDIC CCSID, and then parsed.Since statements are converted to EBCDIC, there is a possibility of data loss when the data is converted.

Since all DB2 keywords, such as SELECT, are representable on all EBCDIC code pages, there shouldn't be a problem with statement textLiterals contained in the statement are subject to data loss.SQLCODE +355 has been added to alert the user of this sort of data loss

Catalog

The catalog will be encoded in the default EBCDIC CCSID

Object Names will need to be convertible to Default EBCDIC CCSID

Database NameTable space/Index space NameExternal Names(UDF, SP, Exits, Fieldproc...)

Identifiers may not be generateable from all clients so should really be limited to common subset that is representable on all clients.

The DB2 V7 catalog will remain EBCDIC, so names stored in the catalog must also be convertible to EBCDIC without lossSince the DB2 catalog contains data stored in the system EBCDIC default CCSID, it is possible to have things like Japanese, or Cyrillic names for things like columns (depending on your system CCSID). However, since all clients may not be capable of producing these characters, users should really be limited to the subset of Latin-1 characters.

Note: To store Japanese, Chinese, or Korean in the DB2 catalog, a MIXED=YES system is needed and the data being stored in the catalog must conform to the rules for well formed mixed data.

Unicode Literals

Literals UTF-8 literals (char/varchar/clob)

conform to normal rules for character stringsINSERT INTO T1 (C1) VALUES('123');

UTF-16 literals (graphic/vargraphic/dbclob)Specified as UTF-8 literalsSpecified as Graphic literals

INSERT INTO T1 (C1) VALUES('123');INSERT INTO T1 (C1) VALUES( ); --Unicode U+BBC0 U+BBC1

Chris Crone 04:08 PM 04/16/02 35-36

Specification of literals for Unicode tables is essentially the same as it is for ASCII or EBCDIC tables.

The one thing to note here is that character literals may be specified for UTF-16 columns. These character literals may be used any place a graphic literal would normally be used.

For instance in the values clause of an insert statement

Host Variables and Parameter Markers

ASCII/EBCDIC/UNICODE -> UNICODEChar or Graphic -> UTF-8 or UTF-16

UNICODE -> ASCII/EBCDIC/UNICODEUTF-8 or UTF-16 -> Char or Graphic

UTF-8 <-> UTF-16Applications don't need to change just because the back end data store changes

When dealing with Unicode tables, we have torn down the barrier between CHAR and GRAPHIC.This means your back end data store can be either UTF-8 or UTF-16 and you can use ASCII, EBCDIC, or Unicode character or graphic host variables and DB2 will perform the necessary conversions to/from the CCSID of the host variable even if the host variable doesn't match the column type (for ASCII and EBCDIC back end data stores, in most cases char and graphic are incompatible).

Declare Variable

DECLARE VARIABLE statement New way to allow CCSID to be specified for host variablesExample

EXEC SQL DECLARE :HV1 CCSID UNICODE;EXEC SQL DECLARE :HV2 CCSID 37;

Precompiler directive to treat hostvar as a specific CCSIDUseful for PREPARE/EXECUTE IMMEDIATE statement text

EXEC SQL PREPARE S1 FROM :HV2;May be used with any character host variable on input or output

Chris Crone 04:08 PM 04/16/02 37-38

The new DECLARE VARIABLE statement can be used to specify the CCSID of a particular host variable.This is a precompiler directive that causes the precompiler to specify the CCSID of the host variable in any SQLDA that the precompiler generates to reference the host variable. This directive works for both input and output host variables.

Application Encoding

New Application Encoding Scheme System Default

Determines Encoding Scheme when none is explicitly specified

Bind OptionAllows explicit specification of ES at an application level. Affects Static SQL - Provides default for dynamicSystem Default used if bind option not specified

Special RegisterAllows explicit specification of ES at the application level. Affects Dynamic SQLInitialized with Bind Option

OPTION is ignored when packages are executed remotelyDRDA specified Input CCSID, Data flows as it to client

Also new to DB2 V7 is the specification of Application Encoding Scheme.This allows a default Application Encoding to be specified

Preset to EBCDICThe Application Encoding Scheme can also be specified on BIND PLAN or PACKAGE

If not specified, the system default value is used for the bind optionPlans/Packages bound prior to V7 are assumed to be EBCDICThe option applies to Static SQL

The Application Encoding Scheme special register can be used to affect dynamic SQL

Initial value is the value of the Bind Option.

Application Encoding (continued)

Example Assume Package MY_PACK is bound with APPLICATION ENCODING(UNICODE)

All Char input/output host variables for static statements are assumed to be in CCSID 1208All Graphic input/output host variables for static statements are assumed to be in CCSID 1200

Initial Value for Application Encoding Special register will be 1208Declare Variable statement or CCSID overrides can be used for overriding bind option or special register

Chris Crone 04:08 PM 04/16/02 39-40

In this example, package MY_PACK is bound with APPLICATION ENCODING(UNICODE)

Character host variables will be treated as CCSID 1208Graphic host variables will be treated as CCSID 1200Initial value of APPLICATION ENCODING special register will be 1208

DECLARE VARIABLE statement can be used to override the bind option/special register for host variablesFor statements that use a DESCRIPTOR, as in FETCH USING DESCRIPTOR, CCSID overrides can be coded by hand in the SQLDA.

ODBC SupportSupport for Wide Character API's (UCS2/UTF-16)See ODBC Guide and Reference (SC26-9941-01)Example

SQLRETURN SQLRETURN SQLPrepare ( SQLPrepareW (SQLHSTMT hstmt, SQLHSTMT hstmt,SQLCHAR *szSqlStr, SQLWCHAR *szSqlStr,SQLINTEGER cbSqlStr ); SQLINTEGER cbSqlStr );

SQLJ/JDBC SupportRemove current support for converting to EBCDIC before calling engine. Let DB2 engine determine where conversion is necessary

ODBC/SQLJ/JDBC

ODBC support for Unicode is included as part of the effort to support ODBC 3.0SQLJ and JDBC already support Unicode, but changes have been made to exploit Unicode support in DB2 V7

Enterprise COBOL for z/OS and OS/390 V3R1 Supports UnicodeNATIONAL is used to declare UTF-16 variables

MY-UNISTR pix N(10). -- declares a UTF-16 VariableN and NX Literals

N'123' NX'003100320033'

Conversions NATIONAL-OF Converts to UTF-16DISPLAY-OF Converts to specific CCSID

Greek-EBCDIC pic X(10) value " ".UTF16STR pic N(10).UTF8STR pix X(20).Move Function National-of(Greek-EBCDIC, 00875) to UTF16STR.Move Function Display-of(UTF16STR, 01208) to UTF8STR.

COBOL

Chris Crone 04:08 PM 04/16/02 41-42

Cobol has recently added support for Unicode charactersIncluded in this support

New NATIONAL data type N and NX literalsConversion operationsMore

Joins, Sub-queries, Unions...

Support will be consistent with ASCII supportNo mixing of ES in:

Queries, Joins, Sub-queries, Unions....For example: Select T1C1, T2C1 from T1,T2 where... fails if T1 and T2 are not the same ES

You cannot reference the DB2 catalog in a query against an ASCII or Unicode table

As in prior releases, you cannot reference tables from more than one encoding scheme in a single statementIn the first example we fail because T1 and T2 are not of the same encoding scheme.

Predicates

Predicates limited to 255 bytes (except like)Basic Predicate

SELECT ... WHERE C1 = :HG1(where C1 is UTF-8 and :HG1 is UTF-16)

Like predicateSELECT ... WHERE C1 LIKE :HG1 ESCAPE :HG2;

(where C1 is UTF-8 and :HG1 and :HG2 are UTF-16)

In PredicateSELECT ... WHERE C1 in (:HG1, :HV1);

(where C1 is UTF-8 and :HG1 is UTF-16 and HV1 is character)

Chris Crone 04:08 PM 04/16/02 43-44

For queries against Unicode tables, Host variables used in a predicate may be specified as UTF-16 or UTF-8 regardless of the data type of the column.This allows the back end data store to change, without changing the application.

When data is primarily Latin-1, it may be more efficient to store UTF-8 dataWhen data is not primarily Latin-1, it may be more efficient to store data in UTF-16

Scalar Functions

FunctionsLENGTH, SUBSTR, POSSTR, LOCATE

Byte Oriented for SBCS and Mixed (UTF-8)Double-Byte Character Oriented for DBCS (UTF-16)

Cast functionsUTF-16/UTF-8 are accepted any where char is accepted (char, date, time, integer...)

SELECT DATE(graphic column) FROM T1;SELECT INTEGER(graphic column) FROM T1;

UTF-8 is result data type/CCSID 1208 for character functions (char(float_col)...)

All Built In Functions (BIFs) have been extended to support UnicodeSome BIFs, such as LENGTH, SUBSTR, POSSTR, and LOCATE are byte oriented for UTF-8 and Double-Byte character oriented for UTF-16Many new functions were added in V7, the CCSID_ENCODING function has been added to help users determine the encoding, ASCII, EBCDIC, or UNICODE of a particular CCSIDUTF-16 data is accepted in casting type functions such as DATE or INTEGERResult CCSIDs for functions that return character strings will return UTF-8/CCSID 1208

RoutinesUDFs, UDTFs, and SPs will all be enabled to allow Unicode parametersParameters will be converted as necessary between char (UTF-8) and graphic (UTF-16)Date/Time/Timestamp passed as UTF-8 (ISO Format)

Routines

Chris Crone 04:09 PM 04/16/02 45-46

User written routines (User Defined Functions, User Defined Table Functions, and Stored Procedures) have been extended to support Unicode

Parameters will be converted as necessaryDate, Time, and Timestamp values are passed to the routine as a UTF-8 character string. These values will be in the ISO format as specified in the DB2 SQL Reference.

Utilities

UtilitiesLOAD Utility

UTF-16 <-> UTF-8SBCS/MIXED -> DBCSDBCS -> SBCS/MIXED

ASCII/EBCDIC <-> UNICODEUNLOAD Utility

ASCII/EBCDIC <-> UNICODENo support for

SBCS/MIXED -> DBCSDBCS -> SBCS/MIXED

The load utility has been extended to support conversion to and from Unicode.Additionally, the load utility will support conversion between character and graphic as long as conversion exists.

Character in load dataset -> Graphic columnGraphic in load dataset -> character column

The unload utility, new for V7, supports conversion to/from Unicode, but does not support conversion between character and graphic

Limits

Index Key Size - remains 255Char limit still 255 bytesVarying length string limit still 32K bytesStrings > 32K bytes - use LOB's

Chris Crone 04:09 PM 04/16/02 47-48

We haven't changed any of these limits. These are the same limits as we had for V6.The limit on index key sizes is something to watch out for.

Unicode data can take from 1-3 times the space needed to store ASCII or EBCDIC data

For character strings longer than 255 use Varchar. Varying length strings are still limited to 32704 bytes. For longer strings, use LOBs

Things to look out for


UTF-8 and UTF-16 are compatible just about everywhere, but you will pay a conversion cost. It is best to match the DB2 data definition to the UNICODE model the application is using

If application uses UTF-8, DB2 tables should be UTF-8If application uses UTF-16, DB2 tables should be UTF-16

CollationUnicode Collation is more like ASCII collation than EBCDIC

Numbers come before lettersUpper characters come before lower case

UTF-8 and UTF-16 Collations are not the same if Surrogates involved

Chris Crone 04:09 PM 04/16/02 49-50

UTF-8 and UTF-16 are very compatible, but the cost of conversion, even with HW support can be high.In addition to the conversion costs, there can be other effects such as predicate indexability that may be affected by a mismatch in data types.Matching applications and back end data store will optimize applications and provide the best performance.Collation of Unicode data will be more like ASCII data.

When surrogates are involved, UTF-8 and UTF-16 do not collate the same.


Storage size does not equal rendered sizeJapanese characters take 3 bytes to store 1 character in UTF-8Latin-1 accented characters take two bytes in UTF-8UNICODE has things called combining characters that allow something like A-Ring to be represented as A and Combining Character Ring. Combining characters can add to the size needed for both UTF-8 and UTF-16 columns

Å can be represented as'00C5'x (or 'C385'x for UTF-8)'00410307'x (or '41CC87'x for UTF-8)

Client ConnectionsClients need to be compatibleCan use two stage conversions

There are many issues that need to be dealt with in a Unicode environmentSome of these are storage related and affect things like database definitionsSome of these are application related and affect things like rendering characters for printing or display

Expanding and contracting conversions are very common in Unicode environments. The sizes of these expansions and contractions are not easy to calculate because of things like combining charactersWhen a connection is made between a client and server using DRDA, CCSIDs are exchanged. If a conversion is not available to convert between CCSIDs, the connection will fail.For conversions that would not normally be available, two stage converters can be used.

For example - converters from Chinese CCSIDs to Japanese CCSIDs aren't normally available, however, we could convert from Chinese to Unicode and from Unicode to Japanese.It is possible to create two stage converters using the OS/390 support for Unicode.


UTF-16 and SPUFI or DSNTEP2SPUFI and DSNTEP2 really aren't UTF-16 aware

In most cases, you should use CHAR(graphic column) when selecting data. For example, use:

SELECT CHAR(g1) FROM T1Not

SELECT g1 FROM T1Hex constants are character based.

INSERT INTO T1 (g1) VALUES(x'0041); -- will result in x'00000041' not x'0041' as you might expect. Because hex constants are character based, DB2 will convert from UTF-8 to UTF-16 for you. x'00' -> x'0000' and x'41' -> x'0041'.

Chris Crone 04:09 PM 04/16/02 51-52

SPUFI and DSNTEP2 are not designed to work with GRAPHIC data on a MIXED = NO subsystem.HEX constants are character based, not graphic based.

HEX constants should be used with UTF-16 with care

Example Scenarios

Pre-V7

DRDA - CCSID 819DRDA - CCSID 850

3270 CCSID 37

DB2 V6Mixed = NoEBCDIC CCSID 500

3270 CCSID 500DB2 V6Mixed = NoEBCDIC CCSID 37

DRDA - CCSID 290/930/300 (Japanese)

Chris Crone 04:09 PM 04/16/02 53-54

Even though 37 and 500 are both Latin 1 Code pages and are compatible, You should not connect into a CCSID 500 system with a CCSID 37 emulator.

Characters such as [,], |, and ! are not represented the same on CCSID 37 and CCSID 500 and thus, these characters will be corrupted when returned to another user via DRDA or 3270 data stream.

V7 and Beyond

DRDA - CCSID 850

3270 CCSID 37

DB2 V7 With UnicodeMixed = NoEBCDIC CCSID 500

3270 CCSID 500DB2 V7 with UnicodeMixed = NoEBCDIC CCSID 37

DRDA - CCSID 912 (Czech)

DRDA - CCSID 290/930/300 (Japanese)

With V7 connections from clients that, in the past were not possible, are now possible. This doesn't mean that there aren't challenges

Conversions need to be definedCorrect behavior may depend on Application Encoding Bind option specification

Summary

Character Conversion Fundamentals


Unicode Support in DB2 UDB for OS/390 V7


Example Scenarios

Chris Crone 04:09 PM 04/16/02 55-56

Appendix - CCSID Information and Documentation

Installation Guide

Appendix A.Character conversion

SQL ReferenceCharacter sets and code pagesCharacter conversionConversion rules for string assignmentConversion rules for string comparisonCharacter conversion in unions and concatenationsSelecting the result CCSIDSQL descriptor area (SQLDA)

Administration GuideChoosing string or numeric data types

There are many sections in the DB2 documentation that deal with character conversion. Some of the more important ones are shown here.I have listed the section titles for the books. I haven't listed page numbers or section numbers because these may vary depending on the form of book, paper, PDF, or Bookmanager you use.You should become familiar with these sections of the documentation if character conversion is occurring on your system

Appendix - Catalog

Catalog changes for Unicode in DB2 for z/OS & OS/390 V7Added Columns

SYSPLANRELBOUND (indicates release when plan was last bound or rebound)

ENCODING_CCSID (Bind option value)SYSPACKAGE

RELBOUND (indicates release when plan was last bound or rebound)ENCODING_CCSID (Bind option value)

SYSVIEWSRELCREATED (indicates release when view was created)

SYSTABLESRELCREATED (indicates release when table was created)

Updated Columns (Updated for ENCODING UNICODE)SYSDATABASESYSTABLESPACESYSTABLESSYSPARMSSYSDATATYPES

Chris Crone 04:09 PM 04/16/02 57-58

There were many changes to the Catalog for DB2 V7The changes related to Unicode support are listed here

Appendix - zSeries Unicode SupportThe UTF-8 <-> UTF-16 instructions are used when DB2 converts from char to graphic or graphic to char. These are used in DB2 V7 if running on G5, G6, zSeries 900, zSeries 800 or OS/390 V2R8 or later.

CUUTF - Convert UTF-16 to UTF-8CUTFU - Convert UTF-8 to UTF-16

The following two instructions are similar to CLCLE and MVCLE. DB2 will use these instructions to perform comparison and padding on UTF-16 data because you can specify a two byte padding character. DB2 UDB for OS/390 and z/OS V7 uses these instructions if running on zSeries 900 or OS/390 V2R8:

CLCLU - Compare logical long UNICODEMVCLU - Move logical long UNICODE

These instructions pack/unpack ASCII (also UNICODE UTF-8) and UNICODE (UTF-16) data. These instructions are used when DB2 converts a character string to a decimal or internal date/time/timestamp or a decimal value to a character string. DB2 UDB for OS/390 and z/OS V7 will use these instructions if running on zSeries 900 or OS/390 V2R8:

PKU - Pack UnicodePKA - Pack ASCIIUNPKU - Unpack UnicodeUNPKA - Unpack ASCII

These instructions are all used when DB2 performs conversion. For instance from ASCII SBCS to UNICODE UTF-16, DB2 will use the TROT (one byte to two byte characters). DB2 indirectly uses these instructions via the Conversion System Services (available in OS/390 V2R8 and above) in DB2 UDB for z/OS & OS/390 V7:

TRTT - Translate Two to TwoTRTO - Translate Two to OneTROT - Translate One to TwoTROO - Translate One to One

The z/Architecture, zSeries 900 and zSeries 800 processors provides instructions which support Unicode data processing.

The support is detailed here

References

DB2 UDB Server for OS/390 Version 7 and z/OS Presentation GuideRedbook on DB2 V7, SG24-6121

DB2 Universal Database Administration Guide - SC09-2946Appendix E - National Language Support

The Unicode Standard Version 3.0The Unicode Consortium - Addison-Wesleywww.unicode.orghttp://www.ibm.com/developerworks/unicode/

Character Data Representation Architecture: Reference & Registry

SC09-2190National Language Design Guide Volume 2

SE09-8002

Chris Crone 04:09 PM 04/16/02 59-60

In appendix E of the DB2 UDB Admin Guide, there is a discussion of NLS issues and how to set/override the codepage at the client using the codepage keyword on NT, and LANG variable on AIX.

db2 unicode-dbcs

Documents

Transcript of db2 unicode-dbcs