db2 unicode-dbcs
-
Upload
santosh-subuddhi -
Category
Documents
-
view
489 -
download
5
Transcript of db2 unicode-dbcs
DB2 UDB for z/OS & OS/390Character Conversion & Unicode FundamentalsChris CroneSenior Software Engineer / [email protected]
This is a basic survival lesson in character conversion and Unicode. What is a CCSID? What does DB2 do with it? Why do I care? What does Unicode do for me? If you care about data integrity and run on more than one operating system in a global business, this is the basic information you need to survive.
Agenda
Character Conversion FundamentalsUnicode FundamentalsUnicode Support in DB2 UDB for z/OS & OS/390 V7Things to Look Out ForExample ScenariosSummary
Chris Crone 04:07 PM 04/16/02 1-2
This is the agenda for a basic survival lesson in character conversion and Unicode. What is a CCSID, what does DB2 do with it, why do I care, and what does Unicode do for me. In a global economy, this is the basic information you need to survive.While these are the fundamentals, they are not simple. There are many misconceptions. Just because you get some of the information back correctly does not guarantee that you are always getting the data without loss.
Terminology
For the purposes of this presentationUNICODE
UTF-8UTF-16
ASCII ASCII is a generic term that refers to all ASCII CCSIDs that DB2 currently supports
EBCDICEBCDIC is a generic term that refers to all EBCDIC CCSIDs that DB2 currently supports
CCSIDCoded Character Set IdentifierUsed By DB2 to tag string data
When I use the term UNICODE, I mean UTF-16, and UTF-8 encodings. See www.unicode.org for more.When I use the term ASCII, I mean any generic ASCII CCSID (like 850, 819, 437) Single Byte Character Set (SBCS), Mixed , or Double Byte Character Set (DBCS)When I use the term EBCDIC I mean any generic EBCDIC CCSID (like 500, 37) SBCS, Mixed, or DBCS CCSIDCCSIDs are used by DB2 to tag string data. A CCSID precisely defines the encoding of the data.
Character Conversion Fundamentals
Chris Crone 04:07 PM 04/16/02 3-4
When we store data in some EBCDIC CCSID and display it on a PC in some ASCII CCSID, it must be translated. There are many translations, and we're just getting started.
CCSID 437 CCSID 1252What is Character Conversion
This slide depicts two common PC codepages. Note the differences in them.
Codepage 1252 defines the things like the Euro, and the full Latin-1 character set with all the accented characters.Codepage 437 defines a partial Latin-1 accented character list
Because the codepages do not contain the exact same set of characters, data cannot necessarily be converted from one to the other without the potential loss of data.Note also that characters are represented in different areas
Look at ae ligature '19'x and '29'x in CCSID 437 and '6e'x and '6c'x for CCSID 1252
Conversion Methods
Native DB2 SYSIBM.SYSSTRINGS (V2.3)
ICONVUses LE base services (V6) - Non-Strategic
Requires OS/390 V2R9
OS/390 V2 R8/R9/R10 & z/OS support for Unicode (V7)Conversion Services
Requires OS/390 V2R8 and above + APAR OW44581
code and program directoryhttp://www6.software.ibm.com/dl/os390/unicodespt-p
documentation http://publibfp.boulder.ibm.com/pubs/pdfs/os390/cunpde00.pdf http://publibfp.boulder.ibm.com/pubs/pdfs/os390/cunuge00.pdfInformation APAR II13048 and II03049
Chris Crone 04:07 PM 04/16/02 5-6
DB2 uses three methods for conversionSYSSTRINGS - This is the conversion services that were introduced in DB2 V2R3 and the ones most people are familiar withICONV - Introduced in DB2 V6 and requiring OS/390 V2R9 and above. This was our first attempt at leveraging OS/390 infrastructure to perform conversion. It is non-strategic because the OS/390 V2 R8/R9/R10 support for Unicode provides more functionality with better performanceOS/390 V2 R8/R9/R10 support for Unicode - Starting in V7, DB2 will be leveraging this service for most future character conversion support.
Native DB2
Based on SYSIBM.SYSSTRINGSHigh PerformanceAdded with DB2 V2R3Support for
Single byte, Mixed, Double byteASCII EBCDIC
Uses a combination of 256 Byte conversion tables and special two stage look up tables
The native conversion services that DB2 ships are based on support that was added in DB2 V2R3
They are high performance and rely on a cached copy of SYSIBM.SYSSTRINGS rows to perform conversion.A large number of ASCII and EBCDIC, Single byte, Mixed, and Double byte conversions are supported.These conversions use a combination of conversion tables contained in the TRANSTAB field of SYSSTRINGS and two stage conversion tables (for mixed and Double byte conversions) specified by the TRANSPROC field of SYSSTRINGS.
Central repository for OS/390 systemHigh performance
Uses HW instructions available in z900 GA2See appendix for complete list of HW instructions
Uses page fixed tables in a data space
Conversion image built by off-line utilityCUNMIUTL - see sample in hlq.SCUNJCL (CUNJIUTL)
Administered via OS/390 Console SET UNIDISPLAY UNI
OS/390 V2 R8/R9/R10 & z/OS support for Unicode
Chris Crone 04:07 PM 04/16/02 7-8
With the introduction of the OS/390 support for Unicode system service, OS/390 now has a central repository for conversion that can be used by applications, middleware, and subsystems.This service is designed to be high performance and utilizes new HW instructions and page fixed conversion tables to perform the conversions.This service uses a conversion image that is built by the off-line utility CUNMIUTL. A customer specifies which conversions will be supported by the conversion image. The conversion image is managed via the OS/390 console, not DB2.
The SET UNI command specifies the image to be loadedThe DISPLAY UNI command displays information about the currently loaded conversion image
Conversion Services Example//CUNMIUTL EXEC PGM=CUNMIUTL //SYSPRINT DD SYSOUT=* //TABIN DD DISP=SHR,DSN=hlq.SCUNTBL //SYSIMG DD DSN=hlq.IMAGES(CUNIMG00),DISP=SHR //SYSIN DD * /******************************************** * INPUT STATEMENTS FOR THE IMAGE GENERATOR * ********************************************/ CONVERSION 00850,01047,ER; /*ASCII -> EBCDIC */ CONVERSION 01047,00850,ER; /*EBCDIC -> ASCII */ CONVERSION 00037,1200,ER; /*EBCDIC 037 -> UCS-2 */ CONVERSION 1200,00037,ER; /*UCS-2 -> EBCDIC 037*/ CONVERSION 00500,1200,ER; /*Latin-1 EBC -> UCS-2 */ CONVERSION 1200,00500,ER; /*UCS-2 -> Latin-1 EBC*/ CONVERSION 01047,1200,ER; /*EBCDIC 1047 -> UCS-2 */ CONVERSION 1200,01047,ER; /*UCS-2 -> EBCDIC 1047*/ CONVERSION 01208,1200,ER; /*UnicodeCCSID-> UCS-2 */ CONVERSION 1200,01208,ER; /*UCS-2 -> UnicodeCCSI*/ CONVERSION 01383,1200,ER; /*Simp Chines -> UCS-2 */ CONVERSION 1200,01383,ER; /*UCS-2 -> Simp Chines*/ CONVERSION 00932,1200,ER; /*Jpn MCCSID -> UCS-2 */ CONVERSION 1200,00932,ER; /*UCS-2 -> Jpn MCCSID */ CONVERSION 00939,1200,ER; /*Jpn-ExtEng -> UCS-2 */ CONVERSION 1200,00939,ER; /*UCS-2 -> Jpn-ExtEng */ CONVERSION 00300,1200,ER; /*Jpn GCCSID -> UCS-2 */ CONVERSION 1200,00300,ER; /*UCS-2 -> Jpn GCCSID */ CONVERSION 00500,00850,ER; /*Latin-1 EBC -> ASCII */ CONVERSION 00850,00500,ER; /*ASCII -> Latin-1 EBC*/
/*
Here's an example of the CUNJIUTL utilityNote the specification of ER (enforced subset, round trip) after each CCSID pair. The ER specification is required by DB2.
Conversion Services ConfigurationWhich Conversions should be configured
CCSID 367 (7-Bit ASCII) <-> ASCII & EBCDIC System CCSID(s)CCSID 1208 (UTF-8) <-> ASCII & EBCDIC System CCSID(s)CCSID 1200 (UTF-16) <-> ASCII & EBCDIC System CCSID(s)Client CCSID(s) <-> Unicode CCSIDs (367, 1208, 1200)Additional ASCII or EBCDIC Conversions
Starting with V7, most new code conversion support will be via conversion services. Native DB2 conversions will continue to be supported and used, but in most cases, not enhanced
OtherConversions needed to LOAD/UNLOAD DataConversions needed to support application encoding bind option, DECLARE VARIABLE, or CCSID overrides
Chris Crone 04:08 PM 04/16/02 9-10
Which Conversions should be configured All DB2 conversions involving Unicode are supported via the OS/390 conversion services. Any conversion involving Unicode must be configured.ASCII and EBCDIC conversions not supported by Native DB2 Conversion methods Other conversions, that are not supported via SYSIBM.SYSSTRINGS are supported via the conversion services
Conversion Services Example
14.34.14 d uni,all14.34.15 CUN3000I 14.34.14 UNI DISPLAY 097 ENVIRONMENT: CREATED 12/11/2000 AT 09.13.53 MODIFIED 12/11/2000 AT 09.13.53 IMAGE CREATED 12/06/2000 AT 17.10.01 SERVICE: CUNMCNV CUNMCASE STORAGE: ACTIVE 50 PAGES LIMIT 524287 PAGES CASECONV: NONE CONVERSION: 00500-00367-ER 00500-01208-ER 00500-01200(13488)-ER 00367-00500-ER 00367-01208-ER 00367-01200(13488)-ER 01208-00500-ER 01208-00367-ER 01208-01200-ER 01200(13488)-00500-ER 01200(13488)-00367-ER 01200-01208-ER
Here's an example of output from a display UNI command.Note when the image was created is displayedAlso, the number of pages used by the image is also displayed. This is important because these pages are page fixed so they are taking up dedicated memory space on the machineFinally, a list of conversions that are supported in this image are displayed.
Note for CCSID 1200 conversions the base CCSID that the conversion was created from is also displayed in parenthesis.
Round Trip - VS - Enforced Subset
Round Trip (RT) ConversionsDesigned to preserve codepoints that are not representable in both codepages
Enforced Subset (ES) ConversionsCodepoints that are not representable are converted to SUB character
DB2 Uses a combination of RT and ES conversionsTrend is toward ES conversionsContinue to use RT conversions in some cases for compatibility reasons
Chris Crone 04:08 PM 04/16/02 11-12
DB2 conversions are either Round Trip, or Enforced Subset.Round Trip conversions attempt to avoid loss of data by mapping unrepresented codepoints to unused (or unlikely to be used) codepoints
Data loss can be avoided or delayedCan cause strange conversions
Enforced Subset conversions map any unrepresented codepoints to the sub-character
Data loss occurs immediately
Conversions can cause the length of a string to change Expanding Conversions
When data converted from one CCSID to another expandsFor Example
Å - 'C5'x in CCSID 819 -> 'C385'x in CCSID 1208Contracting Conversions
When data converted from one CCSID to another contractsFor Example
Å - '00C5'x in CCSID 1200 -> 'C5'x in CCSID 819
Expanding and Contracting Conversions
There are some cases where a conversion causes the length of the data to change.
Expanding conversions cause the length of the data to growContracting conversions cause the length of the data to shrink
What are CCSIDs used for?DB2 uses CCSIDs to describe data stored in the DB2 subsystem
DB2 supports specification of CCSIDs at a subsystem level
With V7, DB2 supports 3 encoding schemesASCIIEBCDICUNICODE
Data is comparable only within a single encoding scheme
Chris Crone 04:08 PM 04/16/02 13-14
So now that we know all about CCSIDs, what are they used for.DB2 uses CCSIDs just like we use data type and length. They are part of the metadata that describes the data being stored in DB2.In V7 DB2 supports specification of three sets of CCSIDs. These three sets of CCSIDs represent the three encoding schemes (ASCII, EBCDIC, and Unicode), that DB2 supports.DB2 supports the specification of these CCSIDs at the subsystem level. Once these values have been specified, they should not be changed.
InstallationSpecification of CCSIDs is performed at installation via install Panel DSNTIPF
DSNTIPF INSTALL DB2 -APPLICATION PROGRAMMING DEFAULTS PANEL 1===>_Enter data below:1 LANGUAGE DEFAULT ===> IBMCOB ASM,C,CPP,COBOL,COB2,IBMCOB,FORTRAN,PLI2 DECIMAL POINT IS ===> . .or ,3 MINIMUM DIVIDE SCALE ===> NO NO or YES for a minimum of 3 digits to right of decimal after division4 STRING DELIMITER ===> DEFAULT DEFAULT,"or '(COBOL or COB2 only)5 SQL STRING DELIMITER ===> DEFAULT DEFAULT,"or '6 DIST SQL STR DELIMTR ===> ' 'or "7 MIXED DATA ===> NO NO or YES for mixed DBCS data8 EBCDIC CCSID ===> 0 CCSID of your SBCS or MIXED DATA9 ASCII CCSID ===> 0 CCSID of SBCS or mixed data.10 Unicode CCSID ===> 1208 CCSID of Unicode UTF-8 data.11 DEF ENCODING SCHEME ===> EBCDIC EBCDIC, ASCII, or UNICODE12 LOCALE LC_CTYPE ===>13 APPLICATION ENCODING ===> EBCDIC EBCDIC, ASCII, UNICODE ccsid (1-65533)14 DECIMAL ARITHMETIC ===> DEC15 DEC15,DEC31,15,3115 USE FOR DYNAMICRULES ===> YES YES or NO16 DESCRIBE FOR STATIC ===> NO Allow DESCRIBE for STATIC SQL.NO or YES.
Install panel DSNTIPF is used to specify CCSID informationOptions 8,9, and 10 are where the CCSIDs for the three encoding schemes are specified.Notice that ASCII and EBCDIC CCSIDs are initialized to 0 and the Unicode CCSID is initialized to 1208
The ASCII and EBCDIC CCSIDs are not pre-filled, these values needs to be set by the customer.
The EBCDIC should be set to the CCSID that the customer's 3270 emulators, CICS, and IMS transactions use. The ASCII value should be set to the CCSID that is most commonly used by workstations in the customer shop (1252 for example).
The Unicode value is pre-filled with 1208 cannot be changed. This value specifies the mixed CCSID for Unicode tables.
Other things to note on this pageOption 11 - This specifies the default encoding scheme for Objects created in the DB2 subsystem.Option 13 - This option specifies the default application encoding. Changing this valuse should be done with great care.
Installation (continued)Information from DSNTIPF ends up in Job DSNTIJUZ
DSNHDECM ASCCSID=1088, AMCCSID=949, AGCCSID=951, SCCSID=833, MCCSID=933, GCCSID=834, USCCSID=367, UMCCSID=1208, UGCCSID=1200, ENSCHEME=EBCDIC, APPENSCH=EBCDIC, MIXED=YES END
DSNHDECM ASCCSID=819, AMCCSID=65534, AGCCSID=65534, SCCSID=37, MCCSID=65534, GCCSID=65534, USCCSID=367, UMCCSID=1208, UGCCSID=1200, ENSCHEME=EBCDIC, APPENSCH=EBCDIC, MIXED=NO END
Mixed System Non-Mixed System
Chris Crone 04:08 PM 04/16/02 15-16
The information from panel DSNTIPF flows to Job DSNTIJUZ.In the case on the Left, we have a Mixed = Yes system that is set up to support Korea. The ASCII and EBCDIC system CCSIDs that actually would have been specified on panel DSNTIPF, to result in this specification, would have been 949 and 833.
For mixed systems, and for the Unicode CCSID, the Mixed CCSID is specified on install panel DSNTIPF and DB2 will pick the corresponding Single byte and Graphic (Double Byte) CCSIDs.
In the case on the Right, we have a Mixed = No system that is set up to support US English.
Note that the user specified 819 and 37 for the ASCII and EBCDIC Single byte CCSIDs and that DB2 used the value 65534 for the ASCII and EBCDIC Mixed and Graphic (Double byte) CCSIDs. 65534 is a reserved value that means no CCSID.
Also note that the Default Encoding and Default Application Encoding also flow to this job.Note there is a bug in DSNTIJUZ and DSNHDECM - These ship with CCSID 500 as default.
CCSIDs are stored in the following placesSYSIBM.SYSDATABASESYSIBM.SYSTABLESPACESYSIBM.SYSVTREESYSIBM.SYSPLAN (V7)SYSIBM.SYSPACKAGE (V7)Plans and Packages (SCT02 and SPT01)Directory (DSNDB01) (V5)DECP
In ENCODING_SCHEME column of - Stored as 'A', 'E', 'U', or blank (default)SYSIBM.SYSDATATYPESSYSIBM.SYSDATABASESYSIBM.SYSPARMSSYSIBM.SYSTABLESPACESYSIBM.SYSTABLES
Where Is Encoding Information stored?
Once we've specified CCSIDs for our system, what does DB2 do with them?DB2 stores CCSIDs in the Catalog, the directory, in bound statements, the directory, and of course the DECP
The value stored in these areas depends on what release of DB2 was used to create the object, and the value in the DECP at the time the object was created.
If a value is 0, it is assumed that the object is EBCDIC.In general, DB2 does not support changing of a CCSID once it is specified in a DECP
The exceptions areChanging from 0 to a valid valueChanging from a CCSID that does not support the EURO symbol to a CCSID that supports the EURO symbol (37 -> 1140 for instance).
Note that this sort of change requires special, disruptive, changes and should be undertaken only after the documentation has been read and the process is thoroughly understood.
Encoding enformation is also stored in some catalog tables
DSNTIPF Mixed Data OptionMixed = No systems have support for
SBCS Data - Pure single byte dataMixed Data
Unicode UTF-8 MBCS ( 1-4 bytes/char) data. No support for ASCII/EBCDIC mixed data
Graphic dataUnicode UTF-16 (2 or 4 bytes/char) data. No support for ASCII/EBCDIC DBCS data
Mixed = Yes systems have support forSBCS Data - Pure single byte dataMixed Data - Single & double byte data in a single stringGraphic Data - Pure Double byte data
Chris Crone 04:08 PM 04/16/02 17-18
Mixed = Yes systems are used in the Far East, primarily China, Japan, and Korea. They offer support for SBCS, Mixed, and DBCS ASCII and EBCDIC data. Mixed = No systems are used elsewhere in the world and only have support for SBCS data.Unicode data is always considered mixed regardless of the Mixed = Yes/Mixed = No setting of the systemCreation of columns with data types of For Mixed data and Graphic are allowed, for EBCDIC tables, on Mixed = No systems prior to DB2 V7. As of DB2 V7, Mixed = No systems only allow specification of these types of columns in Unicode tables.
Mixed DataSBCS Data
SBCS can be compared to mixed without conversion to mixed because it is a subset of the mixed repertoire. This is true for ASCII, EBCDIC, and Unicode
Mixed DataCapable of representing SBCS and MBCS data
EBCDIC SO, SI ('0E'x, '0F'x) delineate DBCS dataAB<A> -> 'C1C20E42C10F'x
ASCII uses first byte code point, if the first byte is within a certain range, say 'A0'x - 'AF'x, then it is the first byte of a DBCS character.
For example 'A055'x would be a DBCS characterSome CCSIDs have several first byte code point ranges.
UTF-8 data uses the high order bit to indicate MBCS data
For example 'EFBC91'x is a three byte UTF-8 character
Graphic DataASCII or EBCDIC - DBCS characters no shift or first byte code points neededUnicode - DBCS characters. Surrogates take two DBCS characters
Mixed Data
Mixed = yes systems use a CCSID triplet. That is to say, there is an SBCS CCSID, a Mixed CCSID, and a DBCS CCSID. On these systems, the SBCS CCSID is a subset of the Mixed CCSID. Because of this, SBCS and Mixed data can be compared without converting.SBCS columns are created by specifying the "FOR SBCS DATA" clause on create
like CREATE TABLE T1 (C1 CHAR(10) FOR SBCS DATA);
Mixed columns are the default, on MIXED = Yes systems, and can be explicitly specified by using the "FOR MIXED DATA" clause on CREATE.DBCS data is stored in Graphic columns
When Does Conversion Occur?
LocalGenerally, conversion does not occur for local applicationsWhen dealing with ASCII/Unicode tablesWhen specified by application
CCSID Override in SQLDA (V2.3 & above)Declare Variable (V7)Application Encoding Bind Option (V7)Current Application Encoding Special Register (V7)
RemoteAutomatically when needed
DRDA Receiver Makes Right
Chris Crone 04:08 PM 04/16/02 19-20
There are no hard and fast rules as to when a conversion occurs. The short answer is that conversion occurs when necessary.Some of the cases when we do conversion are listed here.
Unicode Fundamentals
Why Unicode?Unicode is a single character set that encodes all of the worlds scripts (sort of).The Unicode standard provides a cross platform, cross vendor method of encoding data that enbles lossless representation and manipulationBefore Unicode
Many StandardsANSIJIS
TISI
Provided by various vendorsIBM
ASCII (pSeries, xSereis) and EBCDIC (zSeries and iSeries)
HP Microsoft
Chris Crone 04:08 PM 04/16/02 21-22
Sort of, because new characters are being added all the time and so at any given time, an implementation of the standard is somewhat behind.Sort of because there is one standard that contains several implementations of the standardPrior to Unciode, many different standards and vendor implelmentations existed Unicode attempts to standardize the representation and manipulation of data across vendors and platforms
Unicode Fundamentals
Four forms of UnicodeUTF-8
Unicode Transformation Format in 8 bitsUCS-2
Universal Character Set coded in 2 octetsUTF-16
Unicode Transformation Format in 16 bitsUTF-32
Unicode Transformation Format in 32 bits Introduced with Unicode Technical Report # 19 to replace UCS-4
There are currently 4 forms of Unicode that are being promoted by the Unicode standards organization.
UTF-8, UTF-16, and UTF-32. UTF-16 is the preferred format (according to UTR#19)UCS-2 is the precursor to UTF-16.
UTF-8 (CCSID 1208)
ASCII Safe UNICODE (maps to 7-Bit ASCII)
Bytes '00'x - '7F'x = 7-Bit ASCIIBytes '00'x - '7F'x represented by single byte charsChars above '80'x are encoded by 2-6 byte chars
Most characters take 2-3 bytesMost Japanese, Chinese, and Korean characters take 3 bytesMost Extended Latin characters take 2 bytes
Surrogates take 4 bytes
Chris Crone 04:08 PM 04/16/02 23-24
UTF-8 is represented by CCSID 1208. This is a growing CCSID. This means that as characters are added to the Unicode standard, they will be added to this CCSID.UTF-8 is also commonly called ASCII safe UnicodeThe first 127 characters are the same as CCSID 367 which is a 7-bit ASCII CCSIDOther characters are represented as MBCS, 1-4 byte, charactersOne nice feature of UTF-8 is that since it is an 8-bit encoding, it does not have any big endian/little endian issues.
UCS-2 (CCSID 13488, 17584)
Basic Multilingual Plane - BMP(0)Pure Double Byte Characters
64K characters in Repertoire'0000'x - '00FF'x Represent 8 bit ASCII
'00'x appended to 8 Bit ASCII characters'00FF'x - 'FFFF'x Represent additional characters
Greek -> '0370'x - '03FF'xCyrillic -> '0400'x - '04FF'...
UCS-2 is represented by CCSIDs 13488 and 17584. 13488 corresponds to Unicode Version 2, and 17584 corresponds to Unicode Version 3.When people say Unicode, without qualifying the encoding format, this is usually what they mean.Other characters are allocated in blocks (there's a block for Greek chars, a block for Cyrillic chars....)
UTF-16 (CCSID 1200)
UCS-2 with Surrogate SupportUses two two-byte characters to represent additional characters
~1 Million characters in repertoireBMP1-BMP16 (additional 16 planes).
Supplementary Multilingual Plane (SMP) - Plane 1U+10000..U+1FFFF
Supplementary Ideographic Plane (SIP) - Plane 2U+20000..U+2FFFF
Supplementary Special Purpose Plane (SSP) - Plane 14U+E0000..U+EFFFF
BMP15 and BMP16 are reserved for private use
Chris Crone 04:08 PM 04/16/02 25-26
UTF-16 is represented by CCSID 1200, which is a growing CCSID alsoUTF-16 is a superset of UCS-2 and uses reserved sections of BMP0 to map an additional 16 planesVersion 3.1 of the Unicode standard defines the first characters in the surrogate area
UTF-32
Each Character is 4 bytes Range is restricted to values '00000000'x - '0010FFFF'xRepresents the same repertoire as UTF-16
UCS-4 Implemented by SUN Solaris and HP/UX as base Unicode data type
XPG/4 standard requires fixed width character format z/Series, p/Series looking at UTF-32 implementations to support surrogate characters in C/C++ applications
For completeness, I'm mentioning UTF-32 and UCS-4. DB2 is not implementing any support for these implementations of the Unicode standard, although other vendors have.
EndianessBig Endian
pSeries, zSeries, iSeries, Sun, HPLease significant byte is leftmost
For a 4 byte word - Byte order 0,1,2,3
Little EndianIntel based machines including xSeriesMost significant byte is leftmost
For a 4 byte word - Byte order 3,2,1,0
UTF-8 - not affected by endianess issuesUTF-16 and UTF-32 are effected by endianess issues
Big Endian'A' = x'0041' for UTF-16 or x'00000041' for UTF-32
Little Endian'A' = x'4100' for UTF-16 or x'41000000' for UTF-32
Note: A BYTE is always ordered as leftmost most significant bit to rightmost least significant bit. Bit order within a byte is always 7,6,5,4,3,2,1,0
Chris Crone 04:08 PM 04/16/02 27-28
DB2 and DRDA manipulate and store data in Big Endian format.Little Endian clients convert data to Big Endian before putting on the wire.
Character Examples
A, a, 9, Å (The character A with Ring accent), U+9860, U+200D0
ASCII'41'x, '61'x, '39'x, 'C5'x, 'CDDB'x (ccsid 939), N/A
UTF-8'41'x, '61'x, '39'x, 'C385'x, 'E9A1A0'x, 'F0A08390'x
Note: 'C5'x becomes double byte in UTF-8
UTF-16 (Big Endian format)
'0041'x, '0061'x, '0039'x, '00C5'x, '9860'x, 'D840DCD0'x
UTF-32 (Big Endian format) '00000041'x, '00000061'x, '00000039'x, '000000C5'x, '00009860'x, '000200D0'x
Note: UCS-2/UTF-16 and UCS-4/UTF-32 are using a technique called Zero Extension
Now that we know all about Unicode, here are some examples of what this stuff looks like.Note that A-Ring takes two bytes to represent in UTF-8, and that other characters can take three or 4 bytes to represent
DB2 for z/OS & OS/390 V7 Enhancements for Unicode
Chris Crone 04:08 PM 04/16/02 29-30
Requirement
Enable Unicode on DB2 UDB for OS/390 and z/OSSupport Vendors implementing Unicode applicationsSupport needs of Multinational CompaniesSupport data from more than one country/language in one DB2 subsystem
For V7 our challenge was to enable Unicode data storage on DB2, without regressing function or performance for our ASCII and EBCDIC customers.We wanted to meet the needs of ERP and CRM vendors, as well as address the needs of customer written applications that have a need to store multinational data
Solution
Allow UNICODE to be specified as the Encoding Scheme (ES) at the
System Level UNICODE CCSIDs (Install)Similar to ASCII/EBCDIC System CCSIDs
Databasecreate database mydb ccsid unicode
Table Spacecreate tablespace myts in mydb ccsid unicode
Tablecreate table t1 (c1 char(10)) ccsid unicode
Othercreate procedure mysp (in in_parm char(10) ccsid unicode) ...
Chris Crone 04:08 PM 04/16/02 31-32
The Unicode support we have added with DB2 V7 is similar to the support we have for ASCII, that was added in V5.This support allows ASCII, EBCDIC, and Unicode objects to coexist in a single DB2 subsystem.specification of the encoding scheme is made at the system level as well as the object level
Storage
Storage of Unicode DataChar/VarChar/CLOB FOR SBCS DATA
(7-bit) ASCII this is a subset of UTF-8 CCSID 367 Char/VarChar/CLOB FOR MIXED DATA
UTF-8 CCSID 1208Graphic/VarGraphic/DBCLOB
UTF-16 CCSID 1200
Data stored in Unicode tables in DB2 will be stored in one of the following CCSIDs: 367, 1208, 1200.
Parsing
Parsing will be in EBCDICConversion to EBCDIC system CCSID from
ASCIIEBCDICUNICODE
Need to ensure that literal values are convertible to System EBCDIC CCSID.
If substitution occurs in statement text being converted to EBCDIC - SQLCODE +335 issuedUse Host Variables or Parameter markers where conversion to system EBCDIC CCSID is an issue
Chris Crone 04:08 PM 04/16/02 33-34
Parsing for DB2 V7 is in EBCDIC.This means that all statements sent to DB2 will be converted to the system EBCDIC CCSID, and then parsed.Since statements are converted to EBCDIC, there is a possibility of data loss when the data is converted.
Since all DB2 keywords, such as SELECT, are representable on all EBCDIC code pages, there shouldn't be a problem with statement textLiterals contained in the statement are subject to data loss.SQLCODE +355 has been added to alert the user of this sort of data loss
Catalog
The catalog will be encoded in the default EBCDIC CCSID
Object Names will need to be convertible to Default EBCDIC CCSID
Database NameTable space/Index space NameExternal Names(UDF, SP, Exits, Fieldproc...)
Identifiers may not be generateable from all clients so should really be limited to common subset that is representable on all clients.
The DB2 V7 catalog will remain EBCDIC, so names stored in the catalog must also be convertible to EBCDIC without lossSince the DB2 catalog contains data stored in the system EBCDIC default CCSID, it is possible to have things like Japanese, or Cyrillic names for things like columns (depending on your system CCSID). However, since all clients may not be capable of producing these characters, users should really be limited to the subset of Latin-1 characters.
Note: To store Japanese, Chinese, or Korean in the DB2 catalog, a MIXED=YES system is needed and the data being stored in the catalog must conform to the rules for well formed mixed data.
Unicode Literals
Literals UTF-8 literals (char/varchar/clob)
conform to normal rules for character stringsINSERT INTO T1 (C1) VALUES('123');
UTF-16 literals (graphic/vargraphic/dbclob)Specified as UTF-8 literalsSpecified as Graphic literals
INSERT INTO T1 (C1) VALUES('123');INSERT INTO T1 (C1) VALUES( ); --Unicode U+BBC0 U+BBC1
Chris Crone 04:08 PM 04/16/02 35-36
Specification of literals for Unicode tables is essentially the same as it is for ASCII or EBCDIC tables.
The one thing to note here is that character literals may be specified for UTF-16 columns. These character literals may be used any place a graphic literal would normally be used.
For instance in the values clause of an insert statement
Host Variables and Parameter Markers
ASCII/EBCDIC/UNICODE -> UNICODEChar or Graphic -> UTF-8 or UTF-16
UNICODE -> ASCII/EBCDIC/UNICODEUTF-8 or UTF-16 -> Char or Graphic
UTF-8 <-> UTF-16Applications don't need to change just because the back end data store changes
When dealing with Unicode tables, we have torn down the barrier between CHAR and GRAPHIC.This means your back end data store can be either UTF-8 or UTF-16 and you can use ASCII, EBCDIC, or Unicode character or graphic host variables and DB2 will perform the necessary conversions to/from the CCSID of the host variable even if the host variable doesn't match the column type (for ASCII and EBCDIC back end data stores, in most cases char and graphic are incompatible).
Declare Variable
DECLARE VARIABLE statement New way to allow CCSID to be specified for host variablesExample
EXEC SQL DECLARE :HV1 CCSID UNICODE;EXEC SQL DECLARE :HV2 CCSID 37;
Precompiler directive to treat hostvar as a specific CCSIDUseful for PREPARE/EXECUTE IMMEDIATE statement text
EXEC SQL PREPARE S1 FROM :HV2;May be used with any character host variable on input or output
Chris Crone 04:08 PM 04/16/02 37-38
The new DECLARE VARIABLE statement can be used to specify the CCSID of a particular host variable.This is a precompiler directive that causes the precompiler to specify the CCSID of the host variable in any SQLDA that the precompiler generates to reference the host variable. This directive works for both input and output host variables.
Application Encoding
New Application Encoding Scheme System Default
Determines Encoding Scheme when none is explicitly specified
Bind OptionAllows explicit specification of ES at an application level. Affects Static SQL - Provides default for dynamicSystem Default used if bind option not specified
Special RegisterAllows explicit specification of ES at the application level. Affects Dynamic SQLInitialized with Bind Option
OPTION is ignored when packages are executed remotelyDRDA specified Input CCSID, Data flows as it to client
Also new to DB2 V7 is the specification of Application Encoding Scheme.This allows a default Application Encoding to be specified
Preset to EBCDICThe Application Encoding Scheme can also be specified on BIND PLAN or PACKAGE
If not specified, the system default value is used for the bind optionPlans/Packages bound prior to V7 are assumed to be EBCDICThe option applies to Static SQL
The Application Encoding Scheme special register can be used to affect dynamic SQL
Initial value is the value of the Bind Option.
Application Encoding (continued)
Example Assume Package MY_PACK is bound with APPLICATION ENCODING(UNICODE)
All Char input/output host variables for static statements are assumed to be in CCSID 1208All Graphic input/output host variables for static statements are assumed to be in CCSID 1200
Initial Value for Application Encoding Special register will be 1208Declare Variable statement or CCSID overrides can be used for overriding bind option or special register
Chris Crone 04:08 PM 04/16/02 39-40
In this example, package MY_PACK is bound with APPLICATION ENCODING(UNICODE)
Character host variables will be treated as CCSID 1208Graphic host variables will be treated as CCSID 1200Initial value of APPLICATION ENCODING special register will be 1208
DECLARE VARIABLE statement can be used to override the bind option/special register for host variablesFor statements that use a DESCRIPTOR, as in FETCH USING DESCRIPTOR, CCSID overrides can be coded by hand in the SQLDA.
ODBC SupportSupport for Wide Character API's (UCS2/UTF-16)See ODBC Guide and Reference (SC26-9941-01)Example
SQLRETURN SQLRETURN SQLPrepare ( SQLPrepareW (SQLHSTMT hstmt, SQLHSTMT hstmt,SQLCHAR *szSqlStr, SQLWCHAR *szSqlStr,SQLINTEGER cbSqlStr ); SQLINTEGER cbSqlStr );
SQLJ/JDBC SupportRemove current support for converting to EBCDIC before calling engine. Let DB2 engine determine where conversion is necessary
ODBC/SQLJ/JDBC
ODBC support for Unicode is included as part of the effort to support ODBC 3.0SQLJ and JDBC already support Unicode, but changes have been made to exploit Unicode support in DB2 V7
Enterprise COBOL for z/OS and OS/390 V3R1 Supports UnicodeNATIONAL is used to declare UTF-16 variables
MY-UNISTR pix N(10). -- declares a UTF-16 VariableN and NX Literals
N'123' NX'003100320033'
Conversions NATIONAL-OF Converts to UTF-16DISPLAY-OF Converts to specific CCSID
Greek-EBCDIC pic X(10) value " ".UTF16STR pic N(10).UTF8STR pix X(20).Move Function National-of(Greek-EBCDIC, 00875) to UTF16STR.Move Function Display-of(UTF16STR, 01208) to UTF8STR.
COBOL
Chris Crone 04:08 PM 04/16/02 41-42
Cobol has recently added support for Unicode charactersIncluded in this support
New NATIONAL data type N and NX literalsConversion operationsMore
Joins, Sub-queries, Unions...
Support will be consistent with ASCII supportNo mixing of ES in:
Queries, Joins, Sub-queries, Unions....For example: Select T1C1, T2C1 from T1,T2 where... fails if T1 and T2 are not the same ES
You cannot reference the DB2 catalog in a query against an ASCII or Unicode table
As in prior releases, you cannot reference tables from more than one encoding scheme in a single statementIn the first example we fail because T1 and T2 are not of the same encoding scheme.
Predicates
Predicates limited to 255 bytes (except like)Basic Predicate
SELECT ... WHERE C1 = :HG1(where C1 is UTF-8 and :HG1 is UTF-16)
Like predicateSELECT ... WHERE C1 LIKE :HG1 ESCAPE :HG2;
(where C1 is UTF-8 and :HG1 and :HG2 are UTF-16)
In PredicateSELECT ... WHERE C1 in (:HG1, :HV1);
(where C1 is UTF-8 and :HG1 is UTF-16 and HV1 is character)
Chris Crone 04:08 PM 04/16/02 43-44
For queries against Unicode tables, Host variables used in a predicate may be specified as UTF-16 or UTF-8 regardless of the data type of the column.This allows the back end data store to change, without changing the application.
When data is primarily Latin-1, it may be more efficient to store UTF-8 dataWhen data is not primarily Latin-1, it may be more efficient to store data in UTF-16
Scalar Functions
FunctionsLENGTH, SUBSTR, POSSTR, LOCATE
Byte Oriented for SBCS and Mixed (UTF-8)Double-Byte Character Oriented for DBCS (UTF-16)
Cast functionsUTF-16/UTF-8 are accepted any where char is accepted (char, date, time, integer...)
SELECT DATE(graphic column) FROM T1;SELECT INTEGER(graphic column) FROM T1;
UTF-8 is result data type/CCSID 1208 for character functions (char(float_col)...)
All Built In Functions (BIFs) have been extended to support UnicodeSome BIFs, such as LENGTH, SUBSTR, POSSTR, and LOCATE are byte oriented for UTF-8 and Double-Byte character oriented for UTF-16Many new functions were added in V7, the CCSID_ENCODING function has been added to help users determine the encoding, ASCII, EBCDIC, or UNICODE of a particular CCSIDUTF-16 data is accepted in casting type functions such as DATE or INTEGERResult CCSIDs for functions that return character strings will return UTF-8/CCSID 1208
RoutinesUDFs, UDTFs, and SPs will all be enabled to allow Unicode parametersParameters will be converted as necessary between char (UTF-8) and graphic (UTF-16)Date/Time/Timestamp passed as UTF-8 (ISO Format)
Routines
Chris Crone 04:09 PM 04/16/02 45-46
User written routines (User Defined Functions, User Defined Table Functions, and Stored Procedures) have been extended to support Unicode
Parameters will be converted as necessaryDate, Time, and Timestamp values are passed to the routine as a UTF-8 character string. These values will be in the ISO format as specified in the DB2 SQL Reference.
Utilities
UtilitiesLOAD Utility
UTF-16 <-> UTF-8SBCS/MIXED -> DBCSDBCS -> SBCS/MIXED
ASCII/EBCDIC <-> UNICODEUNLOAD Utility
ASCII/EBCDIC <-> UNICODENo support for
SBCS/MIXED -> DBCSDBCS -> SBCS/MIXED
The load utility has been extended to support conversion to and from Unicode.Additionally, the load utility will support conversion between character and graphic as long as conversion exists.
Character in load dataset -> Graphic columnGraphic in load dataset -> character column
The unload utility, new for V7, supports conversion to/from Unicode, but does not support conversion between character and graphic
Limits
Index Key Size - remains 255Char limit still 255 bytesVarying length string limit still 32K bytesStrings > 32K bytes - use LOB's
Chris Crone 04:09 PM 04/16/02 47-48
We haven't changed any of these limits. These are the same limits as we had for V6.The limit on index key sizes is something to watch out for.
Unicode data can take from 1-3 times the space needed to store ASCII or EBCDIC data
For character strings longer than 255 use Varchar. Varying length strings are still limited to 32704 bytes. For longer strings, use LOBs
Things to look out for
Things to look out for
UTF-8 and UTF-16 are compatible just about everywhere, but you will pay a conversion cost. It is best to match the DB2 data definition to the UNICODE model the application is using
If application uses UTF-8, DB2 tables should be UTF-8If application uses UTF-16, DB2 tables should be UTF-16
CollationUnicode Collation is more like ASCII collation than EBCDIC
Numbers come before lettersUpper characters come before lower case
UTF-8 and UTF-16 Collations are not the same if Surrogates involved
Chris Crone 04:09 PM 04/16/02 49-50
UTF-8 and UTF-16 are very compatible, but the cost of conversion, even with HW support can be high.In addition to the conversion costs, there can be other effects such as predicate indexability that may be affected by a mismatch in data types.Matching applications and back end data store will optimize applications and provide the best performance.Collation of Unicode data will be more like ASCII data.
When surrogates are involved, UTF-8 and UTF-16 do not collate the same.
Things to look out for
Storage size does not equal rendered sizeJapanese characters take 3 bytes to store 1 character in UTF-8Latin-1 accented characters take two bytes in UTF-8UNICODE has things called combining characters that allow something like A-Ring to be represented as A and Combining Character Ring. Combining characters can add to the size needed for both UTF-8 and UTF-16 columns
Å can be represented as'00C5'x (or 'C385'x for UTF-8)'00410307'x (or '41CC87'x for UTF-8)
Client ConnectionsClients need to be compatibleCan use two stage conversions
There are many issues that need to be dealt with in a Unicode environmentSome of these are storage related and affect things like database definitionsSome of these are application related and affect things like rendering characters for printing or display
Expanding and contracting conversions are very common in Unicode environments. The sizes of these expansions and contractions are not easy to calculate because of things like combining charactersWhen a connection is made between a client and server using DRDA, CCSIDs are exchanged. If a conversion is not available to convert between CCSIDs, the connection will fail.For conversions that would not normally be available, two stage converters can be used.
For example - converters from Chinese CCSIDs to Japanese CCSIDs aren't normally available, however, we could convert from Chinese to Unicode and from Unicode to Japanese.It is possible to create two stage converters using the OS/390 support for Unicode.
Things to look out for
UTF-16 and SPUFI or DSNTEP2SPUFI and DSNTEP2 really aren't UTF-16 aware
In most cases, you should use CHAR(graphic column) when selecting data. For example, use:
SELECT CHAR(g1) FROM T1Not
SELECT g1 FROM T1Hex constants are character based.
INSERT INTO T1 (g1) VALUES(x'0041); -- will result in x'00000041' not x'0041' as you might expect. Because hex constants are character based, DB2 will convert from UTF-8 to UTF-16 for you. x'00' -> x'0000' and x'41' -> x'0041'.
Chris Crone 04:09 PM 04/16/02 51-52
SPUFI and DSNTEP2 are not designed to work with GRAPHIC data on a MIXED = NO subsystem.HEX constants are character based, not graphic based.
HEX constants should be used with UTF-16 with care
Example Scenarios
Pre-V7
DRDA - CCSID 819DRDA - CCSID 850
3270 CCSID 37
DB2 V6Mixed = NoEBCDIC CCSID 500
3270 CCSID 500DB2 V6Mixed = NoEBCDIC CCSID 37
DRDA - CCSID 290/930/300 (Japanese)
Chris Crone 04:09 PM 04/16/02 53-54
Even though 37 and 500 are both Latin 1 Code pages and are compatible, You should not connect into a CCSID 500 system with a CCSID 37 emulator.
Characters such as [,], |, and ! are not represented the same on CCSID 37 and CCSID 500 and thus, these characters will be corrupted when returned to another user via DRDA or 3270 data stream.
V7 and Beyond
DRDA - CCSID 850
3270 CCSID 37
DB2 V7 With UnicodeMixed = NoEBCDIC CCSID 500
3270 CCSID 500DB2 V7 with UnicodeMixed = NoEBCDIC CCSID 37
DRDA - CCSID 912 (Czech)
DRDA - CCSID 290/930/300 (Japanese)
With V7 connections from clients that, in the past were not possible, are now possible. This doesn't mean that there aren't challenges
Conversions need to be definedCorrect behavior may depend on Application Encoding Bind option specification
Summary
Character Conversion Fundamentals
Unicode Fundamentals
Unicode Support in DB2 UDB for OS/390 V7
Things to look out for
Example Scenarios
Chris Crone 04:09 PM 04/16/02 55-56
Appendix - CCSID Information and Documentation
Installation Guide
Appendix A.Character conversion
SQL ReferenceCharacter sets and code pagesCharacter conversionConversion rules for string assignmentConversion rules for string comparisonCharacter conversion in unions and concatenationsSelecting the result CCSIDSQL descriptor area (SQLDA)
Administration GuideChoosing string or numeric data types
There are many sections in the DB2 documentation that deal with character conversion. Some of the more important ones are shown here.I have listed the section titles for the books. I haven't listed page numbers or section numbers because these may vary depending on the form of book, paper, PDF, or Bookmanager you use.You should become familiar with these sections of the documentation if character conversion is occurring on your system
Appendix - Catalog
Catalog changes for Unicode in DB2 for z/OS & OS/390 V7Added Columns
SYSPLANRELBOUND (indicates release when plan was last bound or rebound)
ENCODING_CCSID (Bind option value)SYSPACKAGE
RELBOUND (indicates release when plan was last bound or rebound)ENCODING_CCSID (Bind option value)
SYSVIEWSRELCREATED (indicates release when view was created)
SYSTABLESRELCREATED (indicates release when table was created)
Updated Columns (Updated for ENCODING UNICODE)SYSDATABASESYSTABLESPACESYSTABLESSYSPARMSSYSDATATYPES
Chris Crone 04:09 PM 04/16/02 57-58
There were many changes to the Catalog for DB2 V7The changes related to Unicode support are listed here
Appendix - zSeries Unicode SupportThe UTF-8 <-> UTF-16 instructions are used when DB2 converts from char to graphic or graphic to char. These are used in DB2 V7 if running on G5, G6, zSeries 900, zSeries 800 or OS/390 V2R8 or later.
CUUTF - Convert UTF-16 to UTF-8CUTFU - Convert UTF-8 to UTF-16
The following two instructions are similar to CLCLE and MVCLE. DB2 will use these instructions to perform comparison and padding on UTF-16 data because you can specify a two byte padding character. DB2 UDB for OS/390 and z/OS V7 uses these instructions if running on zSeries 900 or OS/390 V2R8:
CLCLU - Compare logical long UNICODEMVCLU - Move logical long UNICODE
These instructions pack/unpack ASCII (also UNICODE UTF-8) and UNICODE (UTF-16) data. These instructions are used when DB2 converts a character string to a decimal or internal date/time/timestamp or a decimal value to a character string. DB2 UDB for OS/390 and z/OS V7 will use these instructions if running on zSeries 900 or OS/390 V2R8:
PKU - Pack UnicodePKA - Pack ASCIIUNPKU - Unpack UnicodeUNPKA - Unpack ASCII
These instructions are all used when DB2 performs conversion. For instance from ASCII SBCS to UNICODE UTF-16, DB2 will use the TROT (one byte to two byte characters). DB2 indirectly uses these instructions via the Conversion System Services (available in OS/390 V2R8 and above) in DB2 UDB for z/OS & OS/390 V7:
TRTT - Translate Two to TwoTRTO - Translate Two to OneTROT - Translate One to TwoTROO - Translate One to One
The z/Architecture, zSeries 900 and zSeries 800 processors provides instructions which support Unicode data processing.
The support is detailed here
References
DB2 UDB Server for OS/390 Version 7 and z/OS Presentation GuideRedbook on DB2 V7, SG24-6121
DB2 Universal Database Administration Guide - SC09-2946Appendix E - National Language Support
The Unicode Standard Version 3.0The Unicode Consortium - Addison-Wesleywww.unicode.orghttp://www.ibm.com/developerworks/unicode/
Character Data Representation Architecture: Reference & Registry
SC09-2190National Language Design Guide Volume 2
SE09-8002
Chris Crone 04:09 PM 04/16/02 59-60
In appendix E of the DB2 UDB Admin Guide, there is a discussion of NLS issues and how to set/override the codepage at the client using the codepage keyword on NT, and LANG variable on AIX.