UAX #15: Unicode Normalization Forms · 2008-01-29 · Chapter 2, General Structure, and . Chapter...

UAX #15: Unicode Normalization Forms http://www.unicode.org/reports/tr15/tr15-28.html

1 of 36 1/29/2008 10:28 AM

Technical Reports

Proposed Update to

Unicode Standard Annex #15

UNICODE NORMALIZATION FORMSVersion Unicode 5.1.0 (draft 5)Authors Mark Davis ([email protected]), Martin DürstDate 2007-01-10This Version http://www.unicode.org/reports/tr15/tr15-28.htmlPreviousVersion

http://www.unicode.org/reports/tr15/tr15-27.html

Latest Version http://www.unicode.org/reports/tr15/Revision 28

Summary

This annex describes specifications for four normalized forms of Unicode text. With theseforms, equivalent text (canonical or compatibility) will have identical binary representations.When implementations keep strings in a normalized form, they can be assured thatequivalent strings have a unique binary representation.

Status

This is a draft document which may be updated, replaced, or superseded by otherdocuments at any time. Publication does not imply endorsement by the UnicodeConsortium. This is not a stable document; it is inappropriate to cite this document asother than a work in progress.

A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard,but is published online as a separate document. The Unicode Standard may requireconformance to normative content in a Unicode Standard Annex, if so specified in theConformance chapter of that version of the Unicode Standard. The version numberof a UAX document corresponds to the version of the Unicode Standard of which itforms a part.

Please submit corrigenda and other comments with the online reporting form [Feedback].Related information that is useful in understanding this annex is found in Unicode StandardAnnex #41, “Common References for Unicode Standard Annexes.” For the latest versionof the Unicode Standard, see [Unicode]. For a list of current Unicode Technical Reports,see [Reports]. For more information about versions of the Unicode Standard, see[Versions].

Contents

[email protected] BoxL2/08-059


2 of 36 1/29/2008 10:28 AM

1 Introduction1.1 Concatenation

2 Notation3 Versioning and Stability

3.1 Stability of Normalized Forms3.2 Stability of the Normalization Process3.3 Guaranteeing Process Stability3.4 Forbidding Characters3.5 Stabilized Strings

4 Conformance5 Specification6 Composition Exclusion Table7 Examples and Charts8 Design Goals9 Implementation Notes10 Decomposition11 Code Sample12 Legacy Encodings13 Programming Language Identifiers14 Detecting Normalization Forms

14.1 Stable Code Points15 Conformance Testing16 Hangul17 Intellectual Property18 Corrigenda19 Canonical Equivalence20 Corrigendum 5 Sequences21 Stream-Safe Text Format

21.1 Buffering with Unicode NormalizationAcknowledgmentsReferencesModifications

1 Introduction

For round-trip compatibility with existing standards, Unicode has encoded many entitiesthat are really variants of the same abstract character. The Unicode Standard defines twoequivalences between characters: canonical equivalence and compatibility equivalence.Canonical equivalence is a fundamental equivalency between characters or sequences ofcharactersthat represent the same abstract character, and when correctly displayed should alwayshave the same visual appearance and behavior. Figure 1 illustrates this equivalence.

Figure 1. Canonical Equivalence

Combining sequence Ç ↔ C ◌̧Ordering of combining marks q+ ̇+ ̣↔q + ̣+ ̇Hangul 가 ↔ᄀ +ᅡ


3 of 36 1/29/2008 10:28 AM

Singleton Ω ↔ ΩCompatibility equivalence is a weaker equivalence between characters or sequences ofcharacters that represent the same abstract character, but may have different visualappearance or behavior. The visual representations of the variant characters are typicallya subset of the possible visual representations of the nominal character, but representvisual distinctions that may be significant in some contexts but not in others, requiringgreater care in the application of this equivalence. If the visual distinction is stylistic, thenmarkup or styling could be used to represent the formatting information. However, somecharacters with compatibility decompositions are used in mathematical notation torepresent distinction of a semantic nature; replacing the use of distinct character codes byformatting may cause problems. Figure 2 illustrates this equivalence.

Figure 2. Compatibility Equivalence

Font variants ℌ ℍBreaking differences -Cursive forms ن ن ن نCircled ①Width, size, rotated ｶカ︷ {Superscripts/subscripts ⁹ ₉Squared characters ㌀Fractions ¼Others ǆ

Both canonical and compatibility equivalences are explained in more detail in Chapter 2,General Structure, and Chapter 3, Conformance, of The Unicode Standard in [Unicode]. Inaddition, the Unicode Standard describes several forms of normalization in Section 5.6,Normalization.These Normalization Forms are designed to produce a unique normalized form for anygiven string. Two of these forms are precisely specified in Section 3.7, Decomposition, in[Unicode]. In particular, the standard defines a canonical decomposition format, which canbe used as a normalization for interchanging text. This format allows for binary comparisonwhile maintaining canonical equivalence with the original unnormalized text.


4 of 36 1/29/2008 10:28 AM

The standard also defines a compatibility decomposition format, which allows for binarycomparison while maintaining compatibility equivalence with the original unnormalized text.The latter can also be useful in many circumstances, because it folds the differencesbetween characters that are inappropriate in those circumstances. For example, thehalfwidth and fullwidth katakana characters will have the same compatibility decompositionand are thus compatibility equivalents; however, they are not canonical equivalents.

Both of these formats are normalizations to decomposed characters. While Section 3.7,Decomposition, in [Unicode] also discusses normalization to composite characters (alsoknown as decomposable or precomposed characters), it does not precisely specify aformat. Because of the nature of the precomposed forms in the Unicode Standard, there ismore than one possible specification for a normalized form with composite characters.

This annex provides a unique specification for normalization, and a label for eachnormalized form as shown in Table 1.

Table 1. Normalization Forms

Title Description SpecificationNormalization Form D(NFD)

Canonical Decomposition Sections 3.7, 3.11, and 3.12[Unicode];also summarized under Section10, Decomposition

Normalization Form C(NFC)

CanonicalDecomposition,followed by CanonicalComposition

Section 5, Specification

NormalizationForm KD (NFKD)

CompatibilityDecomposition

Sections 3.7, 3.11, and 3.12[Unicode];also summarized under Section10, Decomposition

NormalizationForm KC (NFKC)

CompatibilityDecomposition,followed by CanonicalComposition

Section 5, Specification

As with decomposition, there are two forms of normalization that convert to compositecharacters: Normalization Form C and Normalization Form KC. The difference betweenthese depends on whether the resulting text is to be a canonical equivalent to the originalunnormalized text or a compatibility equivalent to the original unnormalized text. (In NFKCand NFKD, a K is used to stand for compatibility to avoid confusion with the C standing forcomposition.) Both types of normalization can be useful in different circumstances.

Figures 3–6illustrate different ways in which source text can be normalized. In the first three figures,the NFKD form is always the same as the NFD form, and the NFKC form is always thesame as the NFC form, so for simplicity those columns are omitted. For consistency, all ofthese examples use Latin characters, although similar examples are found in other scripts.

Figure 3. Singletons


5 of 36 1/29/2008 10:28 AM

Certain characters are known as singletons. They never remain in the text afternormalization. Examples include the angstrom and ohm symbols, which map to theirnormal letter counterparts a-with-ring and omega, respectively.

Figure 4. Canonical Composites

Many characters are known as canonical composites, or precomposed characters. In theD forms, they are decomposed; in the C forms, they are usually precomposed. (Forexceptions, see Section 6, Composition Exclusion Table.)

Normalization provides a unique order for combining marks, with a uniform order for all Dand C forms. Even when there is no precomposed character, as with the “q” with accentsin Figure 5, the ordering may be modified by normalization.

Figure 5. Multiple Combining Marks


6 of 36 1/29/2008 10:28 AM

The example of the letter “d” with accents shows a situation where a precomposedcharacter plus another accent changes in NF(K)C to a different precomposed characterplus a different accent.

Figure 6. Compatibility Composites

In the NFKC and NFKD forms, many formatting distinctions are removed, as shown inFigure 6. The “fi” ligature changes into its components “f” and “i”, the superscript formattingis removed from the “5”, and the long “s” is changed into a normal “s”.

Normalization Form KC does not attempt to map character sequences to compatibilitycomposites. For example, a compatibility composition of “office” does not produce“o\uFB03ce”, even though “\uFB03” is a character that is the compatibility equivalent of thesequence of three characters “ffi”. In other words, the composition phase of NFC andNFKC are the same—only their decomposition phase differs, with NFKC applyingcompatibility decompositions.

All of the definitions in this annex depend on the rules for equivalence and decompositionfound in Chapter 3, Conformance, of [Unicode] and the decomposition mappings in theUnicode Character Database [UCD].

Note:Text exclusively containing only ASCII characters (U+0000..U+007F) is leftunaffected by all of the Normalization Forms. This is particularly important forprogramming languages (see Section 13, Programming Language Identifiers).

Normalization Form C uses canonical composite characters where possible, and maintainsthe distinction between characters that are compatibility equivalents. Typical strings ofcomposite accented Unicode characters are already in Normalization Form C.Implementations of Unicode that restrict themselves to a repertoire containing nocombining marks (such as those that declare themselves to be implementations at level 1as defined in ISO/IEC 10646-1) are already typically using Normalization Form C.(Implementations of later versions of 10646 need to be aware of the versioningissues—see Section 3, Versioning and Stability.)

The W3C Character Model for the World Wide Web [CharMod] uses Normalization Form Cfor XML and related standards (that document is not yet final, but this requirement is not


7 of 36 1/29/2008 10:28 AM

expected to change). See the W3C Requirements for String Identity, Matching, and StringIndexing [CharReq] for more background.

Normalization Form KC additionally folds the differences between compatibility-equivalentcharacters that are inappropriately distinguished in many circumstances. For example, thehalfwidth and fullwidth katakanacharacters will normalize to the same strings, as will Roman numerals and their letterequivalents. More complete examples are provided in Section 7, Examples and Charts.

Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because theyerase many formatting distinctions, they will prevent round-trip conversion to and frommany legacy character sets, and unless supplanted by formatting markup, they mayremove distinctions that are important to the semantics of the text. It is best to think ofthese Normalization Forms as being like uppercase or lowercase mappings: useful incertain contexts for identifying core meanings, but also performing modifications to the textthat may not always be appropriate. They can be applied more freely to domains withrestricted character sets, such as in Section 13, Programming Language Identifiers.

To summarize the treatment of compatibility composites that were in the source text:

Both NFD and NFC maintain compatibility composites.Neither NFKD nor NFKC maintains compatibility composites.None of the forms generate compatibility composites that were not in the source text.

For a list of all characters that may change in any of the Normalization Forms (aside fromreordering), see Normalization Charts [Charts].

1.1 Concatenation

In using normalization functions, it is important to realize that none of the NormalizationForms are closed under string concatenation. That is, even if two strings X and Y arenormalized, their string concatenation X+Y is not guaranteed to be normalized. This evenhappens in NFD, because accents are canonically ordered, and may rearrange around thepoint where the strings are joined. Consider the string concatenation examples shown inTable 2.

However, it is possible to produce an optimized function that concatenates two normalizedstrings and doesguarantee that the result is normalized. Internally, it only needs to normalize charactersaround the boundary of where the original strings were joined, within stable code points.For more information, see Section 14.1, Stable Code Points.

Table 2. String Concatenation

Form String1 String2 Concatenation Correct NormalizationNFD a ^ . (dot under) a ^ . a . ^NFC a ^ a ^ âNFC ᄀ ᅡ ᆨ ᄀ ᅡ ᆨ 각

However, all of the Normalization Forms are closed under substringing. For example, ifone takes a substring of a normalized string X, from offsets 5 to 10, one is guaranteed thatthe resulting string is still normalized.


8 of 36 1/29/2008 10:28 AM

2 Notation

All of the definitions in this annex depend on the rules for equivalence and decompositionfound in Chapter 3, Conformance, of [Unicode] and the Character Decomposition Mappingand Canonical Combining Class property in the Unicode Character Database [UCD].Decomposition mustbe done in accordance with these rules. In particular, the decomposition mappings found inthe Unicode Character Database must be applied recursively, and then the string put intocanonical order based on the characters’ combining classes.

Table 3 lists examples of the notational conventions used in this annex.

Table 3. Notation Examples

Example Notation DescriptioncombiningClass(X) The combining class of a character X"...\uXXXX..." The Unicode character U+XXXX embedded within a string"...\UXXXXXXXX..." The Unicode character U+XXXXXXXX embedded within a stringB-C A single character that is equivalent to the sequence of

characters B + Cki, am, and kf Conjoining jamo types (initial, medial, final) represented by

subscripts"c¸" c followed by a nonspacing cedilla: spacing accents (without a

dotted circle) may be used to represent nonspacing accentsNFX(S) Any Normalization Form: NFD(S), NFKD(S), NFC(S), and

NFKC(S) are the possibilitiestoNFX(s) A function that produces the the normalized form of a string s

according to the definition of Normalization Form XisNFC(s) A binary property of a string s: isNFX(s) is true if and only if

toNFX(s) is identical to s; see also Section 14, DetectingNormalization Forms.

X ≈ Y X is canonically equivalent to YX[a, b] The substring of X that includes all code units after offset a

and before offset b; for example, if X is “abc”, then X[1,2] is“b”

Additional conventions used in this annex:

A sequence of characters may be represented by using plus signs between thecharacter names or by using string notation.

1.

An offset into a Unicode string is a number from 0 to n, where n is the length of thestring and indicates a position that is logically between Unicode code units (or at thevery front or end in the case of 0 or n, respectively).

2.

Unicode names may be shortened, as shown in Table 4.3.

Table 4. Character Abbreviation

Abbreviation Full Unicode Name


9 of 36 1/29/2008 10:28 AM

E-grave LATIN CAPITAL LETTER E WITH GRAVEka KATAKANA LETTER KAhw_ka HALFWIDTH KATAKANA LETTER KAten COMBINING KATAKANA-HIRAGANA VOICED SOUND MARKhw_ten HALFWIDTH KATAKANA VOICED SOUND MARK

3 Versioning and Stability

It is crucial that Normalization Forms remain stable over time. That is, if a string that doesnot have any unassigned characters is normalized under one version of Unicode, it mustremain normalized under all future versions of Unicode. This is the backward compatibilityrequirement. To meet this requirement, a fixed version for the composition process isspecified, called the composition version. The composition version is defined to beVersion 3.1.0 of the Unicode Character Database. For more information, see

Versions of the Unicode Standard [Versions]Unicode 3.1 [Unicode3.1]Unicode Character Database [UCD]

To see what difference the composition version makes, suppose that a future version ofUnicode were to add the composite Q-caron. For an implementation that uses that futureversion of Unicode, strings in Normalization Form C or KC would continue to contain thesequence Q + caron, and not the new character Q-caron, because a canonicalcomposition for Q-caron was not defined in the composition version. See Section 6,Composition Exclusion Table, for more information.

It would be possible to add more compositions in a future version of Unicode, as long asthe backward compatibility requirement is met. It requires that for any new composition XY→ Z, at most one of X or Y was defined in a previous version of Unicode. That is, Z mustbe a new character, and either X or Y must be a new character. However, the UnicodeConsortium strongly discourages new compositions, even in such restricted cases.

In addition to fixing the composition version, future versions of Unicode must be restrictedin terms of the kinds of changes that can be made to character properties. Because of this,the Unicode Consortium has a clear policy to guarantee the stability of NormalizationForms.

3.1 Stability of Normalized Forms

A normalized string is guaranteed to be stable; that is, once normalized, a string isnormalized according to all future versions of Unicode.

More precisely, if a string has been normalized according to a particular version of Unicodeandcontains only characters allocated in that version, it will qualify as normalized according toany future version of Unicode.

3.2 Stability of the Normalization Process

The process


10 of 36 1/29/2008 10:28 AM

of producing a normalized string from an unnormalized string has the same results undereach version of Unicode, except for certain edge cases addressed in the followingcorrigenda:

Three corrigenda correct certain data mappings for a total of seven characters:

Corrigendum #2, “U+FB1D Normalization” [Corrigendum2]Corrigendum #3, “U+F951 Normalization” [Corrigendum3]Corrigendum #4, “Five Unihan Canonical Mapping Errors” [Corrigendum4]

Corrigendum #5, “Normalization Idempotency” [Corrigendum5], fixed a problem inthe description of the normalization process for some instances of particularsequences. Such instances never occur in meaningful text.

3.3 Guaranteeing Process Stability

Unicode provides a mechanism for those implementations that require not only normalizedstrings, but also the normalization process, to be absolutely stable between two versions(including the edge cases mentioned in Section 3.2, Stability of the NormalizationProcess). This, of course, is true only where the repertoire of characters is limited to thosecharacter present in the earlier version of Unicode.

To have the newer implementation produce the same results as the older version (forcharacters defined as of the older version):

Premap a maximum of seven (rare) characters according to whatever corrigendacame between the two versions (see [Errata]).

For example, for a Unicode 4.0 implementation to produce the same results asUnicode 3.2, the five characters mentioned in [Corrigendum4] are premappedto the old values given in version 4.0 of the UCD data file [Corrections].

1.

If the earlier version is before Unicode 4.1 and the later version is 4.1 or later, reorderthe sequences listed in Table 11 of Section 20, Corrigendum 5 Sequences, asfollows:

From: first_character intervening_character(s) last_characterTo: first_character last_character intervening_character(s)

2.

Apply the newer version of normalization.3.

Note:For step 2, in most implementations it is actually more efficient (and much simpler) toparameterize the code to provide for both pre- and post-Unicode 4.1 behavior. Thistypically takes only one additional conditional statement.

3.4 Forbidding Characters

An alternative approach for certain protocols is to forbid characters that differ innormalization status across versions. The characters and sequences affected are not inany practical use, so this may be viable for some implementations. For example, whenupgrading from Unicode 3.2 to Unicode 5.0, there are three relevant corrigenda:

Corrigendum #3, “U+F951 Normalization” [Corrigendum3]


11 of 36 1/29/2008 10:28 AM

Corrigendum #4, “Five Unihan Canonical Mapping Errors” [Corrigendum4] The five characters are U+2F868, U+2F874, U+2F91F, U+2F95F, and U+2F9BF.Corrigendum #5, “Normalization Idempotency” [Corrigendum5]

The characters in Corrigenda #3 and #4 are all extremely rare Han characters. Theyare compatibility characters included only for compatibility with a single East Asiancharacter set standard each: U+F951 for a duplicate character in KS X 1001, and theother five for CNS 11643-1992. That’s why they have canonical decompositionmappings in the first place.

The duplicate character in KS X 1001 is a rare character in Korean to begin with—ina South Korean standard, where the use of Han characters at all is uncommon inactual data. And this is a pronunciation duplicate, which even if it were used wouldvery likely be inconsistently and incorrectly used by end users, because there is novisual way for them to make the correct distinctions.

The five characters from CNS 11643-1992 have even less utility. They are minorglyphic variants of unified characters—the kinds of distinctions that are subsumedalready within all the unified Han ideographs in the Unicode Standard. They are fromPlanes 4–15 of CNS 11643-1992, which never saw any commercial implementationin Taiwan. The IT systems in Taiwan almost all implemented Big Five instead, whichwas a slight variant on Planes 1 and 2 of CNS 11643-1986, and which included noneof the five glyph variants in question here.

As for Corrigendum #5, it is important to recognize that none of the affectedsequences occur in any well-formed text in any language. See Section20, Corrigendum 5 Sequences.

For more information, see Section 18, Corrigenda.

3.5 Stabilized Strings

In certain protocols, there is a requirement for a normalization process for stabilizedstrings. The key concept is that for a given normalization form, once a Unicode string hasbeen successfully normalized according to the process, it will never change ifsubsequently normalized again, in any version of Unicode, past or future. To meet thisneed, the Normalization Process for Stabilized Strings (NPSS) is defined. NPSS adds toregular normalization the requirement that an implementation must abort with an error if itencounters any characters that are not in the current version of Unicode.

Examples:

Sample Characters Required Behavior for Unicode Version

3.2 4.0 4.1 5.0

U+0234 (ȴ) LATIN SMALL LETTER L WITH CURL(added in Unicode 4.0)

Abort Accept Accept Accept

U+0237 (ȷ) LATIN SMALL LETTER DOTLESS J(added in Unicode 4.1)

Abort Abort Accept Accept

U+0242 (ƀ) LATIN SMALL LETTER GLOTTAL STOP(added in Unicode 5.0)

Abort Abort Abort Accept


12 of 36 1/29/2008 10:28 AM

Once a string has been normalized by the NPSS for a particular normalization form, it willnever change if renormalized for that same normalization form by an implementation thatsupports any version of Unicode, past or future. For example, if an implementationnormalizes a string to NFC, following the constraints of NPSS (aborting with an error if itencounters any unassigned code point for the version of Unicode it supports), the resultingnormalized string would be stable: it would remain completely unchanged if renormalizedto NFC by any conformant Unicode normalization implementation supporting a prior or afuture version of the standard.

The one caveat is that this applies to implementations that apply Corrigenda #2 through#5: see Section 3.2 Stability of the Normalization Process. A protocol that requires stabilityeven across implementations that do not apply these corrigenda is a restricted protocol.Such a protocol must use a restricted NPSS, a process that also aborts with an error ifencounters any problematic characters or sequences, as discussed in Section 3.4Forbidding Characters.

Note that NPSS defines a process, not another normalization form. The resulting string issimply in a particular normalization form. If a different implementation applies the NPSSagain to that string, then depending on the version of Unicode supported by the otherimplementation, either the same string will result, or an error will occur. Given a string thatis purported to have been produced by the NPSS for a given normalization form, what animplementation can determine is one of the following three conditions:

definitely produced by NPSS (it is normalized, and contains no unassignedcharacters)

1.

definitely not produced by NPSS (it is not normalized)2.may or may not have been produced by NPSS (it contains unassigned charactersbut is otherwise normalized)

3.

The additional data required for the stable normalization process can be easilyimplemented with a compact lookup table. Most libraries supplying normalization functionsalso supply the required property tests, and in those normalization functions it isstraightforward for them to provide an additional parameter which invokes the stabilizedprocess.

Editorial Notes:

This definition depends on an anticipated further tightening of the UnicodeStability Policies [Policies] such that normalization of assigned characters willnot change in future versions of Unicode.That tightening will also cause a few other changes to the rest of Section3 Versioning and Stability, and changes to Section 18 Corrigenda.The data in Section 20 Corrigendum 5 Sequences should be made available ina machine-readable format.

4 Conformance

UAX15-C1. A process that produces Unicode text that purports to be in a NormalizationForm shall do so in accordance with the specifications in this annex.

UAX15-C2. A process that tests Unicode text to determine whether it is in a Normalization


13 of 36 1/29/2008 10:28 AM

Form shall do so in accordance with the specifications in this annex.

UAX15-C3. A process that purports to transform text into a Normalization Form must beable to pass the conformance test described in Section 15, Conformance Testing.

UAX15-C4. A process that purports to transform text according to the Stream-Safe TextFormat must do so in accordance with the specifications in this annex.

UAX15-C5. A process that purports to transform text according to the NormalizationProcess for Stabilized Strings must do so in accordance with the specifications in thisdocument.

The specifications for Normalization Forms are written in terms of a process for producinga decomposition or composition from an arbitrary Unicode string. This is a logicaldescription—particular implementations can have more efficient mechanisms as long asthey produce the same result. See C18 in Chapter 3, Conformance, of [Unicode] and thenotes following. Similarly, testing for a particular Normalization Form does not requireapplying the process of normalization, so long as the result of the test is equivalent toapplying normalization and then testing for binary identity.

5 Specification

This section specifies the format for Normalization Forms C and KC. It uses four definitionsD1, D2, D3, D4, and two rules R1 and R2. In these definitions and rules, and inexplanatory text, the term “character” is used. It should be interpreted as meaning “codepoint,” because the algorithm applies to any sequence of code points, including thosecontaining code points that are not assigned characters.

All combining character sequences start with a character of combining class zero. Forsimplicity, the following term is defined for such characters:

D1. A character S is a starterif it has a combining class of zero in the Unicode Character Database. Any other characteris a non-starter.

Because of the definition of canonical equivalence, the order of combining characters withthe same combining class makes a difference. For example, a-macron-breve is not thesame as a-breve-macron. Characters cannot be composed if that would change thecanonical order of the combining characters.

D2. In any character sequence beginning with a starter S, a character C is blocked from Sif and only if there is some character B between S and C, and either B is a starter or it hasthe same or higher combining class as C.

This definition is to be applied only to strings that are already canonically orcompatibility decomposed.

When B blocks C, changing the order of B and C would result in a character sequence thatis not canonically equivalent to the original. See Section 3.11, Canonical OrderingBehavior [Unicode].

If a combining character sequence is in canonical order, then testing whether a characteris blocked requires looking at only the immediately preceding character.

The process of forming a composition in Normalization Form C or KC involves two steps:


14 of 36 1/29/2008 10:28 AM

Decomposing the string according to the canonical (or compatibility, respectively)mappings of the Unicode Character Database that correspond to the latest version ofUnicode supported by the implementation.

1.

Composing the resulting string according to the canonical mappings of thecomposition version of the Unicode Character Database by successively composingeach unblocked character with the last starter.

2.

Figure 7 shows a sample of the how the composition process works. The gray cubesrepresent starters, and the white cubes represent non-starters. In the first step, the stringis fully decomposed and reordered. This is represented by the downwards arrows. In thesecond step, each character is checked against the last non-starter and starter, andcombined if all the conditions are met. This is represented by the curved arrows pointing tothe starters. Note that in each case, all of the successive white boxes (non-starters) areexamined plus one additional gray box (starter). Examples are provided in Section 7,Examples and Charts, and a code sample is provided in Section 11, Code Sample.

Figure 7. Composition Process

A precise notion is required for when an unblocked character can be composed with astarter. This uses the following two definitions.

D3. A primary compositeis a character that has a canonical decomposition mapping in the Unicode CharacterDatabase (or has a canonical Hangul decomposition) but is not in the Section 6,Composition Exclusion Table.

Note: Hangul syllable decomposition is considered a canonical decomposition. See[Unicode] and Section 16, Hangul.

D4. A character X can be primary combined with a character Y if and only if there is aprimary composite Z that is canonically equivalent to the sequence .

Based upon these definitions, the following rules specify the Normalization Forms C andKC.

R1. Normalization Form C

The Normalization Form C for a string S is obtained by applying the following process, orany other process that leads to the same result:

Generate the canonicaldecomposition for the source string S according to the decomposition mappings inthe latest supported version of the Unicode Character Database.

1.

Iterate through each character C in that decomposition, from first to last. If C is not2.


15 of 36 1/29/2008 10:28 AM

blocked from the last starter L and it can be primary combined with L, then replace Lby the composite L-C and remove C.

The result of this process is a new string S', which is in Normalization Form C.

R2. Normalization Form KC

The Normalization Form KC for a string S is obtained by applying the following process, orany other process that leads to the same result:

Generate the compatibilitydecomposition for the source string S according to the decomposition mappings inthe latest supported version of the Unicode Character Database.

1.

Iterate through each character C in that decomposition, from first to last. If C is notblocked from the last starter L and it can be primary combined with L, then replace Lby the composite L-C and remove C.

2.

The result of this process is a new string S', which is in Normalization Form KC.

R3. Normalization Process for Stabilized Strings

The Normalization Process for Stabilized Strings for a given normalization form (NFD,NFC, NFKD, or NFKC) is the same as the corresponding process for generating that form,except that:

The process must be aborted with an error if the string contains any code point withthe property value General_Category=Unassigned, according to the version ofUnicode used for the normalization process.

6 Composition Exclusion Table

There are four classes of characters that are excluded from composition:

Script-specifics:precomposed characters that are generally not the preferred form for particularscripts.

These cannot be computed from information in the Unicode CharacterDatabase.An example is U+0958 (क़) DEVANAGARI LETTER QA.

1.

Post composition version: precomposed characters that are added after Unicode3.0 [Unicode3.0] and whose decompositions exist in prior versions of Unicode. Thisset will be updated with each subsequent version of Unicode. For more information,see Section 3, Versioning and Stability.

These cannot be computed from information in the Unicode CharacterDatabase.An example is U+2ADC (⫝̸) FORKING.

2.

Singletons:characters having decompositions that consist of single characters (as describedbelow).

These are computed from information in the Unicode Character Database.An example is U+2126 (Ω) OHM SIGN.

3.

Non-starter decompositions: precomposed characters whose decompositions start4.


16 of 36 1/29/2008 10:28 AM

with a non-starter.These are computed from information in the Unicode Character Database.An example is U+0344 ( ̈́) COMBINING GREEK DIALYTIKA TONOS.

Two characters may have the same canonical decomposition in the Unicode CharacterDatabase. Table 5 shows an example.

Table 5. Same Canonical Decomposition

Source Same Decomposition212B (Å) ANGSTROM SIGN 0041 (A) LATIN CAPITAL LETTER A + 030A (°) COMBINING RING

ABOVE00C5 (Å) LATIN CAPITAL LETTER A WITH RINGABOVE

The Unicode Character Database will first decompose one of the characters to the other,and then decompose from there. That is, one of the characters (in this case, U+212BANGSTROM SIGN) will have a singleton decomposition. Characters with singletondecompositions are included in Unicode for compatibility with certain preexistingstandards. These singleton decompositions are excluded from primary composition.

When a character with a canonical decomposition is added to Unicode, it must be addedto the composition exclusion table if there is at least one character in its decompositionthat existed in a previous version of Unicode. If there are no such characters, then it ispossible for it to be added or omitted from the composition exclusion table. The choice ofwhether to do so or not rests upon whether it is generally used in the precomposed form ornot.

Data File

The Composition Exclusion Table is available as machine-readable data file [Exclusions].

All four classes of characters are included in this file, although the singletons andnon-starter decompositions are commented out, as they can be computed from thedecomposition mappings in the Unicode Character Database.

A derived property containing the complete list of exclusions, Comp_Ex, is availableseparately in the Unicode Charactger Database [UCD] and is described in the UCDdocumentation [UCDDoc]. Implementations can avoid computing the singleton andnon-starter decompositions from the Unicode Character Database by using the Comp_Exproperty instead.

7 Examples and Charts

This section provides some detailed examples of the results when each of theNormalization Forms is applied. The Normalization Charts [Charts] provide charts of all thecharacters in Unicode that differ from at least one of their Normalization Forms (NFC,NFD, NFKC, NFKD).

Basic Examples

The basic examples in Table 6 do not involve compatibility decompositions. Therefore, ineach case Normalization Forms NFD and NFKD are identical, and Normalization FormsNFC and NFKC are also identical.


17 of 36 1/29/2008 10:28 AM

Table 6. Basic Examples

Original NFD, NFKD NFC, NFKC Notesa D-dot_above D + dot_above D-dot_above Both decomposed and

precomposed canonicalsequences produce the sameresult.

b D + dot_above D + dot_above D-dot_above

c D-dot_below+ dot_above

D + dot_below+ dot_above

D-dot_below+ dot_above

The dot_above cannot becombined with the D becausethe D has already combinedwith the interveningdot_below.

d D-dot_above+ dot_below



e D + dot_above+ dot_below



f D + dot_above+ horn+ dot_below

D + horn+ dot_below+ dot_above

D-dot_below+ horn+ dot_above

There may be interveningcombining marks, so long asthe result of the combinationis canonically equivalent.

g E-macron-grave E + macron+ grave

E-macron-grave Multiple combiningcharacters are combined withthe base character.h E-macron

+ graveE + macron+ grave

E-macron-grave

i E-grave+ macron

E + grave+ macron

E-grave+ macron

Characters will not becombined if they would notbe canonical equivalentsbecause of their ordering.

j angstrom_sign A + ring A-ring Because Å (A-ring) is thepreferred composite, it is theform produced for bothcharacters.

k A-ring A + ring A-ring

Effect of Compatibility Decompositions

The examples in Table 7 and Table 8 illustrate the effect of compatibility decompositions.When text is normalized in forms NFD and NFC, as in Table 7, compatibility-equivalentstrings do not result in the same strings. However, when the same strings are normalizedin forms NFKD and NFKC, as shown in Table 8, they do result in the same strings. Thetables also contain an entry showing that Hangul syllables are maintained under allNormalization Forms.

Table 7. NFD and NFC Applied to Compatibility-Equivalent Strings

Original NFD NFC Notesl "Äffin" "A\u0308ffin" "Äffin" The ffi_ligature (U+FB03)


18 of 36 1/29/2008 10:28 AM

is not decomposed,because it has acompatibility mapping, nota canonical mapping. (SeeTable 8.)m "Ä\uFB03n" "A\u0308\uFB03n" "Ä\uFB03n"

n "Henry IV" "Henry IV" "Henry IV" Similarly, the ROMANNUMERAL IV (U+2163) isnot decomposed.o "Henry \u2163" "Henry \u2163" "Henry \u2163"

p ga ka + ten ga Different compatibilityequivalents of a singleJapanese character will notresult in the same string inNFC.

q ka + ten ka + ten gar hw_ka

+ hw_tenhw_ka + hw_ten hw_ka

+ hw_tens ka + hw_ten ka + hw_ten ka + hw_tent hw_ka + ten hw_ka + ten hw_ka + tenu kaks ki + am + ksf kaks Hangul syllables are

maintained undernormalization.

Table 8. NFKD and NFKC Applied to Compatibility-Equivalent Strings

Original NFKD NFKC Notesl' "Äffin" "A\u0308ffin" "Äffin" The ffi_ligature (U+FB03) is

decomposed in NFKC (whereit is not in NFC).m' "Ä\uFB03n" "A\u0308ffin" "Äffin"

n' "Henry IV" "Henry IV" "Henry IV" Similarly, the resultingstrings here are identical inNFKC.o' "Henry \u2163" "Henry IV" "Henry IV"

p' ga ka + ten ga Different compatibilityequivalents of a singleJapanese character will resultin the same string in NFKC.

q' ka + ten ka + ten gar' hw_ka

+ hw_tenka + ten ga

s' ka + hw_ten ka + ten gat' hw_ka + ten ka + ten gau' kaks ki + am + ksf kaks Hangul syllables are

maintained undernormalization.*

*In earlier versions of Unicode, jamo characters like ksf had compatibility mappings to kf +sf. These mappings were removed in Unicode 2.1.9 to ensure that Hangul syllables aremaintained.

8 Design Goals

The following are the design goals for the specification of the Normalization Forms and arepresented here for reference. The first goal is a fundamental conformance feature of thedesign.


19 of 36 1/29/2008 10:28 AM

Goal 1: Uniqueness

The first, and by far the most important, design goal for the Normalization Forms isuniqueness. Two equivalent strings will have precisely the same normalized form. Moreexplicitly,

If two strings x and y are canonical equivalents, thentoNFC(x) = toNFC(y)toNFD(x) = toNFD(y)

1.

If two strings are compatibility equivalents, thentoNFKC(x) = toNFKC(y)toNFKD(x) = toNFKD(y)

2.

All of the transformations are idempotent: that is,toNFC(toNFC(x)) = toNFC(x)toNFD(toNFD(x)) = toNFD(x)toNFKC(toNFKC(x)) = toNFKC(x)toNFKD(toNFKD(x)) = toNFKD(x)

3.

Goal 1.3 is a consequence of Goals 1.2 and 1.1, but is stated here for clarity.

Goal 2: Stability

The second major design goal for the Normalization Forms is stability of characters thatare not involved in the composition or decomposition process.

If X contains a character with a compatibility decomposition, then toNFD(X) andtoNFC(X) still contain that character.

1.

As much as possible, if there are no combining characters in X, then toNFC(X) = X.

The only characters for which this is not true are those in the Section 6,Composition Exclusion Table.

2.

Irrelevant combining marks should not affect the results of composition. See examplef in Section 7, Examples and Charts, where the horn character does not affect theresults of composition.

3.

Goal 3: Efficiency

The third major design goal for the Normalization Forms is to allow efficientimplementations.

It is possible to implement efficient code for producing the Normalization Forms. Inparticular, it should be possible to produce Normalization Form C very quickly fromstrings that are already in Normalization Form C or are in Normalization Form D.

1.

Normalization Forms that compose do not have to produce the shortest possibleresults, because that can be computationally expensive.

2.

9 Implementation Notes

There are a number of optimizations that can be made in programs that produceNormalization Form C. Rather than first decomposing the text fully, a quick check can be


20 of 36 1/29/2008 10:28 AM

made on each character. If it is already in the proper precomposed form, then no work hasto be done. Only if the current character is combining or in Section 6, CompositionExclusion Table, does a slower code path need to be invoked. (This code path will need tolook at previous characters, back to the last starter. See Section 14, DetectingNormalization Forms, for more information.)

The majority of the cycles spent in doing composition are spent looking up the appropriatedata. The data lookup for Normalization Form C can be very efficiently implemented,because it has to look up only pairs of characters, not arbitrary strings. First a multistagetable (also known as trie; see [Unicode] Chapter 5, Implementation Guidelines) is used tomap a character c to a small integer i in a contiguous range from 0 to n. The code for doingthis looks like:

i = data[index[c >> BLOCKSHIFT] + (c & BLOCKMASK)];

Then a pair of these small integers are simply mapped through a two-dimensional array toget a resulting value. This yields much better performance than a general-purpose stringlookup in a hash table.

Because the Hangul compositions and decompositions are algorithmic, memory storagecan be significantly reduced if the corresponding operations are done in code. See Section16, Hangul, for more information.

Note:Any such optimizations must be carefully checked to ensure that they still produceconformant results. In particular, the code must still be able to pass the testdescribed in Section 15, Conformance Testing.

For more information on useful implementation techniques, see Section 14, DetectingNormalization Forms, and [UTN5].

10 Decomposition

For those reading this annex online, the following summarizes the canonicaldecomposition process. For a complete discussion, see Sections 3.7, Decomposition , and3.11, Canonical Ordering Behavior [Unicode].

Canonical decompositionis the process of taking a string, recursively replacing composite characters using theUnicode canonical decomposition mappings (including the algorithmic Hangul canonicaldecomposition mappings; see Section 16, Hangul), and putting the result in canonicalorder.

Compatibility decompositionis the process of taking a string, recursively replacing composite characters using both theUnicode canonical decomposition mappings (including the algorithmic Hangul canonicaldecomposition mappings) andthe Unicode compatibility decomposition mappings, and putting the result in canonicalorder.

A string is put into canonical order by repeatedly replacing any exchangeable pair by thepair in reversed order. When there are no remaining exchangeable pairs, then the string isin canonical order. Note that the replacements can be done in any order.


21 of 36 1/29/2008 10:28 AM

A sequence of two adjacent characters in a string is an exchangeable pair if the combiningclass (from the Unicode Character Database) for the first character is greater than thecombining class for the second, and the second is not a starter; that is, ifcombiningClass(first) > combiningClass(second) > 0. See Table 9.

Table 9. Examples of Exchangeable Pairs

Sequence CombiningClasses

Status

230, 202 exchangeable, because 230 > 202 0, 230 not exchangeable, because 0


22 of 36 1/29/2008 10:28 AM

12 Legacy Encodings

While the Normalization Forms are specified for Unicode text, they can also be extended tonon-Unicode (legacy) character encodings. This is based on mapping the legacy characterset strings to and from Unicode using definitions D5 and D6.

D5. An invertible transcodingT for a legacy character set L is a one-to-one mapping from characters encoded in L tocharacters in Unicode with an associated mapping T-1 such that for any string S in L,T-1(T(S)) = S.

Most legacy character sets have a single invertible transcoding in common use. In a fewcases there may be multiple invertible transcodings. For example, Shift-JIS may have twodifferent mappings used in different circumstances: one to preserve the '\' semantics of5C16, and one to preserve the '¥' semantics.

The character indexes in the legacy character set string may be different from characterindexes in the Unicode equivalent. For example, if a legacy string uses visual encoding forHebrew, then its first character might be the last character in the Unicode string.

If transcoders are implemented for legacy character sets, it is recommended that the resultbe in Normalization Form C where possible. See Unicode Technical Report #22,“Character Mapping Tables,” for more information.

D6. Given a string S encoded in L and an invertible transcoding T for L, the NormalizationForm X of S under Tis defined to be the result of mapping to Unicode, normalizing to Unicode NormalizationForm X, and mapping back to the legacy character encoding—forexample, T-1(NFX(T(S))). Where there is a single invertible transcoding for that characterset in common use, one can simply speak of the Normalization Form X of S.

Legacy character sets are classified into three categories based on their normalizationbehavior with accepted transcoders.

Prenormalized. Any string in the character set is already in Normalization Form X.For example, ISO 8859-1 is prenormalized in NFC.

1.

Normalizable. Although the set is not prenormalized, any string in the set can benormalized to Normalization Form X.

For example, ISO 2022 (with a mixture of ISO 5426 and ISO 8859-1) isnormalizable.

2.

Unnormalizable.Some strings in the character set cannot be normalized into Normalization Form X.

For example, ISO 5426 is unnormalizable in NFC under common transcoders,because it contains combining marks but not composites.

3.

13 Programming Language Identifiers

This section has been moved to Unicode Standard Annex #31, “Identifier and PatternSyntax” [UAX 31].

14 Detecting Normalization Forms


23 of 36 1/29/2008 10:28 AM

The Unicode Character Database supplies properties that allow implementations to quicklydetermine whether a string x is in a particular Normalization Form—for example, isNFC(x).This is, in general, many times faster than normalizing and then comparing.

For each Normalization Form, the properties provide three possible values for eachUnicode code point, as shown in Table 10.

Table 10. Description of Quick_Check Values

Valuee DescriptionNO The code point cannot occur in that Normalization Form.YES The code point is a starter and can occur in the Normalization Form. In

addition, for NFKC and NFC, the character may compose with a followingcharacter, but it never composes with a previous character.

MAYBE The code point can occur, subject to canonical ordering, but withconstraints. In particular, the text may not be in the specifiedNormalization Form depending on the context in which the characteroccurs.

Code that uses this property can do a very fast first pass over a string to determine theNormalization Form. The result is also either NO, YES, or MAYBE. For NO or YES, theanswer is definite. In the MAYBE case, a more thorough check must be made, typically byputting a copy of the string into the Normalization Form and checking for equality with theoriginal.

Even the slow case can be optimized, with a function that does not perform acomplete normalization of the entire string, but instead works incrementally, onlynormalizing a limited area around the MAYBE character. See Section 14.1, StableCode Points.

This check is much faster than simply running the normalization algorithm, because itavoids any memory allocation and copying. The vast majority of strings will return adefinitive YES or NO answer, leaving only a small percentage that require more work. Thesample below is written in Java, although for accessibility it avoids the use ofobject-oriented techniques.

public int quickCheck(String source) { short lastCanonicalClass = 0; int result = YES; for (int i = 0; i < source.length(); ++i) { char ch = source.charAt(i); short canonicalClass = getCanonicalClass(ch); if (lastCanonicalClass > canonicalClass && canonicalClass != 0) { return NO; } int check = isAllowed(ch); if (check == NO) return NO; if (check == MAYBE) result = MAYBE; lastCanonicalClass = canonicalClass; } return result;}

public static final int NO = 0, YES = 1, MAYBE = -1;

The isAllowed() call should access the data from Derived Normalization Properties file[NormProps] for the Normalization Form in question. (For more information, see the UCD


24 of 36 1/29/2008 10:28 AM

documentation [UCDDoc].) For example, here is a segment of the data for NFC:

...

0338 ; NFC_MAYBE # Mn COMBINING LONG SOLIDUS OVERLAY

...

F900..FA0D ; NFC_NO # Lo [270] CJK COMPATIBILITY IDEOGRAPH-F900..CJK COMPATIBILITY IDE

...

These lines assign the value NFC_MAYBE to the code point U+0338, and the valueNFC_NO to the code points in the range U+F900..U+FA0D. There are no MAYBE valuesfor NFD and NFKD: the quickCheckfunction will always produce a definite result for these Normalization Forms. All charactersthat are not specifically mentioned in the file have the values YES.

The data for the implementation of the isAllowed() call can be accessed in memory with ahash table or a trie (see Section 9, Implementation Notes); the latter will be the fastest.

There is also a Unicode Consortium stability policy that canonical mappings are alwayslimited in all versions of Unicode, so that no string when decomposed with NFD expands tomore than 3× in length (measured in code units). This is true whether the text is in UTF-8,UTF-16, or UTF-32. This guarantee also allows for certain optimizations in processing,especially in determining buffer sizes. See also Section 21, Stream-Safe Text Format.

14.1 Stable Code Points

It is sometimes useful to distinguish the set of code points that are stable under a particularNormalization Form. They are the set of code points never affected by that particularnormalization process. This property is very useful for skipping over text that does notneed to be considered at all, either when normalizing or when testing normalization.

Formally, each stable code point CP fulfills all of the following conditions:

CP has canonical combining class 0.1.CP is (as a single character) not changed by this Normalization Form.2.

In case of NFC or NFKC, each stable code point CP fulfills all of the following additionalconditions:

CP can never compose with a previous character.3.CP can never compose with a following character.4.CP can never change if another character is added.5.

Example. In NFC, a-breve satisfies all but (5), but if one adds an ogonek it changes toa-ogonek plus breve. So a-breve is not stable in NFC. However, a-ogonek is stable inNFC, because it does satisfy (1–5).

Concatenation of normalized strings to produce a normalized result can be optimized usingstable code points. An implementation can find the last stable code point L in the firststring, and the first stable code point F in the second string. The implementation has tonormalize only the range from (and including) L to the last code point before F. The resultwill then be normalized. This can be a very significant savings in performance whenconcatenating large strings.


25 of 36 1/29/2008 10:28 AM

Because characters with the Quick_Check=YES property value satisfy conditions 1–3, theoptimization can also be performed using the Quick_Check property. In this case, theimplementation finds the last code point L with Quick_Check=YES in the first string and thefirst code point F with Quick_Check=YES in the second string. It then normalizes the rangeof code points starting from (and including) L to the code point just before F.

15 Conformance Testing

Implementations must be thoroughly tested for conformance to the normalizationspecification. The Normalization Conformance Test [Test15] file is available for testingconformance. This file consists of a series of fields. When Normalization Forms are appliedto the different fields, the results shall be as specified in the header of that file.

16 Hangul

Because the Hangul compositions and decompositions are algorithmic, memory storagecan be significantly reduced if the corresponding operations are done in code rather thanby simply storing the data in the general-purpose tables. Here is sample code illustratingalgorithmic Hangul canonical decomposition and composition done according to thespecification in Section 3.12, Combining Jamo Behavior [Unicode]. Although coded inJava, the same structure can be used in other programming languages.

The canonical Hangul decompositions specified here and in Section 3.12, CombiningJamo Behavior, in [Unicode] directly decompose precomposed Hangul syllable charactersinto two or three Hangul Jamo characters. This differs from all other canonicaldecompositions in two ways. First, they are arithmetically specified. Second, they directlymap to more than two characters. The canonical decomposition mapping for all othercharacters maps each character to one or two others. A character may have a canonicaldecompositionto more than two characters, but it is expressed as the recursive application of mappings toat most a pair of characters at a time.

Hangul decomposition could also be expressed this way. All LVT syllables decompose intoan LV syllable plus a T jamo. The LV syllables themselves decompose into an L jamo plusa T jamo. Thus the Hangul canonical decompositions are fundamentally the same as theother canonical decompositions in terms of the way they decompose. This analysis canalso be used to produce more compact code than what is given below.

Common Constants

static final int SBase = 0xAC00, LBase = 0x1100, VBase = 0x1161, TBase = 0x11A7, LCount = 19, VCount = 21, TCount = 28, NCount = VCount * TCount, // 588 SCount = LCount * NCount; // 11172

Hangul Decomposition

public static String decomposeHangul(char s) { int SIndex = s - SBase; if (SIndex < 0 || SIndex >= SCount) { return String.valueOf(s); } StringBuffer result = new StringBuffer(); int L = LBase + SIndex / NCount; int V = VBase + (SIndex % NCount) / TCount; int T = TBase + SIndex % TCount;


26 of 36 1/29/2008 10:28 AM

result.append((char)L); result.append((char)V); if (T != TBase) result.append((char)T); return result.toString(); }

Hangul Composition

Notice an important feature of Hangul composition: whenever the source string is not inNormalization Form D or Normalization Form KD, one cannot just detect charactersequences of the form and . It is also necessary to catch the sequences ofthe form . To guarantee uniqueness, these sequences must also be composed.This is illustrated in step 2.

public static String composeHangul(String source) { int len = source.length(); if (len == 0) return ""; StringBuffer result = new StringBuffer(); char last = source.charAt(0); // copy first char result.append(last);

for (int i = 1; i < len; ++i) { char ch = source.charAt(i);

// 1. check to see if two current characters are L and V

int LIndex = last - LBase; if (0


27 of 36 1/29/2008 10:28 AM

performed. For example, the trailing consonants kf + sf may be combined into ksf. Inaddition, some Hangul input methods do not require a distinction on input between initialand final consonants, and change between them on the basis of context. For example, inthe keyboard sequence mi + em + ni + si + am, the consonant ni would be reinterpreted asnf, because there is no possible syllable nsa. This results in the two syllables men and sa.

However, none of these additional transformations are considered part of the UnicodeNormalization Forms.

Hangul Character Names

Hangul decomposition is also used to form the character names for the Hangul syllables.While the sample code that illustrates this process is not directly related to normalization, itis worth including because it is so similar to the decomposition code.

public static String getHangulName(char s) { int SIndex = s - SBase; if (0 > SIndex || SIndex >= SCount) { throw new IllegalArgumentException("Not a Hangul Syllable: " + s); } StringBuffer result = new StringBuffer(); int LIndex = SIndex / NCount; int VIndex = (SIndex % NCount) / TCount; int TIndex = SIndex % TCount; return "HANGUL SYLLABLE " + JAMO_L_TABLE[LIndex] + JAMO_V_TABLE[VIndex] + JAMO_T_TABLE[TIndex]; }

static private String[] JAMO_L_TABLE = { "G", "GG", "N", "D", "DD", "R", "M", "B", "BB", "S", "SS", "", "J", "JJ", "C", "K", "T", "P", "H" };

static private String[] JAMO_V_TABLE = { "A", "AE", "YA", "YAE", "EO", "E", "YEO", "YE", "O", "WA", "WAE", "OE", "YO", "U", "WEO", "WE", "WI", "YU", "EU", "YI", "I" };

static private String[] JAMO_T_TABLE = { "", "G", "GG", "GS", "N", "NJ", "NH", "D", "L", "LG", "LM", "LB", "LS", "LT", "LP", "LH", "M", "B", "BS", "S", "SS", "NG", "J", "C", "K", "T", "P", "H" };

17 Intellectual Property

Transcript of letter regarding disclosure of IBM Technology(Hard copy is on file with the Chair of UTC and the Chair of NCITS/L2)

Transcribed on 1999-03-10

February 26, 1999

The Chair, Unicode Technical Committee

Subject: Disclosure of IBM Technology - Unicode Normalization Forms

The attached document entitled “Unicode Normalization Forms” does not require IBMtechnology, but may be implemented using IBM technology that has been filed for US


28 of 36 1/29/2008 10:28 AM

Patent. However, IBM believes that the technology could be beneficial to thesoftware community at large, especially with respect to usage on the Internet,allowing the community to derive the enormous benefits provided by Unicode.

This letter is to inform you that IBM is pleased to make the Unicode normalizationtechnology that has been filed for patent freely available to anyone using them inimplementing to the Unicode standard.

Sincerely,

W. J. Sullivan,Acting Director of National Language Supportand Information Development

18 Corrigenda

The Unicode Consortium has well-defined policies in place to govern changes that affectbackward compatibility. For information on these stability policies, especially regardingnormalization, see Unicode Policies [Policies]. In particular:

Once a character is encoded, its canonical combining class and decompositionmapping will not be changed in a way that will destabilize normalization.

What this means is:

If a string contains only characters from a given version of the Unicode Standard (forexample, Unicode 3.1.1), and it is put into a normalized form in accordance with thatversion of Unicode, then it will be in normalized form according to any future versionof Unicode.

This guarantee has been in place for Unicode 3.1 and after. It has been necessary tocorrect the decompositions of a small number of characters since Unicode 3.1, as listed inthe Normalization Corrections data file [Corrections], but such corrections are inaccordance with the above principles: all text normalized on old systems will test asnormalized in future systems. All text normalized in future systems will test as normalizedon past systems. What may change, for those few characters, is that unnormalized textmay normalize differently on past and future systems.

It is straightforward for any implementation with a future version of Unicode to support allpast versions of normalization. For an implementation of Unicode Version X to support aversion of NFC that precisely matches a older Unicode Version Y, the following two stepsare taken:

Before applying the normalization algorithm, map the characters that were correctedto their old values in Unicode Version Y.

Use the table in [Corrections] for this step, by including any code points thathave a version later than Y and less than or equal to X.For example, for a Unicode 4.0 implementation to duplicate Unicode 3.2results, exactly five characters must be mapped.

1.


29 of 36 1/29/2008 10:28 AM

In applying the normalization algorithm, handle any code points that were not definedin Unicode Version X as if they were unassigned.

That is, the code points will not decompose or compose, and their canonicalcombining class will be zero.The Derived_Age property in the Unicode Character Database [UCD] can beused for the set of code points in question.

2.

[Unicode4.1] corrected a definitional problem with D2.

19 Canonical Equivalence

This section describes the relationship of normalization to respecting (or preserving)canonical equivalence. A process (or function) respects canonical equivalence whencanonical-equivalent inputs always produce canonical-equivalent outputs. For a functionthat transforms one string into another, this may also be called preserving canonicalequivalence. There are a number of important aspects to this concept:

The outputs are not required to be identical, only canonically equivalent.1.Not all processes are required to respect canonical equivalence. For example:

A function that collects a set of the General_Category values present in a stringwill and should produce a different value for thanfor , even though they arecanonically equivalent.A function that does a binary comparison of strings will also find these twosequences different.

2.

Higher-level processes that transform or compare strings, or that perform otherhigher-level functions, must respect canonical equivalence or problems will result.

3.

The canonically equivalent inputs or outputs are not just limited to strings, but are alsorelevant to the offsetswithin strings, because those play a fundamental role in Unicode string processing.

Offset P into string X is canonically equivalent to offset Q into string Y if and only ifboth of the following conditions are true:

X[0, P] ≈ Y[0, Q], andX[P, len(X)] ≈ Y[Q, len(Y)]

This can be written as PX ≈ QY. Note that whenever X and Y are canonically equivalent, itfollows that 0X ≈ 0Y and len(X)X ≈ len(Y)Y.

Example 1. Given X = and Y = ,

0X ≈ 0Y1X ≈ 2Y2X ≈ 3Y1Y has no canonically equivalent offset in X

The following are examples of processes that involve canonically equivalent strings and/oroffsets.


30 of 36 1/29/2008 10:28 AM

Example 2. When isWordBreak(string, offset) respects canonical equivalence, then

isWordBreak(, 1) = isWordBreak(, 2)

Example 3. When nextWordBreak(string, offset) respects canonical equivalence, then

nextWordBreak(, 0) = 1 if and only if nextWordBreak(, 0) = 2

Respecting canonical equivalence is related to, but different from, preserving a canonicalNormalization Form NFx (where NFx means either NFD or NFC). In a process thatpreserves a Normalization Form, whenever any input string is normalized according to thatNormalization Form, then every output string is also normalized according to that form. Aprocess that preserves a canonical Normalization Form respects canonical equivalence,but the reverse is not necessarily true.

In building a system that as a whole respects canonical equivalence, there are two basicstrategies, with some variations on the second strategy.

Ensure that each system component respects canonical equivalence.A.Ensure that each system component preserves NFx, and one of the following:

Reject any non-NFx text on input to the whole system.1.Reject any non-NFx text on input to each component.2.Normalize to NFx all text on input to the whole system.3.Normalize to NFx all text on input to each component.4.All three of the following:

Allow text to be marked as NFx when generated.a.Normalize any unmarked text on input to each component to NFx.b.Reject any marked text that is not NFx.c.

5.

B.

There are trade-offs for each of these strategies. The best choice or mixture of strategieswill depend on the structure of the components and their interrelations, and howfine-grained or low-level those components are. One key piece of information is that it ismuch faster to check that text is NFx than it is to convert it. This is especially true in thecase of NFC. So even where it says “normalize” above, a good technique is to first check ifnormalization is required, and perform the extra processing only if necessary.

Strategy A is the most robust, but may be less efficient.Strategies B1 and B2 are the most efficient, but would reject some data, includingthat converted 1:1 from some legacy code pages.Strategy B3 does not have the problem of rejecting data. It can be more efficient thanA: because each component is assured that all of its input is in a particularNormalization Form, it does not need to normalize, except internally. But it is lessrobust: any component that fails can “leak” unnormalized text into the rest of thesystem.Strategy B4 is more robust than B1 but less efficient, because there are multiplepoints where text needs to be checked.Strategy B5 can be a reasonable compromise; it is robust but allows for all text input.

20 Corrigendum 5 Sequences

Table 11 shows all of the problem sequences relevant to Corrigendum 5. It is important to


31 of 36 1/29/2008 10:28 AM

emphasize that none of these sequences will occur in any meaningful text, because noneof the intervening characters shown in the sequences occur in the contexts shown in thetable.

Table 11. Problem Sequences

First Character InterveningCharacter(s)

Last Character

09C7 BENGALI VOWEL SIGN E

One or morecharacterswith anon-zeroCanonicalCombiningClasspropertyvalue—forexample, anacute accent.

09BE BENGALI VOWEL SIGN AA or09D7 BENGALI AU LENGTH MARK

0B47 ORIYA VOWEL SIGN E 0B3E ORIYA VOWEL SIGN AA or0B56 ORIYA AI LENGTH MARK or0B57 ORIYA AU LENGTH MARK

0BC6 TAMIL VOWEL SIGN E 0BBE TAMIL VOWEL SIGN AA or0BD7 TAMIL AU LENGTH MARK

0BC7 TAMIL VOWEL SIGN EE 0BBE TAMIL VOWEL SIGN AA0B92 TAMIL LETTER O 0BD7 TAMIL AU LENGTH MARK0CC6 KANNADA VOWEL SIGN E 0CC2 KANNADA VOWEL SIGN UU or

0CD5 KANNADA LENGTH MARK or0CD6 KANNADA AI LENGTH MARK

0CBF KANNADA VOWEL SIGN I OR0CCA KANNADA VOWEL SIGN O

0CD5 KANNADA LENGTH MARK

0D47 MALAYALAM VOWEL SIGN EE 0D3E MALAYALAM VOWEL SIGN AA0D46 MALAYALAM VOWEL SIGN E 0D3E MALAYALAM VOWEL SIGN AA or

0D57 MALAYALAM AU LENGTH MARK

1025 MYANMAR LETTER U 102E MYANMAR VOWEL SIGN II0DD9 SINHALA VOWEL SIGN KOMBUVA 0DCF SINHALA VOWEL SIGN AELA-PILLA or

0DDF SINHALA VOWEL SIGN GAYANUKITTA

[1100-1112] HANGUL CHOSEONGKIYEOK..HIEUH (19 instances)

[1161-1175] HANGUL JUNGSEONG A..I (21instances)

[:HangulSyllableType=LV:]* [11A8..11C2] HANGUL JONGSEONGKIYEOK..HIEUH (27 instances)

*This table is constructed on the premise that the text is being normalized (thus theprocessing is in the middle of R1.2 or R2.2), and that the first character has already beencomposed if possible. If the table is used external to normalization to assess whether anyproblem sequences occur, then the implementation must also catch cases that arecanonical equivalents. That is only relevant to the case [:HangulSyllableType=LV:]; theequivalent sequences of where x is in [1100..1112] and y is in [1161..1175] mustalso be detected.

21 Stream-Safe Text Format


32 of 36 1/29/2008 10:28 AM

There are certain protocols that would benefit from using normalization, but that haveimplementation constraints. For example, a protocol may require buffered serialization, inwhich only a portion of a string may be available at a given time. Consider the extremecase of a string containing a digit 2 followed by 10,000 umlauts followed by one dot-below,then a digit 3. As part of normalization, the dot-below at the end must be reordered toimmediately after the digit 2, which means that 10,003 characters need to be consideredbefore the result can be output.

Such extremely long sequences of combining marks are not illegal, even though for allpractical purposes they are not meaningful. However, the possibility of encountering suchsequences forces a conformant, serializing implementation to provide large buffer capacityor to provide a special exception mechanism just for such degenerate cases. TheStream-Safe Text Format specification addresses this situation.

D7. Stream-Safe Text Format: A Unicode string is said to be in Stream-Safe Text Format ifit would not contain any sequences of non-starters longer than 30 characters in lengthwhen normalized to NFKD.

Such a string can be normalized in buffered serialization with a buffer size of 32characters, which would require no more than 128 bytes in any Unicode EncodingForm.Incorrect buffer handling can introduce subtle errors in the results. Any bufferedimplementation should be carefully checked against the normalization test data.The value of 30 is chosen to be significantly beyond what is required for any linguisticor technical usage. While it would have been feasible to chose a smaller number, thisvalue provides a very wide margin, yet is well within the buffer size limits of practicalimplementations.NFKD was chosen for the definition because it produces the potentially longestsequences of non-starters from the same text.

D8. Stream-Safe Text Processis the process of producing a Unicode string in Stream-Safe Text Format by processingthat string from start to finish, inserting U+034F COMBINING GRAPHEME JOINER (CGJ) within longsequences of non-starters. The exact position of the inserted CGJs are determinedaccording to the following algorithm, which describes the generation of an output stringfrom an input string:

If the input string is empty, return an empty output string.1.Set nonStarterCount to zero.2.For each code point C in the input string:

Produce the NFKD decomposition S.a.If nonStarterCount plus the number of initial non-starters in S is greater than 30,append a CGJ to the output string and set the nonStarterCount to zero.

b.

Append C to the output string.c.If there are no starters in S, increment nonStarterCount by the number of codepoints in S; otherwise, set nonStarterCount to the number of trailingnon-starters in S (which may be zero).

d.

3.

Return the output string.4.

The Stream-Safe Text Process ensures not only that the resulting text is in Stream-SafeText Format, but that any normalization of the result is also in Stream-Safe Text Format.This is true for any input string that does not contain unassigned code points. The


33 of 36 1/29/2008 10:28 AM

Stream-Safe Text Process preserves all of the four normalization forms defined in thisdocument (NFC, NFD, NFKC, NFKD). However, normalization and the Stream-Safe TextProcess do not commute. That is, normalizing an arbitrary text to NFC, followed byapplying the Stream-Safe Text Process, is not guaranteed to produce the same result asapplying the Stream-Safe Text Process to that arbitrary text, followed by normalization toNFC.

It is important to realize that if the Stream-Safe Text Process does modify the input text byinsertion of CGJs, the result will not be canonically equivalent to the original. TheStream-Safe Text Format is designed for use in protocols and systems that accept thelimitations on the text imposed by the format, just as they may impose their own limitations,such as removing certain control codes.

However, the Stream-Safe Text Format will not modify ordinary texts. Where it modifies anexceptional text, the resulting string would no longer be canonically equivalent to theoriginal, but the modifications are minor and do not disturb any meaningful content. Themodified text contains all of the content of the original, with the only difference being thatreordering is blocked across long groups of non-starters. Any text in Stream-Safe TextFormat can be normalized with very small buffers using any of the standard NormalizationForms.

Implementations can optimize this specification as long as they produce the same results.In particular, the information used in Step 3 can be precomputed: it does not require theactual normalization of the character. For efficient processing, the Stream-Safe TextProcess can be implemented in the same implementation pass as normalization. In such acase, the choice of whether to apply the Stream-Safe Text Process can be controlled by aninput parameter.

21.1 Buffering with Unicode Normalization

Using buffers for normalization requires that characters be emptied from the buffercorrectly. That is, as decompositions are appended to the buffer, periodically the end of thebuffer will be reached. At that time, the characters in the buffer up to but not including thelast character with the property value Quick_Check=Yes (QC=Y) must be canonicallyordered (and if NFC and NFKC are being generated, must also be composed), and onlythen flushed. For more information on the Quick_Check property, see Section 14 DetectingNormalization Forms.

Consider the following example. Text is being normalized into NFC with a buffer size of 40.The buffer has been successively filled with decompositions, and has two remaining slots.The decomposition takes three characters, and wouldn't fit. The last character with QC=Yis the "s", marked in red below.

Buffer

T h e c ◌́ a ... p ◌̃ q r ◌́ s ◌́ 0 1 2 3 4 5 6 ... 31 32 33 34 35 36 37 38 39

Decomposition

u ◌̃ ◌́0 1 2


34 of 36 1/29/2008 10:28 AM

Thus the buffer up to but not including "s" needs to be composed, and flushed. Once thisis done, the decomposition can be appended, and the buffer is left in the following state:

s ◌́ u ◌̃ ◌́ ... 0 1 2 3 4 5 6 ... 31 32 33 34 35 36 37 38 39

Implementations may also canonically order (and compose) the contents of the buffer asthey go; the key requirement is that they cannot compose a sequence until a followingcharacter with the property QC=Y is encountered. For example, if that had been done inthe above example, then during the course of filling the buffer, we would have had thefollowing state, where "c" is the last character with QC=Y.

T h e c ◌́ 0 1 2 3 4 5 6 ... 31 32 33 34 35 36 37 38 39

When the "a" (with QC=Y) is to be appended to the buffer, it is then safe to compose the"c" and all subsequent characters, and then enter in the "a", marking it as the lastcharacter with QC=Y.

T h e ć a 0 1 2 3 4 5 6 ... 31 32 33 34 35 36 37 38 39

Acknowledgments

Mark Davis and Martin Dürst created the initial versions of this annex. Mark Davis hasadded to and maintains the text.

Thanks to Kent Karlsson, Marcin Kowalczyk, Rick Kunst, Per Mildner, Sadahiro Tomoyuki,Markus Scherer, Dick Sites, Sadahiro Tomoyuki, Ienup Sung, Erik van der Poel, and KenWhistler for feedback on this annex, including earlier versions. Asmus Freytag extensivelyreformatted the text for publication as part of the book.

References

For references for this annex, see Unicode Standard Annex #41, “Common References forUnicode Standard Annexes.”

Modifications

The following summarizes modifications from previous revisions of this annex.

Revision 28

Added Section 3.5 Stabilized StringsAdded R3. Normalization Process for Stabilized StringsAdded UAX15-C5Added a number of clarifications, including:


35 of 36 1/29/2008 10:28 AM

A new Section 21.1 Buffering with Unicode NormalizationA note at the bottom of Section 20 Corrigendum 5 Sequences on theinterpretation of the tableA note in Section 6 Composition Exclusion Table on when characters areadded to the table.

Draft 3:clarified wording in Compatibility decomposition and Hangul Composition

Draft 4:Updated to 5.1.0, since that is now the next release.

Draft 5:Renumbered D5 and D6 in Section 21 Stream-Safe Text FormatAdded note explaining why NFKD is used.Added note explaining that the process does not commute.Rearranged the text slightly to improve the flow.

Revision 27

Replaced Figure 3 with 4 modified figures, added commentary, renumbered figuresChanged “document” to “annex” as appropriateChanged Annexes to Sections and renumberedClarified the definition and use of the QuickTest propertyAdded Section 21, Stream-Safe Text FormatAdded examples of composition exclusionsChanged C1..C3 to UAX15-C1..UAX15-C1.Added clarifications of stability:

Section 3.1, Stability of Normalized FormsSection 3.2, Stability of the_Normalization ProcessSection 3.3, Guaranteeing Process StabilitySection 20, Corrigendum 5 Sequences

Major reformatting to better follow the style in The Unicode StandardExtensive copy-editing and minor editing for clarity

Revision 26 being a proposed update, only changes between versions 27 and 25 arenoted here.

Revision 25

Minor editingAdded note in Section 18, Corrigenda about PRI #29Changed “Tracking Number” to RevisionAdded note that D2 is only to be applied to strings that are already canonicallydecomposed.

Revision 24

As per PRI-29,changed D2 to add “or higher”Changed Goal 1 to clarify that it is a conformance requirement.


36 of 36 1/29/2008 10:28 AM

Added to section 10 to explain Hangul decomposition mappings.Numerous editorial changes

Revision 23

Updated References.Added description of Stable Code Points.Described notation toNFC(x) and isNFC(x), in Notation.Clarified the section on Concatenation.Copied reference to charts in the Introduction.Added pointer to UTN #5 Canonical Equivalences in Applications in ImplementationNotes.Rewrote Section 18, Corrigendafor clarity, and to describe the use of Normalization Corrections.Added Section 19, Canonical Equivalence.Added Acknowledgments.

Note: this does not include people who contributed feedback to previousversions.

Minor editing

Revision 22

Added reference to Corrigendum #3, “U+F951 Normalization,” changing the title ofSection 18Changed references to Unicode 3.1Cleaned up links to versioned files

Copyright © 1998-2008Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, andassumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages inconnection with or arising out of the use of the information or programs contained or accompanying this technical report.The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.

UAX #15: Unicode Normalization Forms · 2008-01-29 · Chapter 2, General Structure, and . Chapter...

Documents

Transcript of UAX #15: Unicode Normalization Forms · 2008-01-29 · Chapter 2, General Structure, and . Chapter...