ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation...

43
ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) [email protected]

Transcript of ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation...

Page 1: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

ICANN IDN TLD Variant Issues Project

Presentation to the Unicode Technical Committee

Andrew Sullivan (consultant) [email protected]

Text Box
L2/11-426
Page 2: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

I’m a consultant Blame me for mistakes here,

not staff or ICANN

2  

Page 3: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Background

•  DNS labels were always in (a subset of) ASCII

•  Lots of people don’t normally use ASCII

•  Internationalized Domains Names for Applications (IDNA) invented to help

3  

Page 4: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Reminder: two flavours

IDNA2003

IDNA2008

4  

Page 5: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Basic problem

•  IDNA (2003 & 2008) expands DNS label repertoire

•  The LDH pattern does not fit perfectly in other languages, scripts, or both

•  People want DNS labels to work like parts of natural language

5  

Page 6: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

What makes a DNS label?

•  DNS labels are octets

•  Preferred syntax (RFC 1035) is Letters, Digits, and Hyphen (“LDH”)

•  Special DNS rule for ASCII

•  Case insensitive but case-preserving

6  

Page 7: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

IDNA

•  Permit non-LDH characters in label

•  Be as compatible as practical with deployed software

•  No changes to deployed DNS software or protocol

7  

Page 8: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

IDNA2003

•  Provide a list of code points that are allowed

•  Map cases that are troublesome (e.g. ZWNJ, upper-to-lowercase) using Nameprep

•  To the extent there’s an installed base, this is it

8  

Page 9: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

IDNA2008

•  Attempt to address some perceived limitations of IDNA2003

•  Permits or disallows code points based on code point properties

•  Certain incompatibilities with IDNA2003

9  

Page 10: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

What’s a variant?

Exactly

10  

Page 11: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Origins of variants

•  Starts because of Simplified Chinese/Traditional Chinese issue

•  JET Guidelines (RFC 3743)

•  Became model for other issues, not always related

11  

Page 12: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Things people have claimed

•  Characters that are substitutable

•  “Same words” or “same meaning”

•  Sometimes a constraint on child names, sometimes not

12  

Page 13: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Why now?

•  ccTLD IDN “Fast Track” process delegated some

•  Not uncontroversial

•  New gTLDs under development

•  If we’re going to create “variants”, we should be able to say what they are.

13  

Page 14: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

IDN Variant Issues Project

14

Page 15: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

IDN Variant Issues Project

15  

We are here

{  

Page 16: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Comment period to 14 Nov http://www.icann.org/en/announcements/announcement-4-03oct11-en.htm

and

h.p://www.icann.org/en/public-­‐comment/  

16  

Page 17: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Reports are only about the root

While some of the conclusions may apply to other types of zones, the reports discuss variants for TLDs only

17  

Page 18: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

A planned constraint for TLDs

Current rule is “only letters” (strictly, General Category {Ll, Lo, Lm, Mn})

•  No numerals

•  No HYPHEN-MINUS

•  No ZWNJ/ZWJ

18  

From the guidebook

Page 19: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Restrictions suggested in report

•  No combining marks •  No digits •  No archaic •  No Quranic marks

19  

Arabic team

Page 20: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

ZWNJ

•  Arguments for and against •  Refinement of IDNA2008

context rule •  Issue is lack of shape change

•  Questions about resulting variants

20  

Arabic team

Page 21: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Groups of characters

•  Identical shape at some position (e.g. YEH)

•  Similar shape at some position (e.g. ALEF w/ HAMZA ABOVE)

•  Interchangeable use (e.g. KAF vs SWASH KAF)

21  

Arabic team

Page 22: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

“NFC” issues

•  Not exactly issue with NFC •  Example: U+06C7 vs.

U+0648,U+064F •  Perhaps could be caught by

“confusables” algorithms?

22  

Arabic team

Page 23: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Recommendations

•  Whenever there is a variant, all resulting labels are available to the applicant

•  It is up to the applicant which ones to activate

23  

Arabic team

Page 24: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Focus on Chinese Language

•  Reports in principle about “script”, but report primarily about Chinese

•  Some consideration of effects on Japanese and Korean

24  

Chinese team

Page 25: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

RFC 3743, experience

•  Experience at other levels of DNS

•  RFC 3743 a good fit for CJK use

25  

Chinese team

Page 26: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Two fundamental cases

•  Traditional vs Simplified •  Variation due to Source

Separation Rule (e.g. U+6237 versus U+6236)

26  

Chinese team

Page 27: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Focus on reducing confusion

•  Mainly interested in confusion of strings between languages

•  Unlike Chinese and Arabic, no strong recommendation that “everything works”

27  

Cyrillic team

Page 28: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Different from other cases

•  Many more languages than some other scripts

•  Extremely fraught political environment: •  Cyrillic vs. Latin •  Cyrillic vs. Arabic •  Many spelling & character

reforms

28  

Cyrillic team

Page 29: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

One language can cause issues

•  Substitutions in one language obliterate differences in others

•  E.g. U+0435 vs U+0451, U+0433 vs U+0491

•  Some characters not on keyboards

29  

Cyrillic team

Page 30: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Interaction with other scripts

•  Issue of relation to Greek and Latin raised

•  Declared out of scope, but problematic

30  

Cyrillic team

Page 31: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Very different issues

•  Confusing similarity a high priority issue

•  Especially worried about URL bar display

•  Concern about ill-formed akshars

31  

Devanagari team

Page 32: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Environment issues

•  Display of Devanagari script can be problematic •  Rendering engines •  Fonts

32  

Devanagari team

Page 33: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

ZWJ and ZWNJ

•  Some Devanagari-using languages rely on ZWJ •  Even if there is a

precomposed version that will do

•  ZWNJ needed for noun paradigms •  Use in TLDs not clear

33  

Devanagari team

Page 34: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Inter-script issues

•  Relationship between Devanagari and other Bramhi-derived scripts?

•  Ruled out of scope, but may be important

34  

Devanagari team

Page 35: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Unusual case

•  Greek alone in studied scripts in being used for only one language

35  

Greek team

Page 36: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Additional restrictions

•  Team recommends excluding ancient characters

•  Team recommends sticking to Monotonic characters

36  

Greek team

Page 37: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Sigma and Tonos

•  IDNA2003 maps upper case to lower case: Tonos can be lost

•  IDNA2003 maps away final form sigma

•  Transformations in applications in IDNA2008

37  

Greek team

Page 38: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Final sigma

•  Recommend registering final form sigmas wherever requested

•  Also register without the final sigma (i.e. with small sigma in place of final sigma)

38  

Greek team

Page 39: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Tonos

•  Recommend registering with Tonos where requested

•  Also register with Tonos stripped

39  

Greek team

Page 40: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Dimotiki and Katharevousa

•  Recommendation that, if Katharevousa string is requested, the “same” Dimotiki “word” is blocked

•  Only report that requests variant behaviour because of whole-string meaning

40  

Greek team

Page 41: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

The impossible dream

•  There are too many relationships among characters in Latin-using languages

•  There’s no way to decide •  Therefore, no variants

41  

Latin team

Page 42: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Remember, please comment

Open until 14 November

h.p://www.icann.org/en/public-­‐comment/  

42  

Page 43: ICANN IDN TLD Variant Issues Project - Unicode · ICANN IDN TLD Variant Issues Project Presentation to the Unicode Technical Committee Andrew Sullivan (consultant) ajs@anvilwalrusden.com

Questions

43