Don Rhoten, CFRE MMC Foundation, President/CEO [email protected] @DonRhoten.
CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis...
-
Upload
teresa-howley -
Category
Documents
-
view
212 -
download
0
Transcript of CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis...
CLDR:CLDR:The Common Locale Data The Common Locale Data
RepositoryRepository
Locales for the WorldLocales for the World
Lisa MooreLisa MooreGeorge Rhoten George Rhoten
Mark Davis Mark Davis Steven LoomisSteven Loomis
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 20062
AgendaAgenda
Why CLDR?Why CLDR?
CLDR dataCLDR data
Tools and vettingTools and vetting
Today and the futureToday and the future
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 20063
AgendaAgenda
Why CLDR?Why CLDR?
CLDR dataCLDR data
Tools and vettingTools and vetting
Today and the futureToday and the future
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 20064
Locales – does anything stay the Locales – does anything stay the same?same?
"Theatre Center News: The"Theatre Center News: The date of date of the last version of this document wasthe last version of this document was 20032003 年年 33 月月 2020 日日 . . A copy can be A copy can be obtained forobtained for $50,0 or 1.234,57 грн$50,0 or 1.234,57 грн. . We would like to acknowledge We would like to acknowledge contributions by the following contributions by the following authorsauthors (in alphabetical order): Alaa (in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug Felt."Avery Bishop, and Doug Felt."
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 20065
Locales – the many differencesLocales – the many differences
Locales specify user preferencesLocales specify user preferences Linguistic and cultural differencesLinguistic and cultural differences
• Languages, scripts, writing systems, ordering, Languages, scripts, writing systems, ordering, directionality, formatting, numbers, sizesdirectionality, formatting, numbers, sizes
Even in the same locale, interoperability Even in the same locale, interoperability issues across platformsissues across platforms
Global economics has increased the need Global economics has increased the need for greater globalization support in for greater globalization support in computer systemscomputer systems
Everyone expects more!Everyone expects more!
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 20066
Add the Universal Character Add the Universal Character EncodingEncoding
Unicode: Unique character codes for Unicode: Unique character codes for all languagesall languages
…
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 20067
The Need for Common Locale DataThe Need for Common Locale Data
Computing environments often contain Computing environments often contain a variety of operating systems and a variety of operating systems and software.software.
Historically locale sensitive data Historically locale sensitive data research has been done by individuals research has been done by individuals and/or companies.and/or companies.
Because of political changes, it is easy Because of political changes, it is easy for locale data to become out of date.for locale data to become out of date.
It is difficult to get complete agreement It is difficult to get complete agreement on correctness.on correctness.
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 20068
Common Locale Data ProjectCommon Locale Data Project Began as Common XML Locale Repository Began as Common XML Locale Repository
(CXLR) developed by OpenI18N in 2003(CXLR) developed by OpenI18N in 2003
CLDR project began in 2004CLDR project began in 2004
Hosted by Unicode ConsortiumHosted by Unicode Consortium• http://www.unicode.org/cldr/http://www.unicode.org/cldr/
Goals:Goals:• Common, necessary software locale data for all world Common, necessary software locale data for all world
languageslanguages• Collect and maintain locale dataCollect and maintain locale data• XML format for effective interchangeXML format for effective interchange• Freely availableFreely available
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 20069
CLDR in use (partial list)CLDR in use (partial list) Libraries and EnvironmentsLibraries and Environments
• ICU – International Components for UnicodeICU – International Components for Unicode• JDK – Java Development KitJDK – Java Development Kit
Operating SystemsOperating Systems• SolarisSolaris• AIXAIX• MacOS XMacOS X
ApplicationsApplications• OpenOffice.orgOpenOffice.org• AcrobatAcrobat• ModernBillModernBill
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200610
AgendaAgenda
Why CLDR?Why CLDR?
CLDR dataCLDR data
Tools and vettingTools and vetting
The futureThe future
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200611
What is a Locale?What is a Locale? A locale is an identifier referring to linguistic A locale is an identifier referring to linguistic
and cultural preferencesand cultural preferences• en_US, en_GB, ja_JPen_US, en_GB, ja_JP
These preferences can change over time due These preferences can change over time due to cultural and political reasonsto cultural and political reasons• Introduction of new currencies, like the EuroIntroduction of new currencies, like the Euro• Standard sorting of Spanish changesStandard sorting of Spanish changes
Many of these preferences have varying Many of these preferences have varying degrees of standardizationdegrees of standardization• 12 and 24 hour format in the United States12 and 24 hour format in the United States
This is a very broad topicThis is a very broad topic
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200612
Types of Locale DataTypes of Locale Data Dates/time/calendar formatsDates/time/calendar formats Number/currency formatsNumber/currency formats Measurement systemMeasurement system Collation specificationCollation specification
• SortingSorting• SearchingSearching• MatchingMatching
Translated names for language, territory, Translated names for language, territory, script, timezones, currencies,…script, timezones, currencies,…
Script and characters used by a languageScript and characters used by a language
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200613
Locale Data Markup LanguageLocale Data Markup Language
Locale data described using XMLLocale data described using XML
CLDR data uses LDMLCLDR data uses LDML Structure of CLDR controlled by Structure of CLDR controlled by
Locale Data Markup Language Locale Data Markup Language (LDML) specification(LDML) specificationhttp://unicode.org/reports/tr35http://unicode.org/reports/tr35
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200614
LDML Data CategoriesLDML Data Categories<ldml><ldml>
<identity><identity>
<localeDisplayNames><localeDisplayNames>
<layout><layout>
<characters><characters>
<delimiters><delimiters>
<measurement><measurement>
<dates><dates>
<numbers><numbers>
<posix><posix>
<collations><collations>
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200615
NamesNames
<localeDisplayNames><localeDisplayNames>
Provides translated display names for Provides translated display names for languages, territories, scripts, languages, territories, scripts, variants and keywords used in CLDR.variants and keywords used in CLDR.
Most of this information is at the Most of this information is at the language level, since it typically does language level, since it typically does not vary by territory, only language.not vary by territory, only language.
An example: An example: ICU Locale ExplorerICU Locale Explorer
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200616
Names ExamplesNames Examples
From ga.xml (Irish):From ga.xml (Irish):
<localeDisplayNames><localeDisplayNames>
<languages><languages>
<language type="aa"><language type="aa">AfarAfar</language></language>
<language type="ab"><language type="ab">AbcáisisAbcáisis</language>…</language>…
<scripts><scripts>
<script type="Arab"><script type="Arab">AraibisAraibis</script>…</script>…
<territories><territories>
<territory type="AD"><territory type="AD">AndóraAndóra </territory> </territory>
<territory type="AE"><territory type="AE">Aontas na nÉimíríochtaí ArabachaAontas na nÉimíríochtaí Arabacha
</territory>…</territory>…
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200617
CharactersCharacters
<characters><characters> Allows for creation of exemplar character Allows for creation of exemplar character
sets. An exemplar set specifies the set of sets. An exemplar set specifies the set of characters that must be present in order characters that must be present in order to properly render the language.to properly render the language.
Auxiliary Auxiliary exemplarexemplar set defines additional set defines additional characters that may appear in foreign characters that may appear in foreign words or phrases.words or phrases.
Lower case onlyLower case only
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200618
Date FormatsDate Formats<dates><dates> Defines representation of calendars using various Defines representation of calendars using various
calendaring systems (Gregorian, Buddhist, Islamic, calendaring systems (Gregorian, Buddhist, Islamic, Japanese, etc.)Japanese, etc.)
Defines formatting for dates, times, eras and time Defines formatting for dates, times, eras and time zoneszones• wide, abbreviated, or narrowwide, abbreviated, or narrow• Date and time formats use patterns of letters to Date and time formats use patterns of letters to
define proper formattingdefine proper formatting Week informationWeek information Relative day/time translations (for example, Relative day/time translations (for example,
yesterday, tomorrow, etc. )yesterday, tomorrow, etc. ) An example: An example: ICU Locale ExplorerICU Locale Explorer
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200619
Characters / Dates ExamplesCharacters / Dates Examples
From ga.xml (Irish):From ga.xml (Irish): <characters><characters>
<exemplarCharacters> <exemplarCharacters> [a á b-e é f-i í j-o ó p-u ú v-z][a á b-e é f-i í j-o ó p-u ú v-z]
</exemplarCharacters></exemplarCharacters>
<exemplarCharacters type="auxiliary"> <exemplarCharacters type="auxiliary"> [ḃ ċ ḋ ḟ ġ ṁ ṗ ṡ ṫ][ḃ ċ ḋ ḟ ġ ṁ ṗ ṡ ṫ] </exemplarCharacters></exemplarCharacters>
</characters>…</characters>…
<dayContext type="format"><dayContext type="format">
<dayWidth type="abbreviated"><dayWidth type="abbreviated">
<day type="sun"><day type="sun">DomhDomh</day></day>
<day type="mon"><day type="mon">LuanLuan </day>…</day>…
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200620
Time Zone NamesTime Zone Names
<timeZoneNames><timeZoneNames>
Based on Olson time zone databaseBased on Olson time zone database
Localized display names for Localized display names for standard, daylight, and generic standard, daylight, and generic representations of time zones.representations of time zones.
Short and long display names.Short and long display names.
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200621
NumbersNumbers
<numbers><numbers> Specifies proper localized formatting of numeric Specifies proper localized formatting of numeric
quantitiesquantities
• DecimalDecimal
• ScientificScientific
• CurrencyCurrency
• PercentagesPercentages
Includes localized decimal, thousands separators, Includes localized decimal, thousands separators, currency symbols, etc.currency symbols, etc.
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200622
Time Zones / CurrenciesTime Zones / CurrenciesFrom ga.xml (Irish) and root.xml:From ga.xml (Irish) and root.xml:
<timeZoneNames><timeZoneNames>
<zone type="Europe/Dublin"><zone type="Europe/Dublin">
<long><long>
<standard><standard>Meán-Am GreenwichMeán-Am Greenwich</standard></standard>
<daylight><daylight>AmAm Samhraidh na hÉireannSamhraidh na hÉireann </daylight></daylight>
</long>…</long>…
<numbers><numbers>
<currencies><currencies>
<currency type=“EUR"><currency type=“EUR">
<displayName><displayName>EuroEuro</displayName></displayName>
<symbol><symbol>€€</symbol></symbol>……
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200623
DelimitersDelimiters
<delimiters><delimiters>
Specifies a primary and secondary of Specifies a primary and secondary of delimiter characters to be used for delimiter characters to be used for bracketing quotations in textbracketing quotations in text
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200624
Delimiters ExampleDelimiters ExampleFrom fr.xml (French):From fr.xml (French):
<delimiters><delimiters>
<quotationStart><quotationStart>««</quotationStart></quotationStart>
<quotationEnd><quotationEnd>»»</quotationEnd></quotationEnd>
<alternateQuotationStart><alternateQuotationStart>““</</alternateQuotationStart>alternateQuotationStart>
<alternateQuotationEnd><alternateQuotationEnd>””</</alternateQuotationEnd>alternateQuotationEnd>
</delimiters></delimiters>
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200625
CollationCollation
<collations><collations>
Information in collation directory, not Information in collation directory, not mainmain
XML version of Java/ICU collation syntaxXML version of Java/ICU collation syntax
Unicode collation algorithm is the base Unicode collation algorithm is the base http://unicode.org/reports/tr10http://unicode.org/reports/tr10
Allows tailoring of the UCA on a per Allows tailoring of the UCA on a per locale basis.locale basis.
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200626
Collation ExampleCollation ExampleFrom collations/root.xml:From collations/root.xml:
<collations validSubLocales="<collations validSubLocales="ga ga_IE id id_ID ms ms_BN ms_MY nl nl_BE ga ga_IE id id_ID ms ms_BN ms_MY nl nl_BE nl_NL pt pt_BR pt_PT">nl_NL pt pt_BR pt_PT">
<collation type="standard"><collation type="standard">
<rules><rules>
......
<s><s>āā</s></s>
<t><t>ĀĀ</t></t>
<s><s>áá</s></s>
<t><t>ÁÁ</t></t>
<s><s>ǎǎ</s></s>
<t><t>ǍǍ</t></t>
<s><s>àà</s></s>
<t><t>ÀÀ</t>…</t>…
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200627
AgendaAgenda
Why CLDR?Why CLDR?
CLDR dataCLDR data
Tools and vettingTools and vetting
Today and the futureToday and the future
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200628
CLDR ToolsCLDR Tools
ExportExport• ICU resource bundle generationICU resource bundle generation
• POSIX locale generatorPOSIX locale generator
• openOffice.org format exportopenOffice.org format export
Survey toolSurvey tool
• http://www.unicode.org/cgi-bin/cldr-survhttp://www.unicode.org/cgi-bin/cldr-surveyey
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200629
Vetting Process for DataVetting Process for Data
Collect from different platforms, Collect from different platforms, experts, submissions: new or revisedexperts, submissions: new or revised
• References to external sources References to external sources strongly encouragedstrongly encouraged
• Must be before freeze date for Must be before freeze date for releaserelease
• Use Survey Tool to Collect DataUse Survey Tool to Collect Data
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200630
Causes of Conflicting DataCauses of Conflicting Data Typographical errorsTypographical errors
• Canda instead of CanadaCanda instead of Canada Regional differencesRegional differences
• German spelling is different between countriesGerman spelling is different between countries Parts of speechParts of speech
• ““март 2004” versus “3 мартмарт 2004” versus “3 мартаа” when the Russian word for ” when the Russian word for March is used in a dateMarch is used in a date
Context of usageContext of usage• Normal German sorting versus German phonebook sortingNormal German sorting versus German phonebook sorting
Standards versus common useStandards versus common use• ““Republic of Laos” versus “Laos”Republic of Laos” versus “Laos”
Individual preferencesIndividual preferences• 24 hour time format versus 12 hour time format24 hour time format versus 12 hour time format
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200631
AgendaAgenda
Why CLDR?Why CLDR?
CLDR dataCLDR data
Tools and vettingTools and vetting
Today and the futureToday and the future
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200632
Latest Release: CLDR 1.4Latest Release: CLDR 1.4
Released: Released: July 17, 2006July 17, 2006
360 locales: 360 locales: • 121 languages121 languages
• 142 territories142 territories
25% more data25% more data
17,000 new or modified data items17,000 new or modified data items
Over 100 different contributorsOver 100 different contributors
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200633
ChallengesChallenges
Complex FormatsComplex Formats
Experts knowledgeable both in Experts knowledgeable both in technology and a specific languagetechnology and a specific language• CollationCollation
• Exemplar charactersExemplar characters
• Etc…Etc…
Require close interaction of CLDR Require close interaction of CLDR experts with language expertsexperts with language experts
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200634
Getting InvolvedGetting Involved
Simplest – Simplest – anyone!anyone!• Use CLDRUse CLDR
• Bug report / feature requestBug report / feature request
More InvolvedMore Involved• Vetting, Assessment, Tools, Policies, Vetting, Assessment, Tools, Policies,
Decisions, …Decisions, …
• Any Unicode member eligible to name Any Unicode member eligible to name representatives including country liaison representatives including country liaison membersmembers
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200635
Example Country Process (Finland)Example Country Process (Finland) Finnish Ministry of Education made CLDR Finnish Ministry of Education made CLDR
data a major goal, 2004-06data a major goal, 2004-06• Research Institute for the Languages of FinlandResearch Institute for the Languages of Finland
(“RILF” aka “Kotus”) designated agency(“RILF” aka “Kotus”) designated agency• Two official languages (Finnish and Swedish) Two official languages (Finnish and Swedish)
& four regional / minority languages (three & four regional / minority languages (three Sámi & Romani as spoken in Finland) to be Sámi & Romani as spoken in Finland) to be coveredcovered
• Over 30 different parties represented: Over 30 different parties represented: commercial, non-commercial, individualscommercial, non-commercial, individuals
• Results expected to lead to new/revised Results expected to lead to new/revised national standardsnational standards
LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200636
For More InformationFor More Information UnicodeUnicode
• http://www.unicode.org/http://www.unicode.org/
CLDRCLDR• http://www.unicode.org/cldr/http://www.unicode.org/cldr/
LDML specificationLDML specification• http://unicode.org/reports/tr35http://unicode.org/reports/tr35
[email protected]@us.ibm.com