Post on 25-Dec-2015
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
S. T. NandasaraLecturerUSCS, University of Colombo, Sri Lanka
Ashu MarasingheAssociate ProfessorLOP, Nagaoka University of Technology, Japan
Yoshiki MikamiProfessor, LeaderLOP, Nagaoka University of Technology, Japan
Asian Languages on the Web
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
Introduction of Asian Languages Survey Objectives and Methodology Asian Language Presence on the Web Multilingualism in the Asian Web Script and Encoding Issues Asian Language Resource Network (ALRN)
Project
Asian Languages on the Web
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
Give an overview for Asian Languages on the web To describe the state of multilingualism in Asian country
domains Defined at various levels, from a personal or document level
to a societal level Multiple language presence in each country domain Give an overview of cross-border languages
To shed light on script and encoding issues of Asian languages What extent is UCS/Unicode employed for Asian
languages? What scripts are actually used to represent a specific
language? What extent are locally developed encodings used?
Define a future agenda, which can guide us in realizing the vision of creating an observation-collection instrument for Asian languages.
Survey Objectives
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
Used a web crawler (Ubi crawler) It traces links within pages and recursively
crawls to gather those newly discovered pages
The collection of downloaded web pages passed to the language identification engine
The language properties of the pages were identified
Survey Methodology
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
Focused on web pages in 42 country domains in Asia.
The crawl was begun from a seed file containing 13,286 URLs
The list of ccTLDs contains ae, af, az, bd, bh, bn, bt, cy, id, il, in, iq, ir, jo, kg, kh, kw, kz, la, lb, lk, mm, mn, mv, my, np, om, ph, pk, ps, qa, sa, sg, sy, th, tj, tm, tp, tr, uz, vn and ye.
The Asia crawl started from 5th July 2006 at 11:00hrs and ended on 19th July 2006 at 19:03hrs
Downloaded 107,141,679 web pages in total, 652,710,237,381 bytes in size
Web Pages Collected
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
Downloaded Pages by ccTLD – Top 10
Country ccTLD Pages Percentage
Israel il 30,943,029 29.48%
Thailand th 12,556,807 11.96%
Turkey tr 11,363,633 10.83%
Malaysia my 6,865,800 6.54%
Kazakhstan kz 6,441,378 6.14%
Singapore sg 5,771,191 5.50%
Indonesia id 5,742,097 5.47%
Vietnam vn 4,490,288 4.28%
India in 4,262,378 4.06%
Iran ir 4,022,270 3.83%
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
Downloaded Pages by ccTLD – Least 10
Country ccTLD Pages Percentage
Iraq iq 0 0.00%
East Timor tp 13,213 0.01%
Myanmar mm 16,759 0.02%
Yemen ye 34,128 0.03%
Maldives mv 37,393 0.04%
Bhutan bt 44,594 0.04%
Syria sy 51,555 0.05%
Qatar qa 52,888 0.05%
Kuwait kw 59,152 0.06%
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
The language identification engine LIM (Language Identification Module) used
LIM consists of two components Training component
Training data is translations of the Universal Declaration of Human Rights (UDHR) provided by the United Nation’s Office of Higher Commissioner for Human Rights
The second component is identification component
LIM can simultaneously detect the triplet of language, script and encoding scheme
Language Identification Process
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
Chinese, Japanese and Korean are excluded from the analysis
Hebrew, Thai, Turkish, Vietnamese, Arabic, Tatar, Farsi, Javanese, Indonesian, Malay, Sundanese, Hindi, Dari, Uzbek, Mongolian, Kazakh, Madurese, Uighur, Kashmiri Pushtu, Balochi, Turkmen, Minangkabau, Bikol, Kyrgyz, Balinese, Punjabi, Sindhi, Achehnese, Sinhala, Kapampangan, Iloko, Bengali & Assamese, Filipino, Waray, Bugisnese, Burmese, Kurdish, Tajiki, Azeri, Tamil, Hiligaynon, Dhivehi, Bhojpuri, Tibetan, Cebuano, Telugu, Saraiki, Lao, Gujarati, Pashto, Kannada, Urdu, Khmer, Hani
Discovered 55 Asian languages
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
No of web pages per 1000 population
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
Number of pages by language – Top 10 Language Script
Speaker population
Total number of pages
No. of pages per 1000 speakers
Hebrew Hebrew 4,612,000 11,957,314 2592.65
Thai Thai 21,000,000 7,752,785 369.18
Turkish Latin 59,000,000 3,959,328 67.11
Vietnamese Latin 66,897,000 2,006,469 29.99
Arabic Arabic 280,000,000 1,671,122 5.97
Tatar Latin 7,000,000 1,575,442 225.06
Farsi Latin 33,000,000 1,293,880 39.21
Javanese Latin 75,000,000 1,267,981 16.91
Indonesian Latin 140,000,000 866,238 6.19
Malay Latin 17,600,000 432,784 24.59
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
Number of pages by language – Least 10
Language ScriptSpeaker population
Total number of pages
No. of pages per 1000 speakers
Cebuano Latin 15,230,000 1,107 0.07
Telugu Telugu 73,000,000 1,072 0.01
Saraiki Arabic 15,020,000 1,036 0.07
Lao Lao 4,000,000 799 0.20
Gujarati Gujarati 44,000,000 765 0.02
Pashto Arabic 9,585,000 259 0.03
Kannada Kannada 33,663,000 164 0.00
Urdu Arabic 54,000,000 70 0.00
Khmer Khmer 7,063,200 65 0.01
Hani Latin 747,000 63 0.08
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
Multilingualism by Country Domain The most recent version of Ethnologue lists close to seven
thousand languages around the world. More than 2600 of them are spoken in the Asian region. Large scale linguistic diversity is observable in Asia. Among the
2600, only around 51 languages are recognized by Asian governments as official or national language(s) Richest diversity of languages in the region, i.e. Indonesia Interesting to note that there is a significantly larger number of
pages in Javanese compared to either Indonesian or Malay The major language found in Indonesia, Malaysia, Brunei,
Singapore, Southern Thailand and Phillipines can be categorized into a single root Malay language spoken in different dialects.
Javanese has a dominating web presence in Indonesia. The lesser Sundanese, Madurese, Achehnese and Buginese
languages are found to be of great importance to Indonesia’s local language diversity on the Internet
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
Cross-Border Languages Another aspect of the multilingualism in the region is
the overwhelming presence of cross-border languages on the web
Defined two categories of languages First category is “local languages”, which are
officially recognized language(s) and home speakers’ languages of the state
The second category is “cross-border languages”, such as English, French, Russian and Arabic, which are used as a language of communication among the peoples of different nations
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
0%
20%
40%60%
80%
100%
Cyp
rus
Tur
key
Isra
elL
eban
on
Jord
anS
yria
Pal
estin
eG
CC
Iran
Afg
anis
tan
%Local
%Arabic
%Others
%Russian
%English
West Asia
Cross-Border Language Presence
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
0%
20%
40%60%
80%
100%
Cyp
rus
Turk
eyIs
rael
Leba
non
Jord
anSy
riaPa
lest
ine
GC
CIr
anA
fgan
ista
n
%Local
%Arabic
%Others
%Russian
%English0%20%
40%60%
80%100%
Kaz
akhs
tan
Kyr
gyzs
tan
Uzb
ekis
tan
Tur
kmen
ista
n
Taj
ikis
tan
Aze
rbai
jan
Mon
golia
Central Asia
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
0%
20%
40%
60%
80%
100%
Mya
nmar
Tha
iland Lao
Cam
bodi
a
Mal
aysi
a
Indo
nesi
a
Phi
lippi
nes
Bru
nei
Vie
tnam
Sin
gapo
re
%Local
%Arabic
%Others
%Russian
%English
South East Asia
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
0%
20%
40%60%
80%
100%
Cyp
rus
Turk
eyIs
rael
Leba
non
Jord
anSy
riaPa
lest
ine
GC
CIr
anA
fgan
ista
n
%Local
%Arabic
%Others
%Russian
%English0%
20%
40%
60%
80%
100%P
akis
tan
Indi
a
Sri
Lan
ka
Mal
dive
s
Bhu
tan
Nep
al
Ban
glad
esh
South Asia
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
Chinese 普通話 Urdu اردو Kyrgyz Кыргыз
EnglishEnglish
Sindhi سنڌي Filipino (Tagalog) Tagalog
Arabic (Alarabia) لعربية
Turkish (Türkçe) Türkçe Assameseঅসমী�য়া�
Bengaliবাং��লা�
Turkmen түркmенче Azeri/Azerbaijani (Cyrillic)
Азәрбајҹан дили
Hindiहि�न्दी�
Gujarati ગુ�જરા�તી�Malayalam മലയാ�ളം�
PortuguesePortuguês
Tamil தமி�ழ் Kashmiri का�ऽशुर / ر ٲكُش�
Indonesian Indonesea Kannada ಕನ್ನ�ಡ Pashto/Pakhto پښتو
Japanese (Nihongo) 日本語
Punjabi/Panjabi ਪੰ�ਜਾ�ਬੀ� / باجنپ Kazakh Қазақ / قازاق
Hankuko (Korean)
한국어 [ 韓國語 ]
Thai ภาษาไทย Uighur (Uyghur) Уйғур ئۇيغۇر/
Telugu తెలు�గు� Fijian vaka-Viti Uzbek (Cyrillic) Ўзбек
Vietnamese Tiếng Việt Uzbek (Cyrillic) Ўзбек Dari د"ر!ي
Marathi मर�ठी� Sanskrit सं�स्का� तम� Tatar татарча / تاتارچا
Tamil தமி�ழ்
Turkish (Türkçe) Türkçe
Kashmiri का�ऽशुर / ر ٲكُش�
Gujarati ગુ�જરા�તી� Balinese Bahasa BaliKyrgyz Кыргыз
Kannada ಕನ್ನ�ಡPunjabi/Panjabi ਪੰ�ਜਾ�ਬੀ� /
باجنپ
Maldivian Dhivehi
ިހ� ެވ� ިދ�Thai ภาษาไทย Sanskrit सं�स्का� तम�
TahitianTe Reo Tahiti
Uzbek (Cyrillic) Ўзбек MaoriTe Reo Māori
Bahasa Melayu (Malay) Bahasa melayu
Maori
Te Reo Māori
HawaiianŌlelo Hawai'i
Script Diversity of Asia
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
More than 480 million speakersHindi
More than 10 million speakersMarathiNepali
More than 1 million speakersAwadhiBhojpuriBraj-DhashaChahattsigarhiKonkaniKachchiMarwaniMaithaliMagahi
Scholars’ languageSanskrit
GarhwaliMundariNewariBegheliBhatneriBathiBateriBhiliGondiJaipuriHarautiHoKachchhiKanaujiKhadiyaKhorthi
KuluiKumaoniKhadiyaKhorthaKuluiKumaoniKurkuKurukhKurmaliPalpaPanchparganiaSantaliNagpuriKankanLimbuSherpa
Less than 1 million speakers
Devanagari Script used by
Same Script Shared by Various Languages
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
Script Region
Encoded PDFs Images
Latin 253 2 1
Cyrillic 19 4 3
Arabic 1 2 7
Ideographic 3 0 0
Indic - 7 12
Others 1 10 7Speaker Population in
Millions[1] 4,644 254 905
Representation of the UDHR Document by Major Script Grouping
[1] Cumulated speaker population based on Ethnologue, “Language of the World”, 15th ed. (2005)
UDHR Document by Major Script Grouping
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
UTF-8 Encoding in Selected LanguagesLanguage UTF-8 encoded
documentsDocument encodedotherwise
Examples of other encodings found [1]
Vietnamese 1,934,392 (96.4%) 72,077 (3.6%) TCVN, VIQR, VPS
Mongolian 48,834 (95.5%) 2,300 (4.5%) Latin-Cyrillic
Hindi, Bhojpuri, Magahi, Marathi, Nepali, Sanskrit, Tamang
81,800 (78.4%) 22,544 (21.6%) Agra, Arjun, Kiran, Kruti, Hungama, Naidunia, Shivaji, Shree, Shusha
Sinhala 4,793 (44.5%) 5,977 (55.5%) Metta, Kaputa
Arabic 400,933 (24.0%) 1,270,189 (76.0%) Latin-Arabic
Telugu 178 (16.6%) 894(83.4%) Shree, TLH
Tamil 566 (14.9%) 3,232 (85.1%) Amudham, Kumudam, Shree, Vikatan
Hebrew 1,468,344 (12.3%) 10,488,970 (87.7%) Latin-Hebrew
Thai 207,901 (2.7%) 7,544,884 (97.3%) TIS 620
Burmese 24 (0.7%) 3,261 (99.3%) WinResearcher
Turkish 20,591 (0.5%) 3,938,737 (99.5%) Latin-Turkish
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
To create a network of qualified Asian partners to specify and support the development of high priority Language Resources (LRs) for Asian Languages in a systematic, standards-driven, collaborative and learning context. The project will focus on identifying the state of the art of
LRs in the region, assessing priority requirements through consultations with
language research, industry and communication players, and establishing a protocol and
standards for developing a LR Network for the languages spoken in the region.
ALRN Mission
Asian Language Resources – Agenda
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
ALRN Action Plan
The project will be focusing on South, South East, Central & West Asian Languages
Act as an umbrella with Asian Language Resources (ALR)
To accommodate Secure and Sustainable UTF base encoding
Take advantage of existing Organization such as Language Observatory Project (LOP,TCL)
Corpus collection from the web using LO’s crawler/language identifier
Language resources originated from Japan and with their paralleled language corpus available in other languages (UDHR, Oshin, One Straw Revolution, etc)
Multilingual Terminology Dictionary
Information Standards of language corpus building
Liaison with international organization such as UNESCO, UDHR, etc.
Information resource shearing web site (www.language-resource.net)
Asia
n Ac
adem
y of
Lan
guag
es …
?
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
Thank youDanke schönMerciGraciasObrigadoGrazieDankeSpacibaΕυχάριστο
Thank youDanke schönMerciGraciasObrigadoGrazieDankeSpacibaΕυχάριστο
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
(The exact number of languages may never be determined exactly)
Language Presence in Asian Countries
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
(Half of the world’s languages are spoken in only eight countries)
Language Diversity
Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI
Country Number of Languages
Country Population
Official or National Languages
Indonesia 742 245,452,739 Indonesian
India 427 1,095,351,995 Assamese, Bengali, English, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Maithili, Malayalam, Manipuri, Marathi, Marwari, Nepali, Oriya, Panjabi, Sanskrit, Sindhi, Tamil, Telugu, Urdu
China 241 1,313,973,713 Chinese, Zhuang, Uighur, Hmong, Hani
Philippines 180 89,468,677 Filipino, English
Malaysia 147 24,385,858 Malay
Nepal 125 28,287,147 Nepali, Gurung, Tamang
Myanmar 109 47,382,633 Burmese
Vietnam 93 84,402,966 Vietnamese
Laos 82 6,368,481 Lao
Thailand 75 64,631,595 Thai
Iran 74 68,688,433 Arabic, Farsi
Pakistan 69 165,803,560 Urdu, Panjabi, Sindhi, English
Afghanistan 45 31,056,997 Dari, Pashto
Bangladesh 38 147,365,352 Bengali
Bhutan 24 2,279,723 Dzongkha
Iraq 23 26,783,383 Arabic, Kurdi
Cambodia 19 13,881,427 Khmer
Brunei 17 379,444 Malay, English
Mongolia 12 2,832,224 Halh Mongolian
Sri Lanka 8 20,222,240 Sinhala, Tamil, English
Asian Language Recognition