The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC...

21
The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley

Transcript of The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC...

Page 1: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.

The Script Encoding Initiative

E-MELDAugust 4, 2002Deborah Anderson, Dept. of Linguistics, UC Berkeley

Page 2: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.

Levels of Representation

BASE FLOOR: Character Encoding

Example: A with a ring above = hex C5

Page 3: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.

Levels of Representation

UPPER LEVEL: Higher level markup: HTML, XML, TEI, other tagsets

BASE FLOOR: Character Encoding

Page 4: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.

Problem

Alcestis’ Euripides:

Ἄδμηθ', á½Ïá¾·Ï‚ Î³á½°Ï Ï„á¼€Î¼á½° � � �Ï€Ïάγμαθ' ὡς ἔχει,�

Page 5: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.

Background to the Problem

Different countries and/or vendors had their own character encoding systems

Interoperability was low or non-existent Fonts were created using non-standard

(ad hoc) character encodings

Page 6: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.

Unicode (www.unicode.org)

Page 7: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.

Unicode Fully synchronized with ISO 10646 Unicode aims to be universal. Unicode 3.2 now has nearly 95,200

encoded characters. Those code points above the first “plane” (Basic Multilingual Plane) are encoded with pairs of 16-bit units. This “surrogate” technique allows over 1 million characters to be encoded.

Page 8: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.

Character Proposal Process

Must pass two standards bodies:1. Unicode Technical Committee2. ISO WG2 (ISO/IEC JTC1/SC2/WG2)

Proposals require close review Needs letters of support from the user

community (scholars/modern speakers) Process from first proposal until final passage

takes approx. 3-5 years

Page 9: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.

Current Situation

52 scripts are now covered, but 90+ remain (primarily historic and minority scripts)

Work has been almost entirely voluntary to date Outstanding script proposals will require

substantial scholarly input and contact with modern speakers (for minority scripts)

Less commercial interest in the remaining scripts

Page 10: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.

Scripts Missing from Unicode (http://www.unicode.org/sei/alpha-script-list.html)

MINORITY SCRIPTS New Tai Lue Lepcha N'ko Ol Cemet' Meithei/Manipuri Pahawh Hmong Cham Saurashtra Lanna Tifinagh Chakma

HISTORIC SCRIPTS Egyptian hieroglyphs Sumero-Akkadian

cuneiform Phoenician Carian Lycian Luwian Aztec pictographs Mayan hieroglyphs Avestan Old Persian cuneiform

Page 11: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.

Sample Unicode Proposal

Proposals contain: background to the script for the general user and

implementer a description of the characters’ properties sample from running texts inventory of the characters of a script:

a graphic representation (a “glyph”) and a name

a list of recent reference works.

Page 12: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.

Solution: Script Encoding Initiative

April 2002 started the SEI at UC Berkeley in conjunction with the Unicode VP

Will fund Unicode proposal authors and font creators

Proposals are screened by Unicode VP Brings the university into the international

character encoding standards process Help promote Unicode, which will assure

longevity and stability for linguistic data

Page 13: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.

Results to Date

$5,000 seed-funding from an anonymous donor No corporate funding forthcoming;

still in need of stable funding base Have already set priorities for scripts, ranking

minority scripts higher We have created a preliminary website listing

the scripts, so experts can look at the list and see the current status of proposals.

Page 14: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.

Impact of Project

Allow minority populations to access their own script for communication with others and the world

Permit online courseware in these scripts Allow academic scholars to use the script in their

research and online publications Provide assistance for linguists working on those

languages currently without a script Helps build a storehouse of information on the

world’s scripts

Page 15: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.

Needs

Funding: Applied for NEH, but this requires raising matching funds

Participation from linguists to answer questions and identify problems/missing characters

Recommend linguists might want to include a line-item for Unicode proposals in their government grant proposals

Page 16: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.

How to Help

Contact Script Encoding Initiative:[email protected]

Website:www.linguistics.berkeley.edu/~dwanders

Unicode website: www.unicode.org

Page 17: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.
Page 18: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.
Page 19: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.
Page 20: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.
Page 21: The Script Encoding Initiative E-MELD August 4, 2002 Deborah Anderson, Dept. of Linguistics, UC Berkeley.

Problem: How to handle characters now outside Unicode

Results of a TEI Working Group meeting:1. use Private Use Area

2. use entities

Other alternatives: Continue to use transcription/transliteration schemes (with ASCII) or non-standard encodings