Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

38
1 Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library - Tsutomu SUZUKI ([email protected] ) Waseda University Library 4 th Hong Kong INNOPAC Users Group Meeting December 2003

description

Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -. Tsutomu SUZUKI ( [email protected] ) Waseda University Library 4 th Hong Kong INNOPAC Users Group Meeting December 2003. WASEDA University Overview. Founded in 1882 - PowerPoint PPT Presentation

Transcript of Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

Page 1: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

1

Character Codes Related Problems- UNICODE OPAC and Millennium at WASEDA Univ. Library -

Tsutomu SUZUKI ([email protected])

Waseda University Library

4th Hong Kong INNOPAC Users Group Meeting

December 2003

Page 2: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

2

WASEDA University Overview

Founded in 1882 Now has:

-- 10 undergraduate schools-- 14 graduate schools-- 5 large campus libraries & 27 small libraries-- 2 university museums

-- 44,576 undergraduate and 6,147 graduate students (as of end April, 2002)

Page 3: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

3

Library Overview (as of March 31, 2002)

4,705,597 books(2,980,352 cjk books + 1,725,245 western books)

49,615 journal titles(Currently subscribing 19,509)

879,336 items checked out / year ILL transactions

: 13,951 requesets to other libraries: 18,491 requesets received from other libraries

Total number of Central Library visits: 1,197,731 (2002.4 – 2003.3)

Page 4: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

4

Current Status of Our INNOPAC

Recent record numbers (Oct. 29, 2003) from M-I-F-S

1,752,690 bibliographic records 3,434,122 item records 52,133 check-in records

Public Catalog Searches from “ANALYZE patron searches”

5,149,322 searches (2002.4- 2003.3)

Page 5: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

5

Unicode Port on WEBPAC On November 17th, Unicode OPAC was released to

the public. ( some character code troubles still remain....)

Downloading Chinese & Korean bib data from OCLC.

Record Maitainance: AnzioWin Number of the C & K bib records (as of 11th Nov.)

:15,971 bibs of Chinese materials:157 bibs of Korean materials

Page 6: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

6

Appearance - Chinese record -

Page 7: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

7

Appearance - Korean record -

Page 8: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

8

Character code issuesDisplay Search Glyph

Case1 Mapping Error NG NG

Case2 Shift_JIS to EACC issue NG

Case3EACC layers related issue

NG

Case4Duplication codes in EACC

NG

Case5Not Unified character in UNICODE

NG

Page 9: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

9

Case1: Mapping ErrorThe screen below shows my patron record on Millennium Circulation.One of Katakana character “Zu” is not displayed properly.

Page 10: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

10

Case1: Mapping Error

If I search “suzuki” on Unicode-OPAC, “zu” is ignored and “suki” hit.

Page 11: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

11

Case1: Mapping Error

SJIS: 253A EACC: 69253A

SJIS EACC UNICODE

This EACC character is NOT mapped to any UNICODE character. It should be mapped to 30BA in UNICODE.

UNICODE:30BA

Page 12: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

12

Case2: Shift-JIS to EACC Issue

When I search for this hanji on Shift_JIS OPAC, then Innopac returns only 9 records.

Page 13: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

13

Case2: Shift-JIS to EACC Issue

SJIS: 97E9 EACC: 214930

SJIS EACC UNICODE

The EACC character ”215D58” is not assigned any glyph, according to the OCLC CJK 3.11. But the mapping from S-JIS to EACC works fine.

Page 14: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

14

Case2: Shift-JIS to EACC Issue

On the other hand, I searched this hanji on Unicode OPAC, then Innopac returned more than 2,000 records!

Page 15: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

15

Case2: Shift-JIS to EACC Issue

UNICODE: 6FDB

EACC: 214930

SJIS EACC UNICODE

These Shift_JIS and Unicode characters have the same glyph, but Innopac stored them into two different EACC code positions. Therefore we can NOT search both characters at once.

SJIS: 97E9

EACC: 455564

No relationship

Page 16: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

16

Case2: Shift-JIS to EACC Issue

UNICODE: 6FDB

EACC: 214930

SJIS EACC UNICODE

SJIS: 97E9

EACC: 455564

One of the solutions

Change the mapping of this Shift_JIS character from 214930 to 455564.

Page 17: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

17

Case3: EACC Layers Related IssueShift_JIS Telnet Screen Sample (my record). The data is displayed correctly.

Page 18: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

18

Case3: EACC Layers Related Issue

SJIS: 97E9 EACC: 215D58

SJIS EACC UNICODE

In Shift_JIS environment, there is no troubles in searching and displaying this character.

Page 19: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

19

Case3: EACC Layers Related Issue

We can see the same data properly on Millennium.

{69253a} is other problem already mentioned in case 1.

Page 20: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

20

Case3: EACC Layers Related IssueReviewing the same data AFTER editing an element (NOTE) on Millennium.EACC character codes are displayed directly at one of name field and address.

Page 21: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

21

Case3: EACC Layers Related IssueWe can see

the data correctly on Millennium even after editting.

Page 22: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

22

Case3: EACC Layers Related Issue

SJIS: 97E9 EACC: 215D58

EACC: 4B5D58

SJIS EACC UNICODE

UNICODE: 9234

Relationship

Same code position on other layers

Page 23: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

23

Case3: EACC Layers Related Issue

SJIS: 97E9 EACC: 215D58

EACC: 4B5D58

SJIS EACC UNICODE

UNICODE: 9234

No character assigned

{4B5D58}

If records including this character are saved on Millennium, this hanji is NOT stored as original EACC code (215D58).

Relationship

Same code position on other layers

Page 24: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

24

Case4: Duplication codes in EACC

Page 25: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

25

Case4: Duplication codes in EACCThere are more than 1,000 records by “matsu” on Shift_JIS OPAC.

Page 26: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

26

Case4: Duplication codes in EACCThere is ONLY one record by “matsu” on Unicode OPAC.

(The below shows direct hit result.)

Page 27: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

27

Case4: Duplication codes in EACC

UNICODE: 677E

EACC: 21442D

SJIS EACC UNICODE

We can DISPLAY both 21442D and 276163 in Unicode OPAC, but only 276163 is searchable.

Because of this EACC code duplication, the search results is NOT same between Shift_JIS OPAC and Unicode OPAC.

SJIS: 8FBC

EACC: 276163

Page 28: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

28

Case5: Not Unified characters in UNICODEDo you think these two characters are same or not?

UNICODE: 5618 UNICODE: 5653

Page 29: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

29

The result of searching “uso” on Shift_JIS OPAC.

Case5: Not Unified characters in UNICODE

Page 30: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

30

The same search on Unicode OPAC. The result does not seem correct .

Case5: Not Unified characters in UNICODE

Page 31: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

31

Case5: Not Unified Characters in UNICODEInput the other “uso” by picking up from code table, the result is the same as Shift_JIS OPAC.

Page 32: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

32

Case5: Not Unified Characters in UNICODE

UNICODE: 5618

EACC: 21373B

SJIS EACC UNICODE

UNICODE: 5653

SJIS: 8952

NOT HIT

!

Page 33: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

33

Case5: Not Unified Characters in UNICODE

UNICODE: 5618

EACC: 21373B

SJIS EACC UNICODE

UNICODE: 5653

SJIS: 8952

This 5618 should be normalized as 5653 in searching.

Page 34: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

34

Normalization issue

Some special characters are ignored at searching on Unicode OPAC. In this sample, “Cho-on” , Japanese prolonged sound symbol does not work.

This search means “Harry Potter” in Katakana form.

Page 35: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

35

Example of NOT unified characters (Case5)Unicode:6236,6237,6238

Page 36: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

36

Related Documents & Information The Library of Congress Homepage

MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media -- CHARACTER SETS: Part 3 -- Code Table 9: EAST ASIAN (June 16, 2003)http://www.loc.gov/marc/specifications/specchareacc.html

The Unicode Standard Version 3.0. The Unicode Consortium. ISBN 0201616335 (Version 4.0 released now)

OCLC CJK and it’s contents in HELPhttp://www.oclc.org/cjk/

Page 37: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

37

Unicode Opac in Japan University of Tokyo

Multilingual OPAC the University of Tokyo http://mulopac.dl.itc.u-tokyo.ac.jp/

National Diet LibraryNDL Asian Language Materials OPAC http://asiaopac.ndl.go.jp/index_e.html

Page 38: Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

38

Thank you!!

The Best Solution

Unicode + normalization scheme