Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

Post on 02-Feb-2016

45 views 0 download

Tags:

description

Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -. Tsutomu SUZUKI ( tsutomu@waseda.jp ) Waseda University Library 4 th Hong Kong INNOPAC Users Group Meeting December 2003. WASEDA University Overview. Founded in 1882 - PowerPoint PPT Presentation

Transcript of Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -

1

Character Codes Related Problems- UNICODE OPAC and Millennium at WASEDA Univ. Library -

Tsutomu SUZUKI (tsutomu@waseda.jp)

Waseda University Library

4th Hong Kong INNOPAC Users Group Meeting

December 2003

2

WASEDA University Overview

Founded in 1882 Now has:

-- 10 undergraduate schools-- 14 graduate schools-- 5 large campus libraries & 27 small libraries-- 2 university museums

-- 44,576 undergraduate and 6,147 graduate students (as of end April, 2002)

3

Library Overview (as of March 31, 2002)

4,705,597 books(2,980,352 cjk books + 1,725,245 western books)

49,615 journal titles(Currently subscribing 19,509)

879,336 items checked out / year ILL transactions

: 13,951 requesets to other libraries: 18,491 requesets received from other libraries

Total number of Central Library visits: 1,197,731 (2002.4 – 2003.3)

4

Current Status of Our INNOPAC

Recent record numbers (Oct. 29, 2003) from M-I-F-S

1,752,690 bibliographic records 3,434,122 item records 52,133 check-in records

Public Catalog Searches from “ANALYZE patron searches”

5,149,322 searches (2002.4- 2003.3)

5

Unicode Port on WEBPAC On November 17th, Unicode OPAC was released to

the public. ( some character code troubles still remain....)

Downloading Chinese & Korean bib data from OCLC.

Record Maitainance: AnzioWin Number of the C & K bib records (as of 11th Nov.)

:15,971 bibs of Chinese materials:157 bibs of Korean materials

6

Appearance - Chinese record -

7

Appearance - Korean record -

8

Character code issuesDisplay Search Glyph

Case1 Mapping Error NG NG

Case2 Shift_JIS to EACC issue NG

Case3EACC layers related issue

NG

Case4Duplication codes in EACC

NG

Case5Not Unified character in UNICODE

NG

9

Case1: Mapping ErrorThe screen below shows my patron record on Millennium Circulation.One of Katakana character “Zu” is not displayed properly.

10

Case1: Mapping Error

If I search “suzuki” on Unicode-OPAC, “zu” is ignored and “suki” hit.

11

Case1: Mapping Error

SJIS: 253A EACC: 69253A

SJIS EACC UNICODE

This EACC character is NOT mapped to any UNICODE character. It should be mapped to 30BA in UNICODE.

UNICODE:30BA

12

Case2: Shift-JIS to EACC Issue

When I search for this hanji on Shift_JIS OPAC, then Innopac returns only 9 records.

13

Case2: Shift-JIS to EACC Issue

SJIS: 97E9 EACC: 214930

SJIS EACC UNICODE

The EACC character ”215D58” is not assigned any glyph, according to the OCLC CJK 3.11. But the mapping from S-JIS to EACC works fine.

14

Case2: Shift-JIS to EACC Issue

On the other hand, I searched this hanji on Unicode OPAC, then Innopac returned more than 2,000 records!

15

Case2: Shift-JIS to EACC Issue

UNICODE: 6FDB

EACC: 214930

SJIS EACC UNICODE

These Shift_JIS and Unicode characters have the same glyph, but Innopac stored them into two different EACC code positions. Therefore we can NOT search both characters at once.

SJIS: 97E9

EACC: 455564

No relationship

16

Case2: Shift-JIS to EACC Issue

UNICODE: 6FDB

EACC: 214930

SJIS EACC UNICODE

SJIS: 97E9

EACC: 455564

One of the solutions

Change the mapping of this Shift_JIS character from 214930 to 455564.

17

Case3: EACC Layers Related IssueShift_JIS Telnet Screen Sample (my record). The data is displayed correctly.

18

Case3: EACC Layers Related Issue

SJIS: 97E9 EACC: 215D58

SJIS EACC UNICODE

In Shift_JIS environment, there is no troubles in searching and displaying this character.

19

Case3: EACC Layers Related Issue

We can see the same data properly on Millennium.

{69253a} is other problem already mentioned in case 1.

20

Case3: EACC Layers Related IssueReviewing the same data AFTER editing an element (NOTE) on Millennium.EACC character codes are displayed directly at one of name field and address.

21

Case3: EACC Layers Related IssueWe can see

the data correctly on Millennium even after editting.

22

Case3: EACC Layers Related Issue

SJIS: 97E9 EACC: 215D58

EACC: 4B5D58

SJIS EACC UNICODE

UNICODE: 9234

Relationship

Same code position on other layers

23

Case3: EACC Layers Related Issue

SJIS: 97E9 EACC: 215D58

EACC: 4B5D58

SJIS EACC UNICODE

UNICODE: 9234

No character assigned

{4B5D58}

If records including this character are saved on Millennium, this hanji is NOT stored as original EACC code (215D58).

Relationship

Same code position on other layers

24

Case4: Duplication codes in EACC

25

Case4: Duplication codes in EACCThere are more than 1,000 records by “matsu” on Shift_JIS OPAC.

26

Case4: Duplication codes in EACCThere is ONLY one record by “matsu” on Unicode OPAC.

(The below shows direct hit result.)

27

Case4: Duplication codes in EACC

UNICODE: 677E

EACC: 21442D

SJIS EACC UNICODE

We can DISPLAY both 21442D and 276163 in Unicode OPAC, but only 276163 is searchable.

Because of this EACC code duplication, the search results is NOT same between Shift_JIS OPAC and Unicode OPAC.

SJIS: 8FBC

EACC: 276163

28

Case5: Not Unified characters in UNICODEDo you think these two characters are same or not?

UNICODE: 5618 UNICODE: 5653

29

The result of searching “uso” on Shift_JIS OPAC.

Case5: Not Unified characters in UNICODE

30

The same search on Unicode OPAC. The result does not seem correct .

Case5: Not Unified characters in UNICODE

31

Case5: Not Unified Characters in UNICODEInput the other “uso” by picking up from code table, the result is the same as Shift_JIS OPAC.

32

Case5: Not Unified Characters in UNICODE

UNICODE: 5618

EACC: 21373B

SJIS EACC UNICODE

UNICODE: 5653

SJIS: 8952

NOT HIT

!

33

Case5: Not Unified Characters in UNICODE

UNICODE: 5618

EACC: 21373B

SJIS EACC UNICODE

UNICODE: 5653

SJIS: 8952

This 5618 should be normalized as 5653 in searching.

34

Normalization issue

Some special characters are ignored at searching on Unicode OPAC. In this sample, “Cho-on” , Japanese prolonged sound symbol does not work.

This search means “Harry Potter” in Katakana form.

35

Example of NOT unified characters (Case5)Unicode:6236,6237,6238

36

Related Documents & Information The Library of Congress Homepage

MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media -- CHARACTER SETS: Part 3 -- Code Table 9: EAST ASIAN (June 16, 2003)http://www.loc.gov/marc/specifications/specchareacc.html

The Unicode Standard Version 3.0. The Unicode Consortium. ISBN 0201616335 (Version 4.0 released now)

OCLC CJK and it’s contents in HELPhttp://www.oclc.org/cjk/

37

Unicode Opac in Japan University of Tokyo

Multilingual OPAC the University of Tokyo http://mulopac.dl.itc.u-tokyo.ac.jp/

National Diet LibraryNDL Asian Language Materials OPAC http://asiaopac.ndl.go.jp/index_e.html

38

Thank you!!

The Best Solution

Unicode + normalization scheme