Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -
description
Transcript of Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library -
1
Character Codes Related Problems- UNICODE OPAC and Millennium at WASEDA Univ. Library -
Tsutomu SUZUKI ([email protected])
Waseda University Library
4th Hong Kong INNOPAC Users Group Meeting
December 2003
2
WASEDA University Overview
Founded in 1882 Now has:
-- 10 undergraduate schools-- 14 graduate schools-- 5 large campus libraries & 27 small libraries-- 2 university museums
-- 44,576 undergraduate and 6,147 graduate students (as of end April, 2002)
3
Library Overview (as of March 31, 2002)
4,705,597 books(2,980,352 cjk books + 1,725,245 western books)
49,615 journal titles(Currently subscribing 19,509)
879,336 items checked out / year ILL transactions
: 13,951 requesets to other libraries: 18,491 requesets received from other libraries
Total number of Central Library visits: 1,197,731 (2002.4 – 2003.3)
4
Current Status of Our INNOPAC
Recent record numbers (Oct. 29, 2003) from M-I-F-S
1,752,690 bibliographic records 3,434,122 item records 52,133 check-in records
Public Catalog Searches from “ANALYZE patron searches”
5,149,322 searches (2002.4- 2003.3)
5
Unicode Port on WEBPAC On November 17th, Unicode OPAC was released to
the public. ( some character code troubles still remain....)
Downloading Chinese & Korean bib data from OCLC.
Record Maitainance: AnzioWin Number of the C & K bib records (as of 11th Nov.)
:15,971 bibs of Chinese materials:157 bibs of Korean materials
6
Appearance - Chinese record -
7
Appearance - Korean record -
8
Character code issuesDisplay Search Glyph
Case1 Mapping Error NG NG
Case2 Shift_JIS to EACC issue NG
Case3EACC layers related issue
NG
Case4Duplication codes in EACC
NG
Case5Not Unified character in UNICODE
NG
9
Case1: Mapping ErrorThe screen below shows my patron record on Millennium Circulation.One of Katakana character “Zu” is not displayed properly.
10
Case1: Mapping Error
If I search “suzuki” on Unicode-OPAC, “zu” is ignored and “suki” hit.
11
Case1: Mapping Error
SJIS: 253A EACC: 69253A
SJIS EACC UNICODE
This EACC character is NOT mapped to any UNICODE character. It should be mapped to 30BA in UNICODE.
UNICODE:30BA
12
Case2: Shift-JIS to EACC Issue
When I search for this hanji on Shift_JIS OPAC, then Innopac returns only 9 records.
13
Case2: Shift-JIS to EACC Issue
SJIS: 97E9 EACC: 214930
SJIS EACC UNICODE
The EACC character ”215D58” is not assigned any glyph, according to the OCLC CJK 3.11. But the mapping from S-JIS to EACC works fine.
14
Case2: Shift-JIS to EACC Issue
On the other hand, I searched this hanji on Unicode OPAC, then Innopac returned more than 2,000 records!
15
Case2: Shift-JIS to EACC Issue
UNICODE: 6FDB
EACC: 214930
SJIS EACC UNICODE
These Shift_JIS and Unicode characters have the same glyph, but Innopac stored them into two different EACC code positions. Therefore we can NOT search both characters at once.
SJIS: 97E9
EACC: 455564
No relationship
16
Case2: Shift-JIS to EACC Issue
UNICODE: 6FDB
EACC: 214930
SJIS EACC UNICODE
SJIS: 97E9
EACC: 455564
One of the solutions
Change the mapping of this Shift_JIS character from 214930 to 455564.
17
Case3: EACC Layers Related IssueShift_JIS Telnet Screen Sample (my record). The data is displayed correctly.
18
Case3: EACC Layers Related Issue
SJIS: 97E9 EACC: 215D58
SJIS EACC UNICODE
In Shift_JIS environment, there is no troubles in searching and displaying this character.
19
Case3: EACC Layers Related Issue
We can see the same data properly on Millennium.
{69253a} is other problem already mentioned in case 1.
20
Case3: EACC Layers Related IssueReviewing the same data AFTER editing an element (NOTE) on Millennium.EACC character codes are displayed directly at one of name field and address.
21
Case3: EACC Layers Related IssueWe can see
the data correctly on Millennium even after editting.
22
Case3: EACC Layers Related Issue
SJIS: 97E9 EACC: 215D58
EACC: 4B5D58
SJIS EACC UNICODE
UNICODE: 9234
Relationship
Same code position on other layers
23
Case3: EACC Layers Related Issue
SJIS: 97E9 EACC: 215D58
EACC: 4B5D58
SJIS EACC UNICODE
UNICODE: 9234
No character assigned
{4B5D58}
If records including this character are saved on Millennium, this hanji is NOT stored as original EACC code (215D58).
Relationship
Same code position on other layers
24
Case4: Duplication codes in EACC
25
Case4: Duplication codes in EACCThere are more than 1,000 records by “matsu” on Shift_JIS OPAC.
26
Case4: Duplication codes in EACCThere is ONLY one record by “matsu” on Unicode OPAC.
(The below shows direct hit result.)
27
Case4: Duplication codes in EACC
UNICODE: 677E
EACC: 21442D
SJIS EACC UNICODE
We can DISPLAY both 21442D and 276163 in Unicode OPAC, but only 276163 is searchable.
Because of this EACC code duplication, the search results is NOT same between Shift_JIS OPAC and Unicode OPAC.
SJIS: 8FBC
EACC: 276163
28
Case5: Not Unified characters in UNICODEDo you think these two characters are same or not?
UNICODE: 5618 UNICODE: 5653
29
The result of searching “uso” on Shift_JIS OPAC.
Case5: Not Unified characters in UNICODE
30
The same search on Unicode OPAC. The result does not seem correct .
Case5: Not Unified characters in UNICODE
31
Case5: Not Unified Characters in UNICODEInput the other “uso” by picking up from code table, the result is the same as Shift_JIS OPAC.
32
Case5: Not Unified Characters in UNICODE
UNICODE: 5618
EACC: 21373B
SJIS EACC UNICODE
UNICODE: 5653
SJIS: 8952
NOT HIT
!
33
Case5: Not Unified Characters in UNICODE
UNICODE: 5618
EACC: 21373B
SJIS EACC UNICODE
UNICODE: 5653
SJIS: 8952
This 5618 should be normalized as 5653 in searching.
34
Normalization issue
Some special characters are ignored at searching on Unicode OPAC. In this sample, “Cho-on” , Japanese prolonged sound symbol does not work.
This search means “Harry Potter” in Katakana form.
35
Example of NOT unified characters (Case5)Unicode:6236,6237,6238
36
Related Documents & Information The Library of Congress Homepage
MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media -- CHARACTER SETS: Part 3 -- Code Table 9: EAST ASIAN (June 16, 2003)http://www.loc.gov/marc/specifications/specchareacc.html
The Unicode Standard Version 3.0. The Unicode Consortium. ISBN 0201616335 (Version 4.0 released now)
OCLC CJK and it’s contents in HELPhttp://www.oclc.org/cjk/
37
Unicode Opac in Japan University of Tokyo
Multilingual OPAC the University of Tokyo http://mulopac.dl.itc.u-tokyo.ac.jp/
National Diet LibraryNDL Asian Language Materials OPAC http://asiaopac.ndl.go.jp/index_e.html
38
Thank you!!
The Best Solution
Unicode + normalization scheme