Indigenous Language Newspapers as a Digital Library Collection: The Niupepa Example Department of...
-
Upload
kerry-thompson -
Category
Documents
-
view
218 -
download
1
Transcript of Indigenous Language Newspapers as a Digital Library Collection: The Niupepa Example Department of...
Indigenous Language Indigenous Language Newspapers as a Newspapers as a
Digital Library Collection: Digital Library Collection:
The Niupepa ExampleThe Niupepa Example
Department of Computer Science Department of Computer Science
Te Taka KeeganTe Taka Keegan
2
OverviewOverview
NZ Language & Publishing HistoryNZ Language & Publishing History
The Niupepa CollectionThe Niupepa Collection
Digital LibrariesDigital Libraries
Image CaptureImage Capture
Text CaptureText Capture
NZDL environmentNZDL environment
User InterfaceUser Interface
ConclusionConclusion
historyNiupepadigital libraryimage capturetext captureNZDLuser interfaceconclusion
3
New Zealand Encounter HistoryNew Zealand Encounter History
Some Māori say they were resident when the Some Māori say they were resident when the
‘land rose from the sea’. Other traditions give ‘land rose from the sea’. Other traditions give
migrations occurring between 900-1200AD.migrations occurring between 900-1200AD.
First European sighting was in 1642 by Abel First European sighting was in 1642 by Abel
Tasman. In 1769 the islands were charted by Tasman. In 1769 the islands were charted by
the James Cook. the James Cook.
In the 1790s whalers and sealers began to In the 1790s whalers and sealers began to
arrive, trading with Māori and often settling. arrive, trading with Māori and often settling.
They were followed by missionaries.They were followed by missionaries.
historyNiupepadigital libraryimage capturetext captureNZDLuser interfaceconclusion
4
……New Zealand Encounter HistoryNew Zealand Encounter History
In 1840 the Treaty of Waitangi was signed In 1840 the Treaty of Waitangi was signed
establishing the British colony.establishing the British colony.
In 1845-1872 the NZ Land wars were fought In 1845-1872 the NZ Land wars were fought
resulting in significant land confiscationsresulting in significant land confiscations
For 100 years the Colonial Government served For 100 years the Colonial Government served
policies of assimilation and integration to policies of assimilation and integration to
‘civilize the native savage’‘civilize the native savage’
In the1970s Government policies began to In the1970s Government policies began to
favour multiculturalism favour multiculturalism
historyNiupepadigital libraryimage capturetext captureNZDLuser interfaceconclusion
5
NZ Māori Language History…NZ Māori Language History…
At the time when the Treaty was signed At the time when the Treaty was signed
Māori language was used in everyday Māori language was used in everyday
activities, e.g. trade, schooling, religion.activities, e.g. trade, schooling, religion.
After 50 years of encounter history the Māori After 50 years of encounter history the Māori
population had reduced to ~30% because of population had reduced to ~30% because of
wars, diseases, and land alienation.wars, diseases, and land alienation.
The Colonial attitudes & legislation had a The Colonial attitudes & legislation had a
devastating effect on Māori language. The devastating effect on Māori language. The
Education Acts (1867 & 1871) prohibited the Education Acts (1867 & 1871) prohibited the
use of Māori language in education.use of Māori language in education.
historyNiupepadigital libraryimage capturetext captureNZDLuser interfaceconclusion
6
……NZ Māori Language HistoryNZ Māori Language History
Following WWII there was a significant Following WWII there was a significant
Māori urban shift which often meant an Māori urban shift which often meant an
abandonment of Māori language and custom.abandonment of Māori language and custom.
Research undertaken in 1979 suggested the Research undertaken in 1979 suggested the
death of the ‘living’ language was imminent.death of the ‘living’ language was imminent.
The 1980s saw a significant resurgence in The 1980s saw a significant resurgence in
Māori language initiatives, especially schooling, Māori language initiatives, especially schooling,
and Government support.and Government support.
Currently, 83% of Māori have little or no fluency, Currently, 83% of Māori have little or no fluency,
1/3 of fluent speakers are over 60. However the 1/3 of fluent speakers are over 60. However the
outlook is promising.outlook is promising.
historyNiupepadigital libraryimage capturetext captureNZDLuser interfaceconclusion
7
Māori Language Publishing…Māori Language Publishing…
Māori is traditionally an oral language, Māori is traditionally an oral language,
certain characteristics are lost in the print media.certain characteristics are lost in the print media.
It was the Missionaries who initiated printing in It was the Missionaries who initiated printing in
the 1830s to enable reading of scripture.the 1830s to enable reading of scripture.
It has been estimated that 50% of Māori were It has been estimated that 50% of Māori were
literate in the 1850-1900 period.literate in the 1850-1900 period.
Consequently many periodicals, letters, and Consequently many periodicals, letters, and
Government proceedings were written in Māori in Government proceedings were written in Māori in
that period.that period.
historyNiupepadigital libraryimage capturetext captureNZDLuser interfaceconclusion
8
……Māori Language PublishingMāori Language Publishing
With the Māori population recovery in the With the Māori population recovery in the
1900s there was not also a language recovery. 1900s there was not also a language recovery.
Consequently in 1900-1950s there was a sharp Consequently in 1900-1950s there was a sharp
decline in the amount of material pub. in Māori.decline in the amount of material pub. in Māori.
1970s Ministry of Education began publishing in 1970s Ministry of Education began publishing in
the Māori language and a demand has arisen the Māori language and a demand has arisen
with the recent resurgence initiatives.with the recent resurgence initiatives.
1987 Māori language Commission established 1987 Māori language Commission established
and Māori Language given official status.and Māori Language given official status.
historyNiupepadigital libraryimage capturetext captureNZDLuser interfaceconclusion
9
What is the Niupepa Collection? What is the Niupepa Collection?
Libraries throughout NZ hold various papers Libraries throughout NZ hold various papers
from the time period when Māori language from the time period when Māori language
publishing was flourishing.publishing was flourishing.
In 1988/1989 the Alexander Turnbull Library In 1988/1989 the Alexander Turnbull Library
undertook to source the best copies of these undertook to source the best copies of these
periodicals and preserve them on microfilm.periodicals and preserve them on microfilm.
This collection is called the “Niupepa Collection” This collection is called the “Niupepa Collection”
and is available on 408 microfiche at most and is available on 408 microfiche at most
libraries.libraries.
history Niupepadigital libraryimage capturetext captureNZDLuser interfaceconclusion
10
What is the Niupepa Collection? What is the Niupepa Collection? history Niupepadigital libraryimage capturetext captureNZDLuser interfaceconclusion
11
What’s in the Niupepa Collection? What’s in the Niupepa Collection?
There are approximately 18,000 pages in There are approximately 18,000 pages in
40 40 periodicals. They were mostperiodicals. They were mostly written in ly written in
Māori or for Māori audience.Māori or for Māori audience.
They are written from 3 distinct perspectives; They are written from 3 distinct perspectives;
Missionaries, Government and from MMissionaries, Government and from Māāori ori
themselves.themselves.
The language is written in a very clear style.The language is written in a very clear style.
It is an invaluable source of historic informationIt is an invaluable source of historic information
1.1. examples to followexamples to follow
history Niupepadigital libraryimage capturetext captureNZDLuser interfaceconclusion
12
Digital Library considerationsDigital Library considerations historyNiupepa digital lib.image capturetext captureNZDLuser interfaceconclusion
Two major stages to the digital captureTwo major stages to the digital capture
of legacy print material: of legacy print material:
1.1. capture of a digital facsimile image;capture of a digital facsimile image; - digital photograph of the original image - digital photograph of the original image
comprising individual dots or pixelscomprising individual dots or pixels
2.2. extraction of text using OCR;extraction of text using OCR; - need electronic text as a record of the actual - need electronic text as a record of the actual
characters if text is to be able to be searchedcharacters if text is to be able to be searched
13
Image CaptureImage Capture
Images were available on 35mm filmImages were available on 35mm film
Photographs of good quality, but original print Photographs of good quality, but original print
material varied substantially in quality and material varied substantially in quality and
information density information density
...examples...examples
historyNiupepadigital library image capt.text captureNZDLuser interfaceconclusion
14
Image Capture Image Capture
Quality or fidelity of image depends on:Quality or fidelity of image depends on:
1.1. density of dots (dots per inch - dpi)density of dots (dots per inch - dpi)
2.2. sensitivity of dot values (bits/pixel)sensitivity of dot values (bits/pixel)
Both parameters influence storage required:Both parameters influence storage required:
A4 page at 72 dpi b/wA4 page at 72 dpi b/w
=> 63 Kbytes=> 63 Kbytes
A4 page at 300 dpi and 256 grey levelsA4 page at 300 dpi and 256 grey levels
=> 9000 Kbytes=> 9000 Kbytes
historyNiupepadigital library image capt.text captureNZDLuser interfaceconclusion
15
Image CaptureImage Capture
A large variation in quality and information A large variation in quality and information
density meant significant performance density meant significant performance
differences across periodicalsdifferences across periodicals
Experiments showed scanning density of Experiments showed scanning density of
300dpi on original page produced uniform 300dpi on original page produced uniform
OCR results.OCR results.
For large format newspaper, this meant ~20 For large format newspaper, this meant ~20
million pixels for a one-page imagemillion pixels for a one-page image
historyNiupepadigital library image capt.text captureNZDLuser interfaceconclusion
16
Automated image captureAutomated image capture
Automated capture from 35mm film carried out Automated capture from 35mm film carried out
by New Zealand Micrographicsby New Zealand Micrographics
Because of set-up costs, two images captured Because of set-up costs, two images captured
at the same time:at the same time:
1.1. bitonal image for Internet deliverybitonal image for Internet delivery
~ 200Kbytes each~ 200Kbytes each
2.2. grey-scale image for OCRgrey-scale image for OCR
~ 5-10Mbytes each~ 5-10Mbytes each
historyNiupepadigital library image capt.text captureNZDLuser interfaceconclusion
17
bitonal and grey-scale imagesbitonal and grey-scale imageshistoryNiupepadigital library image capt.text captureNZDLuser interfaceconclusion
18
Image formatImage format
Images stored as “tagged image format” Images stored as “tagged image format”
(.tif) files, one file per frame of film (.tif) files, one file per frame of film
(~19000 frames)(~19000 frames)
8 CDs for entire collection in bitonal form8 CDs for entire collection in bitonal form
90+ CDs for entire collection in 90+ CDs for entire collection in
grey-scale formgrey-scale form
historyNiupepadigital library image capt.text captureNZDLuser interfaceconclusion
19
Cropping and SplittingCropping and Splitting
Many images were of double-page spreadsMany images were of double-page spreads
For consistent granularity, 1-page-per-file, For consistent granularity, 1-page-per-file,
these images were split into two separate these images were split into two separate
images prior to text extractionimages prior to text extraction
Extraneous image material (outside the Extraneous image material (outside the
text) was cropped reducing sizetext) was cropped reducing size
Skewed images straightenedSkewed images straightened
historyNiupepadigital library image capt.text captureNZDLuser interfaceconclusion
20
Text Capture Text Capture
OCR software identifies shapes in the image OCR software identifies shapes in the image
to creates a corresponding page of textto creates a corresponding page of text
Recognition accuracy depends on image Recognition accuracy depends on image
quality, visual quality of original, font, and quality, visual quality of original, font, and
software sophistication.software sophistication.
OCR can be time-consuming and expensive, OCR can be time-consuming and expensive,
depending on quality of images captureddepending on quality of images captured
Electronic text has low storage requirements - Electronic text has low storage requirements -
say 4Kbyte for single spaced A4 pagesay 4Kbyte for single spaced A4 page
historyNiupepadigital libraryimage capturetext captureNZDLuser interfaceconclusion
21
Text CaptureText Capture
An An index fileindex file helps manage a large helps manage a large numbers of files with similar file namesnumbers of files with similar file names
Each image can give rise to several Each image can give rise to several different files;different files;
1.1. Digital facsimile grey scale image Digital facsimile grey scale image
2.2. Digital facsimile bitonal image Digital facsimile bitonal image
3.3. Cropped/Split image ready for OCRCropped/Split image ready for OCR
4.4. Text fileText file
5.5. Reduced Resolution image for WWWReduced Resolution image for WWW
6.6. Preview Image for WWWPreview Image for WWW
historyNiupepadigital libraryimage capturetext captureNZDLuser interfaceconclusion
22
File RenamingFile Renaming
Should occur before the OCR process so Should occur before the OCR process so
that correct text files names are generatedthat correct text files names are generated
A consistent naming convention will match A consistent naming convention will match
image file names with text files namesimage file names with text files names
A self explanatory naming convention assists A self explanatory naming convention assists
interface programming e.g.interface programming e.g.
1.1. 01_01_02_05 is used to represent01_01_02_05 is used to represent
Niupepa 1, Volume 1, Issue 2, Page 5Niupepa 1, Volume 1, Issue 2, Page 5
This is very time consuming without a This is very time consuming without a
renaming program…renaming program…
historyNiupepadigital libraryimage capturetext captureNZDLuser interfaceconclusion
23
File Renamer ExampleFile Renamer Example
File Renamer Ultra 2000, by TechalchemyFile Renamer Ultra 2000, by Techalchemy
historyNiupepadigital libraryimage capturetext captureNZDLuser interfaceconclusion
24
OCR softwareOCR software
We tested many OCR programs and finally We tested many OCR programs and finally
settled on ABBYY FineReader 6.0 Coporate settled on ABBYY FineReader 6.0 Coporate
mainly for the following reasons;mainly for the following reasons;
1.1. It supports the Māori language (though not It supports the Māori language (though not
fully) and 175 other languagesfully) and 175 other languages
2.2. It does not try to write English words in non It does not try to write English words in non
English textsEnglish texts
3.3. It is very accurateIt is very accurate
4.4. It is cost effectiveIt is cost effective
historyNiupepadigital libraryimage capturetext captureNZDLuser interfaceconclusion
25
OCR softwareOCR software
Other characteristics of FineReader that Other characteristics of FineReader that
are appropriate for this work;are appropriate for this work;
1.1. You can train a ‘recognition pattern’You can train a ‘recognition pattern’
2.2. Are able to input user dictionaries and Are able to input user dictionaries and
maintain them in the proofing processmaintain them in the proofing process
3.3. Can define the character setCan define the character set
4.4. It is user friendly and easy to run, thus new It is user friendly and easy to run, thus new
staff require minimum trainingstaff require minimum training
historyNiupepadigital libraryimage capturetext captureNZDLuser interfaceconclusion
26
historyNiupepadigital libraryimage capturetext captureNZDLuser interfaceconclusion
OCR softwareOCR software
27
NZDL – Greenstone softwareNZDL – Greenstone software
supports a range of document styles, forms supports a range of document styles, forms
and languages; and languages;
a wide range of collection sizes;a wide range of collection sizes;
different interface languages;different interface languages;
different browsing and searching different browsing and searching
structures;structures;
different storage and delivery techniques;different storage and delivery techniques;
historyNiupepadigital libraryimage capturetext capture NZDLuser interfaceconclusion
28
Building the collectionBuilding the collection
Niupepa collection comprises two main Niupepa collection comprises two main
sets of documents:sets of documents:
1.1. extracted electronic text (for searching)extracted electronic text (for searching)
2.2. digital facsimile pages (for viewing)digital facsimile pages (for viewing)
Niupepa is one of the standard collections Niupepa is one of the standard collections
from the NZDL site, delivered over the from the NZDL site, delivered over the
Internet via a standard web browser;Internet via a standard web browser;
Can also be delivered on CD-ROM, as with Can also be delivered on CD-ROM, as with
all Greenstone collections.all Greenstone collections.
historyNiupepadigital libraryimage capturetext capture NZDLuser interfaceconclusion
29
User InterfaceUser Interface
nzdl.org/niupepanzdl.org/niupepa The default language is Māori. The collection is The default language is Māori. The collection is
primarily in Māori, it was funded as a Māori primarily in Māori, it was funded as a Māori
language resource, and it makes a statement language resource, and it makes a statement
about the use of Māori. about the use of Māori.
The user interface can easily be switched to The user interface can easily be switched to
English or one of nine other languagesEnglish or one of nine other languages
historyNiupepadigital libraryimage capturetext captureNZDL user i/fconclusion
30
User InterfaceUser Interface
There are 3 methods of accessThere are 3 methods of access
1.1. Browse by TitleBrowse by Title; selecting a particular ; selecting a particular
newspaper, issue, and then the pagenewspaper, issue, and then the page
2.2. Browse by DateBrowse by Date; selecting a particular time ; selecting a particular time
period, and then the 1period, and then the 1stst page of the issue page of the issue
3.3. Search by ContentSearch by Content; entering words or ; entering words or
phrases for full text searchingphrases for full text searching
historyNiupepadigital libraryimage capturetext captureNZDL user i/fconclusion
31
User interface issuesUser interface issues
Because of poor quality images, not all Because of poor quality images, not all
OCR could be done. Full text search is OCR could be done. Full text search is notnot
available for < 2% of total series.available for < 2% of total series.
Indigenous languages often use non-ASCII Indigenous languages often use non-ASCII
characters. Unicode is used to correctly characters. Unicode is used to correctly
display these characters.display these characters.
Indigenous languages often require new Indigenous languages often require new
word generation at the user interfaceword generation at the user interface
historyNiupepadigital libraryimage capturetext captureNZDL user i/fconclusion
32
User interface enhancementsUser interface enhancements
We were fortunate with this collection to make We were fortunate with this collection to make
available 2 additional sets of information on the available 2 additional sets of information on the
various Newspaper publications:various Newspaper publications:
1.1. Commentaries - historic research material Commentaries - historic research material
including bibliographic information, background, & including bibliographic information, background, &
subject matter and current physical locationssubject matter and current physical locations
2.2. Abstracts in English- a hypertext linked summary Abstracts in English- a hypertext linked summary
that gives non-Māori speaking readers insight to that gives non-Māori speaking readers insight to
what was written what was written
historyNiupepadigital libraryimage capturetext captureNZDL user i/fconclusion
33
User interface future developmentsUser interface future developments
Possibilities exist for future development on:Possibilities exist for future development on:
1.1. A graphical timeline where the time period A graphical timeline where the time period
may be selected by moving an adjustable may be selected by moving an adjustable
slider along a timeline.slider along a timeline.
2.2. Generating a search by selecting a certain Generating a search by selecting a certain
location in a map and returning all the location in a map and returning all the
pages associated with that area.pages associated with that area.
3.3. High-lighting on the facsimile image areas High-lighting on the facsimile image areas
that match the search criteria. that match the search criteria.
historyNiupepadigital libraryimage capturetext captureNZDL user i/fconclusion
34
UsageUsage
48 of top 50 search terms are in Māori 48 of top 50 search terms are in Māori
41 of 241 of 2ndnd top 50 search terms are in Māori top 50 search terms are in Māori
historyNiupepadigital libraryimage capturetext captureNZDLuser interface conclusion
35
Usage Usage
0
20,000
40,000
60,000
80,000
100,000
2000 (283) 2001 (336) 2002 (297) 2003 (69)
Māori - English Comparisons for Niupepa Site
Māori
English
historyNiupepadigital libraryimage capturetext captureNZDLuser interface conclusion
36
Usage Usage historyNiupepadigital libraryimage capturetext captureNZDLuser interface conclusion
37
Conclusion...Conclusion...
A unique collection of historical indigenous A unique collection of historical indigenous
language newspapers covering a 100-year period language newspapers covering a 100-year period
has been captured in digital form; has been captured in digital form;
Not only is the collection preserved, but it is made Not only is the collection preserved, but it is made
much more widely and conveniently accessible;much more widely and conveniently accessible;
Full-text search, although a costly option, adds Full-text search, although a costly option, adds
significant utility and value to the collection;significant utility and value to the collection;
There are difficulties in carrying out OCR with There are difficulties in carrying out OCR with
minority languages, but these can be overcome;minority languages, but these can be overcome;
historyNiupepadigital libraryimage capturetext captureNZDLuser interface conclusion
38
……conclusionconclusion
The experience gained, and the techniques learned The experience gained, and the techniques learned
and developed, are equally applicable to a wide and developed, are equally applicable to a wide
range of legacy text collections...range of legacy text collections...
……however, they are particularly pertinent to however, they are particularly pertinent to
collections of historical indigenous language collections of historical indigenous language
documents…documents…
……and digital collections of this type have the and digital collections of this type have the
potential to contribute significantly to the promotion potential to contribute significantly to the promotion
and preservation of language and culture.and preservation of language and culture.
historyNiupepadigital libraryimage capturetext captureNZDLuser interface conclusion