Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC...
Transcript of Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC...
Faculty of Computer Science & Information Technology
GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS
LANGUAGES OF SARAWAK (NERSIL)
YONG SOO FONG
Master of Computer Science
2013
GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF
SARAWAK (NERSIL)
YONG SOO FONG
A thesis submitted in
fulfillment of the requirements for the degree of Master of Computer Science
Faculty of Computer Science and Information Technology
UNIVERSITI MALAYSIA SARAWAK
2013
ii
Declaration
No portion of the work referred to in this report has been submitted in support of an
application for another degree or qualification of this or any other university or institution
of higher learning.
………………………………….
YONG SOO FONG 24th September 2013
iii
Acknowledgements
At the end of my thesis, I would like to thank everyone who made this thesis a success
and an unforgettable experience for me.
First and foremost, I would like to express my sincerest gratitude to my supervisor,
Assoc. Prof. Dr. Alvin Yeo Wee, for his constructive comments, and his strong support
throughout this work.
Secondly, I am extremely grateful to Assoc. Prof. Dr. Bali Ranaivo-Malanҫon. I
thank her for her guidance and great effort in training me in computational linguistics
field. I attribute my Master degree to her encouragement and effort and without her; this
thesis would have not been completed.
I am thankful to my best friend, Amy Chong for her selfless support,
encouragement and also for her grammatical editing of my thesis.
I would like to acknowledge the financial, academic and technical support of the
Universiti Malaysia Sarawak particularly from the award of Vice Chancellor's Research
Scholarship that provided the necessary financial support for this research.
Finally, I take this opportunity to express my profound and deepest gratitude to
my beloved parents and my siblings for their love and continuous support, both
spiritually and financially.
iv
Abstract
The aim of this research is to create the first Named Entity Recognition (NER) system for
the Sarawak Indigenous Languages (SILs), hereinafter is called NERSIL. The main goal
of NERSIL is to achieve a good accuracy with regard to the identification and
classification of named entities (NEs). The NEs considered in this research are Person,
Location, Organisation, Date, Time, Monetary and Percentage. Generally, all these NEs
carry important information about the text itself. Thus, there are targets for extraction.
NER approaches can be categorised broadly as rule-based approach, machine learning-
based approach, and hybrid approach. Rule-based approach relies on hand-crafted
linguistic grammars. Machine learning-based approach needs a large amount of annotated
training data, which is unavailable for SILs. Hybrid approach is the combination of rule-
based and machine learning-based approach. NERSIL requires special attention as it is
impossible to apply directly from the existing NER approaches.
In this thesis, an NER system that is built by extending and modifying the existing NER
approaches is presented. There are three main processes: the non-modified ANNIE (A
Nearly-New IE system) NER, the adapted ANNIE to SILs, and finally the context
investigation. Firstly, the input texts are submitted to an English NER, in this case
ANNIE with the assumption that some NEs that appear in English texts will also occur in
SIL‟s texts. At that stage, the rules for unrecognised NEs from the rules of recognised
NEs are distinguished. Next, the new rules for unrecognised NEs are written and the new
gazetteers for SILs are built in order to identify more NEs. However, the first two
v
processes are not enough to provide a good accuracy in recognising all NEs. Thus,
context investigation is needed. Context investigation includes frequency analysis,
triggered words filtering, and concordance analysis. The context of a NE (the left or right
side of NE) will be investigated.
Finally, a NER system designed for SILs will be an advancement of world knowledge.
Besides, the design can be improved by incorporating the machine translation, WordNet,
and adding more noise filtering (e.g. context filtering, and morphological filtering). With
more research and future studies, this NER system will reach a high level of performance
like the English NER work on.
vi
Abstrak
Tujuan kerja tesis Sarjana ini adalah untuk menghasilkan sebuah sistem Named Entity
Recognition yang pertama (NER) untuk bahasa pribumi Sarawak, yang dipanggil
NERSIL. Matlamat utama NERSIL ialah untuk mendapatkan ketepatan yang baik
berhubung dengan pengenalpastian dan pengelasan entiti-entiti yang dinamakan (NEs).
NEs yang dipertimbangkan dalam kajian ini adalah Orang, Tempat, Pertubuhan, Tarikh,
Masa, Kewangan dan Peratus. Secara umumnya, semua NEs ini membawa maklumat
penting tentang teks sendiri. Oleh itu, terdapat sasaran untuk pengekstrakan.
Pendekatan NER boleh dikategorikan secara meluas sebagai pendekatan berdasarkan
peraturan, pendekatan berdasarkan pembelajaran mesin, dan pendekatan berdasarkan
hibrid. Pendekatan berdasarkan peraturan bergantung kepada tatabahasa linguistik.
Pendekatan berdasarkan pembelajaran mesin memerlukan sejumlah besar data latihan
beranotasi, yang buat masa ini tidak wujud untuk bahasa pribumi Sarawak. Pendekatan
berdasarkan hibrid adalah gabungan pendekatan berdasarkan peraturan dan pendekatan
berdasarkan pembelajaran mesin. NERSIL memerlukan pemerhatian khusus kerana ia
adalah mustahil untuk menggunakan terus dari pendekatan NER yang sedia ada.
Sistem NER yang dibangunkan dengan melanjutkan dan mengubahsuai pendekatan NER
yang wujud dibentangkan di dalam tesis ini. Terdapat tiga proses utama: ANNIE (A
Nearly-New IE system) yang tidak diubahsuai, ANNIE disesuaikan dengan bahasa
pribumi Sarawak dan akhirnya kajian konteks. Pertama, teks input telah diserahkan
kepada English NER, dari kes ini ANNIE dengan andaian bahawa sesetengah NEs
vii
muncul dalam teks bahasa Inggeris juga akan berlaku dalam bahasa pribumi Sarawak.
Pada peringkat itu, peraturan untuk NEs tidak dikenali dibezakan dari peraturan NEs
yang diiktiraf. Seterusnya, peraturan baru utnuk NEs tidak diiktiraf telah disenaraikan
dan gazetteer baru dibina untuk bahasa pribumi Sarawak supaya mengenalpasti lebih
banyak NEs. Bagaimanapun, dua proses pertama tidak cukup untuk memberikan
ketepatan yang baik dalam pengiktirafan semua NEs. Oleh itu, kajian konteks diperlukan.
Kajian konteks termasuk analisis kekerapan, penapisan perkataan dicetuskan, dan analisa
konkordans. Konteks NE (sebelah kiri atau kanan NE) akan dikaji.
Akhir sekali, bahasa sistem NER yang direka untuk bahasa pribumi Satawak adalah satu
kemajuan bagi pengetahuan seluruh dunia. Selain itu, rekaan boleh diperbaiki dengan
menggunakan penterjemahan mesin, WordNet, dan menambah lebih banyak penapisan
(seperti penapisan konteks, dan penapisan morfologi). Dengan lebih banyak penyelidikan
dan kajian masa hadapan, sistem NER ini akan mencapai satu tahap prestasi yang tinggi
seperti English NER pada suatu masa nanti.
viii
Table of Contents Declaration ................................................................................................................................. ii
Acknowledgements .................................................................................................................... iii
Abstract ..................................................................................................................................... iv
Abstrak ...................................................................................................................................... vi
Table of Contents ......................................................................................................................viii
List of Published Papers ............................................................................................................. xi
List of Figures........................................................................................................................... xii
List of Tables ............................................................................................................................ xiii
List of Abbreviations ................................................................................................................ xiv
Chapter 1 INTRODUCTION ...................................................................................................... 1
1.1 Definitions: Named Entity (NE) and Named Entity Recognition (NER) ....................... 1
1.2 Background of SILs ..................................................................................................... 3
1.3 Problem Statement ....................................................................................................... 4
1.4 Objectives of the Study ................................................................................................ 5
1.5 Scope of the Study ....................................................................................................... 6
1.6 Significance of the Study ............................................................................................. 6
1.7 Organisation of the Thesis ........................................................................................... 6
Chapter 2 LITERATURE REVIEW ............................................................................................ 9
2.1 Introduction ................................................................................................................. 9
2.2 Named Entity Recognition ........................................................................................... 9
2.2.1 Named Entity (NE) Types .................................................................................. 10
2.2.2 Problems in NEs ................................................................................................ 15
2.2.3 Applications of NER .......................................................................................... 17
2.3 Features of NEs ......................................................................................................... 18
2.3.1 Word-level Features ........................................................................................... 19
2.3.2 List Lookup Features ......................................................................................... 20
2.3.3 Document and Corpus Features .......................................................................... 22
2.4 NER Approaches ....................................................................................................... 23
2.4.1 Rule-based Approach ......................................................................................... 23
2.4.2 Machine Learning-based Approach .................................................................... 26
2.4.3 Hybrid-based Approach ..................................................................................... 30
ix
2.4.4 Summary of the Three Major NER Approaches .................................................. 30
2.4.5 NER via Machine Translation ............................................................................ 32
2.5 Some Existing NER Systems ..................................................................................... 33
2.5.1 ANNIE .............................................................................................................. 36
2.5.2 Freeling ............................................................................................................. 38
2.5.3 Text Pro ............................................................................................................. 40
2.5.4 ClearForest ........................................................................................................ 41
2.5.5 Summary of the Existing of NER systems .......................................................... 43
2.6 Summary of the Literature Review of NER ................................................................ 44
Chapter 3 METHODOLOGY ................................................................................................... 46
3.1 Introduction ............................................................................................................... 46
3.2 Define the Research Problems ................................................................................... 48
3.3 Review the Literature ................................................................................................ 48
3.4 Propose a Solution to the Problems ............................................................................ 48
3.4.1 NERSIL Overall Framework .............................................................................. 49
3.4.2 Requirements ..................................................................................................... 66
3.5 Collect data ............................................................................................................... 79
3.6 Implement and Iteratively Improve ............................................................................ 79
3.7 Evaluation and Discussion ......................................................................................... 79
3.8 Summary ................................................................................................................... 80
Chapter 4 EXPERIMENTS, RESULTS ANALYSIS AND DISCUSSION ................................ 81
4.1 Introduction ............................................................................................................... 81
4.2 Experiments Description and Setup ............................................................................ 81
4.2.1 Data Set ............................................................................................................. 82
4.2.2 Evaluation Metrics ............................................................................................. 85
4.3 Result Analysis on Iban Corpus ................................................................................. 89
4.3.1 Results from Non-modified ANNIE NER .......................................................... 89
4.3.2 Results from Adapted ANNIE for Iban ............................................................... 90
4.3.3 Context Investigation: Results from Frequency Analysis .................................... 93
4.3.4 Context Investigation: Results from Triggered Words Filtering .......................... 94
4.3.5 Context Investigation: Results from Concordance Analysis ................................ 99
4.3.6 Performance of NERSIL .................................................................................. 102
x
4.4 Results Analysis on Bau Bidayuh Corpus ................................................................ 103
4.4.1 Results from Non-modified ANNIE NER ........................................................ 103
4.4.2 Results from Adapted ANNIE for Bau Bidayuh ............................................... 104
4.4.3 Context Investigation: Results from Frequency Analysis .................................. 106
4.4.4 Context Investigation: Results from Triggered Words Filtering ........................ 107
4.4.5 Context Investigation: Results from Concordance Analysis .............................. 109
4.4.6 Performance of NERSIL .................................................................................. 109
4.5 Summary of the Results ........................................................................................... 110
4.6 Discussion ............................................................................................................... 111
Chapter 5 CONCLUSION AND FUTURE WORK ................................................................. 113
5.1 Introduction ............................................................................................................. 113
5.2 Research Contributions ............................................................................................ 113
5.3 Limitations .............................................................................................................. 115
5.4 Future Works ........................................................................................................... 117
5.5 Summary ................................................................................................................. 119
References .............................................................................................................................. 121
Appendix A: The Most Frequent Top 30 Words in Iban Corpus .............................................. 129
Appendix B: The Context of Iban Language ............................................................................ 130
Appendix C: The Most Frequent Top 30 Words in Bau Bidayuh Corpus ................................. 134
Appendix D: The Context of Bau Bidayuh Language .............................................................. 135
xi
List of Published Papers
1. Yong Soo Fong, Bali Ranaivo-Malançon, & Alvin Yeo Wee. “NERSIL – the Named-
Entity Recognition System for Iban Language”. The 25th Pacific Asia Conference on
Language, Information and Computation (PACLIC 25), Singapore, 16-18 December
2011.
2. Yong Soo Fong, Bali Ranaivo-Malançon, & Alvin Yeo Wee. “Discovering Triggered
Word for Iban-Entity Recogniser”. Proceedings of the Sixth International Workshop
on Malay and Indonesian Language Engineering (MALINDO 2012), Universit i
Malaysia Sarawak, 21 Jun 2012.
xii
List of Figures
Figure 1.1: Organisation of the Thesis ......................................................................................... 8
Figure 2.1: 200 Extended Named Entity (ENE) Categories (Sekine & Nobata, 2004) ................ 11
Figure 2.2: NER in Newswire Domain (Institute for InfoComm Research, 2004) ...................... 13
Figure 2.3: NER in Biomedical Domain (Institute for InfoComm Research, 2004) .................... 14
Figure 2.4: NER Approaches .................................................................................................... 23
Figure 2.5: Tool Features of NER (Marrero et al., 2009) ............................................................ 33
Figure 2.6: Screenshot of ANNIE (Cunningham et al., 2012) ..................................................... 36
Figure 2.7: Screenshot of Freeling 3.0 ....................................................................................... 38
Figure 2.8: Screenshot of TextPro ............................................................................................ 40
Figure 2.9: Screenshot of ClearForest ....................................................................................... 42
Figure 2.10: F-measure in Entity Identification and Classification (Marrero et al., 2009) ........... 43
Figure 3.1: Research Methodology Process ............................................................................... 47
Figure 3.2: Conceptual Design of the Proposed Framework ....................................................... 49
Figure 3.3: ABBYY FineReader's Process .................................................................................. 52
Figure 3.4: Output of After Performing OCR and After Correction ............................................ 52
Figure 3.5: ANNIE Works with the Set of Core PRs (Maynard, 2004) ....................................... 53
Figure 3.6: Results of non-modified ANNIE NER on Iban Text ................................................ 54
Figure 3.7: Adapted ANNIE to Iban .......................................................................................... 55
Figure 3.8: Context Investigation .............................................................................................. 62
Figure 3.9: Input and Output of Concordance Analysis .............................................................. 65
Figure 3.10: Screenshot of ABBYY FineReader ........................................................................ 68
Figure 3.11: Screenshot of GATE's Framework (Cunningham et al., 2012) ................................ 69
Figure 3.12: Screenshot of VIM Editor ...................................................................................... 73
Figure 3.13: Screenshot of AntConc .......................................................................................... 74
Figure 4.1: NE Distribution in the Iban Corpus.......................................................................... 83
Figure 4.2: Annotation Tool Using GATE's ANNIE System ...................................................... 84
Figure 4.3: NE Distribution in the Bau Bidayuh Corpus ............................................................ 85
Figure 4.4: Annotation Diff Tool ............................................................................................... 87
Figure 4.5: NEs recognised by Non-modified ANNIE NER (Iban Corpus) ................................ 89
Figure 4.6: Results from Adapted ANNIE for Iban (Iban Corpus).............................................. 92
Figure 4.7: The Relationship between the Frequency and Ranking (Iban Corpus) ...................... 99
Figure 4.8: Class of Triggered Word (Iban Corpus) ................................................................. 101
Figure 4.9: NEs recognised by Non-modified ANNIE NER (Bau Bidayuh Corpus) ................. 103
Figure 4.10: NEs recognised by Adapted ANNIE for Bau Bidayuh (Bau Bidayuh Corpus) ...... 105
xiii
List of Tables
Table 2.1: Output of Machine Translation (using NER) (Ishak et al., 2008) ............................... 17
Table 2.2: Word-level Features of NEs (Nadeau & Sekine, 2007) .............................................. 19
Table 2.3: List Lookup Features of NEs (Nadeau & Sekine, 2007) ............................................ 20
Table 2.4: Document and Corpus Features of NEs (Nadeau & Sekine, 2007) ............................. 22
Table 2.5: Results NER for Indonesian Language (Budi et al., 2005) ......................................... 26
Table 2.6: Strengths and Weaknesses of Each Approach ........................................................... 31
Table 2.7: Results by Entity Type (Marrero et al., 2009) ............................................................ 35
Table 2.8: Default Resources of ANNIE (Cunningham et al., 2012) .......................................... 37
Table 2.9: Analysis Services Available for Each Language (Padró & Stanilovsk, 2012) ............. 39
Table 2.10: Summary of the Literature Review ......................................................................... 44
Table 3.1: List of Contexts Features for Iban NEs ..................................................................... 56
Table 3.2: List of Word-Level Features ..................................................................................... 57
Table 3.3: Examples of JAPE Rules for Iban Person .................................................................. 59
Table 3.4: Structure of JAPE Rules (Maynard, 2004) ................................................................ 71
Table 3.5: Hardware Requirements for Each of the Software ..................................................... 75
Table 3.6: Running Time Evaluation for ANNIE ....................................................................... 75
Table 3.7: Summary of the Framework ..................................................................................... 77
Table 4.1: Total No. of Word Types, Total No. of Word Token, Size of Data Set ...................... 83
Table 4.2: Differences between Iban Text and English Text ...................................................... 90
Table 4.3: Number of Iban Jape Rules ....................................................................................... 91
Table 4.4: Number of Jape Rules which Reused and Created for Iban ........................................ 91
Table 4.5: Gazetteers which Created for Iban ............................................................................ 91
Table 4.6: The Most Frequently Occurring Words (Top Ten) (Iban Corpus) .............................. 93
Table 4.7: Results from Triggered Word Filtering (Iban Corpus) .............................................. 94
Table 4.8: Type of Word for the Most Frequently Occurring Word (Top Ten) (Iban Corpus) ..... 96
Table 4.9: The Most (top ten) frequently occurring in Five Different Sets of Data ..................... 98
Table 4.10: Probability of a Word at Rank r .............................................................................. 98
Table 4.11: Probability of Triggered Word in Each Category of NEs Class (Iban Corpus) ....... 100
Table 4.12: Performance f NERSIL (Iban Corpus) .................................................................. 102
Table 4.13: The Most Frequently Occurring Words (top ten) (Bau Bidayuh Corpus) ............... 106
Table 4.14: Results from Triggered Words Filtering (Bau Bidayuh Corpus) ............................ 107
Table 4.15: Results from Native Speakers ............................................................................... 108
Table 4.16: Performance of NERSIL (Bau Bidayuh Corpus) ................................................... 109
Table 4.17: Summary of the Results ........................................................................................ 110
xiv
List of Abbreviations
The following is a list of abbreviations used in this thesis:
IE: Information Extraction
MUC: Message Understanding Conference
NER: Named Entity Recognition
GATE: General Architecture for Text Engineering
ANNIE: A Nearly-New Information Extraction System
NERSIL: Named Entity Recognition Sarawak Indigenous Languages
NLP: Natural Language Processing
IR: Information Retrieval
HMM: Hidden Markov Models
MEMM: Maximum Entropy Markov Model
SILs: Sarawak Indigenous Languages
SVM: Support Vector Machines
1
Chapter 1 INTRODUCTION
1.1 Definitions: Named Entity (NE) and Named Entity Recognition (NER)
The numbers of online electronic documents are growing exponentially with more
important information continuing to become available as text. Thus, it is very difficult to
identify the relevant information quickly and accurately. Thus, it should be supported by
computational tools as the identification task is complex. Currently, there are many
technologies that have been developed to deal with the tremendous amount of
information such as Information Extraction (IE). However, Named Entity Recognition
(NER) is one of the important sub-tasks of IE. The NER process is divided into
successive parts. The first part consists of identifying proper names in a given text. The
second part concerns the classification of these proper names into semantic class such as
Person, Organisation, Location, Date, Time, Monetary and Percentage. Currently, much
work has been done in NER for English and others that are deemed “big” languages. This
has generated much interest among researchers in finding ways to develop NER for
Sarawak Indigenous Languages (SILs). The background of SILs will be described in
detail in Section 1.2.
The identification and classification of rigid designators such as name expressions,
numeral expressions and temporal expressions from raw texts are very important in
numerous Natural Language Processing (NLP) applications. According to online
Collinsdictionary.com (2012), rigid designators refer an expression that distinguishes the
2
same individual in every possible world. For example, “Shakespeare” is a rigid
designator. This can be seen in the following sentence: “Shakespeare might not have been
a playwright but not that he might not have been Shakespeare” (Collinsdictionary.com,
2012). These rigid designators are called Named Entities (NEs), as defined by Kripke
(1980). Generally speaking, NEs are proper nouns. NEs are often used in naming sports
and adventure activities, and terms for biological species and substances. Besides, there
are different lists of NE types provided by Message Understanding Conference (MUC)‟s
list (Grishman & Sundheim, 1996), Conference on Computational Natural Language
Learning (CoNLL)‟s list (Sang & Meulder, 2003), and Sekine‟s list (Sekine & Nobata,
2004). Indeed, NE types are confusing for researchers. Thus, NE types will be explained
in more details in Chapter 2.
NER approaches can be broadly divided into three main types: a rule-based approach, a
machine learning-based approach, and a hybrid-based approach. Rule-based approach
relies on hand-crafted linguistic grammars. Machine learning-based approach needs a
huge amount of annotated training data, which is often unavailable for SILs. Besides, the
hybrid-based approach is used to overcome the weaknesses of the two NER approaches.
In general, rule-based approach will provide better results compared with the other two
approaches. NERSIL requires special attention as it is impossible to apply directly the
existing NER approaches.
In conclusion, it is possible to build a strong NER for SILs by using conventional rule-
based approach, extending and modifying the existing NER approaches.
3
1.2 Background of SILs
According to Dewan Bahasa Dan Pustaka (Malay for The Institute of Language and
Literature), there are 63 indigenous languages in Sarawak, an East Malaysia state.
According to the Ethnologue report (Encyclopaedia for the languages of the world in
2009), the number of individual languages listed in Malaysia (Sarawak) is 46 (Lewis,
2009). Out of the 46 languages, 44 are living languages and 2 have no known speakers
(Lewis, 2009). Examples of SILs are Iban, Bidayuh (Bau-Jagoi), and Melanau (Matu-
Daro). These SILs have received relatively little research attention. In Zahid (2008), he
reported that the reasons for the absence of research on SILs are due to reliance on
researchers coming from outside of Sarawak such as Peninsular Malaysia and other parts
of the world. So far, the research that has been conducted has mostly been confined to
collect basic word lists aiming to gather structural characteristics in terms of phonology,
morphology and syntax.
Among the indigenous of Sarawak, Iban has received considerable research attention.
The Iban is the largest ethnic group making up about 44% of the population of Sarawak
(Berita Publishing Group, 1994). Iban language is the vernacular for Iban people.
Presently, the online Theborneopost.com (2011) reported that Sarawak government
continues to promote the Iban language as an international lingua franca. In 2008, Iban
language was introduced as one of the subjects for the fifth-year secondary school
Malaysia examination. In comparison with the other indigenous languages in Sarawak,
Iban has already its own orthography system along with a few dictionaries and grammar
4
books (Suhaila et al., 2008). Apart from the Malaysian government initiative, the
Sarawak Language Technology (SaLT) research group has also embarked on research on
the application of information and communication technology (ICT) in the preservation
of SILs. Nevertheless, Iban language is still considered as an under-resourced language
although it has a few NLP tools such as a morphological analyzer and generator, a
syntactic parser, a part-of-speech tagger, and a spell checker. However, some of these
tools are still work in progress.
Moreover, researchers face a number of challenges in the development of NER for SILs
as there are certain restrictions. Below are the restrictions stated by Zahid (2008):
“SILs have not yet been systematically romanised”
“SILs are written in the Roman script likes many of its neighbouring languages
such as Iban, Bidayuh, and Malay”
“SILs do not boast an extensive corpus that is able to provide a reliable resources
about the language‟s syntax, morphology and phonology”
“Have limited word list which fails to reveal the phonological, syntactic, and
lexical variations of the language”
1.3 Problem Statement
NER are now available for European languages (English & French) and even for East
Asian languages (Chinese, Japanese, Korean, and Vietnamese). However, for under-
resourced languages such as the SILs, the problem of NER is still far from being solved.
5
Although many insights can be gained from the methods used in English, but there are
still many issues that need to be considered. One significant issue is that researchers do
not have a deep linguistic knowledge about the SILs. Besides, linguistic-resources are
scarce. There are also the problems of non-standard spelling and spelling variation. Also,
an NER for SILs does not exist. Thus, an approach for developing the first NER for SILs
will be proposed.
To summarise,
Lack of standardisation of spelling and variation in spelling
No existing NER system for SILs
NER system is an important component of many NLP applications such as
information extraction, machine translation, and question answering. Thus,
building NER for SILs (NERSIL) will open the possibility of creating many NLP
applications for SILs.
Researchers do not have a deep linguistic knowledge in SILs
1.4 Objectives of the Study
With the problem description as a basis, the present research has the following objectives.
The main objective of this research is
To design and develop a generic NER for SILs (NERSIL)
The specific objectives are as follows:
i. To define a framework for developing NERSIL
ii. To design generic rules and build gazetteers
6
iii. To investigate the contexts of NEs for SILs
iv. To evaluate automatically the accuracy of NERSIL
1.5 Scope of the Study
The scope of this study is in the NER field only. Besides that, only three types of NEs are
considered in these studies which are name expressions, numeric expressions, and time
expressions. Name expressions include Person, Organisation, and Location. The numeric
expressions include Monetary and Percentage. Time expressions include Time and Date.
The target languages are SILs that are Iban and Bau Bidayuh language.
1.6 Significance of the Study
The significance of this study is proposing a solution for developing NER for SILs. The
proposed solution may apply to other under-resourced languages. This study will also
provide access to indigenous languages to researchers and people who are interested in
the local culture. Thus, there will be more work to be conducted and in turn preserve the
culture.
1.7 Organisation of the Thesis
This thesis is divided into five chapters that are the introduction, literature review,
methodology, results analysis and discussion, conclusion, and future works.
7
Chapter One provides an overview of this dissertation by providing a definition of NE
and NER, the background of SILs, the problem statement, the objectives of the present
studies, the scope as well as the significance of the research.
Chapter Two presents the background of NER, NEs, and NER applications. A review of
NER approaches and existing NER systems is also presented. At the end of the chapter, a
summary of the literature review is outlined.
Chapter Three lays out the research methodology process. Each step in the proposed
framework will be discussed in details. Besides, this chapter also describes the
environment requirements for implementation of the proposed framework such as
software requirements and hardware requirements.
Chapter Four covers the setup of the experiments. Moreover, the results and analysis are
shown through several graphs and tables and followed by discussion.
Chapter Five ends this dissertation with restates the contributions, the limitations, and
the ideas will be retained in the future works.
Next, Figure 1.1 is the big picture of this research: what, why, and how. This picture also
summarises the background information. Thus, readers will be able to understand the
relevance of the work and will get familiar with the NER area.
8
Figure 1. 1 Sumaary of ChaptersNatural Language Processing
Existing NER system
NER Approaches
NER Related Works
Conclusion and Future Work Result Analysis and Discussion
Proposed Framework
Chapter 5 Chapter 4
Chapter 3
Chapter 2
Chapter 1
Context Investigation
Frequency analysis
Triggered word filtering
Adapted ANNIE to Iban/ Bau
Bidayuh
Rules Building
Gazetteers Building
Pre-processing
Non-modified
ANNIE NER
Analysis of Literature
Reviews
Extension, Modification existing approach
HOW WHY
WHO
?
WHERE
?
WHEN
PERFORMANCE
ANNIE
Machine Translation
Question Answering
Automatic Text
Summarization
Person, Location, Organization, Time,
Date, Monetary, Percentage
Named Entity Types
Background of NER
NER Applications
Rule-Based
Patterns
Gazetteers
Machine Learning
HMM, CRF,
SVM, MEM
Hybrid
Rule-based
+Machine Learning
Information Extraction
Named Entity Recognition
WHAT
Concordance Analysis
Figure 1.1: Organisation of the Thesis
9
Chapter 2 LITERATURE REVIEW
2.1 Introduction
In previous chapters, the definition of NE, NER and the background of the SILs are
briefly introduced. Besides that, the problem statement, objectives, scope as well as the
significance of the study are identified. In this chapter, more details on NER will be
studied. It will cover NEs types, problems in NEs as well as applications of NER.
Moreover, the achievements and limitations of recent works for NER approaches as well
as some existing NER systems will be reviewed.
2.2 Named Entity Recognition
Nowadays, most of the knowledge is stored and communicated as natural language text.
Many resources are freely available in the Internet. To make this knowledge available in
a structured form for deeper analysis, technologies from the field of IE are necessary.
NER is a fundamental task in information extraction.
In 1990s, NER was successfully applied in English after the evaluation conference such
as Message Understanding Conference (MUC). The reason for such success in English
was because English has a very rich tagged corpus. In addition, researchers obtained
good linguistic insights about the use of a NE. Thus, English is the most popular