Semantic Frames as an Interlingua for Multilingual Lexical Databases
LexBank: A Multilingual Lexical Resource for Low … · · 2016-09-13iv Al Tarouti, Feras A....
Transcript of LexBank: A Multilingual Lexical Resource for Low … · · 2016-09-13iv Al Tarouti, Feras A....
LexBank: A Multilingual Lexical Resource for Low-Resource
Languages
by
Feras Ali Al Tarouti
M.S., King Fahd University of Petroleum & Minerals, 2008
B.S., University of Dammam, 2001
A Dissertation submitted to the Graduate Faculty of the
University of Colorado at Colorado Springs
in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
Department of Computer Science
2016
ii
© Copyright by Feras Ali Al Tarouti 2016All Rights Reserved
iii
This dissertation for Doctor of Philosophy degree by
Feras Ali Al Tarouti
has been approved for the
Department of Computer Science
by
Jugal Kalita, Chair
Tim Chamillard
Rory Lewis
Khang Nhut Lam
Sudhanshu Semwal
Date
iv
Al Tarouti, Feras A. (Ph.D., Computer Science)
LexBank: A Multilingual Lexical Resource for Low-Resource Languages
Dissertation directed by Professor Jugal Kalita
In this dissertation, we present new methods to create essential lexical resources for
low-resource languages. Specifically, we develop methods for enhancing automatically cre-
ated wordnets. As a baseline, we start by producing core wordnets, for several languages,
using methods that need limited freely available resources for creating lexical resources
(Lam et al., 2014a,b, 2015b). Then, we establish the semantic relations between synsets in
wordnets we create. Next, we introduce a new method to automatically add glosses to the
synsets in our wordnets. Our techniques use limited resources as input to ensure that they
can be felicitously used with languages that currently lack many original resources. Most
existing research works with languages that have significant lexical resources available,
which are costly to construct. To make our created lexical resources publicly available,
we developed LexBank which is a Web-based system that provides language services for
several low-resource languages.
To my mother, father and my wife.
vi
Acknowledgments
I would like to express my appreciation to my wife and the mother of my kids Omima for
the unlimited support she gave to me during my journey toward my Ph.D. I am also very
grateful to the support and guidance provided by my advisor Dr. Jugal Kalita. In addition, I
would like to thank my dissertation committee members: Dr. Sudhanshu Semwal, Dr. Tim
Chamillard, Dr. Rory Lewis and Dr. Khang Nhut Lam for their guidance and consultation.
vii
Table of Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Assamese Language . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1.1 Assamese Script . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1.2 Assamese Morphology . . . . . . . . . . . . . . . . . . 5
1.2.1.3 Assamese Syntax . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Vietnamese Language . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2.1 Vietnamese Script . . . . . . . . . . . . . . . . . . . . . 6
1.2.2.2 Vietnamese Morphology . . . . . . . . . . . . . . . . . 7
1.2.2.3 Vietnamese Syntax . . . . . . . . . . . . . . . . . . . . 8
1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Case Study: The Current Status and Challenges of processing information in Arabic 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Fundamental of Arabic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Arabic Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Arabic Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . 15
viii
2.2.3 Arabic Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Literature Review 20
3.1 Automatic Construction of Wordnets . . . . . . . . . . . . . . . . . . . . . 20
3.2 Wordnet Management Tools . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Creating Bilingual Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 Automaticaaly Constructing Structured Wordnets 31
4.1 Constructing Core Wordnets . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Constructing Wordnet Semantic Relations . . . . . . . . . . . . . . . . . . 33
4.3 Experiment and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 Enhancing Automatic Wordnet Construction Using Word Embeddings 39
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Similarity Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Generating Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4 Removing Irrelevant Words in Synsets . . . . . . . . . . . . . . . . . . . . 42
5.5 Validating Candidate Relations . . . . . . . . . . . . . . . . . . . . . . . . 44
5.6 Selecting Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.7.1 Generating Vector Representations of Wordnets Words . . . . . . . 45
ix
5.7.2 Producing Word Embeddings for Arabic . . . . . . . . . . . . . . . 47
5.8 Evaluation & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6 Selecting Glosses for Wordnet Synsets Using Word Embeddings 53
6.1 Creating Language Model Using Word Embedding . . . . . . . . . . . . . 53
6.2 Generating Vector Representation of Wordnet Synsets . . . . . . . . . . . . 53
6.3 Automatically Selecting a Synset Gloss From a Corpus Using Synset2Vec . 56
6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.4.1 Using Synset2vec to Select Glosses for PWN Synsets . . . . . . . . 58
6.4.2 Using Synset2vec to Select Glosses for Arabic,Assamese and Viet-
namese Synsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.4.3 Results & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7 LexBank: a Multilingual Lexical Resource 65
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.2 Database Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.2.1 The system settings database . . . . . . . . . . . . . . . . . . . . . 66
7.2.1.1 Users_Info . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.2.1.2 System_log . . . . . . . . . . . . . . . . . . . . . . . . 67
7.2.2 The lexical resources database . . . . . . . . . . . . . . . . . . . . 67
7.2.2.1 CoreWordnet . . . . . . . . . . . . . . . . . . . . . . . . 68
7.2.2.2 Sem_Relations . . . . . . . . . . . . . . . . . . . . . . . 68
x
7.2.2.3 WordnetGlosses . . . . . . . . . . . . . . . . . . . . . . 68
7.2.2.4 Sem_Relations_Eval_Data . . . . . . . . . . . . . . . . 69
7.2.2.5 Sem_Relations_Eval_Response . . . . . . . . . . . . . . 69
7.2.2.6 WordnetGlosses_Eval_Data . . . . . . . . . . . . . . . . 70
7.2.2.7 WordnetGlosses_Eval_Response . . . . . . . . . . . . . 70
7.3 Application layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.4 Web Interface Design & Implementation . . . . . . . . . . . . . . . . . . . 72
7.4.1 Registration Form . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.4.2 Log-in Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.4.3 The Main Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.4.4 Searching Wordnet By Lexeme Web Form . . . . . . . . . . . . . . 77
7.4.5 Searching Wordnet By OffsetPos Web Form . . . . . . . . . . . . . 78
7.4.6 Evaluating Semantic Relations Between Synsets Web Form . . . . 80
7.4.7 Evaluating Wordnet Synsets Glosses Web Form . . . . . . . . . . . 83
7.4.8 Users Management Web Form . . . . . . . . . . . . . . . . . . . . 85
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8 Conclusions 88
9 Future Work 91
9.1 Extending Bilingual Dictionaries . . . . . . . . . . . . . . . . . . . . . . . 91
9.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.1.2 Extending Bilingual Dictionaries Using Structured Wordnets . . . . 93
9.2 Integrating Part-of-speech Tagging into Wordnet Construction . . . . . . . 95
xi
9.3 Wordnet Expansion Using Word Embeddings . . . . . . . . . . . . . . . . 96
9.4 Producing Vector Representation for Multi-word Lexemes . . . . . . . . . 97
9.5 Vector Representation for Mulit-lingual Wordnets . . . . . . . . . . . . . . 97
Bibliography 98
Appendices 110
A Data Processing Software Code 110
A.1 computCosineSim.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
A.2 GenerateVectorForSynset.py . . . . . . . . . . . . . . . . . . . . . . . . . 112
A.3 GenerateVectorForGloss.py . . . . . . . . . . . . . . . . . . . . . . . . . . 113
A.4 ComputeGlossSynsetSimilarity.py . . . . . . . . . . . . . . . . . . . . . . 114
B Microsoft SQL Server Tables 115
C LexBank Utility Class 127
xii
List of Tables
3.1 A list of the Java libraries tested in (Finlayson, 2014). . . . . . . . . . . . . 25
3.2 A comparison between some of the Java libraries for accessing the PWN. . 26
4.1 Wordnet semantic relations. . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Size, coverage and precision of the core wordnet we create for Arabic,
Assamese and Vietnamese. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Precision of the semantic relations established for our Arabic wordnet. . . . 37
5.1 An example of cosine similarity between words in a candidate synset . . . . 44
5.2 The weighted average similarity between related words in AWN. . . . . . . 47
5.3 Comparison between the weighted similarity average obtained using dif-
ferent word2�ec settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 Comparison between the number of synsets in AWN and our Arabic word-
net using different threshold values. . . . . . . . . . . . . . . . . . . . . . 48
5.5 Precision of the Arabic wordnet we create. . . . . . . . . . . . . . . . . . . 49
5.6 Precision of the Assamese wordnet we create. . . . . . . . . . . . . . . . . 49
5.7 Precision of the Vietnamese wordnet we create. . . . . . . . . . . . . . . . 49
5.8 Examples of related words and their cosine similarity from our Arabic
wordnet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
xiii
5.9 Examples of related words and their cosine similarity from our Assamese
wordnet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.10 Examples of related words and their cosine similarity from our Vietnamese
wordnet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.1 Meanings of the noun “spill” and its synonyms. . . . . . . . . . . . . . . . 55
6.2 Cosine similarity between the different synset vectors and glosses of the
word “abduction” in PWN. . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.3 The precision of selecting glosses for PWN synsets . . . . . . . . . . . . . 60
6.4 Examples of Arabic glosses we produce in our Arabic wordnet. . . . . . . . 61
6.5 Examples of Assamese glosses we produce in our Assamese wordnet. . . . 62
6.6 Examples of Vietnamese glosses we produce in our Vietnamese wordnet. . 63
6.7 The precision of selecting glosses for Arabic, Assamese and Vietnamese
synsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
xiv
List of Figures
3.1 An overview of the CSS management tool, adapted from (Nagvenkar et al.,
2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 IWND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Core wordnet mapping to structured wordnet. . . . . . . . . . . . . . . . . 34
4.3 Creating wordnet semantic relations using intermediate wordnet. . . . . . . 35
4.4 The effect of missing synsets in recovering wordnet semantic relations us-
ing intermediate wordnet. . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5 Percentage of synset semantic relations recovered for the Arabic, Assamese
and Vietnamese wordnets. . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1 A histogram of synonyms, semantically related words, and non-related
words extracted from AWN. . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.1 An example of creating a vector for a wordnet synset that include more
than one word. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 An example of creating vectors for wordnet synsets that share a single word. 56
7.1 An overview of LexBank system. . . . . . . . . . . . . . . . . . . . . . . . 65
7.2 LexBank web site map . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
xv
7.3 The registration web form . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.4 Sequence diagram of the registration process . . . . . . . . . . . . . . . . 75
7.5 The log-in web form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.6 Sequence diagram of the log-in process . . . . . . . . . . . . . . . . . . . 76
7.7 The main menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.8 The Web form for searching wordnet by lexeme. The form is showing the
result of searching the Arabic lexeme (���) which means Egypt. . . . . . 78
7.9 Sequence diagram of the process of searching wordnet using lexeme . . . . 79
7.10 The Web form for searching wordnet by OffsetPos. The form is showing
the result of searching the Arabic synset (08897065-n). . . . . . . . . . . . 80
7.11 Sequence diagram of the process of searching wordnet using OffsetPos. . . 81
7.12 The Web form for evaluating semantic relations between synsets in a word-
net. The form is showing an example of evaluating a hyponymy relation
between the two Assamese lexemes radiotelegraph and radio. . . . . . . . . 81
7.13 Sequence diagram of the process of evaluating the relation between two
lexemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.14 The Web form for evaluating wordnet synsets glosses. The form is showing
an example of evaluating Arabic synset (13108841-n). . . . . . . . . . . . 83
7.15 Sequence diagram of the process of evaluating the relation between two
lexemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.16 The Web form for managing users in LexBank. . . . . . . . . . . . . . . . 85
7.17 Sequence diagram of the process of managing users in LexBank. . . . . . . 86
xvi
9.1 The IW approach for creating a new bilingual dictionary . . . . . . . . . . 92
9.2 Extending bilingual dictionaries using structured wordnets . . . . . . . . . 94
Chapter 1
INTRODUCTION
1.1 Motivation
A Lexical resource is a classified group of lexical units that provide some linguistic
information. The lexical units can be morphemes, words or multi-word phrases. The basic
unit of a lexical resource is usually called a lexical entry. Some lexical resources can
be used by humans directly while other lexical resources are machine readable. Lexical
resources are the base of most Natural Language Processing (NLP) applications.
There are many types of lexical resources. Based on its type, a lexical resource
can provide syntactical, morphological, phonological or semantic information. Lexicons,
unilingual dictionaries, bilingual dictionaries and wordnets are examples of lexical re-
sources. There are some few fortunate languages, such as English and Chinese, which
have relatively large number of high quality lexical resources. These languages are usually
called resource-rich. Most of the created lexical resources of the resource-rich languages
have been painstakingly manually created by researchers through many years. Unfortu-
nately, most of the other existing languages lack many of those lexical resources. Thos lan-
guages which lack lexical resources are called resource-low or resource-poor languages.
While some of those languages might have some resources, other languages barely have
any lexical resources. Especially poor in this context are the endangered languages around
the world.
2
One important resource that is very helpful in computational processing and in human
language learning is a thesaurus providing synonyms and antonyms of words. An extended
version of a thesaurus that provides additional relations among words in the computational
context is usually called a wordnet. A wordnet is a structured lexical ontology of words
that groups words based on their meaning using sets that are called synsets. For example,
the words helicopter, chopper, whirlybird and eggbeater are grouped in one synset that
has the gloss: an aircraft without wings that obtains its lift from the rotation of overhead
blades. The wordnet connects synsets with each other based on semantic relations. Word-
nets are used in many applications such as word sense disambiguation, machine translation,
information retrieval, text classification and text summarization.
The Princeton WordNet (PWN) is the original English version of such a wordnet and
has been painstakingly produced with diligent manual work augmented by the development
of computational tools, over several decades at Princeton University. Similar complete
wordnets have also been produced for a small number of additional languages such as
French (Sagot and Fišer, 2008), Finnish (Lindén and Carlson, 2010) and Japanese (Kaji and
Watanabe, 2006). Efforts to produce wordnets for a variety of other languages have been
proposed, but most are moving slowly, such as the effort to construct the Asian wordnets
Charoenporn et al. (2008) and Indian wordnets (Bhattacharyya, 2010).
Another important type of resource is the bilingual dictionary, an essential tool for
human language learners. Most existing (online) bilingual dictionaries are between two
resource-rich languages or between a resource-rich language and a resource-poor language.
It is fortunate that many endangered languages have one bilingual dictionary, created usu-
ally by explorers, evangelists or other scholars. However, dictionaries or translators for
3
translations between two resource-poor languages do not really exist. Wiktionary1, a dic-
tionary created by volunteers, supports over 171 languages, although coverage is poor for
many of them. The online translation machines developed by Google2 and Microsoft3 pro-
vide pairwise translations, including translations for single words, for 90 and 51 languages,
respectively. While this is a wide range of languages, these machine translators still leave
out many widely-spoken languages, not to mention endangered ones.
In previous work we focused on developing new techniques that leverage existing
resources for resource-rich languages to build bilingual dictionaries, and core wordnets
and other resources such as simple translators for resource-poor languages, including a few
endangered ones (Lam et al., 2014a,b, 2015b). In this thesis work, we take these resources
in the next level by improving the functionality, quality and coverage of these resources.
We present several new techniques that we did not use in our previous work. Our ultimate
goal is to produce an integrated multilingual lexical resource available online, one that
includes several important individual resources for several languages. We believe that our
resources will help researches, speakers, learners and other users of these languages.
1.2 Research Focus
The goal of this dissertation is to create and make available multilingual lexical re-
sources for several languages by bootstrapping from a limited number of existing resources.
Our study has the potential not only to construct new lexical resources, but also to provide
support for communities using languages with limited resources. Additionally, our re-1http://en.wiktionary.org/wiki/Wiktionary:Main_Page2http://translate.google.com/3http://www.bing.com/translator
4
search presents novel approaches to generate new lexical resources from a limited number
of existing resources.
The main focus of our work is to collect data from disparate sources, develop algo-
rithms for mining and integrating such data, produce lexical resources, and evaluate the
resources in regards to the quality and quantity of entries. To develop and test our ideas, we
work with a few languages with in-house expertise. These include Assamese (asm), Arabic
(arb), English (eng) and Vietnamese (vie). In Chapter 2 we present a detailed introduction
to Arabic. Next, we present a brief introduction to Assamese and Vietnamese.
1.2.1 Assamese Language
Assamese is an Indo-European language that are spoken by more than 15 million
people (Hinkle et al., 2013). It is mainly used in the Indian states of Assam, Arunachal
Pradesh, Meghalaya, Nagaland and West Bengal. Assamese has 4 dialects: Standard As-
samese, Jharwa, Mayang and Western Assamese (Gordon and Grimes, 2005). We present
a brief description of the script, morphology and syntax of Assamese.
1.2.1.1 Assamese Script
Assamese script consists of 37 consonants, 11 vowels, 147 conjuncts and a few punc-
tuation marks (Hinkle et al., 2013). Unlike English where the written letters might have
variable pronunciation, Assamese written letters have one pronunciation. A consonant that
does not occur at the end of a word is assumed to have implicit vowel a following it. How-
ever, when several consonants need to be pronounced together, they are usually written
using a new conjunct letter.
5
When a vowel follows a consonant, the vowel is not written explicitly, but implicitly
as an operator. These operators is attached to consonants in different positions (Hinkle
et al., 2013). They can appear to the left, right, below or above the consonants. Foreign
words can appear in Assamese script as transliteration. However, It is not unusual to write
foreign words in foreign alphabets within a piece of Assamese text.
1.2.1.2 Assamese Morphology
Assamese morphology has two types of morphological transformations: derivational
and inflectional. Around 48% of the Assamese words are constructed using those two types
of transformation (Sharma et al., 2008). The derivational transformation in Assamese is
usually performed by changing the vowel component in the word, while the inflectional
transformation is performed by adding prefixes or suffixes to the word. Assamese is well-
known for its complex suffixes. It is common in Assamese that a word includes a sequence
of suffixes. Four to six suffixes in sequence are not uncommon (Saharia et al., 2009).
In Assamese, suffixes are used for many purposes. The most common purpose of
suffixes is determination (Sharma et al., 2008). In fact, a large number of the Assamese
suffixes are determiners. As in other languages, some determiners are attached to nouns
and pronouns to make them specific. This is similar to using this and that in English.
Unlike in many other languages, such as English, where affixes are used, determiners in
Assamese are also used to transfer single noun to plural.
6
1.2.1.3 Assamese Syntax
Assamese has less firm syntax which means that it is considered a free word order
language. This means that sentences could be written in different word orders and still have
the same meaning. The normal form of a simple Assamese sentences is Subject+Object+
Verb (SOV) (Sarma, 2012), although other orders are acceptable.
1.2.2 Vietnamese Language
Vietnamese, the first language of Vietnam, is an Austroasiatic language that arose
in Indo-China (Thompson, 1987). It is the first language of more than 75 millions peo-
ple living in Vietnam (Gordon and Grimes, 2005). Also, due to emigration, it is the fist
language of many people living around the world, specially in East and Southeast Asia.
Vietnamese, which is called Annamese also, has five main dialects that differ mainly in
their sound systems. The five main dialects of Vietnamese are: Northern Vietnamese,
North-central Vietnamese, Mid-Central Vietnamese, South-Central Vietnamese and South-
ern Vietnamese (Wikipedia, 2016a). In the next sections, we present a brief description of
the script, morphology and syntax of Vietnamese.
1.2.2.1 Vietnamese Script
Old Vietnamese texts are written using Chinese characters. In the 17th century, the
Latin alphabet was introduced to Vietnamese by the French. By the beginning of the 20th
century, the Romanized version of Vietnamese became dominant (Thompson, 1987).
7
Compared to other languages, Vietnamese has a large number of vowels. It has 11
single vowels in addition to three types of composed vowels: centering diphthongs, clos-
ing diphthongs and triphthongs (Gordon and Grimes, 2005). These vowels are created
by combining single vowels together. Vowels are modified by diacritics. The diacritics,
which can be written above or below a vowel, are used to specify the tone of the vowel.
These tones have different lengths, pitch heights, pitch melodies and phonations. There are
25 consonants in Vietnamese. Consonants are represented in written script by a variable
number of letters. Some of the consonants are represented using one letter and other conso-
nants are represented by a digraph, which is a combination of two letters. There are some
consonants which are represented by more than one digraph or letter (Wikipedia, 2016a).
1.2.2.2 Vietnamese Morphology
In Vietnamese, the majority of words are polysyllabic words (Noyer, 1998). Poly-
syllabic words are words composed of two or more syllables. The construction of polymor-
phemic words in Vietnamese is done in three ways: combining two words, adding affixes
to stem or reduplication. Words formed using reduplication morphology are constructed by
duplicating a word or a part of a word. There are a small number of affixes in Vietnamese.
Most of them are in the form of prefixes and suffixes. One distinct characteristic of Viet-
namese is that it does not have any number, gender, case and tense distinction (Wikipedia,
2016b). However, usually a noun classifier is used as a determiner and is added after the
word to specify those characteristics.
8
1.2.2.3 Vietnamese Syntax
Vietnamese sentences follow the Subject+Verb+Object (SVO) word order. To dis-
tinguish between verbs and nouns in a Vietnamese sentence, a copula is used before the
nouns. Noun phrases are usually composed of a noun and a modifier. The modifier can
be a numerator, classifier, prepositional phrase or other description word. Like in other
languages, pronouns are used to substitute the nouns and noun phrases.
1.3 Research Contributions
The resources created by Khang’s PhD dissertation (Lam, 2015) and reported in (Lam
et al., 2014a,b, 2015b), have many holes. E.g., the wordnets have only synsets, which are
sets of synonyms for words. In this dissertation work, we develop algorithms and models
to automatically establish the semantic relations between synsets in our previously created
core wordnets for our languages of focus using both pre-existing resources, as well as by
bootstrapping with resources we create ourselves. Following are the contributions produced
by this thesis:
• We construct the rest of the structure for our core wordnets with acceptable qual-
ity. We focus on the construction of wordnet semantic relations such as Hypernyms,
Hyponyms, Member Meronyms, Part Meronyms and Part Holonyms between the
synsets.We believe that our work contributes significantly to the repository of re-
sources for languages that lack them.
• We present a method to enhance the quality of wordnets, we create in the first task,
by filtering the mistakenly created synsets and relations. In this task, we use one
9
of the state of the art techniques which is word embedding (Mikolov et al., 2013).
This method give a solution to the problem of wrong translation produced by the
translation method.
• We produce an approach to create a vector representation for synsets. This approach
aims to produce a better way for representing meaning. This representation can be
used in several areas. In this task we use it to automatically extract glosses from
corpora for wordnet synsets we create in the previous tasks. It, also, can be used in
the word-sense disambiguation (WSD) problem which occurs with words that have
multiple meanings.
• Then, based on the vector representation of synsets, we present a novel approach
to add a gloss for each synonym set (synset) in our core wordnets. A gloss is a
definition or a sentence that clarifies the meaning of the synset. Glosses are mostly
added manually by human or automatically generated using rule-based generation
approach (Cucchiarelli et al., 2004).
• Finally, we present LexBank which is a system that makes our created resources
available for public. We design and implement the system such that it provides use-
ful services for users that seek linguistics resources in a friendly manner. We aim
to make our system flexible and expandable so it can accommodate additional new
languages and resources.
Chapter 2
CASE STUDY: THE CURRENT STATUS AND CHALLENGES OF
PROCESSING INFORMATION IN ARABIC
Since Arabic is one of the languages we use in our experiments throughout this dis-
sertation, we present the current status of Arabic language processing as an example in this
chapter.
2.1 Introduction
According to Ethnologue (Gordon and Grimes, 2005), Arabic is the official language
of more than 223 million people in 25 countries which makes it one of the most widely-
spoken languages in the word. Arabic is the language of Islam, which is the religion of
1.6 billion people around the world. Muslims are required to use Arabic to read the Quran
(the Holy Book of Islam) and to perform the rituals of Islam. There are around 30 major
dialects in Arabic. These dialects have different phonologies, morphologies, syntax and
even lexicons (Habash, 2010). However, these dialects are not used as official languages
by themselves. They are used for informal speech. For formal writing and speaking, the
official Modern Standard Arabic (MSA) is used. MSA was developed based on Classical
Arabic, which is the language of historical literature. However, dialects are commonly used
for writing now-a-days in social media. But, they are rarely used in books, newspapers and
in literary writing. Even though most Arabs can speak MSA, it is not the natively spoken
language of any region (Diab and Habash, 2007). This coexistence between MSA and
11
dialects is problematic for Arabic language processing. This happens to be a problem in
most of widely spoken languages in the world.
One important survey (Farghaly and Shaalan, 2009) discussed the importance of
research in the field of Arabic processing from two perspectives. First, the perspective of
non-Arabic speakers who need to process a huge amount of Arabic texts. The Department
of Homeland Security in the United States is a good example. With increasing security
risks, there is a crucial need to be able to understand the meaning of Arabic documents
and retrieve important information from them such as names, organization and places. The
second perspective is that of Arabic speakers. Machine translation, retrieving information,
summarization, and linguistic tools are some of the applications which are requested by
Arabic speakers.
In the rest of this chapter, we give a summary of the features that make the process-
ing of Arabic text so challenging and some of the solutions and resources that have been
designed to address these challenges. First, in Section 2, we discuss the fundamental issues
in Arabic which are the script, the morphological issues, and the syntactical issues. Then,
in Section 3, we discuss three of the most valuable resources for Arabic processing. These
are The Penn Arabic Treebank (PATB), The Prague Arabic Dependency Treebank (PADT),
and The Columbia Arabic Treebank (CATiB).
2.2 Fundamental of Arabic
In this section we discuss the script, morphology and syntax of Arabic.
12
2.2.1 Arabic Script
Arabic is written as a right to left script. The Arabic script is also used by languages
such as Kurdish, Urdu, Persian and Pashto (Habash, 2010). One important aspect of Arabic
is that most of Arabic letters are composed of two parts: a base form and a mark. There
are three kinds of marks in Arabic letters. The first kind consists of dots which are used
to distinguish between letters that share the same base form. An example of letters that
share the same base form are the letters (�) “ba”,(�) “ta”, and (�) “tha”. The second kind
of mark is the Hamza mark (�) which can be written above some letters, as in (�) “u”, or
under some letters, as in (�) “I”. Unfortunately, people often misspell words by not writing
such marks making it hard to distinguish between similar letters and causing ambiguity in
the text. It is also important to notice that Hamza (�) can also be considered a letter by
itself. An example of a word that has the Hamza letter is the word (����) which means
“sky”. The third kind of mark is the Hamza mark that distinguishes the letter (�) “Kaf”
from the letter (�) “Lam”.
Most letters in Arabic have several shapes. The shape of a written letter is determined
based on the position of that letter in the word. Let us take the letter (�) “qaf” as an
example. If it appears at the beginning of the word, it will have the shape (��) whereas it
will have the shape (���) if it appears in the middle of the word, and the shape (��) if it
is at the end of the word. All word processors select the appropriate letter shape based on
the rules which govern these shapes, and therefore, there is only one key for each letter.
Inflectional morphology is also a factor that governs the shape of some Arabic letters.
The Arabic letter “Hamza” is a good example for that. The word (������), which means
13
“friends”, becomes (�������) instead of (�������) when we add the letter (�) which
means the possessive pronoun “my”.
In Arabic, each letter is mapped to one unvarying sound, which makes it a phonetic
language. For example, the Arabic letter (��) always has the pronunciation /s/. On the
other hand, letter “s” in English has three pronunciations: /z/, /s/, and /sh/ as in “nose”,
“salt”, and “sugar”, respectively. However, in Arabic a short vowel may be added to the
letter to change its sound. There are three short vowels in Arabic, which means that each
letter has three more sounds in addition to the original sound. There are no dedicated letters
to represent short vowels. The short vowels may be specified in the written language using
optional diacritics. To show how the short vowels change the sound of Arabic letters, let us
look at the Arabic letter (��) again. We said that (��) is pronounced as /s/; however, if we
add the short vowel “Dhamma” it will be pronounced as “su” and it may be written, with
the “Dhamma” diacritic, as (���). If we add the short vowel “Kasra”, it will be pronounced
as “si” and it may be written with “Kasra” diacritic, as (���). Keep in mind that in MSA, the
writing of the diacritics is optional, although a change in a diacritic of a letter can change
the meaning of the word and may even change the morphological structure of the sentence.
Clearly, this a major source of ambiguity in Arabic processing (Diab and Habash, 2007).
Obviously, with all these problems caused by the Arabic script, Arabic input text
has to be pre-processed to enhance recognition during the actual processing. This pre-
processing, which is called normalization, aims to standardize different Arabic script varia-
tions. There are several solutions which have been proposed to normalize the Arabic script.
For example, (Larkey et al., 2002) normalized the corpus, the queries, and the dictionaries
of Arabic using the following steps. They first unified the encoding and removed punctua-
14
tions in the text. Then they removed all the diacritics and the non-letters called “tatweel”.
After that, they removed the Hamza mark (�) from the letter “Alif” to standardize all the
variations (�),(�), and (�) to (�). Also, they replaced (��) with (�), (�) with (�), and (�)
with (�). The Stanford Natural Language Processing Group adopted a similar procedure in
the Stanford Arabic Statistical Parser (Green and Manning, 2010). The normalization pro-
cess, as you might expect, does not come without a price. Since all these removed marks
purpose to clarify ambiguity, the normalization of the variant scripts causes the ambiguity
probability to increase (Farghaly and Shaalan, 2009).
Unlike English, there are no silent letters in Arabic. An example of a silent letter
in English is the letter “p” in the word “pneumonia”. There are no new sounds in Arabic
produced by combining two letters. For instance, in English, “c” and “h” are combined
to produce three distinct sounds: the sound at the beginning of “cheese”, the sound at the
beginning of “character”, and the sound at the beginning of “chef.”
It is well known that the process of splitting text into sentences is an essential step
in many Natural Language Processing (NLP) applications. In English, this is relatively an
easy task since English sentences start with an uppercase letter and finish with a period.
However, splitting Arabic sentences is not as easy as in English since there is no capital
form for Arabic letters (Chinese, Japanese, and Korean have no capitalization too). In
addition, punctuation rules in Arabic are not strict; so many people do not use it properly. In
fact, Arabic writers excessively use coordinations, subordinations and logical connectives
to conjoin the sentences (Farghaly and Shaalan, 2009). Hence, it is not unusual for an
Arabic article to have a complete paragraph which does not include any periods other than
15
the period at the end of the paragraph. Therefore, texts in the Arabic must go through
complicated preprocessing.
The lack of capitalization obviously makes it hard to detect named entities (Darwish,
2013) which is an essential part of Information Retrieval (IR). In English, extracting named
entities such as cities, names of people, addresses and organizations is done with the help of
capitalization and punctuation. For example, to recognize a name like “Barack H. Obama”,
a simple algorithm can be used to search for an uppercase word followed by an initial
with an optional period followed by an uppercase letter. We are not claiming that NER
in English is straightforward or simple in general, but since Arabic does not have these
features, new methods must be used to address the problem of named entity recognition
(Darwish, 2013).
2.2.2 Arabic Morphology
Arabic has a very rich and complex morphology (Attia, 2008). Similar to the other
Semitic languages, morphology in Arabic is of two types, derivational and inflectional.
Derivational morphology is the process of creating new words. This is done by mapping
a root to a pattern. The root holds the meaning while the pattern changes the structure of
the root generating a new word with a different part-of-speech. This type of derivational
morphology is called nonlinear morphology (Bhattacharya et al., 2005). On the other
hand, inflectional morphology is the process of modifying the words with features to create
plural, feminine, or definite forms of the word (Habash, 2010).
16
A morpheme is “a linguistic form which bears no partial phonetic-semantic resem-
blance to any other form” (Bloomfield, 1933). Morphological processing in NLP is the
process of decomposing a word into morphemes. Relatively, this is an easy task in con-
catenative morphology. However, in languages with nonconcatenative morphology, like
Arabic, it is a much harder task. In Arabic, words are built by merging a consonantal root
and a vocalism (McCarthy, 1981). The root holds a semantic field while the vocalism
specifies the grammatical form. An example showing the nonconcatenative morphology
of Arabic would be the word (���) “katab” which means “to write”. It is composed by
associating the root (���) /k-t-b/ which has the meaning of “writing”.
Several approaches have been used to decompose Arabic words. The first approach
recovers the root by extracting all prefixes and suffixes from the word, then, stripping the
rest of the word using a lexicon of roots (Hlal, 1985). This approach is very common;
however, it requires a lexicon of all possible Arabic roots, prefixes, infixes and suffixes
(Beesley, 1996; Shaalan et al., 2006). Buckwalter introduced another approach in his mor-
phological analyzer (BAMA) (Buckwalter, 2004). Rather than recovering the root, BAMA
recovers the stem and considers it the main building block for the Arabic word. The stem
is recovered by just removing the prefixes and the suffixes. Therefore, BAMA decomposes
the Arabic word into three parts: Arabic stems, Arabic prefixes and Arabic suffixes.
The decomposition process searches for the prefixes and the suffixes in the word
that satisfy constraints governing the possibility of combining them with the stem in the
word. BAMA has a bidirectional transliteration schema from Arabic script to Latin script.
That means that developers can work with unstructured Arabic texts without any Arabic
language knowledge. For this reason, many recent statistical ANLP systems use BAMA as
17
the foundation for machine translation and information retrieval. However, BAMA has the
limitation of giving a general analysis that includes all possible cases of the word without
considering the context of the input text. A more refined result can be obtained using a
disambiguation module that considers the context of the input text after eliminating the
incorrect analyses (Habash and Rambow, 2005).
Dialectal Arabic differs from MSA morphologically, lexically and phonologically
(Habash et al., 2013). Furthermore, there are no standard orthographies and no language
academies in dialectal Arabic. Therefore, the tools and resources designed for MSA do
not work with dialectal Arabic. Recently, several research efforts have focused on Arabic
dialectal texts (Habash and Rambow, 2005; Habash et al., 2013; Zaidan and Callison-
Burch, 2014). The state-of-the-art dialectal Arabic morphological analyzer is the Columbia
Arabic Language and dialect Morphological Analyzer (CALIMA) (Habash et al., 2013).
Arabic is an agglutinative language, which means that Arabic words usually include
affixes and clitics that represent different parts-of-speech. Let us take the word (���������)
“katabto ho” which means “I wrote it”. This word is a verb that has the subject and the
object attached to it. The subject is the diacritic on the fourth letter (�) “ta”, while the
object is the suffix (��) “ha”. This is just a simple example whereas words usually have
more complex structures that include other clitics to specify the gender, person, number and
voice. Hence, due to complex phonological rules, the decomposition of words in Arabic is
relatively more difficult.
18
2.2.3 Arabic Syntax
According to (Habash, 2010), there are two kinds of sentences in Arabic: sentences
that start with verb (V-SENT), and sentences that start with noun (N-SENT). Verb-subject-
object (VSO) is the primary structure of a V-SENT sentence in the Classical and Modern
Standard Arabic. However, the object-verb-subject (OVS) and subject-verb-object are also
commonly used. As we mentioned before, Arabic is a pro-drop language which means that
the subjectless sentences are perfectly grammatical in Arabic. Also, unlike English, the
use of the equational sentences like “He a journalist”, are allowed without the need of a
“to be” verb. Russian, Hungarian, Hebrew, and Quechuan languages also allow this type
of sentences.
In Arabic, the structure of constituent questions is usually composed by starting with
a wh-phrase. However, it is grammatically correct if the constituent question does not start
with the wh-phrase. For example, the question (������ ���� ����) literally means “you
eat what yesterday?”. Furthermore, relative clauses in Arabic are connected using relative
pronouns. For example, in the sentence (������� ���� ����� �����) there are two clauses:
(����� �����) which means “I liked the house”, and (������� ����) which means “which
I bought”. The two clauses are connected using the relative pronoun (����) which means
“which”. The Arabic relative pronouns must agree with the noun which it modifies at the
second clause in number and gender.
19
2.3 Summary
In this chapter, we presented a short overview of inofrmation processing in Arabic.
We summarized challenges that face developers and researchers when processing Arabic
text due to many of its features. The lack of capitalization, dropped subjects, missing
short vowels and the nonconcatenative nature are some of these features. In addition, there
are many dialects in Arabic, which are used in the informal speaking and writing. These
dialects must be treated differently when processing Arabic texts. Much research has been
conducted to address the challenges of Arabic text processing. Some valuable resources
and techniques have been presented for Arabic. However, more work needs to be done to
give Arabic developers and speakers the support they need.
Chapter 3
LITERATURE REVIEW
In this chapter, we provide a summary of the main existing approaches for creating
lexical resources. We focus on two types of lexical resources: wordnets and bilingual
dictionaries.
3.1 Automatic Construction of Wordnets
Wordnet is a lexical ontology of words. There are two ways to construct wordnets
for languages that do not possess such resources: manual construction and automatic con-
struction. We intend to use automatic construction using core wordnets we have created in
our earlier work (Lam et al., 2014a,b, 2015b) and other existing resources that are freely
available. Other efforts are underway to manually (or mostly manually) create wordnets in
a variety of languages, although progress seems slow all around.
High-quality wordnets have been developed for a small number of languages. Word-
nets, other than the Princeton WordNet (Fellbaum, 1998; Miller, 1995), are typically con-
structed by one of two approaches. The first approach, which is called the expansion ap-
proach, translates the PWN to target languages (Akaraputthiporn et al., 2009; Barbu, 2007;
Bilgin et al., 2004; Kaji and Watanabe, 2006; Lam et al., 2014b; Lindén and Niemi, 2014;
Oliver and Climent, 2012; Sagot and Fišer, 2008; Saveski and Trajkovski, 2010). In con-
trast, the second approach, which is called the merge approach, builds the semantic taxon-
21
omy of a wordnet in a target language, and then aligns it with the Princeton WordNet by
generating translations (Borin and Forsberg, 2014; Gunawan and Saputra, 2010; Maziarz
et al., 2013; Rigau et al., 1998). To construct the taxonomic relations between words,
first definitions of words are retrieved from machine readable dictionaries. Then, a genus
disambiguation process, which is the process of finding a word with a broad meaning that
more specific words fall under, is performed using the definitions to construct a hierarchical
class of concepts. Next, classes are merged with the synsets in the PWN using a bilingual
dictionary to form the target wordnet.
The expansion approach dominates the merge approach in popularity. Wordnets gen-
erated using the merge approach may have different structures from the Princeton Word-
Net. In contrast, wordnets created using the expansion approach have the same structure
as the Princeton WordNet, which provides for a level of uniformity among them, pos-
sibly at the cost of some natural language-specific expressiveness (Leenoi et al., 2008).
Many approaches to construct wordnets are semi-automatic and, therefore, can be used
only for languages that have some existing lexical resources. Therefore, any attempt to
build wordnets for resource-poor languages using these methods would be doomed from
the start. Moreover, while wordnets are always difficult to evaluate, it is even harder to eval-
uate machine-created wordnets in resource-poor languages because these languages do not
have gold standards to compare with, and frequently do not have easily-accessible experts
to evaluate such resources.
Crouch clusters documents first using a complete link clustering algorithm and gener-
ates thesaurus classes or synonym lists based on user-supplied parameters (Crouch, 1990).
Curran and Moens evaluate the performance and efficiency of thesaurus extraction meth-
22
ods and also propose an approximation method that provides for better time complexity
with little loss in accuracy (Curran and Moens, 2002a,b). Ramirez and Matsumoto develop
a multilingual Japanese-English-Spanish thesaurus using two freely available resources:
Wikipedia and the Princeton WordNet (Ramírez et al., 2013). They extract translation tu-
ples from Wikipedia articles in these languages, disambiguate them by mapping to wordnet
senses, and extract a multilingual thesaurus with a total of 25,375 entries. One thing we
must note about all these approaches is that they are resource-hungry, requiring a large cor-
pus of Wikipedia or non-Wikipedia documents and wordnets. For example, Lin works with
a 64-million word English corpus to produce a high quality thesaurus with about 10,000
entries (Lin, 1998). Ramirez and Matsumoto have the entire Wikipedia at their disposal
with millions of articles in three languages, although for experiments they use only about
13,000 articles in total (Ramírez et al., 2013). Furthermore, (Miller and Gurevych, 2014)
work with more than 19 thousands of Wiktionary senses and 16 thousands of Wikipedia
articles to produce a three-way alignment of WordNet, Wiktionary, and Wikipedia. When
we work with low-resource or endangered languages, we do not have the luxury of collect-
ing such big corpora or accessing even a few thousand articles from Wikipedia or the entire
Web. Many such languages have no or very limited Web presence. As a result, we have to
work with whatever limited resources are available.
In this work we propose approaches to generate synonyms, hypernyms, hyponyms
and some other semantic relations. To enhance the quality of wordnets we create, several
approaches are used to measure relatedness between concepts or words. Some potential
approaches for measuring semantic relationships are a dictionary-based approach (Kozima
and Furugori, 1993) and thesaurus-based approach (Hirst and St-Onge, 1998).
23
Oliver (Vossen, 1998) presented approaches for constructing wordnets using the ex-
pand model and made them available through a Python toolkit (Oliver, 2014). The authors
designed three strategies that use three types of resources to construct wordnets: dictio-
naries, semantic networks (Navigli and Ponzetto, 2010) and parallel corpora. While the
construction approaches of wordnets using dictionaries and semantic networks were direct,
the authors used machine translation and automatic sense-tagging to construct their word-
nets using parallel corpora. A toolkit1 provides access to the three construction methods
besides access to some freely available lexical resources. To test their dictionary based
approach, the authors constructed wordnets for six languages: Spanish, Catalan, French,
Italian, German and Portuguese with precision between 48.09% and 84.8%. Using their
semantic network based approach, the authors constructed wordnets for the six languages
with precision between 49.43% and 94.58%. The parallel corpus based approach with
machine translation achieved precision between 70.26% and 93.81%, while with auto-
matic sense-tagging it achieved between 75.35% and 82.44%. The authors stated that their
automatically-calculated precision value is very prone to errors.
Another example of constructing wordnets using dictionary based methods is JAWS
(Mouton and de Chalendar, 2010). JAWS is a French wordnet for nouns constructed by
translating wordnet nouns using a bilingual dictionary and syntactic language model. The
construction of JAWS starts with copying the structure (the synsets with no words) of the
source wordnet. Then, the phrases that are available in the bilingual dictionary are used to
fill out the initial synsets. Finally, the language model is used to incrementally add new
phrases to JAWS. An improved version of JAWS is called WoNeF (Pradet et al., 2014).1http://lpg.uoc.edu/wn-toolkit
24
The new improved wordnet includes parts of speech and was evaluated using a gold stan-
dard produced by two annotators. In addition, WoNef uses a better translation selection
algorithm that uses machine learning to select variable thresholds for translations.
In (Lam et al., 2014b), we presented three methods to construct wordnet synsets
for several resource-rich and resource-poor languages. We used some publicly available
wordnets, a machine translator and a single bilingual dictionary. Our algorithms translate
synsets of existing wordnets to a target language T, then apply a ranking method on the
translation candidates to find best translations in T. The approaches we used are applicable
to any language which has at least one existing bilingual dictionary translating from English
to it.
In the first approach, which we call the direct translation approach (DR), for each
synset in PWN, we directly translate the words from English to the target language. In
the second approach, which we call IW, we extract candidates from several intermediate
wordnets rather than just using PWN to disambiguate the translation. In the third approach,
which we call IWND, we try to reduce the number of bilingual dictionaries we use in the
second approach. When the intermediate wordnet is not PWN, we translate the extracted
words from the wordnets to English, and then we use a single bilingual dictionary to trans-
late the words from English to the target language. In all of these methods, after extracting
the candidates, we use a ranking method to select the best translations and insert them as a
synset in the traget wordnet.
25
Library URLCICWN http://fviveros.gelbukh.com/wordnet.htmlextJWNL http://extjwnl.sourceforge.net/Javatools http://www.mpi-inf.mpg.de/yago-naga/javatools/Jawbone http://sites.google.com/site/mfwallace/jawbone/JawJaw http://www.cs.cmu.edu/~hideki/software/jawjaw/JAWS http://lyle.smu.edu/~tspell/jaws/JWI http://projects.csail.mit.edu/jwi/JWNL http://sourceforge.net/apps/mediawiki/jwordnet/URCS http://www.cs.rochester.edu/research/cisd/wordnet/WNJN http://wnjn.sourceforge.net/WNPojo http://wnpojo.sourceforge.net/WordnetEJB http://wnejb.sourceforge.net/
Table 3.1. A list of the Java libraries tested in (Finlayson, 2014).
3.2 Wordnet Management Tools
Maintaining wordnets is an important area of research. The manual construction of
a wordnet is an intensive process that requires a large number of specialists to work for
several years. Furthermore, a wordnet is not static. The meaning of many phrases change
through time and new phrases appear every year. For example, the country Sudan was
divided into two countries Sudan and South Sudan in 2011. If one searches the PWN 3.1 for
Sudan, only the senses corresponding to the old Sudan show up since the new sense has not
yet been added. Moreover, the representation of wordnets evolves over time. For example,
many old wordnets were upgraded to provide the XML representation. In addition, as this
section shows, many wordnets are built based on the PWN. Every time PWN gets updated,
these wordnets must be updated also to preserve the alignment with PWN. All the previous
issues show the need for wordnet maintenance tools.
One recent work on tools for maintaining wordnets is by (Mladenovic et al., 2014).
The tools are designed to provide for upgrade, cleaning, validation, search, import and
26
export of functionalities for the Serbian wordnet (Christodoulakis et al., 2002). Another
recent work develops a Java library, which is called JWI, for accessing the PWN and com-
pares it with eleven other libraries is (Finlayson, 2014). The comparison between the li-
braries was based on five features: special requirements, used similarity metrics, ability to
edit the wordnet, whether they need to work with the Maven project or not, and forward-
compatibility with Java. Table 3.1 shows the tested libraries and Table 3.2 shows a sum-
mary of the comparison.
Metric
Stan
dalo
ne
Sim
ilari
tyM
etri
cs
Editi
ng
Mav
en
Min
imum
Java
CICWN Yes No No No 1.6extJWNL No No Yes Yes 1.6Javatools Yes Yes No No 1.6Jawbone Yes Yes No No 1.6JawJaw Yes Yes No No 1.5JAWS Yes No No No 1.4JWI Yes Yes No No 1.5JWNL No Yes No Yes 1.4URCS Yes No No No 1.6WNJN No No No No 1.5WNPojo No No No No 1.6WordnetEJB No No No No 1.6
Table 3.2. A comparison between some of the Java libraries for accessing the PWN.
Another wordnet management tool was also presented recently for the IndoWordNet2
(Nagvenkar et al., 2014). The tool, which is called the Concept Space Synset Management
Tool3 (CSS), provides an interactive user interface for creating new language synsets and2http://www.cfilt.iitb.ac.in/indowordnet/3http://indradhanush.unigoa.ac.in/conceptspace
27
linking them to other Indian language wordnets. The CSS tool uses a role-based access
control to restrict the access to the wordnet. Figure 3.1 shows an overview of the CSS tool.
Figure 3.1: An overview of the CSS management tool, adapted from (Nagvenkar et al.,2014)
Sense marking is the process of tagging words with senses in corpus. It is a necessary
task in preparing training data for machine learning techniques. Since sense marking is an
intensive process, sense marking tools are very handy. For example, the Indian Institute
of Technology Bombay has developed a sense marker tool for the IndoWordNet (Prab-
hugaonkar et al., 2014). The sense marking tool shows a highlighted word in a piece of text
and asks the annotator to choose the most appropriate sense from the available senses. The
tool, also, allows the annotator to add new senses that do not exist in the wordnet.
28
3.3 Creating Bilingual Dictionaries
Bilingual dictionaries are essential lexical resources which we use in our approaches.
The majority of low-resource languages have bilingual dictionaries to provide phrase trans-
lation between them and rich-resource languages. However, only relativity few bilingual
dictionaries are available for translation between low-resource languages. Several meth-
ods have been presented to automatically construct such dictionaries between low-resource
lanauges. Since wordnets we create in this dissertation are aligned with each others, we
believe that they can be good resources for phrase translation between languages. In this
section, we discuss some methods for automatically creating bilingual dictionaries.
Given two input dictionaries L1
-Lp and Lp-L2
, a naïve method to create a new bilin-
gual dictionary L1
-L2
may use Lp as a pivot using a straightforward transitive approach.
However, if a word has more than one sense, being a polysemous word, this method may
introduce incorrect translations. After computing an initial bilingual dictionary, past re-
searchers have used several approaches to mitigate the effect of ambiguity in word senses.
Methods used for disambiguation use wordnet distance between source and target words in
some way, look at dictionary entries in both forward and backward directions and compute
the amount of overlap to compute disambiguation scores (Ahn and Frampton, 2006; Bond
and Ogura, 2008; Gollins and Sanderson, 2001; Lam and Kalita, 2013; Shaw et al., 2013;
Soderland et al., 2010; Tanaka and Umemura, 1994).
Researchers have also merged information from several sources such as parallel cor-
pora or comparable corpora (Nerima and Wehrli, 2008; Otero and Campos, 2010) and a
wordnet (István and Shoichi, 2009; Lam and Kalita, 2013) to address the ambiguity prob-
29
lem. Some researchers extract bilingual dictionaries directly from monolingual corpora,
parallel corpora or comparable corpora using statistical methods (Bouamor et al., 2013;
Brown, 1997; Haghighi et al., 2008; Héja, 2010; Ljubešic and Fišer, 2011; Nakov and Ng,
2009; Yu and Tsujii, 2009).
Obviously, the quality and quantity of existing resources strongly affect the accura-
cies of newly-created dictionaries. For instance, Nerima and Wehrli create new English-
German and English-Italian bilingual dictionaries with 21,600 and 26,834 entries, respec-
tively, from 76,311 entries in an English-French dictionary, 45,492 entries in a German-
French dictionary, and 36,672 entries in a French-Italian dictionary (Nerima and Wehrli,
2008). Given parallel corpora of Lithuanian consisting of 1,765,000 tokens and Hungarian
including 2,121,000 tokens, Heja can extract only 2,616 correct translation candidates with
accuracy over a certain threshold from 4,025 translation candidates (Héja, 2010). Thus,
new bilingual dictionaries created using current approaches have very few entries com-
pared to the size of the input dictionaries. Furthermore, most resource-poor languages do
not have any corpora, or even online documents. Some languages have only one very small
bilingual dictionary, such as the Karbi-English dictionary of 2,341 words.
In (Lam et al., 2015b), we present approaches to automatically build a large num-
ber of new bilingual dictionaries for low-resource languages, especially resource-poor and
endangered languages, using a single input bilingual dictionary. Our algorithms produce
translations of words in a source language to many target languages using publicly avail-
able wordnets and a machine translator (MT). Our approaches may produce any bilingual
dictionary as long as one of the two languages is English or has a wordnet linked to the
PWN. Using our approaches and starting with 5 available bilingual dictionaries, we cre-
30
ated 48 new bilingual dictionaries. Of these, 30 pairs of languages are not supported by the
popular MTs: Google4 and Bing5.
3.4 Summary
In this chapter, we have discussed the existing methods for the automatic construc-
tion of wordnets. We have also discussed several tools and system for managing wordnets.
Moreover, we covered some of the approaches for automatically creating bilingual dictio-
naries.
4http://translate.google.com/5http://www.bing.com/translator
Chapter 4
AUTOMATICAALY CONSTRUCTING STRUCTURED
WORDNETS
The core idea behind wordnet is to group words which are synonyms, or roughly syn-
onymous, into lexical categories that are called synsets. Then, semantic relations between
these synsets are established in a hierarchical manner. In this chapter, we present a method
to automatically construct the wordnet semantic relations such as Hypernyms, Hyponyms,
Member Meronyms, Part Meronyms and Part Holonyms using PWN.
4.1 Constructing Core Wordnets
In (Lam et al., 2014b) we introduced an approach, which we refer to as the IWND
approach, that creates wordnet synsets with relatively high coverage. As Figure 4.1 shows,
in IWND, to create wordnet synsets for a target language T we used existing wordnets and
a machine translator (MT) and/or a single bilingual dictionary. First, we extracted every
synset in Princeton WordNet (PWN) using the unique offset-POS key, which refers to the
offset for a synset with a particular part-of-speech (POS). Notice here that each synset
may have one or more words, each of which may be in one or more synsets. Words in a
synset have the same sense. Then, we extracted the corresponding synsets for each offset-
POS from existing wordnets linked to PWN, in several languages. Next, we translated
the extracted synsets in each language to T to produce synset candidates using MT or a
32
dictionary. Then, we applied a ranking method on these candidates to find the correct
words for a specific offset-POS in T.
Figure 4.1: Creating wordnet synsets using the IWND algorithm (Lam et al., 2014b).
33
The ranking method we used in (Lam et al., 2014b) is based on the occurrence count
of a candidate. Specifically, the rank of a word w, the so-called rankw , is computed as
below.
rankw =occurw
numCandidates ⇤ numDstWordNetsnumWordNets where:
- numCandidates is the total number of translation candidates of an offset-POS
- occurw is the occurrence count of the word w in the numCandidates
- numWordNets is the number of intermediate wordnets used, and
- numDstWordNets is the number of distinct intermediate wordnets that have words
translated to the word w in the target language.
4.2 Constructing Wordnet Semantic Relations
Synsets in wordnet are linked in hierarchal fashion. The hierarchy in wordnet is
established using the super-subordinate relation between synsets. For example, nouns are
linked using hyperonymy which is a relation between general synsets and specific one. An
example of a hyperonymy relation is the relation between the synsets {food, solid_food}
and {baked_goods}. Hyperonymy relation is transitive, for example, the synset {bread},
which is a hyponym of the synset {baked_goods}, is also a hyponym of the synset {food,
solid_food}. Table 4.1 shows the semantic relations available in wordnet(Wikipedia, 2015).
In (Lam et al., 2014b), we constructed core wordnets, which essentially means that
we created synsets with no connections between them. As Figure 4.2 shows, our goal is to
recover the taxonomy of synsets. To establish the semantic relations between the sysnets
34
Phrase Type Relation Definition
Nouns
Hypernyms Y is a hypernym of X if every X is a (kind of) YHyponyms Y is a hyponym of X if every Y is a (kind of) XCoordinate terms Y is a coordinate term of X if X and Y share a
hypernymMeronyms Y is a meronym of X if Y is a part of XHolonyms Y is a holonym of X if X is a part of Y
Verbs
Hypernyms The verb Y is a hypernym of the verb X if theactivity X is a (kind of) Y
Troponyms The verb Y is a troponym of the verb X if theactivity Y is doing X in some manner
Entailments The verb Y is entailed by X if by doing X youmust be doing Y
Coordinate terms Those verbs sharing a common hypernym
Table 4.1. Wordnet semantic relations.
Figure 4.2: Core wordnet mapping to structured wordnet.
we created in (Lam et al., 2014b), we rely on Princeton WordNet (Fellbaum, 2005) as an
intermediate resource.
35
As Figure 4.3 shows, to construct the links between synsets in our wordnet for lan-
guage T, we extract each synseti from wordnett and map it with synsetj , which is the cor-
responding synset in the Princeton WordNet. Then, for each synsetj in Princeton WordNet,
we extract each semantic relations rj and the linked synsetsk . Next, we check the availabil-
ity of synsetk in wordnett . Finally, if synsetk is available in wordnett , we add a relation
between synseti and synsetk to wordnett .
Figure 4.3: Creating wordnet semantic relations using intermediate wordnet.
We must notice here that although we used some disambiguation methods when we
created the core wordnets, there still are words that are misplaced. This will cause some
false classification of synset relations. Another challenge is that translation leads to loss
of some information. For example, it is very important to distinguish between classes and
instances in wordnets (Miller and Hristea, 2006). There is no guarantee that an instance
will not be translated into the target language as a class and vice versa. Furthermore, as
Figure 4.4 shows, since the core wordnets are automatically created, there will be some
36
missing synsets that might not be available in the target languages. That is will lead to
fragments in the recovered links. All the previous needs to be observed and dealt with to
obtain accepted accuracy.
Figure 4.4: The effect of missing synsets in recovering wordnet semantic relations usingintermediate wordnet.
4.3 Experiment and Evaluation
In this section, we use generate the semantic relations between synsets in three word-
nets: Arabic, Assamese and Vietnamese. We start by creating the core nets using the algo-
rithms we described in Section 4.1. Table 4.2 shows the result of creating the core wordnets
for the three languages. Next we apply our method, which we is presented in Section 4.2,
to link between the synsets. The algorithm was able to recover a total of 206,766 relations
37
Language Synsets Coverage Precision /4.00Arabic 93,383 59.95% 3.82Assamese 107,616 36.95% 3.78Vietnamese 55,451 36.20% 3.75
Table 4.2. Size, coverage and precision of the core wordnet we create for Arabic, Assameseand Vietnamese.
Relation PrecisionSimilarTo 75.62%Hypernym 70.41%Hyponym 71.23%MemberMeronym 77.54%PartHolonym 84.29%Average 75.82%
Table 4.3. Precision of the semantic relations established for our Arabic wordnet.
between the Arabic synsets, 139,502 relations between the Assamese synsets and 146,172
relations between the Vietnamese synsets. As Figure 4.5 shows, most of the recovered
relations are hyponym and hypernym relations.
To evaluate our algorithm, we evaluated the relations recovered for the Arabic word-
net. We asked three Arabic to evaluate a sample of 500 relations. The sample consists of
the following relations: 100 “hypernym” relations, 100 “hyponym” relations, 100 “simi-
lar to” relations, 100 “MemberMeronym” relations and 100 “PartHolonym” relations. The
evaluation done using a True and False questions where the True gives score of 1 and
False gives a score of 0 to the relation
As Table 4.3 shows, the precision of algorithm was between 70.41%, which was
for the “hypernym” relation, and 84.29% which was for the “PartHolonym” relation. The
average precision score was 75.82%.
38
Figure 4.5: Percentage of synset semantic relations recovered for the Arabic, Assameseand Vietnamese wordnets.
4.4 Summary
In this chapter, we presented an approach that automatically construct semantic re-
lations between synsets in a wordnet. The approach depends on the PWN to establish the
links between the synsets. We conducted an experiment to evaluate our algorithm. Our
approach produces semantic relations between the Arabic synsets with 75.82% precision.
Chapter 5
ENHANCING AUTOMATIC WORDNET CONSTRUCTION USING
WORD EMBEDDINGS
In the previous chapters we have shown that a wordnet for a new language, possibly
resource-poor, can be constructed automatically by translating wordnets of resource-rich
languages. The quality of these constructed wordnets is affected by the quality of the
resources used such as dictionaries and translation methods in the construction process.
Recent work shows that vector representation of words (word embeddings) can be used to
discover related words in text. In this chapter, we propose a method that performs such
similarity computation using word embeddings to improve the quality of automatically
constructed wordnets.
5.1 Introduction
It is well known that one way to find out semantically related word is to use context as
lead (Firth, 1957; Harris, 1954). Words that share the same neighbors are usually somehow
related to each other. For example, consider the two sentences:
“He rides his bike to the park everyday” and
“He rides his bicycle to the park everyday”.
One can conclude that the words “bike” and “bicycle” are similar or semantically related
since they appeared in similar context. This observation led to the researches to what
40
is called distributional methods which is widely used in recent days. In these methods,
also known as �ector semantics and word embeddin�s, co-occurrences of the words in a
corpus is represented as vectors in a multidimensional space forming a word-word matrix
(Jurafsky and Martin, 2009).
Since corpora consist of large number of distinct words, these vectors are usually
long and sparse vectors. The sparseness of the vectors is caused by the fact that a word
often co-occur with limited number of other words in a given corpus. For these reasons,
special algorithms are used to process and save these sparse vectors. Also, usually, the
co-occurrence of a word is limited to a specific window of words before and after the
word. According to (Jurafsky and Martin, 2009), there are two types of co-occurrence:
f irst �orderco�occurrence and second �orderco�occurrence. In the first type, are used
to describe words that appear next to each other, while in the second type, the words share
similar surrounding words.
In order to reduce the effect of stop words, i.e. words that co-occur with most of
the words, usually the pointwise mutual information measure (PMI ) (Fano and Hawkins,
1961) is used rather than using the pure co-occurrences. This measure considers the prob-
ability of the co-occurring of two words comparing to other pairs in the corpus. Usually,
the PMI between two words w1
and w2
is
PMI (w1
,w2
) = log
2
P (w1
,w2
)
P (w1
)P (w2
). (5.1)
Where:
P (w1
) is the probability of word w1
41
P (w2
) is the probability of word w2
P (w1
,w2
) is the probability of w1
in context of w2
5.2 Similarity Metrics
There are many ways to compute similarity between vectors (Jurafsky and Martin,
2009). Next, we will list three of the common metrics used to measure similarity or relat-
edness between two vectors ~A and ~B with size N.
• Cosine Similarity: the most common measure used in natural language processing.
It produces similarity values from 0 to 1, when using the row co-occurrences or PMI ,
where words with cosine similarity value near 1 supposedly very similar and words
with cosine similarity value near 0 supposedly unrelated. Cosine similarity usually
measured using the next formula:
cosine ( ~A, ~B) =
PN
i=1
A
i
B
i
qPN
i=1
A
2
i
qPN
i=1
B
2
i
. (5.2)
• Jaccard Measure: which was introduce by (Jaccard, 1912) and adapted by (Grefen-
stette, 2012) to be used withe vectors. The Jaccard similarity is computed using the
following formula:
Jaccard
sim
( ~A, ~B) =
PN
i=1
min(Ai
,Bi
)P
N
i=1
min(Ai
,Bi
)(5.3)
• Dice Measure: which is originally used with binary vectors and was adapted by
(Curran, 2004) to be applied with semantic similarity. The Dice similarity measure
42
can be computed using the next equation:
Dice
sim
( ~A, ~B) =2
PN
i=1
min(Ai
,Bi
)P
N
i=1
(Ai
+B
i
)(5.4)
5.3 Generating Word Embeddings
In order to validate the synsets we create using translation and obtain relations be-
tween them, we use the word2�ec algorithm (Mikolov et al., 2013) to generate word rep-
resentations from an existing corpus. The word2�ec algorithm uses a feedforward neural
network to predict the vector representation of words within a multi-dimensional language
model. Word2�ec has two variations: Skip-Gram (SG) and Continuous Bag-Of-Words
(CBOW). In the SG version, the neural network predicts words adjacent to a given word
on either side, while in the CBOW model the network predicts the word in the middle of a
given sequence of words. In the work presented in this section, we generate representations
of words using both models with several different vector and window sizes to obtain the
settings for the highest precision. The purpose of the steps discussed next is to improve the
quality of synsets produced by the translation process in addition to generating relations
among the synsets.
5.4 Removing Irrelevant Words in Synsets
We compute the cosine similarity between word vectors within each single synset in
TWN, the wordnet being constructed in language T , to filter false word members within
synsets. To filter the initially constructed synsets in TWN, we pick a threshold value �
43
such that the selected words have cosine similarity larger than � with each other. Next, we
describe the filtering process we propose.
1. Let
synset
c
i
= {word
1
,word
2
,word
3
,word
4
} (5.5)
be a candidate synset to be potentially included in TWN.
2. We compute the cosine similarity between all the possible pairs of words in s�nsetci .
3. We extract the pair of words with the highest cosine similarity.
4. If this pair of words have cosine similarity larger than � , the pair is kept in the final
synset s�nseti , otherwise, s�nsetci itself is discarded. This may have been a low
quality candidate synset generated in the translation process.
5. Next, among the remaining words in s�nsetci , a word is kept if it has a connection
with any word in s�nseti with similarity higher than � .
For example, let us assume that the cosine similarity between the words in s�nsetci
are as shown in Table 5.1 and �=0.70. First, the pair with the highest cosine similarity,
(word1
,word2
) is kept in the final s�nseti since its cosine similarity is larger than � . Then,
word3
is discarded since it does not have any cosine similarity larger than � with any of the
words in the current final s�nseti . Finally, word4
is kept s�nseti since it does have a cosine
similarity with word1
that satisfies the threshold � .
44
Pair Cosine Similarity(word
1
,word2
) 0.91(word
1
,word3
) 0.22(word
1
,word4
) 0.82(word
2
,word3
) 0.34(word
2
,word4
) 0.72(word
3
,word4
) 0.12
Table 5.1. An example of cosine similarity between words in a candidate synset
5.5 Validating Candidate Relations
Similarly, we compute the cosine similarity between words within pairs of semanti-
cally related synsets. This allow us to verify the constructed relations between synsets in
TWN. For example, let
s�nseti = {wordi1,wordi2,wordi3,wordi4}, and
s�nsetj = {wordj1,wordj2,wordj3,wordj4}
be synsets in TWN. And let
�i j be a candidate semantic relation between s�nseti and s�nsetj .
We compute the cosine similarity between all the possible pairs of words from s�nseti to
s�nsetj and obtain the maximum similarity obtained. Then, if this value is larger than a
threshold �� , then we retain the relation �i j , otherwise, we discard it. A pseudo code of the
validation algorithm is shown in Algorithm 1.
45
Algorithm 1: Validating Semantic RelationData: s�nseti , s�nsetj , relation �i j , threshold ��
Result: retain or discard the relation �i jinitialization;Similarit�max 0;foreach wordi in s�nseti do
foreach wordj in s�nsetj dosim ComputeCosineSimilarity(wordi ,wordj);if sim > Similarit�max then
Similarit�max = sim;end
endendif Similarit�max < �� then
Discard(�i j) ;end
5.6 Selecting Thresholds
To pick the synset similarity threshold value � and the threshold �� for each semantic
relation we create, we compute the cosine similarity between pairs of synonym words,
semantically related words, and non-related words obtained from existing wordnets. Then,
based on the previous data, we select the threshold values that are associated with higher
precision and maximum coverage.
5.7 Experiments
In this section, we discuss the enhancement of the Arabic, Assamese and Vietnamese
wordnets we create using our method we described in the previous sections.
5.7.1 Generating Vector Representations of Wordnets Words
For generating vector representations of the Arabic Words we use the following freely
available corpora:
• Watan-2004 corpus (12 million words) (Abbas et al., 2011),
46
• Khaleej-2004 corpus (3 million) (Abbas and Smaili, 2005), and
• 21 million words of Wikipedia1 Arabic articles.
We process and combine the three corpora into a single plain text file.
For both Assamese and Vietnamese, we used Wikipedia articles to generate the vector
representation for words. The size of the Assamese Wikipedia articles we used is 1.4
million of words, While the size of Vietnamese articles was 80 million words.
Figure 5.1: A histogram of synonyms, semantically related words, and non-related wordsextracted from AWN.
In order to compute the synset similarity threshold value � and the threshold for
each semantic relation �� , we use the freely available Arabic wordnet (AWN) (Rodríguez
et al., 2008). AWN was manually constructed in 2006 and has been semi-automatically
enhanced and extended several times. We start by extracting synonym words, semantically
related words, and non-related words from AWN. The Python program that we wrote to1https://ar.wikipedia.org
47
Relation Weighted AverageSimilarity
Synonyms 0.28Hypernyms 0.22TopicDomains 0.23PartHolonyms 0.28InstanceHypernyms 0.08MemberMeronyms 0.29
Table 5.2. The weighted average similarity between related words in AWN.
compute the cosine similarity between the words is listed in Appendix A.1. Then, we
use the histogram representation of the cosine similarity of the previous sets of words to
set the thresholds. As Figure 5.1 shows, more than 67% of the non-related words have
cosine similarity less than 0.1, while about 23% of the synonym words in AWN have a
cosine similarity less than 0.1. Furthermore, about 34% of the semantically related words
in AWN have cosine similarity less than 0.1. Table 5.2 shows the weighted average cosine
similarity between synonyms, hypernyms, topic-domain related, part-holonyms, instance-
hypernyms, and member-meronyms in AWN where the frequency of the similarity value is
the weight.
5.7.2 Producing Word Embeddings for Arabic
In this part of this experiment, we use the word2vec algorithm to produce vector
representation of Arabic. We test the word2�ec algorithm with different window sizes to
select the window size that produces the highest similarity. We generate word embeddings
using the CBOW version with window sizes 3, 5 and 8. Next, we compute the weighted
averages of the cosine similarity between the synonyms in AWN. The highest weighted
average we obtained was 0.288 with window size 3, while the weighted averages obtained
with window sizes 5 and 8 were 0.283 and 0.277 respectively. Then, we compare between
the SG and the CBOW with different vector sizes. Table 5.3 shows the weighted average
cosine similarity obtained between 16,000 pairs of synonyms in AWN using both variations
48
Algorithm Vector Size Similarity AverageSG 100 0.289SG 200 0.258SG 500 0.194CBOW 100 0.288CBOW 200 0.259CBOW 500 0.195
Table 5.3. Comparison between the weighted similarity average obtained using differentword2�ec settings.
Threshold AWN Our Arabic WordNet0.000 5,941 17,349
0.100 3,433 2,073
0.288 2,471 943
0.500 1,190 271
0.750 209 13
Table 5.4. Comparison between the number of synsets in AWN and our Arabic wordnetusing different threshold values.
of word2�ec, with window size=3 and vector size set to 100, 200, and 500. We notice that
both versions produce almost similar results with a slight advantage to SG with the cost of
more execution time. However, for the corpus we use, smaller vector size produces better
precision.
5.8 Evaluation & Discussion
We compute cosine similarity between semantically related words extracted from
our initial Arabic, Assamese and Vietnamese wordnets produced in the previous chapter.
The language model to calculate the cosine similarity is created using CBOW with vector
size=100 and window size=3. Table 5.4 shows a comparison between the number of Arabic
synsets we create and the number of synsets in AWN.
We notice that the translation method we use produces high number of synsets com-
pared to the manually constructed AWN. However, the number of synsets sharply decreases
after filtering the initial synonyms using the method described in Section 5.3. Although
49
Threshold Range0- 0.1 0.1 - 0.288 0.288 - 1
Synonyms 34.8% 56.8% 78.4%Hypernyms 45.2% 57.2% 84.4%PartHolonym 50.8% 75.2% 90.4%Member-Meronym 40.8% 56.8% 79.6%
Overall 42.9% 61.5% 83.2%
Table 5.5. Precision of the Arabic wordnet we create.
Threshold Range0- 0.1 0.1 - 0.288 0.288 - 1
Synonyms 52.0% 57.6% 88.0%Hypernyms 37.6% 49.6% 76.0%PartHolonym 51.2% 46.4% 82.4%Member-Meronym 62.4% 67.2% 81.6%
Overall 50.8% 55.2% 82.0%
Table 5.6. Precision of the Assamese wordnet we create.
our Arabic wordnet is automatically created, the number of synsets we create is 60% of the
number of synsets in the manually created AWN when filtering the synsets using �= 0.1.
We evaluate precision by comparing 600 pairs of synonyms, hypernyms, part-holonyms,
and member-meronyms with three ranges of cosine similarity values: 0 to 0.1, 0.1 to 0.288,
and 0.288 to 1. We asked 3 Arabic speakers to evaluate the pairs using a 0 to 5 scale where 0
represents the minimum score and 5 represents the maximum score. We compute precision
by taking the average score and converting it to a percentage. See Table 5.5.
Threshold Range0- 0.1 0.1 - 0.288 0.288 - 1
Synonyms 31.2% 40.2% 57.6%Hypernyms 31.8% 39.0% 69.4%PartHolonym 32.2% 42.8% 75.0%Member-Meronym 22.0% 24.0% 73.8%
Overall 29.3% 36.5% 68.95%
Table 5.7. Precision of the Vietnamese wordnet we create.
50
Table 5.8. Examples of related words and their cosine similarity from our Arabic wordnet.
The precision of the synonyms, hypernyms, part-holonyms, and member-meronyms
we produce is 78.4%, 84.4%, 90.4%, and 79.6% respectively, with the threshold set to
0.288. This is higher than the precision obtained by (Lam et al., 2014b) which produces
synonyms with 76.4% precision when just using PWN. Furthermore, the precision of the
Assamese and Vietnamese wordnets are shown in Tables 5.6 and 5.7. As shown in Tables
(5.8, 5.9, 5.10), our results suggest that using lower precision for producing synsets reduces
the quality of the other created semantic relations. Our results clearly show that pairs with
higher cosine similarity are more likely to be semantically related. It confirms the benefit
of combining the translation method with word embeddings in the process of automatically
generating new wordnets.
5.9 Summary
In this chapter, we discuss an approach for enhancing the automatically generated
wordnets we create for low-resource languages. Our approach takes advantage of word
embeddings to enhance the translation method for automatic wordnet creation. We present
51
Table 5.9. Examples of related words and their cosine similarity from our Assamese word-net.
Table 5.10. Examples of related words and their cosine similarity from our Vietnamesewordnet.
52
an application of our approach to producing new Arabic Wordnet. Our method automat-
ically produces Arabic synonyms with 78.4% precision and semantically related pairs of
words with up to 90.4% precision.
Chapter 6
SELECTING GLOSSES FOR WORDNET SYNSETS USING WORD
EMBEDDINGSWord embedding is a way to represent words as vectors in a multi-dimensional space
such that related word are represented as vectors with similar direction. It has been shown
that this model can be used to discover relation between words effectively. In this chapter,
we introduce a method to represents wordnet synsets in similar way. A wordnet synset is
a group of synonym words grouped together because they all represent the same concept.
Our proposed method can be used in several NLP applications such as word-sense disam-
biguation and automatic wordnet construction. To test our method we use it in the task of
selecting glosses for wordnet synsets of several languages.
6.1 Creating Language Model Using Word Embedding
We start by creating word embeddings using a corpus and the word2�ec software
(Mikolov et al., 2013). word2�ec is a two-layer feedforward neural-network learning
model that produces multi-dimensional vector representation of words. There are two im-
plementations of this learning model: Skip-Gram (SG) implementation and Continuous
Bag-Of-Words (CBOW) implementation. In the SG implementation, the model learns the
words around a given word, while in the CBOW implementation the model learns the word
within a given sequence of words.
6.2 Generating Vector Representation of Wordnet Synsets
In this section, we present our method to produce wordnet synsets. We build our
method based on the vectors of the synonym words produced by the word embedding
54
method. We believe that combining the vectors of synonym words into one vector can
produce a way to represent meaning. Next, we describe our propose method to build the
vector representation of synsets, which we call s�nset2�ec.
Let
s�nseti = {word1
,word2
, ...,wordj} be a synset in wordnetx ,
{n1
,n2
, ...,nj} is the number of synsets for each word in s�nseti , and
{~V1
, ~V2
, ..., ~Vj} are the corresponding vectors for {word1
,word2
, ...,wordj} in the word
embedding model.
We identify two cases:
1. The first case is when a word, which does not have any synonyms, represents several
synsets i.e. have more than one meaning. Therefore, the vector that produced by the
word embedding is actually representing the combined meanings of the word. For
example, in PWN, the word “abduction” is the only word in both synset 00775460-
n, “the criminal act of capturing and carrying away by force a family member”, and
synset 00333037-n, “moving of a body part away from the central axis of the body”.
Hence, the vector for “abduction” actually represents both meanings.
2. The second case is when a word, which does have one or more synonyms, have
one or more meanings. In this case, the synonyms might or might not have other
meanings also. For example, the noun “spill” have four meanings in PWN and it
have 6 synonyms. Table 6.1 shows all the meanings of the noun “spill” and all its
synonyms in PWN.
Obviously, to generate a combined vector for a synset, we need a way to limit the
effect of the other meanings that the synonyms might hold. To do so we start by solving
the second case where the synsets have more than one word. In this case, We normalize
the vector of each word by dividing its coordinates by the number of synsets that the word
55
Synset Key Gloss Synonyms00076884-n a sudden drop from an upright position {spill, tumble, fall}00329619-n the act of allowing a fluid to escape {spill, spillage, release}
04277034-n a channel that carries excess waterover or around a dam or other obstruction {spill, spillway, wasteweir}
15049594-n liquid that is spilled {spill}
Table 6.1. Meanings of the noun “spill” and its synonyms.
belongs to. This reduces the noise when generating the synset vector caused by the other
meanings that a word can hold. We define the vector of s�nseti (~Vsi) as follows:
~Vsi =1
j· (~V
1
· 1
n1
+ ~V2
· 1
n2
+ ...+ ~Vj ·1
nj).
Figure 6.1 shows an example of creating a vector for the synset 00076884-n which include
three words: spill, tumble and fall.
Figure 6.1: An example of creating a vector for a wordnet synset that include more thanone word.
Next, we produce vectors for the synsets that share a single word, i.e. words that do
not have any synonyms and have more than one meaning. In this case, for each synset,
we produce the synset vector by combing the word vector with the vector of a word in a
56
related synset, e.g. a hypernym, a hyponym, or a meronym. For example, let s�nseti and
s�nsetk be synsets that both include the same single wordw . And let h1
be a word from the
hypernym of s�nseti and h2
be a word from the hypernym of s�nsetk . We define the vector
of s�nseti (~Vsi) as follows:
~Vsi =1
2
· (~Vw ·1
nw+ ~Vh
1
· 1
nh1
) .
Similarly, we define the vector of s�nsetk (~Vsk) as follows:
~Vsk =1
2
· (~Vw ·1
nw+ ~Vh
2
· 1
nh2
).
Figure 6.2 shows an example of creating vectors for the two synsets of the word “abduc-
tion”. In Appendix A.2 we list a python implementation of the procedure.
Figure 6.2: An example of creating vectors for wordnet synsets that share a single word.
6.3 Automatically Selecting a Synset Gloss From a Corpus Using Synset2Vec
In this section, we give one usage example of our model. We show how our proposed
model can be used in the automatic selection of glosses for wordnet synsets. The automatic
57
selection of synset gloss is a word-sense disambiguation problem. A gloss is short sentence
which is , usually, manually attached to a synset to clarify the meaning of the synset. This
short sentence can be a definition or an example sentence of one of the members of the
synset. We test our method using PWN and, then, apply it to automatically add glosses to
wordnets created in (Lam et al., 2014b).
In the foloowing steps, we present our method to select a gloss for s�nseti we defined
in section 6.2.
• Let G = {�1
,�2
, ...,��} be set of candidate glosses that include a word belongs to
s�nseti .
• To select the closest gloss to s�nseti from G we generate a vector for each gloss �z 2
G. We list a Python function for this step in Appendix A.3.
• Assume that the gloss �z consists of the words {w1
,w2
, ...,wd},
{m1
,m2
, ...,md} is the number of synsets for each word in �z , and
{~Vw1
, ~Vw2
, ..., ~Vwd} are the corresponding vectors for {w1
,w2
, ...,wd}.
• We compute the vector of gloss �z as follows:
~V�z =1
d· (~Vw1
· 1
m1
+ ~Vw2
· 1
m2
+ ...+ ~Vwd ·1
md).
• Then, we compute the cosine similarity between the vector of each gloss �z and ~Vsi .
We present a Python implementation for this step in Appendix A.4.
• Finally, we select the gloss with highest cosine similarity with ~Vsi .
For instance, as shown in Table 6.2, if we consider the word “abduction” which belongs
to two synsets and does not have any synonyms, we notice that our algorithm was able to
distinguish between the two meanings and select the right gloss for both synsets.
58
Synset Key Gloss CosineSimilarity
00333037-nthe criminal act of capturing and carrying awayby force a family member 0.172
moving of a body part away from the centralaxis of the body 0.214
00775460-nthe criminal act of capturing and carrying awayby force a family member 0.204
moving of a body part away from the centralaxis of the body 0.189
Table 6.2. Cosine similarity between the different synset vectors and glosses of the word“abduction” in PWN.
6.4 Evaluation
In this section, we introduce two forms of evaluation. First, we apply our method
to select glosses for the PWN synsets. In this case, we directly compare our results to the
actual manually attached glosses in PWN. Then, we apply our method to attach glosses to
wordnet synsets generated by (Lam et al., 2014b). In this case, we ask human judges to
evaluate the resulting glosses for three languages: Arabic, Assamese and Vietnamese.
6.4.1 Using Synset2vec to Select Glosses for PWN Synsets
In order to evaluate our synset vector representation in the task of selecting glosses for
wordnets, we use it in the process of gloss selection for PWN synsets. We take advantage of
the glosses manually added to the synsets in PWN to automatically measure the precision of
our synsets representation. The following steps describe the evaluation process of selecting
glosses for PWN synsets.
• For each s�nseti in PWN, we construct a set of candidate glosses. The candidate
glosses are extracted from PWN using the following method. First the gloss attached
to s�nseti in PWN is added to the candidate set of glosses. Next, to generate negative
glosses for s�nseti , we extract words which belong to s�nseti and other synsets, i.e.
59
words have the meaning of s�nseti and one ore more other meaning. This allow us to
examine the ability of the algorithm to differentiate between the different meanings
of synsets.
• We randomly selects two types of synsets from PWN: synsets that have single words,
i.e. synsets that are represented by only single words, and synsets that include multi-
ple synonym words.
• We generate the synset vectors using the algorithm we described in 6.2.
• Next, we generate the gloss vectors using the method we described in 6.3.
• Then, we compute the cosine similarity between s�nseti and each gloss in the candi-
date set.
• Finally, we select the gloss with the highest cosine similarity.
6.4.2 Using Synset2vec to Select Glosses for Arabic,Assamese and Viet-
namese Synsets
In this section, we examine the precision of our method by applying it for the pur-
pose of selecting glosses from corpora to attach to the wordnets we create in the previous
chapters. In this experiment, we used the wordnets of the languages: Arabic, Assamese
and Vietnamese. Next, we describe the steps of evaluating glosses selected by our method
for the synsets of the target languages:
• For each s�nseti in the target wordnet wordnett , we generate a set of candidate
glosses by extracting the set of sentences that include any member of s�nseti from
the corpora we described in Section 5.7.
60
Synset Type Number of Synsets PrecisionSingle Member 1400 76.5%Multi Member 600 79.6%
Table 6.3. The precision of selecting glosses for PWN synsets
• We randomly selects two types of synsets from wordnett : synsets that have single
words, i.e. synsets that are represented by only single words, and synsets that include
multiple synonym words.
• We generate the synset vectors using the algorithm we described in 6.2.
• Next, we generate vectors for each sentence in the set of candidate glosses using the
method we described in 6.3.
• Then, we compute the cosine similarity between s�nseti and each sentence in the
candidate set.
• Next, the top 3 sentences with the highest cosine similarity with the s�nseti are se-
lected.
• Finally, 3 native speakers of the target language are asked to evaluate the selected
sentences using a 5 point scale.
6.4.3 Results & Discussion
As shown in Table 6.3, we used our algorithm to select glosses for 1400 single-
member synsets from PWN. The algorithm achieved 76.5% precision. Also, we used it to
select glosses for 600 multi-member synsets from PWN. The precision was 79.6% in this
case.
In the second evaluation, we randomly selected 300 synsets from the Arabic, As-
samese and Vietnamese wordnets we create (100 synset each). For each synset, we ex-
tracted all the sentences that included any member of the synset from the corpora. The
61
Table 6.4. Examples of Arabic glosses we produce in our Arabic wordnet.
sentences were sorted according to the cosine similarity with the synset vector and the top
3 sentences where selected.
As shown in Table 6.7, the precision of selecting glosses for the Arabic synsets is
81.4% when selecting the sentences with the highest cosine similarity with the synset vec-
tor. Furthermore, the precision of the top 2 and top 3 sentences is 70.4% and 65.8% respec-
tively. The overall precision of selecting glosses using our method for the Arabic synsets is
72.6%. Table 6.4 shows some examples of glosses we produce for the Arabic synsets along
with the their cosine similarity values.
The precision of our method for selecting glosses for the Assamese synsets is 85.2%
when selecting the sentences with the highest cosine similarity. Moreover, the top 2 and
top 3 selected sentences achieved 83.2% and 84.6% respectively. The overall precision for
Assamese glosses is 84.4%. Table 6.5 shows some examples of glosses we produce for the
Assamese synsets along with the their cosine similarity values.
The top Vietnamese glosses selected by our method has 39.4% precision. The top 2
and top 3 Vietnamese glosses selected by our method has 36.6% and 37% precision. Table
62
Table 6.5. Examples of Assamese glosses we produce in our Assamese wordnet.
6.6 shows some examples of glosses we produce for the Vietnamese synsets along with the
their cosine similarity values.
In general, the precision of the recently published algorithms (Apidianaki and Von Neu-
mann, 2013) for the task of multilingual word-sense disambiguation is arround 68.7%,
meaning that our algorithm is showing better performance for English, Arabic and As-
samese. However, we notice that our method perform poorly with Vietnamese. The reason
behind the poor results with Vietnamese is that Vietnamese words are not separated by
white spaces (Gordon and Grimes, 2005). That means that the meaning of most the words
can change based on the following words. This makes the process of generating the vectors
for both the synsets and sentences extremely difficult since word2�ec algorithm assumes
that words are separated by white spaces. The Same problem appears in the process of
automatically generating bilingual dictionaries for Vietnamese (Lam et al., 2015a). One
63
Table 6.6. Examples of Vietnamese glosses we produce in our Vietnamese wordnet.
PrecisionWordnet Top 1 Top 2 Top 3 Overall
Arabic 81.4% 70.4% 65.8% 72.6%Assamese 85.2% 83.2% 84.6% 84.4%Vietnamese 39.4% 36.6% 37.0% 37.6%
Table 6.7. The precision of selecting glosses for Arabic, Assamese and Vietnamese synsets
possible solution to this problem is replacing the white spaces within the single Vietnamese
words with a special non-white character. This requires the existence of a language dictio-
nary to distinguish the words that include white spaces within them.
64
6.5 Summary
In this chapter, we presented new method for selecting synset gloss from a corpus.
The method can be used for low-resource languages to attache glosses to wordnets con-
structed automatically. Our method present vector representation for wordnet synsets in
a multi-dimensional space. We construct a synset vector by grouping the word embed-
ding vector of each synonym in the synset. Our evaluation showed that our method selects
glosses with precision up to 84.4%.
Chapter 7
LEXBANK: A MULTILINGUAL LEXICAL RESOURCE
Figure 7.1: An overview of LexBank system.
7.1 Introduction
In this chapter, we discuss the design and implementation of LangBank: a system that
provides access to the multilingual lexical resources we create in this dissertation. We aim
to give public users the ability to access and use the resources that we have created in our
project. The system provides wordnet search services to several resource-low languages in
addition to bilingual dictionary look up services. In addition, the system receives evaluation
and feedback from users to improve the quality of the resources.
66
As Figure 7.1 shows, the system is divided into three layers: Web interface, applica-
tion layer and database layer. The Web interface allows users to log into the system and
access the search services. The web interface, also, provides a control panel for adminis-
trators to allow them to manage the system. The application layer includes all the software
required to securely execute the users requests. The database layer has two databases: lex-
ical resources database and system database. The system database stores users information
and the system settings. The design of the system allows including new language resources
and easy modifications.
7.2 Database Design
LexBank uses two databases: one for storing the system settings and one for storing
the lexical resources. We have used Microsoft SQL Server to construct the databases. The
SQL code we used to construct the databases is listed in Appendix B. Next, we describe
each database in details.
7.2.1 The system settings database
There are two tables in the setting database: Users_Info and System_log. Next, we
describe both of the tables.
7.2.1.1 Users_Info
The Users_Info table contains information of the registered users. Following are the
fields contained in the Users_Info table:
• UserId: a unique short alias name, which is selected by the user, that is used to
identify users in the system.
• UserName: the full name of the user.
• UserEmail : the email address of the user.
67
• UserPwd: the encrypted password used by the user to access the system.
• UserPri�: a text field that determine the privileges that the user has. There are two
levels of users in the system. The first level is administrator which has the privileges
of managing users and data in the system. The Second level is client which has the
privilege of browsing the available resources.
• UserStatus: this field specify the status of the user. The status can be Active, Inactive
or New.
7.2.1.2 System_log
The System_log table keep records of all the users activities in the system. This helps
us in maintenance and keeping track of the utilization of the system. The following fields
are contained in the System_log table:
• E�entId: a unique key that is used to identify the event.
• E�entDesc: a text description of the event.
• E�entTime: the date and time of the event.
• UserId: the identification key of the user who committed the event.
7.2.2 The lexical resources database
The lexical resources database contains the resources we produce in this thesis. For
each language supported by the system the database maintain tables for storing the core
wordnet, the semantic relations, the wordnet glosses, the evaluation data of the semantic
relations and the evaluation data of the wordnet glosses. Next, we describe each table in
this database.
68
7.2.2.1 CoreWordnet
The CoreWordnet table stores the wordnet synsets we create in this thesis. The
core wordnet groups the synonym words into sets called synsets. In this table, synsets are
identified using the offset-pos of the corresponding synset in PWN. In PWN, the offset-pos
consists of two parts: byte offset used to locate the synset in the data file and the part of
speech of the synset. Following are the fields in the CoreWordnet table:
• offset-pos: the offset-pos of the wordnet synset which is used as an identifier for the
synset.
• Member : a word belongs to the synset.
7.2.2.2 Sem_Relations
Whereas the synonymy relation is stored in the CoreWordnet table, other semantic
relations such as hyperonymy and meronymy are stored in the Sem_Relations table. As
we described in Section 4.2, the semantic relations are directed relations. Therefore, we
should maintain the direction by specifying the side of each synset in the relation. The
Sem_Relations table contain the following fields:
• Le f t_offset-pos: this field specify the offset-pos of the synset in the left side of the
relation.
• Relation: a text field that specify the relation between the left side and the right side
synsets.
• Ri�ht_offset-pos : the offset-pos of the synset in the right side of the relation.
7.2.2.3 WordnetGlosses
The WordnetGlosses table stores the wordnet glosses we generate in Chpater 6. Fol-
lowing are the fields of the WordnetGlosses table:
69
• offset-pos: the offset-pos of wordnet synset.
• Gloss: a text field that contain the gloss of the synset.
7.2.2.4 Sem_Relations_Eval_Data
The Sem_Relations_Eval_Data table contains the semantic relations sample data
which is used in the evaluation. This table contains the following fields:
• RelationKey: a unique identification number used to identify the semantic relation
being evaluated.
• Le f t_offset-pos: the offset-pos of the synset in the left side of the relation being
evaluated.
• Word1: this field specify the word in the left side of the relation being evaluated.
• Relation: a text field that specify the type of relation being evaluated.
• Ri�ht_offset-pos: the offset-pos of the synset in the right side of the relation being
evaluated.
• Word2: this field specify the word in the right side of the relation being evaluated.
• COS: the cosine distance, as measured in Section 5.4, between the left word and the
right word in the relation being evaluated.
7.2.2.5 Sem_Relations_Eval_Response
The Sem_Relations_Eval_Response table contains the collected responses of the se-
mantic relations we produce from evaluators. This table consists of the following fields:
• AnswerKey: a unique integer number that is generated automatically to identify the
response.
70
• RelationKey: the key of the semantic relation being evaluated.
• Score: an integer value from 1 to 5 that represents the score assigned by the evaluator
to the semantic relation.
• UserId: identification key of the evaluator who evaluated the response.
7.2.2.6 WordnetGlosses_Eval_Data
The WordnetGlosses_Eval_Data table holds the wordnet glosses sample which being
evaluated by the users. The table includes the following fields:
• GlossKey: an automatically generated unique integer used to identify the gloss being
evaluated.
• offset-pos: the offset-pos of the wordnet synset.
• Word: the word which being used in the gloss to represent the wordnet synset.
• Sentence: the sentence selected as gloss for this wordnet synset.
• PWNGloss: the English gloss of the corresponding synset in PWN.
• CosSem: the cosine similarity between the selected sentence and the synset as mea-
sured in Section 6.3.
• GlossRank: an integer value that represents the rank of the gloss among the other
candidate glosses. The rank is assigned by the system to the gloss being evaluated
based on the CosSem value. Glosses with the highest CosSem value have a rank value
1.
7.2.2.7 WordnetGlosses_Eval_Response
Responses from the users for evaluating the wordnet glosses we produced in Section
6.3 are stored in the WordnetGlosses_Eval table. This table consists of the following fields:
71
• AnswerKey: a unique integer number that is generated automatically to identify the
response.
• GlossKey: the key of the gloss being evaluated.
• Score: an integer value from 1 to 5 that represents the score assigned by the evaluator
to the gloss.
• UserId: identification key of the evaluator who evaluated the gloss.
7.3 Application layer
In this section, we describe the main functions provided by LexBank. In order to
maintain simplicity, we implement most of the functions of the system in one utility class
(LexBankUtils.cs) written in Microsoft C#. The utility class, which is listed in Appendix
C, consists of the following methods:
• IsUserIdAvailable(): takes a userId and return true if this never been used by another
user before.
• EncryptPassword(): takes a plain text password and return an encrypted password.
• DecryptPassword(): takes an encrypted password and return a decrypted password.
• CreateNewUser(): takes the details of a new user and create an account for him by
string the data in the Users_Info table.
• IsAuthenticated(): takes the user identification and password and return true if it
match the user information in the users table.
• FindSynSet(): takes a lexeme and return a list of synsets that include this lexeme.
• FindSynSetLexemes(): takes an OffsetPos of a synset and return the list of lexemes of
this synset.
72
• IsSynSetAvailable(): takes an OffsetPos of a synset in a specific wordnet, and return
true if the synset is available in the spcified wordnet.
• FindSynSetRelations(): takes an OffsetPos of a synset and return all the semantically
related lexemes.
• FindGloss(): takes an OffsetPos of a synset and return the gloss of the synset.
• ReadRelation(): takes a RelationKey and return the details of the relation.
• ReadSynsetGloss(): takes a GlossKey and return the details of the gloss.
• EvaluateRelation(): takes RelationKey,Score and UserId and store them in the eval-
uation table of the semantic Relations.
• EvaluateGloss(): takes GlossKey,Score and UserId and store them in the evaluation
table of the wordnet glosses.
• LogEvent(): takes event description and store it in the System_log table.
• ChangeUserStatus(): takes UserId of a user and change his status to a specific new
status.
• RetrieveUsers(): a method that return a list of all the users in the system and their
information.
7.4 Web Interface Design & Implementation
In this section, we describe the design of the web interface of LexBank. The web
interface is implemented in ASP.NET using Microsoft Visual Studio 2012. Figure 7.2
shows the site map of the web interface. The interface is accessed by the log-in web page
(frmLogin.aspx). New users need to register to gain access to the system. Registration can
be done by filling the registration web form (frmRegister.aspx). Once a user logged into the
73
system, the main menu web page (frmMainMenu.aspx) is shown. The main menu include
links to access the services available in the system. In the following sections we describe
each web page in the system.
Figure 7.2: LexBank web site map
7.4.1 Registration Form
New users needs to register in the system uisng the registration form (frmRegis-
ter.aspx). As shown in Figure 7.3, a new user needs to provide: the full name, email,
email confirmation, user identification, password and password confirmation, then press
the Register button.
The registration process starts when a new user submit his information through the
registration web form. Once the registration form receive the information, it check if all the
fields met the requirements of the system. The requirements include a valid format for the
email address and the password. The requirements, also, include that the user identification
was never been used before by an existing user. If the information sent by the user pass
the validation process, the registration form calls the CreateNewUser() method from the
74
Figure 7.3: The registration web form
utility class. The CreateNewUser() method uses the EncryptPassword() method to encrypt
the password, then it writes the data into the Users_info table. The registration process is
summarized in the sequence diagram shown in Figure 7.4.
7.4.2 Log-in Form
Registered users can login to the system using the log-in web page (frmLogin.aspx)
which is shown in Figure 7.5. User with an active account needs to provide his user identi-
fication and password to start the log-in process.
As shown in Figure 7.6, when the log-in web form (frmLogin.aspx) receives the
userid and the passowrd it calls the IsAuthenticated() method from the utility class. Then,
the password is encrypted using the EncryptPassword() and compared with the encrypted
75
Figure 7.4: Sequence diagram of the registration process
Figure 7.5: The log-in web form
76
Figure 7.6: Sequence diagram of the log-in process
password stored in the users table. If the userid and the password provided by the user
matched the userid and the password stored in the users table, the main menu of the web
interface is shown to the users, otherwise, an error message is shown to the user. The main
menu is shown in Figure
7.4.3 The Main Menu
The main menu include links to access the services available in the system. The
services presented by the web interface are:
• Searching wordnet using lexeme, provided by the web page (frmWordnetSearch.aspx).
• Searching wordnet using OffsetPos, provided by the web page (frmSynsetDetails.aspx).
• Evaluating semantic relations between synsets, provided by the web page (frmEval-
Relations.aspx).
• Evaluating wordnet glosses, provided by the web page (frmEvalGloss.aspx).
77
Figure 7.7: The main menu
• Users management, provided by the web page (frmManageUsers.aspx).
7.4.4 Searching Wordnet By Lexeme Web Form
The web form (frmWordnetSearch.aspx) allows users to search for the synsets of a
lexeme in a specific langauge. As shown in Figure 7.8, this web form consists of the
following components:
• A text box used to allow the user to enter a lexeme.
• A drop menu to allow the user select the language.
• A list box for showing the synsets list of the entered lexeme.
• A list box for showing the synonyms of the entered lexeme.
• A list box for showing the related lexemes.
• A button to start the searching process.
The searching process, as shown in Figure 7.9, starts when the user submit a lexeme
and language to the frmWordnetSearch.aspx web form. Then, the method FindSynset()
78
Figure 7.8: The Web form for searching wordnet by lexeme. The form is showing the resultof searching the Arabic lexeme (���) which means Egypt.
from the utility class is called to retrieve the synsets that include the entered lexeme and
show the result in the synsets list. Next, when the user selects a synset from the synsets
list, the frmWordnetSearch.aspx web form calls the FindSynsetLexemes() method from the
utility class to show the synonyms of the lexeme in the synonym list. It, also, calls the
FindSynsetRelations() method to obtain the related lexemes and show them to the user in
the related lexemes list. The user also can extend the details of the synset shown in the
synset list and the related lemexes list by double clicking on the synset OffsetPos. This will
show the frmSynsetDetails.aspx web form which we will describe next.
7.4.5 Searching Wordnet By OffsetPos Web Form
Wordnet search using OffsetPos is provided by the frmSynsetDetails.aspx web form
which is shown in Figure 7.10. This web form consists of the following components:
79
Figure 7.9: Sequence diagram of the process of searching wordnet using lexeme
• A text box for entering the OffsetPos of the synset.
• A drop menu to allow the user select the language.
• A text box for showing the gloss of the synset.
• A text box for showing the English gloss of the synset.
• A list box to show the synonym list of the synset.
• A list box to show the related synsets and lexemes of the entered synset.
• A button to start the search process.
In this form, the user starts the process of searching wordnet by submitting the Off-
setPos of the synset and the target language to the frmSynsetDetails.aspx web form. The
80
Figure 7.10: The Web form for searching wordnet by OffsetPos. The form is showing theresult of searching the Arabic synset (08897065-n).
web form calls the FindGloss() mehtod from the utility class to retrieve the gloss of the
synset. It, also, calls the FindSynSetLexemes() and the FindSynSetRelations() methods to
obtain the synonym list and releated synsets of the input synset to show them in the form.
7.4.6 Evaluating Semantic Relations Between Synsets Web Form
The web form frmEvalRealtions.aspx allow users to evaluate semantic relations be-
tween lexemes and synsets in the system. The form shows the relation as a sentence and
asks the user to rate the correctness of the sentence using a Likert-type scale. The form
consists of the following components:
• A text box showing the relation key.
• A text box showing the relation in the form of a sentence.
81
Figure 7.11: Sequence diagram of the process of searching wordnet using OffsetPos.
Figure 7.12: The Web form for evaluating semantic relations between synsets in a word-net. The form is showing an example of evaluating a hyponymy relation between the twoAssamese lexemes radiotelegraph and radio.
82
• A text box showing the UserId of the evaluator.
• An option box that allow the user to rate the relation.
• A button to submit the score.
• A button to end the evaluation session.
Figure 7.13: Sequence diagram of the process of evaluating the relation between two lex-emes.
The evaluation form frmEvalRealtions.aspx starts the evaluation process by calling
the ReadRelation() method from the utility class to show the relation details to the user.
When the user submit the score he assign to a relation, the evaluation form frmEvalReal-
tions.aspx store the score by calling the EvaluateRelation() method from the utility class.
Then, the evaluation form reads the next relation and show it to the user. The user can
stop the evaluation process by clicking the End Session button. The user have the option
to resume the evaluation process if he stopped any time he wish without re-evaluating the
relations he already evaluated.
83
7.4.7 Evaluating Wordnet Synsets Glosses Web Form
Figure 7.14: The Web form for evaluating wordnet synsets glosses. The form is showingan example of evaluating Arabic synset (13108841-n).
The glosses of the wordnets is evaluated using the frmEvalGloss.aspx web form. To
evaluate a synset gloss, the form attach the English gloss of the synset obtained from the
PWN to the selected gloss in the target language. Then, the user is asked if the lexeme
in the selected gloss has the same meaning of the PWN gloss. This evaluation form is
composed of the following components:
• A text box showing the gloss key.
• A text box showing a lexeme from a synset, a candidate gloss written in the target
language, the English gloss of the synset.
• A text box showing the UserId of the evaluator.
• An option box that allow the user to rate the candidate gloss.
84
• A button to submit the score.
• A button to end the evaluation session.
Figure 7.15: Sequence diagram of the process of evaluating the relation between two lex-emes.
The web form frmEvalGloss.aspx starts the evaluation process of glosses by calling
the ReadSynsetGloss() method from the utility class to obtain the lexeme, the candidate
gloss and the English gloss of the synset being evaluated. Then, the web uses the previous
data to construct a question for the user. When the user submit the score he assign to the
candidate gloss, the evaluation form stores the score by calling the EvaluateGloss() method
from the utility class. Then, the evaluation form reads the next gloss and show it to the user.
The user can stop glosses evaluation process by clicking the End Session button. The user
can resume glosses evaluation process in any time he wish without re-evaluating the glosses
he already evaluated.
85
Figure 7.16: The Web form for managing users in LexBank.
7.4.8 Users Management Web Form
To allow the administrators of LexBank to manage the users, we designed the frm-
ManageUsers.aspx web form. Access to this form is restricted to administrators. The form
list all registered users with their information. An administrator can activate the accounts
of new users using this form. Also, he can deactivate any user from the list. This form can
be extended in the future by adding more functionality. As shown in Figure 7.16, this form
consists of the following components:
• ID: the UserId of the user.
• Name: the full name of the user.
• Email: the email address of the user.
• Privilege: the privilege assigned to the user. This can be administrator or client.
• Status: the current status of the user.
• Change Status: a command link to change the current status of the user. The status
of the user can be change to be Inactive or Active.
As summarized in the sequence diagram shown in Figure 7.17, an administrator user
start the process of users management by trying to access the frmManageUsers.aspx web
86
Figure 7.17: Sequence diagram of the process of managing users in LexBank.
form. The web form calls the method IsAdmin() from the utility calss to verify if the user is
authorized to access the form or not. If the user is not authorized, an error message is sent to
the user. Otherwise, if the user is authorized the web form calls the method RetrieveUsers()
to obtain the list of registered users in the system. Then, the administrator can select a user
from the list and click the change status link to change the current status of the user. Then,
the web form calls the ChangeUserStatus() method from the utility class to store the new
status and reload the updated users list in the screen.
7.5 Summary
In this chapter, we described the design and implementation of the LexBank, the mul-
tilingual lexical resource we produce in this thesis. The architecture of LexBank consists
of three layers: the database layer, the application layer and the web interface layer. The
87
database layer consists of two databases: system settings database and resource database.
The application layer of the system is implemented using Microsoft C#. It provides admin-
istrative and resource access services to the web interface. The web interface is designed
and implemented using Microsoft Visual Studio 2012. The interface include web forms for
managing users and provide different wordnet search services in several languages. The
system can easily updated to accommodate other lingual services and languages.
Chapter 8
CONCLUSIONSIn this chapter, we summarize the main contributions of this dissertation. This dis-
sertation is motivated by the fact that so many languages around the word lack the compu-
tational lexical resources that are essential in natural language processing. Our first goal
in this dissertation is to develop automatic techniques, that rely on few available public
resources, for constructing wordnets for low-resource languages. A wordnet is a structured
lexical ontology of words that groups words based on their meaning using sets that are
called synsets. Wordnet is a very important lexical resource that is used in many applica-
tions, such as translation, word-sense disambiguation, information retrieval and document
classification. The second goal of this dissertation is to design and implement a system that
makes the lexical resources we produced available to the public. Next, we list the main
contributions of this dissertation.
• We have developed an approach for constructing structured wordnets. This approach
was developed by extending the approach for constructing the core wordnets pre-
sented by (Lam et al., 2014b). A core wordnet consists of only synsets that group
synonym words in sets with unique id. In a more comprehensive wordnet, these
synsets are semantically connected to represent the relation between the meaning of
the synsets. Our approach produces synsets that are semantically connected by se-
mantic relations. Examples of the semantic semantic relations we produced are: syn-
onyms, hypernyms, topic-domain related, part-holonyms and instance-hypernyms
and member-meronyms.
• We presented an approach for enhancing the quality of automatically constructed
wordnets. The approach is based on the vector representation of words (word em-
89
beddings). Word embeddings is a machine learning technique that maps words to
real numbres vectors in a multi-dimensional space. Our approach uses the word2�ec
algorithm (Mikolov et al., 2013) to generate word representations from an exist-
ing corpus. The word2�ec algorithm is a feedforward neural network that predict
the vector representation of words within a multi-dimensional language model. Our
approach compute the cosine similarity, using word2�ec, between semantically re-
lated words in our constructed wordnets and filter any entries which do not satisfy a
pre-selected threshold value.
• We introduced s�nset2�ec, which is an algorithm for representing wordnet synsets
in a multi-dimensional space. Word embeddings provides an excellent vector repre-
sentation of words. However, words representation is effected by the fact that many
words have multiple meanings. In order to represent meanings rather than words, we
combine the vectors of synset lemxes into one vector that represent the meaning. We
believe that this vector representation can be used in many important applications.
For example, it can be used in of word-sense disambiguation, machine translation
and gloss selection for wordnet synsets.
• We used our algorithm s�nset2�ec to add glosses to our automatically constructed
synsets. Glosses are a very important part of wordnets. It is used to declare or
clarify the meaning of a synset in a wordnet. Gloss can be a definition statement
or an example sentence that shows the usage of the synonyms of the synset. To
select a gloss from a corpus for a synset, we used s�nset2�ec to generate vector
representations of candidate glosses and the synset. Then we compute the cosine
similarity between each candidate gloss and the synsets. Finally, we select the gloss
with highest cosine similarity with synset and attach it to the synset.
• We have developed LexBank which is a web application that give access for public
users to our created resources. LexBank provides useful services for users that seek
90
linguistic assistance in a friendly manner. It, also, include evaluation web forms
that are used to gather feedback from human judges. The design of LexBank is
flexible and it can be easily expanded to accommodate additional new languages
and resources.
Chapter 9
FUTURE WORKIn this chapter, we propose some potential future work that can be done based on this
dissertation. The general goal of the proposed future word is to enhance the quality and
extend the coverage of the lexical resources. For example, we produced our core wordnets
based on machine translation and some small dictionaries. The quality of these wordnets
are limited by the resources we used to create them. It is well known that these resources
does not guarantee high coverage and accuracy for all of the target languages. Next, we list
some of the potential future work.
9.1 Extending Bilingual Dictionaries
In this section, we provide one more additional possible task that can be undertaken in
future work. We propose a new method to extend the bilingual dictionaries created in (Lam
et al., 2015b). To increase the coverage of the bilingual dictionaries, we take advantage of
the wordnets we have created in this dissertation. This section is divided into two parts.
In the first part, we describe the approach we used in (Lam et al., 2015b) to create the
bilingual dictionaries. In the second part, we describe the proposed method to extend these
bilingual dictionaries.
9.1.1 Related Work
In (Lam et al., 2015b) we have created a large number of new bilingual dictionar-
ies using intermediate core wordnets and a machine translator. A dictionary, or a lexicon,
as defined by (Landau, 1984), consists of sorted 2-tuple <LexicalUnit, Definition> en-
tries. Each entry is called LexicalEntry. The first part of a LexicalEntry is the phrase being
defined, while the second part is the definition of the phrase. The definition include the
92
meaning of the LexicalUnit and usually have several Senses which is is a separate repre-
sentation of a single aspect of the meaning of a phrase. In (Lam et al., 2015b), the entries
in the dictionaries are of the form < LexicalUnit ,Sense1
>, < LexicalUnit ,Sense2
>,....
The approach for creating dictionaries using intermediate wordnets and a machine
translator (IW) is described as in Figure 9.1 and Algorithm 2.
Figure 9.1: The IW approach for creating a new bilingual dictionary
Suppose that we would like to construct a bilingual dictionary Dict(S,D), where S
is a source language and D is a target language, given the dictionary Dict(S,R) where R
93
is a resource-rich intermediate language. The IW algorithm reads each LexicalEntry from
Dict(S,R) and extract SenseR from it. Then, it retrieves all Offset-POSs of SenseR from
the wordnet of language R (Algorithm 2, lines 2-5). All the synonyms of the extracted
Offset-POSs are extracted from all the available intermediate wordnets. Then, the algorithm
construct a candidate set candidateSet for the final translations in language D by translating
all the extracted synonyms to language D using machine translation (Algorithm 3). There
are 2 attributes in each candidate in candidateSet : word which represents a translation in
language D, and rank which counts the occurrence of this translation. The rank attribute
is used to order the candidates in descending order where the top candidate is the best
translation. Finally, the sorted candidates are inserted into the new dictionary Dict(S,D)
(Algorithm 2, lines 8-10).
Algorithm 2: IW algorithmInput: Dict(S,R)Output: Dict(S, D)
1: Dict(S, D) := �2: for all LexicalEntry 2 Dict(S,R) do3: for all SenseR 2 LexicalEntry do4: candidateSet := �5: Find all Offset-POSs of synsets containing SenseR from the R Wordnet6: candidatSet = FindCandidateSet (Offset-POSs, D)7: sort all candidate in descending order based on their rank values8: for all candidate 2 candidateSet do9: SenseD=candidate.word
10: add tuple <LexicalUnit,SenseD> to Dict(S,D)11: end for12: end for13: end for
9.1.2 Extending Bilingual Dictionaries Using Structured Wordnets
In this section, we propose a new method to extend dictionaries we created by (Lam
et al., 2015b) using the structured wordnets that we have created in this dissertation. The
94
Algorithm 3: FindCandidateSet (Offset-POSs,D)Input: Offset-POSs, DOutput: candidateSet
1: candidateSet := �2: for all Offset-POS 2 Offset-POSs do3: for allword in the Offset-POS extracted from the PWN and other available WordNets
linked to the PWN do4: candidate .word= translate (word,D)5: candidate .rank++6: candidateSet += candidate7: end for8: end for9: return candidateSet
following steps, which are summarized in Figure 9.2, describe the proposed method to
extend the dictionaries.
Figure 9.2: Extending bilingual dictionaries using structured wordnets
• We start by extracting each input enrty Si from the source language S in the bilingual
dictionary from S to D.
95
• Then, we retrieve the synsets list of Si from the wordnet of S .
• Next, we extract the corresponding synsets from the wordnet of D.
• For each synset member Dk we extracted from wordnet of D, we create a lexical
entry (Si ,Dk).
• Besides that, for each synset we extracted from wordnet of D, we extract the direct
hypernyms and we, also, create a lexical entry (Si ,Hl ).
• Finally, we add any lexical entry we have created in the previous steps to the bilingual
dictionary from S to D if it is not already exists in the dictionary.
9.2 Integrating Part-of-speech Tagging into Wordnet Construction
Since our approach for wordnet automatic construction is based on translation, some
of the generated synsets include words that are in the wrong part of speech form. One
solution is to use a Part-Of-Speech Tagger (POS Tagger) to correct the wrong form of the
words in the synset.
A POS Tagger is a computer program which is used to specify the part of speech
of words in a text written in some language. For example, the Stanford Part-Of-Speech
Tagger (Toutanova et al., 2003), which is freely available, provides part of speech tagging
for Arabic, Chinese, French, Spanish and German. Also, other POS Taggers are available
for Assamese (Saharia et al., 2009) and Vietnamese (Le-Hong et al., 2010). Since we are
dealing with low-resource languages, many languages does not have any POS Taggers and
, therefore, this approach is not applicable to them.
To correct the part of speech in the words within a synset, we propose the following
steps:
• For each synset s�nseti in a wordnet wordnetT , we extract the part of speech of the
synset from Offset-POS of s�nseti .
96
• For each word wordj in s�nseti , we find out the part of speech of wordj and compare
it with the part of s�nseti . If the of parts of speech of wordj and s�nseti does not
match, we convert the form of wordj to the correct part of speech form and update
s�nseti .
9.3 Wordnet Expansion Using Word Embeddings
One possible way to automatically improve the coverage of a wordnet is by looking
for additional related words in a corpus using word embeddings. In Chapter 6 we intro-
duced s�nset2�ec which is a vector representation of synsets in a multi-dimensional space.
Taking advantage of s�nset2�ec, we believe it is possible to look for previously unknown
words that are semantically related to a synset and include them to the wordnet. Next, we
present a brief description of our idea.
• Assume that we would like to expand a wordnetwordnetT of languageT . First, word
embeddings for T is generated.
• Next, for each synset s�nseti inwordnetT , the vector for s�nseti ~Vi is generated using
s�nset2�ec.
• Then, all the words that are have cosine similarity value of a preselected threshold
� or less are extracted. From those words, only the words that does not have any
semantic relation with s�nseti is inserted into a candidate set Ci .
• Next, for each word wordj in Ci , a semantic relation rj is selected based on a classi-
fication approach.
• Finally, wordj is inserted into wordnetT and connected to s�nseti using semantic
relation rj .
97
9.4 Producing Vector Representation for Multi-word Lexemes
One issue that appears when producing vector representation is that wordnet lexemes
can be multi-word phrases. Most of the existing tools for producing word embeddings
are single-word based. This means that they produce vectors for lexical units that are
surrounded by spaces. Therefore, when we try to generate a vector for a wordnet synset,
we avoid multi-word lexemes. An enhanced version of our approach of generating vectors
for wordnet synsets can be achieved by including a vector representation for multi-word
lexemes. The vectors of the single words within a multi-word lexeme should be aggregated
such that it represents one vector within the synset. However, one issue that rises is that
each single words within the multi-word lexeme might have several meanings when they
individually appear. Therefor, a careful research is needed to determine a good solution for
this problem.
9.5 Vector Representation for Mulit-lingual Wordnets
In this dissertation, we produced vector representation for the individual wordnets.
One work that might help in problems, such as wordnets expansion and machine transla-
tion, is the vector representation of aggregated wordnets of several languages. Since all
of wordnets we create in this dissertation are aligned with PWN, synsets having the same
Offset-Pos in different wordnets actually represents the same meaning. Therefore, we be-
lieve that combining the vectors of aligned synsets from different languages will produce
representation for the meaning within several language. One can use this representation to
discover the closest meaning of new words that are not included within the wordnets. This,
also, could be used in discovering a rough translation for words that are not included in a
dictionary.
98
BIBLIOGRAPHY
M. Abbas and K. Smaili. Comparison of topic identification methods for arabic language.
In International Conference on Recent Advances in Natural Language Processing-
RANLP 2005, volume 14, 2005.
M. Abbas, K. Smaïli, and D. Berkani. Evaluation of topic identification methods on arabic
corpora. JDIM, 9(5):185–192, 2011.
K. Ahn and M. Frampton. Automatic generation of translation dictionaries using inter-
mediary languages. In Proceedings of the International Workshop on Cross-Language
Knowledge Induction, pages 41–44. Association for Computational Linguistics, 2006.
P. Akaraputthiporn, K. Kosawat, and W. Aroonmanakun. A Bi-directional Translation
Approach for Building Thai Wordnet. In Asian Language Processing, 2009. IALP’09.
International Conference on, pages 97–101. IEEE, 2009.
M. Apidianaki and R. J. Von Neumann. Limsi: Cross-lingual word sense disambiguation
using translation sense clustering. In Second Joint Conference on Lexical and Computa-
tional Semantics (* SEM), volume 2, pages 178–182, 2013.
M. A. Attia. Handling Arabic morphological and syntactic ambiguity within the LFG
framework with a view to machine translation. PhD thesis, University of Manchester,
2008.
E. Barbu. Automatic Building of Wordnets EdUArd BarbU* &: Verginica BarbU Mi-
TiTElU*** Graphitech Italy" Romanian Academy, Research Institute for Artificial In-
99
telligence. Recent Advances in Natural Language Processing IV: Selected Papers from
RANLP 2005, 292:217, 2007.
K. R. Beesley. Arabic finite-state morphological analysis and generation. In Proceedings
of the 16th conference on Computational linguistics-Volume 1, pages 89–94. Association
for Computational Linguistics, 1996.
S. Bhattacharya, M. Choudhury, S. Sarkar, and A. Basu. Inflectional morphology synthesis
for bengali noun, pronoun and verb systems. Proc. of NCCPB, 8, 2005.
P. Bhattacharyya. Indowordnet. In In Proc. of LREC-10, 2010.
O. Bilgin, z. Çetinoglu, and K. Oflazer. Building a wordnet for Turkish. Romanian Journal
of Information Science and Technology, 7(1-2):163–172, 2004.
L. Bloomfield. Language. new york: Holt, rinehart and winston. A classic in linguistic
studies and the first serious attempt in the development of morphology. Pre-and post-
generative morphology conceptually were nurtured from the remarkable insights given
in this linguistic masterpiece, 1933.
F. Bond and K. Ogura. Combining linguistic resources to create a machine-tractable
Japanese-Malay dictionary. Language Resources and Evaluation, 42(2):127–136, 2008.
L. Borin and M. Forsberg. Swesaurus; or, the frankenstein approach to wordnet construc-
tion. In Proceedings of the Seventh Global WordNet Conference (GWC 2014), 2014.
D. Bouamor, N. Semmar, C. France, and P. Zweigenbaum. Using Wordnet and semantic
similarity for bilingual terminology mining from comparable corpora. In Proceedings of
the 6th Workshop on Building and Using Comparable Corpora, pages 16–23. Citeseer,
2013.
100
R. D. Brown. Automated dictionary extraction for “knowledge-free” example-based trans-
lation. In Proceedings of the Seventh International Conference on Theoretical and
Methodological Issues in Machine Translation, pages 111–118, 1997.
T. Buckwalter. Issues in arabic orthography and morphology analysis. In Proceedings of
the Workshop on Computational Approaches to Arabic Script-based Languages, pages
31–34. Association for Computational Linguistics, 2004.
T. Charoenporn, V. Sornlertlamvanich, C. Mokarat, and H. Isahara. Semi-automatic com-
pilation of Asian WordNet. In 14th Annual Meeting of the Association for Natural Lan-
guage Processing, pages 1041–1044, 2008.
D. Christodoulakis, K. Oflazer, D. Dutoit, S. Koeva, G. Totkov, K. Pala, D. Cristea, D. Tufis,
M. Grigoriadou, I. Tsakou, and others. BalkaNet: A Multilingual Semantic Network for
Balkan Languages. In Proceedings of the 1st International Wordnet Conference, Mysore,
India, 2002.
C. J. Crouch. An approach to the automatic construction of global thesauri. Information
Processing & Management, 26(5):629–640, 1990.
A. Cucchiarelli, R. Navigli, F. Neri, and P. Velardi. Automatic Generation of Glosses in the
OntoLearn System. In LREC. Citeseer, 2004.
J. R. Curran. From distributional to semantic similarity. 2004.
J. R. Curran and M. Moens. Improvements in automatic thesaurus extraction. In Pro-
ceedings of the ACL-02 workshop on Unsupervised lexical acquisition-Volume 9, pages
59–66. Association for Computational Linguistics, 2002a.
J. R. Curran and M. Moens. Scaling context space. In Proceedings of the 40th Annual
Meeting on Association for Computational Linguistics, pages 231–238. Association for
Computational Linguistics, 2002b.
101
K. Darwish. Named entity recognition using cross-lingual resources: Arabic as an example.
In ACL (1), pages 1558–1567, 2013.
M. Diab and N. Habash. Arabic dialect processing tutorial. In Proceedings of the Hu-
man Language Technology Conference of the NAACL, Companion Volume: Tutorial Ab-
stracts, pages 5–6. Association for Computational Linguistics, 2007.
R. M. Fano and D. Hawkins. Transmission of information: A statistical theory of commu-
nications. American Journal of Physics, 29(11):793–794, 1961.
A. Farghaly and K. Shaalan. Arabic natural language processing: Challenges and solutions.
ACM Transactions on Asian Language Information Processing (TALIP), 8(4):14, 2009.
C. Fellbaum. A semantic network of English verbs. WordNet: An electronic lexical
database, 3:153–178, 1998.
C. Fellbaum. WordNet and Wordnets. In A. Barber, editor, Encyclopedia of Language and
Linguistics, pages 2–665. Elsevier, 2005.
M. A. Finlayson. Java libraries for accessing the Princeton WordNet: Comparison and
evaluation. In Proceedings of the 7th Global Wordnet Conference, pages 78–85, 2014.
J. R. Firth. {A synopsis of linguistic theory, 1930-1955}. 1957.
T. Gollins and M. Sanderson. Improving cross language retrieval with triangulated transla-
tion. In Proceedings of the 24th annual international ACM SIGIR conference on Research
and development in information retrieval, pages 90–95. ACM, 2001.
R. G. Gordon and B. F. Grimes. Ethnologue: Languages of the world, volume 15. SIL
international Dallas, TX, 2005.
S. Green and C. D. Manning. Better arabic parsing: Baselines, evaluations, and analysis. In
Proceedings of the 23rd International Conference on Computational Linguistics, pages
394–402. Association for Computational Linguistics, 2010.
102
G. Grefenstette. Explorations in automatic thesaurus discovery, volume 278. Springer
Science & Business Media, 2012.
G. Gunawan and A. Saputra. Building synsets for Indonesian Wordnet with monolingual
lexical resources. In Asian Language Processing (IALP), 2010 International Conference
on, pages 297–300. IEEE, 2010.
N. Habash and O. Rambow. Arabic tokenization, part-of-speech tagging and morphological
disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting on Asso-
ciation for Computational Linguistics, pages 573–580. Association for Computational
Linguistics, 2005.
N. Habash, R. Roth, O. Rambow, R. Eskander, and N. Tomeh. Morphological analysis and
disambiguation for dialectal arabic. In HLT-NAACL, pages 426–432, 2013.
N. Y. Habash. Introduction to arabic natural language processing. Synthesis Lectures on
Human Language Technologies, 3(1):1–187, 2010.
A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein. Learning Bilingual Lexicons
from Monolingual Corpora. In ACL, volume 2008, pages 771–779, 2008.
Z. S. Harris. Distributional structure. Word, 10(2-3):146–162, 1954.
L. Hinkle, A. Brouillette, S. Jayakar, L. Gathings, M. Lezcano, and J. Kalita. Design and
evaluation of soft keyboards for brahmic scripts. ACM Transactions on Asian Language
Information Processing (TALIP), 12(2):6, 2013.
G. Hirst and D. St-Onge. Lexical chains as representations of context for the detection
and correction of malapropisms. WordNet: An electronic lexical database, 305:305–332,
1998.
E. Héja. Dictionary Building based on Parallel Corpora and Word Alignment. In Proceed-
ings of the XIV Euralex International Congress, Leeuwarden, pages 6–10, 2010.
103
Y. Hlal. Morphological analysis of arabic speech. In Workshop Papers Kuwait/Proceedings
of Kuwait Conference on Computer Processing of the Arabic Language, pages 273–294,
1985.
V. István and Y. Shoichi. Bilingual dictionary generation for low-resourced language pairs.
In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Pro-
cessing: Volume 2-Volume 2, pages 862–870. Association for Computational Linguistics,
2009.
P. Jaccard. The distribution of the flora in the alpine zone. New phytologist, 11(2):37–50,
1912.
D. Jurafsky and J. H. Martin. Speech and Language Processing (2Nd Edition). Prentice-
Hall, Inc., Upper Saddle River, NJ, USA, 2009. ISBN 0131873210.
H. Kaji and M. Watanabe. Automatic Construction of Japanese WordNet. Proceedings of
LREC2006, Italy, 2006.
H. Kozima and T. Furugori. Similarity between words computed by spreading activation
on an English dictionary. In Proceedings of the sixth conference on European chapter of
the Association for Computational Linguistics, pages 232–239. Association for Compu-
tational Linguistics, 1993.
K. N. Lam. Automatically Creating MultiLingual Resources. PhD thesis, University of
Colorado, Colorado Springs, Apr. 2015.
K. N. Lam and J. Kalita. Creating Reverse Bilingual Dictionaries. In HLT-NAACL, pages
524–528. Citeseer, 2013.
K. N. Lam, F. Al Tarouti, and J. Kalita. Creating Lexical Resources for Endangered Lan-
guages. In Proceedings of the 2014 Workshop on the Use of Computational Methods
104
in the Study of Endangered Languages, pages 54–62, Baltimore, Maryland, USA, June
2014a. Association for Computational Linguistics.
K. N. Lam, F. A. Tarouti, and J. Kalita. Automatically constructing Wordnet synsets.
In 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014),
Baltimore, USA, June, 2014b.
K. N. Lam, F. Al Tarouti, and J. Kalita. Phrase translation using a bilingual dictionary and
n-gram data: A case study from vietnamese to english. In Proceedings of NAACL-HLT,
pages 65–69, 2015a.
K. N. Lam, F. Al Tarouti, and J. Kalita. Automatically Creating a Large Number of New
Bilingual Dictionaries. In Twenty-Ninth AAAI Conference on Artificial Intelligence, Feb.
2015b.
S. I. Landau. Dictionaries. NY: Scribners, 1984.
L. S. Larkey, L. Ballesteros, and M. E. Connell. Improving stemming for arabic informa-
tion retrieval: light stemming and co-occurrence analysis. In Proceedings of the 25th
annual international ACM SIGIR conference on Research and development in informa-
tion retrieval, pages 275–282. ACM, 2002.
P. Le-Hong, A. Roussanaly, T. M. H. Nguyen, and M. Rossignol. An empirical study of
maximum entropy approach for part-of-speech tagging of vietnamese texts. In Traitement
Automatique des Langues Naturelles-TALN 2010, page 12, 2010.
D. Leenoi, T. Supnithi, and W. Aroonmanakun. Building a Gold Standard for Thai Word-
Net. In Proceeding of The International Conference on Asian Language Processing 2008
(IALP2008), pages 78–82, 2008.
D. Lin. Automatic retrieval and clustering of similar words. In Proceedings of the 36th An-
nual Meeting of the Association for Computational Linguistics and 17th International
105
Conference on Computational Linguistics-Volume 2, pages 768–774. Association for
Computational Linguistics, 1998.
K. Lindén and J. Niemi. Is it possible to create a very large wordnet in 100 days? an
evaluation. Language resources and evaluation, 48(2):191–201, 2014.
K. Lindén and L. Carlson. Finn WordNet-WordNet p\a a finska via översättning. Lexi-
coNordica, 17(17), 2010.
N. Ljubešic and D. Fišer. Bootstrapping bilingual lexicons from comparable corpora for
closely related languages. In Text, Speech and Dialogue, pages 91–98. Springer, 2011.
M. Maziarz, M. Piasecki, E. Rudnicka, and S. Szpakowicz. Beyond the transfer-and-merge
wordnet construction: plwordnet and a comparison with wordnet. In RANLP, pages
443–452, 2013.
J. J. McCarthy. A prosodic theory of nonconcatenative morphology. Linguistic inquiry, 12
(3):373–418, 1981.
T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic regularities in continuous space word
representations. In HLT-NAACL, pages 746–751, 2013.
G. A. Miller. WordNet: a lexical database for English. Communications of the ACM, 38
(11):39–41, 1995.
G. A. Miller and F. Hristea. WordNet nouns: Classes and instances. Computational Lin-
guistics, 32(1):1–3, 2006.
T. Miller and I. Gurevych. Wordnet-wikipedia-wiktionary: Construction of a three-way
alignment. In LREC, pages 2094–2100, 2014.
M. Mladenovic, J. Mitrovic, and C. Krstev. Developing and Maintaining a WordNet: Pro-
cedures and Tools. In Proceedings of the 7th Global Wordnet Conference (GWC 2014),
pages 55–62, 2014.
106
C. Mouton and G. de Chalendar. JAWS: Just another WordNet subset. Proc. of TALN’10,
2010.
A. S. Nagvenkar, N. R. Prabhugaonkar, V. P. Prabhu, R. N. Karmali, and J. D. Pawar. Con-
cept Space Synset Manager Tool. In Proceedings of the 7th Global Wordnet Conference,
pages 86–94, 2014.
P. Nakov and H. T. Ng. Improved statistical machine translation for resource-poor lan-
guages using related resource-rich languages. In Proceedings of the 2009 Conference on
Empirical Methods in Natural Language Processing: Volume 3-Volume 3, pages 1358–
1367. Association for Computational Linguistics, 2009.
R. Navigli and S. P. Ponzetto. BabelNet: Building a very large multilingual semantic
network. In Proceedings of the 48th annual meeting of the association for computational
linguistics, pages 216–225. Association for Computational Linguistics, 2010.
L. Nerima and E. Wehrli. Generating Bilingual Dictionaries by Transitivity. In LREC,
volume 8, pages 2584–2587, 2008.
R. Noyer. Vietnamese’morphology’and the definition of word. University of Pennsylvania
Working Papers in Linguistics, 5(2):5, 1998.
A. Oliver. Wn-toolkit: Automatic generation of wordnets following the expand model.
Proceedings of the 7th Global WordNetConference, Tartu, Estonia, 2014.
A. Oliver and S. Climent. Parallel corpora for Wordnet construction: machine translation
vs. automatic sense tagging. In Computational Linguistics and Intelligent Text Process-
ing, pages 110–121. Springer, 2012.
P. G. Otero and J. R. P. Campos. Automatic generation of bilingual dictionaries using inter-
mediary languages and comparable corpora. In Computational Linguistics and Intelligent
Text Processing, pages 473–483. Springer, 2010.
107
N. R. Prabhugaonkar, J. D. Pawar, and T. Plateau. Use of Sense Marking for Improving
WordNet Coverage. In Proceedings of the 7th Global Wordnet Conference, pages 95–99,
2014.
Q. Pradet, G. de Chalendar, and J. B. Desormeaux. Wonef, an improved, expanded and
evaluated automatic french translation of wordnet. Proceedings of the 7th Global Word-
NetConference, Tartu, Estonia, 2014.
J. Ramírez, M. Asahara, and Y. Matsumoto. Japanese-Spanish thesaurus construction using
English as a pivot. arXiv preprint arXiv:1303.1232, 2013.
G. Rigau, H. Rodriguez, and E. Agirre. Building accurate semantic taxonomies from
monolingual MRDs. In Proceedings of the 17th international conference on Compu-
tational linguistics-Volume 2, pages 1103–1109. Association for Computational Linguis-
tics, 1998.
H. Rodríguez, D. Farwell, J. Ferreres, M. Bertran, M. Alkhalifa, and M. A. Martí. Arabic
wordnet: Semi-automatic extensions using bayesian inference. In LREC, 2008.
B. Sagot and D. Fišer. Building a free French wordnet from multilingual resources. In
OntoLex, 2008.
N. Saharia, D. Das, U. Sharma, and J. Kalita. Part of speech tagger for assamese text. In
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 33–36. Associa-
tion for Computational Linguistics, 2009.
R. C. S. K. Sarma. Structured and logical representations of assamese text for question-
answering system. In 24th International Conference on Computational Linguistics,
page 27, 2012.
108
M. Saveski and I. Trajkovski. Automatic construction of wordnets by using machine trans-
lation and language modeling. In 13th Multiconference Information Society, Ljubljana,
Slovenia, 2010.
K. Shaalan, A. A. Monem, and A. Rafea. Arabic morphological generation from interlin-
gua. In Intelligent Information Processing III, pages 441–451. Springer, 2006.
U. Sharma, J. K. Kalita, and R. K. Das. Acquisition of morphology of an indic lan-
guage from text corpus. ACM Transactions on Asian Language Information Processing
(TALIP), 7(3):9, 2008.
R. Shaw, A. Datta, D. VanderMeer, and K. Dutta. Building a scalable database-driven
reverse dictionary. Knowledge and Data Engineering, IEEE Transactions on, 25(3):
528–540, 2013.
S. Soderland, O. Etzioni, D. S. Weld, K. Reiter, M. Skinner, M. Sammer, J. Bilmes, and
others. Panlingual lexical translation via probabilistic inference. Artificial Intelligence,
174(9):619–637, 2010.
K. Tanaka and K. Umemura. Construction of a bilingual dictionary intermediated by a third
language. In Proceedings of the 15th conference on Computational linguistics-Volume 1,
pages 297–303. Association for Computational Linguistics, 1994.
L. C. Thompson. A Vietnamese reference grammar, volume 13. University of Hawaii Press,
1987.
K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Feature-rich part-of-speech tagging
with a cyclic dependency network. In Proceedings of the 2003 Conference of the North
American Chapter of the Association for Computational Linguistics on Human Language
Technology-Volume 1, pages 173–180. Association for Computational Linguistics, 2003.
109
P. Vossen. Introduction to eurowordnet. In EuroWordNet: A multilingual database with
lexical semantic networks, pages 1–17. Springer, 1998.
Wikipedia. Wordnet — wikipedia, the free encyclopedia, 2015. URL http://en.
wikipedia.org/w/index.php?title=WordNet&oldid=656664111.
[Online; accessed 22-April-2015].
Wikipedia. Vietnamese language — wikipedia, the free encyclopedia, 2016a. URL
https://en.wikipedia.org/w/index.php?title=Vietnamese_
language&oldid=731154067. [Online; accessed 30-July-2016].
Wikipedia. Vietnamese morphology — wikipedia, the free encyclopedia, 2016b.
URL https://en.wikipedia.org/w/index.php?title=Vietnamese_
morphology&oldid=730832239. [Online; accessed 30-July-2016].
K. Yu and J. Tsujii. Extracting bilingual dictionary from comparable corpora with de-
pendency heterogeneity. In Proceedings of Human Language Technologies: The 2009
Annual Conference of the North American Chapter of the Association for Computational
Linguistics, Companion Volume: Short Papers, pages 121–124. Association for Compu-
tational Linguistics, 2009.
O. F. Zaidan and C. Callison-Burch. Arabic dialect identification. Computational Linguis-
tics, 40(1):171–202, 2014.
Appendix A
DATA PROCESSING SOFTWARE CODE
A.1 computCosineSim.py
############################ Program to compute cosine similarity# between semantically related words in a WordNet# using Word2Vec# Author: Feras Al Tarouti# Date : Feb 4 2016
import unicodecsv as csvimport codecsimport gensimimport editdistance
word2vecmodel=gensim.models.Word2Vec.load_word2vec_format('VieVectors_SG_Size100_W5.bin', binary=True)
with open('LexBankVieSemRelatedWords_WithCOS.csv', 'wb') as f:writer = csv.writer(f)writer.writerow(['OffsetPos1','Word1','Relation','OffsetPos2','Word2',
'COS','ld'])with open('LexBankVieSemRelatedWords.csv', 'rb') as f:
reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)firstline = Truerownum = 0for row in reader:
if firstline:firstline=False
else:print("Compute Similarity for pairs number: {0}".format(rownum))SynsetID1=row[0]Word1= row[1]Relation=row[2]SynsetID2=row[3]Word2=row[4]try:
cos= round(word2vecmodel.similarity(Word1,Word2),3)except Exception:
cos=00.00ld= editdistance.eval(Word1,Word2)
111
newrow=[SynsetID1,Word1,Relation,SynsetID2,Word2,cos,ld]writer.writerow(newrow)
rownum =rownum +1
112
A.2 GenerateVectorForSynset.py
############################ A function for computing a synset vector# Author: Feras Al Tarouti# Date : May 18 2016def GenerateVectorForSynset(syn,thislemma):
FinalVector=np.zeros(100)VectorList=[] # define the vector set for this synsetLemmasList=FindLemmasOfSyns(syn) # the list of lemmas for this synset
for lemma in LemmasList:if lemma != thislemma:
Vector= GenerateVectorForLemma(lemma)if np.count_nonzero(Vector)>0:
VectorList.append(Vector) # add vector of word to the synset Vector
# Find out if this synset have only one word,# in this case we have to find a related word and add it to the
vector setsif len(VectorList)<2:
#we need to find out a related synsetrelatedword=FindRelatedSyn(syn)if relatedword != "":
Vector =GenerateVectorForLemma(relatedword)if np.count_nonzero(Vector)>0:
VectorList.append(Vector) # add vector of word to the synset Vector
for vec in VectorList:FinalVector=np.add(FinalVector,vec)
# compute the averagenumbofVec= len(VectorList)scalar=np.divide(float(1),float(numbofVec))FinalVector=np.multiply(FinalVector, scalar)return FinalVector
113
A.3 GenerateVectorForGloss.py
############################ A function for computing a gloss vector# Author: Feras Al Tarouti# Date : May 18 2016def GenerateVectorFor(thisSentence,lemma):
VectorList=[] # define the vector set for this SentenceFinalVector=np.zeros(100)for word in thisSentence.split():
skip = Falseif word not in stopwrds and word != lemma:try:
Vector = word2vecmodel[word]NofSyns = FindNumberOfSyns(word)# Scale the vector base on the number of synsetsif NofSyns > 1:
thisScalar = np.divide(float(1),float(NofSyns))Vector = np.multiply(Vector, thisScalar)
VectorList.append(Vector)skip=False # we have this word in our model
except Exception:skip=True
if len(VectorList)>0:for vec in VectorList:
FinalVector=np.add(FinalVector,vec)numbofVec= len(VectorList)saclar=np.divide(float(1),numbofVec)FinalVector=np.multiply(FinalVector, saclar)
return FinalVector;
114
A.4 ComputeGlossSynsetSimilarity.py
############################ A program for computing similarity between synset and gloss# Author: Feras Al Tarouti# Date : May 18 2016# First Step : Open the synset-gloss files, and read the sentence# Second Step : Generate the vector for the synset# Third Step : Generate the vector for the sentence# Fourth Step : Compute the cosine similarity between the synset vector# and the sentence vector# Fivth Step : Save the result###########################
with open(InputDataFile,'rb') as SentencesFile, open(outputfile, 'wb')as out_file:reader = csv.reader(SentencesFile,encoding='utf-8' ,delimiter=',')writer = csv.writer(out_file, encoding='utf-8')writer.writerow(['ID','CosSem'])rownum=0for row in reader:
if rownum!=0:print("Computing Cosine Similarity for Row numb: {0}".format(rownum)
)thisSenID = row[0] # read the current sentence IDthisSynset = row[1] # read the current synsetIDthisSynMem = row[2] # read number of members for this synsetthiswrd = row[3] # read the word used in this sentencethiswrdSyns = row[4] # read the number of synsets for this wordthisSentence = row[5] # read the current sentence
#Compute a vector for this synsetthisSynsetVector = GenerateVectorForSynset(thisSynset,"")
# Generate Vector for this sentencethisSentenceVector = GenerateVectorFor(thisSentence,"")
CosDistance = ComputeCosine (thisSynsetVector, thisSentenceVector)x=Decimal(CosDistance)if math.isnan(x):
CosDistance=0newrow=[thisSenID,CosDistance]
writer.writerow(newrow)
rownum=rownum+1
Appendix BMICROSOFT SQL SERVER TABLES
---- Database: `LexBank_System`---- ------------------------------------------------------------ Table structure for table `Users_Info`--USE [LexBank_System]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Users_Info]([UserId] [varchar](50) NOT NULL,[UserName] [varchar](100) NOT NULL,[UserEmail] [varchar](70) NOT NULL,[UserPwd] [varchar](max) NOT NULL,[UserPriv] [varchar](15) NOT NULL,[UserStatus] [varchar](15) NOT NULL,
CONSTRAINT [PK_Users_Info] PRIMARY KEY CLUSTERED(
[UserId] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY =
OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ------------------------------------------------------------ Table structure for table `System_Log`--USE [LexBank_System]GO
SET ANSI_NULLS ON
116
GO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[System_Log]([EventId] [int] IDENTITY(1,1) NOT NULL,[EventDesc] [varchar](200) NOT NULL,[EventTime] [datetime] NOT NULL,[UserId] [varchar](50) NOT NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ------------------------------------------------------------ Database: `LexBank_Resources`---- ------------------------------------------------------------ Table structure for table `Arabic_CoreWordnet`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Arabic_CorWordnet]([Offset_Pos] [nvarchar](10) NOT NULL,[Member] [nvarchar](200) NOT NULL
) ON [PRIMARY]
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_CoreWordnet`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
117
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Assamese_CorWordnet]([Offset_Pos] [nvarchar](10) NOT NULL,[Member] [nvarchar](200) NOT NULL
) ON [PRIMARY]
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_CoreWordnet`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Vietnamese_CorWordnet]([Offset_Pos] [nvarchar](10) NOT NULL,[Member] [nvarchar](200) NOT NULL
) ON [PRIMARY]
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Arabic_Sem_Relations`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Arabic_Sem_Relations]([Left_Offset_Pos] [nvarchar](10) NOT NULL,[Relation] [nvarchar](50) NOT NULL,
118
[Right_Offset_Pos] [nvarchar](10) NOT NULL) ON [PRIMARY]
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_Sem_Relations`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Assamese_Sem_Relations]([Left_Offset_Pos] [nvarchar](10) NOT NULL,[Relation] [nvarchar](50) NOT NULL,[Right_Offset_Pos] [nvarchar](10) NOT NULL
) ON [PRIMARY]
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_Sem_Relations`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Vietnamese_Sem_Relations]([Left_Offset_Pos] [nvarchar](10) NOT NULL,[Relation] [nvarchar](50) NOT NULL,[Right_Offset_Pos] [nvarchar](10) NOT NULL
) ON [PRIMARY]
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Arabic_WordnetGlosses`--
119
USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Arabic_WordnetGlosses]([Offset_Pos] [varchar](10) NOT NULL,[Gloss] [varchar](4000) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_WordnetGlosses`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Assamese_WordnetGlosses]([Offset_Pos] [varchar](10) NOT NULL,[Gloss] [varchar](4000) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_WordnetGlosses`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
120
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Vietnamese_WordnetGlosses]([Offset_Pos] [varchar](10) NOT NULL,[Gloss] [varchar](4000) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Arabic_Sem_Relations_Eval_Data`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Arabic_Sem_Relations_Eval_Data]([RelationKey] [int] IDENTITY(1,1) NOT NULL,[Left_Offset_Pos] [nvarchar](10) NOT NULL,[Word1] [nvarchar](100) NOT NULL,[Relation] [nvarchar](50) NOT NULL,[Right_Offset_Pos] [nvarchar](10) NOT NULL,[Word2] [nvarchar](100) NOT NULL,[COS] [real] NULL,
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_Sem_Relations_Eval_Data`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ON
121
GO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Assamese_Sem_Relations_Eval_Data]([RelationKey] [int] IDENTITY(1,1) NOT NULL,[Left_Offset_Pos] [nvarchar](10) NOT NULL,[Word1] [nvarchar](100) NOT NULL,[Relation] [nvarchar](50) NOT NULL,[Right_Offset_Pos] [nvarchar](10) NOT NULL,[Word2] [nvarchar](100) NOT NULL,[COS] [real] NULL,
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_Sem_Relations_Eval_Data`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Vietnamese_Sem_Relations_Eval_Data]([RelationKey] [int] IDENTITY(1,1) NOT NULL,[Left_Offset_Pos] [nvarchar](10) NOT NULL,[Word1] [nvarchar](100) NOT NULL,[Relation] [nvarchar](50) NOT NULL,[Right_Offset_Pos] [nvarchar](10) NOT NULL,[Word2] [nvarchar](100) NOT NULL,[COS] [real] NULL,
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Arabic_Sem_Relations_Eval_Response`--USE [LexBank_Resources]GO
122
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Arabic_Sem_Relations_Eval_Response]([AnswerKey] [int] IDENTITY(1,1) NOT NULL,[RelationKey] [int] NOT NULL,[Score] [int] NOT NULL,[UserId] [varchar](50) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_Sem_Relations_Eval_Response`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Assamese_Sem_Relations_Eval_Response]([AnswerKey] [int] IDENTITY(1,1) NOT NULL,[RelationKey] [int] NOT NULL,[Score] [int] NOT NULL,[UserId] [varchar](50) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_Sem_Relations_Eval_Response`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
123
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Vietnamese_Sem_Relations_Eval_Response]([AnswerKey] [int] IDENTITY(1,1) NOT NULL,[RelationKey] [int] NOT NULL,[Score] [int] NOT NULL,[UserId] [varchar](50) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Arabic_WordnetGloss_Eval_Data`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Arabic_WordnetGloss_Eval_Data]([GlossKey] [int] IDENTITY(1,1) NOT NULL,[Offset-pos] [varchar](10) NOT NULL,[Word] [nvarchar](500) NULL,[Sentence] [nvarchar](4000) NULL,[PWNGloss] [nvarchar](900) NULL,[CosSem] [real] NULL,[GlossRank] [int] NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_WordnetGloss_Eval_Data`--USE [LexBank_Resources]GO
SET ANSI_NULLS ON
124
GO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Assamese_WordnetGloss_Eval_Data]([GlossKey] [int] IDENTITY(1,1) NOT NULL,[Offset-pos] [varchar](10) NOT NULL,[Word] [nvarchar](500) NULL,[Sentence] [nvarchar](4000) NULL,[PWNGloss] [nvarchar](900) NULL,[CosSem] [real] NULL,[GlossRank] [int] NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_WordnetGloss_Eval_Data`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Vietnamese_WordnetGloss_Eval_Data]([GlossKey] [int] IDENTITY(1,1) NOT NULL,[Offset-pos] [varchar](10) NOT NULL,[Word] [nvarchar](500) NULL,[Sentence] [nvarchar](4000) NULL,[PWNGloss] [nvarchar](900) NULL,[CosSem] [real] NULL,[GlossRank] [int] NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Arabic_WordnetGloss_Eval_Response`--
125
USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Arabic_WordnetGlosses_Eval_Response]([AnswerKey] [int] IDENTITY(1,1) NOT NULL,[GlossKey] [int] NOT NULL,[Score] [int] NOT NULL,[UserId] [varchar](50) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_WordnetGloss_Eval_Response`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Assamese_WordnetGlosses_Eval_Response]([AnswerKey] [int] IDENTITY(1,1) NOT NULL,[GlossKey] [int] NOT NULL,[Score] [int] NOT NULL,[UserId] [varchar](50) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_WordnetGloss_Eval_Response`--USE [LexBank_Resources]GO
126
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Vietnamese_WordnetGlosses_Eval_Response]([AnswerKey] [int] IDENTITY(1,1) NOT NULL,[GlossKey] [int] NOT NULL,[Score] [int] NOT NULL,[UserId] [varchar](50) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- --------------------------------------------------------
Appendix C
LEXBANK UTILITY CLASS
1 using System;2 using System.Collections.Generic;3 using System.Linq;4 using System.Web;5 using System.Data;6 using System.Data.SqlClient;7 using System.Web.Configuration;8 using System.IO;9 using System.Text;
10 using System.Security.Cryptography;11
12 namespace LexBank201613 {14 public class LexBankUtils15 {16 private string LexBankConnectionString = WebConfigurationManager
.ConnectionStrings["LexBankData"].ToString();17
18 public Boolean IsUserIdAvailable(string UserId)19 {20 // This function takes user id and check if it is already
used or not21 Boolean result = false;22
23
24 using (SqlConnection connection = new SqlConnection(LexBankConnectionString))
25 {26 connection.Open();27 //28 // Create new SqlCommand object.29 //30 using (SqlCommand command = new SqlCommand("SELECT
UserId FROM Users_Info where UserId like @UserId",connection))
31 {32 // Define the parameters33 command.Parameters.AddWithValue("@UserId", UserId.
Trim());34 //35 // Invoke ExecuteReader method.36 //37 var firstColumn = command.ExecuteScalar();
128
38 if (firstColumn == null)39 {40 result = true;41 }42 }43 }44 return result;45
46
47 }48
49 public string EncryptPassword(string PlanePassword)50 {51 string EncryptionKey = "LexBank";52 byte[] PlaneBytes = Encoding.Unicode.GetBytes(PlanePassword)
;53 using (Aes PasswordEncryptor = Aes.Create())54 {55 Rfc2898DeriveBytes PBKDF = new Rfc2898DeriveBytes(
EncryptionKey, new byte[] { 0x49, 0x76, 0x61, 0x6e,0x20, 0x4d, 0x65, 0x64, 0x76, 0x65, 0x64, 0x65, 0x76});
56 PasswordEncryptor.Key = PBKDF.GetBytes(32);57 PasswordEncryptor.IV = PBKDF.GetBytes(16);58 using (MemoryStream ms = new MemoryStream())59 {60 using (CryptoStream cs = new CryptoStream(ms,
PasswordEncryptor.CreateEncryptor(),CryptoStreamMode.Write))
61 {62 cs.Write(PlaneBytes, 0, PlaneBytes.Length);63 cs.Close();64 }65 PlanePassword = Convert.ToBase64String(ms.ToArray())
;66 }67 }68 return PlanePassword;69 }70
71 public string DecryptPassword(string EncryptedPassword)72 {73 string EncryptionKey = "LexBank";74 byte[] DecryptedBytes = Convert.FromBase64String(
EncryptedPassword);75 using (Aes PasswordEncryptor = Aes.Create())76 {77 Rfc2898DeriveBytes PBKDF = new Rfc2898DeriveBytes(
EncryptionKey, new byte[] { 0x49, 0x76, 0x61, 0x6e,0x20, 0x4d, 0x65, 0x64, 0x76, 0x65, 0x64, 0x65, 0x76});
78 PasswordEncryptor.Key = PBKDF.GetBytes(32);79 PasswordEncryptor.IV = PBKDF.GetBytes(16);80 using (MemoryStream ms = new MemoryStream())
129
81 {82 using (CryptoStream cs = new CryptoStream(ms,
PasswordEncryptor.CreateDecryptor(),CryptoStreamMode.Write))
83 {84 cs.Write(DecryptedBytes, 0, DecryptedBytes.
Length);85 cs.Close();86 }87 EncryptedPassword = Encoding.Unicode.GetString(ms.
ToArray());88 }89 }90 return EncryptedPassword;91 }92
93 public Boolean CreateNewUser(string UserId, string UserName,string UserEmail, string UserPwd)
94 {95 Boolean result = false;96 string UserPriv = "client";97 string UserStatus = "New";98 using (SqlConnection connection = new SqlConnection(
LexBankConnectionString))99 {
100 connection.Open();101 //102 // Create new SqlCommand object.103 //104 using (SqlCommand command = new SqlCommand("INSERT INTO
Users_Info VALUES(@UserId,@UserName,@UserEmail,@UserPwd,@UserPriv,@UserStatus)", connection))
105 {106 // Define the parameters107 command.Parameters.AddWithValue("@UserId", UserId.
Trim());108 command.Parameters.AddWithValue("@UserName",
UserName.Trim());109 command.Parameters.AddWithValue("@UserEmail",
UserEmail.Trim());110 command.Parameters.AddWithValue("@UserPwd", UserPwd.
Trim());111 command.Parameters.AddWithValue("@UserPriv",
UserPriv.Trim());112 command.Parameters.AddWithValue("@UserStatus",
UserStatus.Trim());113 //114 // Invoke ExecuteNonQuery method.115 //116 int c = 0;117 try118 {119 c = command.ExecuteNonQuery();120 if (c == 1)
130
121 result = true;122 }123 catch (Exception e)124 {125
126 }127
128
129 }130
131 }132
133
134
135
136 return result;137 }138
139 public bool IsAuthenticated(string userid, string userpassword)140 {141
142 bool result = false;143 SqlConnection LexBankDataConnection = new SqlConnection(
LexBankConnectionString);144 SqlCommand AuthCommand = new SqlCommand("Select UserId,
UserPriv, UserStatus from Users_Info where UserId=@userid and UserPwd=@userpassword",LexBankDataConnection);
145 AuthCommand.Parameters.AddWithValue("@userid", userid);146 AuthCommand.Parameters.AddWithValue("@userpassword",
EncryptPassword(userpassword.Trim()));147 LexBankDataConnection.Open();148 SqlDataReader reader = AuthCommand.ExecuteReader();149 while (reader.Read())150 {151 string UserStatus = reader["UserStatus"].ToString();152 if (UserStatus == "Active")153 {154 result = true;155 LogEvent("Login", DateTime.Now, userid.Trim());156
157 }158 }159 return result;160 }161
162 public List<string> FindSynSet(string lexeme, string WordNet)163 {164
165 List<string> result = new List<string>();166
167 using (SqlConnection connection = new SqlConnection(LexBankConnectionString))
168 {
131
169 connection.Open();170 //171 // Create new SqlCommand object.172 //173 using (SqlCommand command = new SqlCommand("SELECT *
FROM " + WordNet + " where Member like @lexeme",connection))
174 {175 // Define the parameters176 command.Parameters.AddWithValue("@lexeme", lexeme.
Trim());177 //178 // Invoke ExecuteReader method.179 //180 SqlDataReader reader = command.ExecuteReader();181 while (reader.Read())182 {183 result.Add(reader.GetString(0).Trim());184
185 }//end while186
187 } //end the second using188 }//end the first using189 return result;190 }191
192 public List<string> FindSynSetLexemes(string OffsetPos, stringWordNet)
193 {194
195 List<string> result = new List<string>();196
197 using (SqlConnection connection = new SqlConnection(LexBankConnectionString))
198 {199 connection.Open();200 //201 // Create new SqlCommand object.202 //203 using (SqlCommand command = new SqlCommand("SELECT *
FROM " + WordNet + " where Offset_Pos like@OffsetPos", connection))
204 {205 // Define the parameters206 command.Parameters.AddWithValue("@OffsetPos",
OffsetPos.Trim());207 //208 // Invoke ExecuteReader method.209 //210 SqlDataReader reader = command.ExecuteReader();211 while (reader.Read())212 {213 result.Add(reader.GetString(1).Trim());214
132
215 }//end while216
217 } //end the second using218 }//end the first using219 return result;220 }221
222 public Boolean IsSynSetAvailable(string OffsetPos, stringWordnet)
223 {224 // This function takes synsetID and check if it is included
in a Wordnet225 Boolean result = false;226
227
228 using (SqlConnection connection = new SqlConnection(LexBankConnectionString))
229 {230 connection.Open();231 //232 // Create new SqlCommand object.233 //234 using (SqlCommand command = new SqlCommand("SELECT
Offset_Pos FROM " + Wordnet.Trim() + " whereOffset_Pos like @OffsetPos", connection))
235 {236 // Define the parameters237 command.Parameters.AddWithValue("@OffsetPos",
OffsetPos.Trim());238 //239 // Invoke ExecuteReader method.240 //241 SqlDataReader reader = command.ExecuteReader();242
243 if (reader.Read())244 result = true;245
246 }247
248
249 }250
251
252 return result;253
254
255 }256
257 public Dictionary<string, string> FindSynSetRelations(stringOffsetPos, string WordNet, string RelationsTable)
258 {259
260 Dictionary<string, string> result = new Dictionary<string,string>();
133
261
262
263 using (SqlConnection connection = new SqlConnection(LexBankConnectionString))
264 {265 connection.Open();266 //267 // Create new SqlCommand object.268 //269
270 using (SqlCommand command = new SqlCommand("SELECT *FROM " + RelationsTable.Trim() + " whereLeft_Offset_Pos like @OffsetPos", connection))
271 {272 // Define the parameters273 command.Parameters.AddWithValue("@OffsetPos",
OffsetPos.Trim());274 //275 // Invoke ExecuteReader method.276 //277 SqlDataReader reader = command.ExecuteReader();278
279 string Relation = "";280
281 int c = 0;282 while (reader.Read())283 {284 if (IsSynSetAvailable(reader.GetString(2).Trim()
, WordNet))285 {286 Relation = reader.GetString(1).Trim() + " :
" + reader.GetString(2).Trim();287 string RelatedOffsetPos = reader.GetString
(2).Trim();288 List<string> RelatedLexemes =
FindSynSetLexemes(RelatedOffsetPos,WordNet);
289
290 foreach (string lexeme in RelatedLexemes)291 {292 c++;293 result.Add(RelatedOffsetPos + c.ToString
(), Relation + "-->" + lexeme);294
295 }296
297 }298
299
300 }//end while301
302
303
304
134
305 } //end the second using306 }//end the first using307
308 return result;309
310 }311
312 public string FindGloss(string OffsetPos, string GlossTable)313 {314 string result = "Gloss is not available";315
316 using (SqlConnection connection = new SqlConnection(LexBankConnectionString))
317 {318 connection.Open();319 //320 // Create new SqlCommand object.321 //322 using (SqlCommand command = new SqlCommand("SELECT *
FROM " + GlossTable + " where Offset_Pos like@OffsetPos", connection))
323 {324 // Define the parameters325 command.Parameters.AddWithValue("@OffsetPos",
OffsetPos.Trim());326 //327 // Invoke ExecuteReader method.328 //329 SqlDataReader reader = command.ExecuteReader();330 while (reader.Read())331 {332 result=reader.GetString(1).Trim();333
334 }//end while335
336 } //end the second using337 }//end the first using338
339
340
341 return result;342
343
344 }345
346 public List<string> ReadRelation(string RelationKey, stringRelationDataTable)
347 {348 // This method reads a relation and return it to be
evaluated349
350 List<string> Result = new List<string>();351
352 try
135
353 {354 SqlConnection MyConnection = new SqlConnection(
LexBankConnectionString);355
356 string Sqls = "SELECT [RelationKey], [Word1] , [Relation], [Word2] FROM " + RelationDataTable + " where [RelationKey] = @RelationKey";
357 SqlCommand Mycommand = new SqlCommand(Sqls, MyConnection);
358 DataTable MyTable = new DataTable();359 using (SqlDataAdapter Myadapter = new SqlDataAdapter(
Mycommand))360 {361
362 Myadapter.Fill(MyTable);363
364 if (MyTable.Rows.Count > 0)365 {366
367 for (int x = 0; x < 4; x++)368 {369
370 Result.Add(MyTable.Rows[0][x].ToString());371
372 }373
374 }375
376 }377
378 return Result;379 }380
381 catch (Exception ex)382 {383 return Result;384 }385
386 }387
388 public List<string> ReadSynsetGloss(int GlossKey,stringTableName)
389 {390 // This method Read a synset gloss from the table and return
it to be evaluated391
392 List<string> Result = new List<string>();393
394 try395 {396 SqlConnection MyConnection = new SqlConnection(
LexBankConnectionString);397
136
398 string Sqls = "SELECT [GlossKey], [Word] , [Sentence], [PWN_Gloss] FROM " + TableName + " where [GlossKey]=@GlossKey";
399 DataTable MyTable = new DataTable();400 SqlCommand Mycommand = new SqlCommand(Sqls, MyConnection
);401 Mycommand.Parameters.AddWithValue("@GlossKey",GlossKey);402 using (SqlDataAdapter Myadapter = new SqlDataAdapter(
Mycommand))403 {404 Myadapter.Fill(MyTable);405 if (MyTable.Rows.Count > 0)406 {407
408 for (int x = 0; x < 4; x++)409 {410 Result.Add(MyTable.Rows[0][x].ToString());411 }412
413 }414
415 }416
417 return Result;418 }419 catch (Exception ex)420 {421 return Result;422 }423
424 }425
426 public Boolean EvaluateRelation(int RelationKey, int Score,string UserId, string EvaluationTable)
427 {428
429 try430 {431
432 SqlConnection MyConnection = new SqlConnection(LexBankConnectionString);
433
434 string sqls = "INSERT INTO " + EvaluationTable + " ([RelationKey],[Score] ,[UserID]) values (@RelationKey,@Score,@UserId)";
435 var command = new SqlCommand(sqls, MyConnection);436 command.Parameters.AddWithValue("@RelationKey",
RelationKey);437 command.Parameters.AddWithValue("@Score", Score);438 command.Parameters.AddWithValue("@UserId", UserId.Trim()
);439 MyConnection.Open();440 command.ExecuteNonQuery();441 MyConnection.Close();
137
442 return true;443
444 }445
446 catch (Exception ex)447 {448 return false;449 }450
451 }452
453 private Boolean EvaluateGloss(int GlossKey, int Score, stringUserId, string EvaluationTable)
454 {455
456 try457 {458
459 SqlConnection MyConnection = new SqlConnection(LexBankConnectionString);
460
461 string sqls2 = "INSERT INTO " + EvaluationTable + " ([GlossKey],[Score] ,[UserID]) values (@GlossKey,@Score,@UserId)";
462 var command = new SqlCommand(sqls2, MyConnection);463 command.Parameters.AddWithValue("@GlossKey", GlossKey);464 command.Parameters.AddWithValue("@Score", Score);465 command.Parameters.AddWithValue("@UserId", UserId);466
467 MyConnection.Open();468 command.ExecuteNonQuery();469 MyConnection.Close();470 return true;471
472 }473
474 catch (Exception ex)475 {476 return false;477 }478
479 }480
481 public void LogEvent(string EventDesc, DateTime EventTime,string UserId)
482 {483 using (SqlConnection connection = new SqlConnection(
LexBankConnectionString))484 {485 connection.Open();486 //487 // Create new SqlCommand object.488 //
138
489 using (SqlCommand command = new SqlCommand("INSERT INTOSystem_Log([EventDesc], [EventTime], [UserId])VALUES(@EventDesc, @EventTime, @UserId)", connection))
490 {491 // Define the parameters492 command.Parameters.AddWithValue("@EventDesc",
EventDesc.Trim());493 command.Parameters.AddWithValue("@EventTime",
SqlDbType.DateTime).Value = EventTime;494 command.Parameters.AddWithValue("@UserId", UserId.
Trim());495 //496 // Invoke ExecuteNonQuery method.497 //498 //try499 //{500 command.ExecuteNonQuery();501 //}502 //catch (Exception e)503 //{504
505 //}506
507 }508 }509
510 }511
512 public void ChangeUserStatus(string UserId, string NewStatus)513 {514 using (SqlConnection connection = new SqlConnection(
LexBankConnectionString))515 {516 connection.Open();517 //518 // Create new SqlCommand object.519 //520 using (SqlCommand command = new SqlCommand("UPDATE
Users_Info SET UserStatus=@UserStatus WHERE UserId=@UserId", connection))
521 {522 // Define the parameters523 command.Parameters.AddWithValue("@UserId", UserId.
Trim());524 command.Parameters.AddWithValue("@UserStatus",
NewStatus.Trim());525 //526 // Invoke ExecuteNonQuery method.527 //528 //try529 //{530 command.ExecuteNonQuery();531 //}
139
532 //catch (Exception e)533 //{534
535 //}536
537 }538 }539
540 }541
542 public DataTable RetrieveUsers()543 {544 DataTable result = new DataTable();545
546 using (SqlConnection connection = new SqlConnection(LexBankConnectionString))
547 {548 connection.Open();549 //550 // Create new SqlCommand object.551 //552 using (SqlCommand command = new SqlCommand("SELECT [
UserId], [UserName], [UserEmail], [UserPriv], [UserStatus] FROM [Users_Info]", connection))
553 {554 SqlDataAdapter dadapter = new SqlDataAdapter(command
);555 dadapter.Fill(result);556
557 }558 }559 return result;560
561 }562
563 }564 }