LABELING TURKISH NEWS STORIES WITH
CRFProf. Dr. Eşref Adalı
ISTANBUL TECHNICAL UNIVERSITYCOMPUTER ENGINEERING
1
PURPOSE of STUDY• As internet grows dramatically, the number of electronic text
documents increases considerably. • By means of increasing number of documents, the information
extraction grows in importance. • This study introduces an approach to information extraction,
which provides extraction of the main subject, main predicate, main location and main date of a text document and label it to use for semantic web applications.
2
PURPOSE of STUDY
3
LABEL MEANING
SUBJECT The most important person, place, thing, or idea in the document
PREDICATE Actual doing or being of the main subject
LOCATION Location of the main predicate
DATE Date of the main predicate.
PURPOSE of STUDY• The most pronounced difference between key phrase
extraction studies and labeling study is that labeling study extract the most significant phrases with their functions in the document.
• Extracted labels give an idea to the reader about the main topic of the document at a glance.
4
A SAMPLE LABELED DOCUMENT
5
SCOPE of STUDY
• Documents inspected are written in Turkish language.• Documets are gathered from news distributors.• Documents include 50-300 words.
6
LABELING by ANNOTATORS• Data set is composed of 1000 raw news stories gathered from
RSS feeds of Turkish news distributors and then labeled by annotators .
• Manually labeled documents are used for training and test phase of CRF model.
7
Manual Labeling Process
8
Capturing RSS feeds from news distributors
Arrange captured news with XML
format
Reading news by human annotators
and labeling manually
FIRST STEP of STUDY• Due to the Turkish is an agglutinative language, input file is
converted to the file includes the information of stems, inflectional suffixes and parser results of the raw new stories.
Morphological analyzerrMorphological disambiguator Dependency parser
9
Morphological Analyzer• Each word in a raw is morphologically analyzed. As a
morphologic analyzer, Oflazer’s morphologic analyzer is used. The output of morphological analyzer presents one or more possible results.
10
MORPHOLOGICAL DISAMBIGUATOR
11
• • The most possible result must be distinguished in the output of morphological analyzer. Morphological disambiguator which is developed by Sak et al. has been used for disambiguating. At this point roots or stems are provided.
DEPENDENCY PARSER
12
• Dependency parser defines the attribute of each word in a sentence. In order to do this we use a multilingual dependency parser.
CONSTRUCTING THE MODEL• At first we are developed a rule based model with the help of
the features provided by morphological analyzer, disambiguator and dependency parser.
• Because of the success rates are not enough to use we developed a new model with machine learning techniques.
• In our case labels consist of one word generally more than one word. So, we can estimate our problem is a sequence classification problem.
13
CONSTRUCTING THE MODEL• Each word in the document belongs to a class which is subject,
predicate, location, date or none of them.
14
Rule based features• Due to the experimental set of this study is news stories, main
subject of the text should be proper noun phrases.
• This assumption is obtained after inspected all manually annotated subject labels.
• In order to obtain proper name phrases in Turkish language, at first all words start with capital letter are gathered. However, this assumption is not correct in all cases, because some other words may start with capital letter, such as first word of sentence, titles, month or day names in dates etc.
15
Rule based features• Rule 1 : If the word is first word in a sentence and it is a proper name, it is
a possible candidate of proper name phrase.
• Rule 2 : If a word starts with capital letter and not the first word of sentence, select it as a possible candidate of proper name phrase.
• Rule 3 : If a conjunction is between two possible candidates of proper name phrases, select this word.
But all these rules are not enough to divide all these words into proper noun phrases. For instance, “Mustafa Kemal Atatürk Ankara’ya gitti.” is a sample Turkish sentence. In this sentence “Mustafa Kemal Atatürk ”and “Ankara” are two different proper noun phrases. However, the rules explained above selects the proper name phrase as “Mustafa Kemal Atatürk Ankara’ya”. So new boundary rules are defined.
16
Rule based features• Boundary Rule 1: If a possible candidate of proper noun
phrase ends with a punctuation such as quotation mark, comma etc, this word is the last name of proper noun phrase.
• • Boundary Rule 2: If a possible candidate has the suffixes
”P3sg”, this word is a last word of proper noun phrases.
17
Other Features• Morphological Features: Outputs of morphological
disambiguator.• Syntactic Features: Output of dependency parser.• Structural Features:
Document sequence number in data set is defined in order to describe word is belong to which document in data set.
In order to distinguish sentence, sentence sequence number in document is used.
Term Frequency in document is used as a feature.First observed sentence sequence number of a word in the
document is used as a feature.The feature which defines first letter of the word is capital or not is
used.18
TRAINING CRF SYSTEM
19
Manually annotated documents are used with features of each word. 950 news stories are used as training set and CRF model is generated as Figure 1.
TESTING CRF SYSTEM
20
In order to the measure the success of the system, rest 50 manually annotated documents are used with generated CRF model.
EVALUATION• In this study, the main concern is the precision and the recall
that is how many of the suggested keywords are correct (precision), and how many of the manually assigned labels that are found (recall).
21
EVALUATION
22
ConclusionFactors affects success rate:
Human annotators are not %100 reliable. Human makes mistake.•Spell chek is needed, because it also affects results of morphologic analyzer.•Errors of morphologic analyzer•Errors of morphologic disambiguator•Errors of dependancy parser
•Size and scope of traning set
23
THANK YOU
24
Top Related