CONDUCT AND PROFICIENCY MARKS. CONDUCT AND PROFICIENCY MARKS.
[Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests
-
Upload
naist-machine-translation-study-group -
Category
Technology
-
view
254 -
download
0
Transcript of [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests
Evaluating MT Systems with Second Language Proficiency Tests
Takuya Matsuzaki, Akira Fujita, Naoya Todo, Noriko H. Arai
ACL 2015
2015/09/24 AHCLab M1 Makoto Morishita
Abstract
• BLEU have some weak points to evaluate the system in a real situation.
• In this paper, evaluate the system by using second language ability test (TOEIC, etc).
• It revealed that the context-unawareness of the current MT systems severely damages human performance when solving the test problems.
2
Weak Points of BLEU
1. Unreliability in evaluating short translations
2. Non-interpretability of the scores beyond numerical comparison
3. Bias towards SMT systems
3
Weak Points of Manual Evaluation
1. It costs much.
2. It is not easy to analyze the characteristics of MT systems based solely on the evaluation results.
4
Solution
• Task-based evaluation of MT systems - Measures the human performance in a task
• Human do some task such as information extraction from a machine-translated text.
5
Weak Points ofTask-Based Evaluation• It costs much.
- We have to make test materials, and gather appropriate human subjects.
• This paper use second-language proficiency tests (SLPTs) such as TOEIC, as the source of test materials.
• Human solve the problem which is translated and evaluate the system by the test scores.
6
Second-Language Proficiency Tests(SLPT)
• There are a lot of SLPTs in many languages.
• They are carefully designed to evaluate various aspects of language ability.
• SLPTs are designed to assess the language ability, but not general intelligence.- Can be robust against the heterogeneity of the subjects.
7
(多様性)
Materials
• We chose 40 problems randomly from National Center Test for University Admissions (センター試験).
• All the problem consisted of a short conversation between two people.
8
Materials
• In this paper, we use a multiple-choice dialogue completion problems.
9
Experiment
• The original problems were English, and we translated them into Japanese.
• The human subjects solved the translated problems.
• The translation quality was evaluated based on the rate of correct answers given by the human subjects.
10
Experiment
• Evaluated 4 systems.- G: Google Translate- Y: Yahoo Translate - Hs: Human translation which do not consider context- Ho: Human translation which consider context
11
Participants
• 320 Japanese junior high school student
12
School A School B1st: 80 2nd: 80 3rd: 78
1st: 82
Extrinsic Evaluation Metric
• CAR: Correct Answer Rate
13
CARM (p) =# of subjects that correctly answered M(p)
# of subjects who solved M(p)
Avg � CARM =1
|P |X
p2P
CARM (P )
Robustness against the Heterogeneity of the Human Subjects
14
School A
1st: 80 2nd: 80 3rd: 78
No difference
School A1st: 80
School B1st: 82
No difference
→The participants’ Heterogeneity did not affect the test result
System-level Evaluation
• We cannot find significant difference between Y and Hs
15
System-level Evaluation
16
System-level Evaluation
17
Better
Better
System-level Evaluation
18
Same
Better
System-level Evaluation
19
• Refo: Do not consider context
• Refs: Consider context
Better
Agreement
• If Score of Intrinsic Measure M System A’s translation > B’s translation AndScore of CAR System A’s translation > B’s translation then Agree
• Check the agreement rate of each problems
20
Agreement Rate
• Agreement Rates between Automatic Evaluation Metrics and Human Evaluation
21
Agreement Rate
• Agreement Rates between Intrinsic Evaluation Metrics and Correct Answer Rate
22
Agreement Rate
• The human evaluation agrees with the CAR slightly better than the automatic metrics.
• But still less than 0.7
• CAR can be critically damaged by a subtle mistake.
23
Conclusion
• Comparing 4 systems, it is important to consider contexts of individual sentences in translating dialogues.
• SLPT can evaluate a different dimension of translation quality.
• SLPT can be robust against the heterogeneity of human subjects.
24
Questions & Comments