Automatic extraction and manual validation of a hierarchical English-Swedish terminology
Transcript of Automatic extraction and manual validation of a hierarchical English-Swedish terminology
Automatic extraction and manual validation of a hierarchical English-Swedish terminology
NORDTERM 2009!
Magnus Merkel*, Jody Foo*, Mikael Andersson**, Lars Edholm**, Mikaela Gidlund**, Sanna Åsberg**!
Presented by Jody Foo!
* Department of Computer and Information Science, Linköping University!
** Fodina Language Technology AB!
Overview!
!! Background!
!! Term extraction and validation process!
!! Results!
!! Conclusions and future work!
Merkel, Foo et al, NORDTERM 2009!
Some history!
Merkel, Foo et al, NORDTERM 2009!
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
First attempts at MT @ EPO2004
Patent Abstracts of Japan (PAJ) launches online machine translation initiative2000
PRV term extraction and validation2008 – 2009
NLPLAB, Linköping University Spin-o!: Fodina Language Technology2004
Patent Information Conference2006
Results from initial machine translation projects
EPO launches patent MT service2006
Machine translation!
!! Two main approaches!!! Rule based machine translation (RBMT), e.g. Babelfish!!! Statistical machine translation (SMT), e.g. Google Translate!
!! MT @ EPO!!! Rule-based MT engine: Systran!!! RBMT requires domain specific dictionaries – patent terms!
Merkel, Foo et al, NORDTERM 2009!
Diallo 2006!
Merkel, Foo et al, NORDTERM 2009!
Diallo 2006!
Merkel, Foo et al, NORDTERM 2009!
Input data!
0
2000
4000
6000
8000
10000
12000
14000
A0
1
A2
3
A4
2
A4
5
A6
1
B0
1
B0
4
B0
7
B2
1
B2
4
B2
7
B3
0
B4
1
B4
4
B6
2
B6
5
B6
8
C0
2
C0
5
C0
8
C11
C1
4
C2
3
D0
1
D0
4
D0
7
E0
2
E0
5
F0
1
F0
4
F1
7
F2
3
F2
6
F4
1
G0
2
G0
5
G0
8
G11
H0
1
H0
4
0
1000
2000
3000
4000
5000
6000
7000
8000
A0
1B
A2
2C
A4
1B
A4
5D
A6
1D
A6
3D
B0
4B
B2
1G
B2
3P
B2
7B
B2
9C
B4
1N
B6
0F
B6
1C
B6
2M
B6
5G
C0
1D
C0
6F
C0
8K
C1
0K
C1
2S
C2
3G
D0
3J
D0
6P
E0
2B
E0
5D
F0
1N
F0
4B
F1
6L
F2
1S
F2
3Q
F2
8B
G0
1B
G0
1V
G0
5B
G0
7D
G11
B
H0
1J
H0
2N
H0
4K
0
5000
10000
15000
20000
25000
30000
A B C D E F G H
Merkel, Foo et al, NORDTERM 2009!
Merkel, Foo et al, NORDTERM 2009!
Overview of the term extraction and validation process!
OLIF!
Source data analysis and system configuration!
Term candidate extraction!
Term candidate filtering and initial linguistic validation!
Manual validation by domain experts!
Final linguistic validation!
Publishing of validated terms!
SGML & OCR!
Merkel, Foo et al, NORDTERM 2009!
Perform necessary steps before term extraction is possible!
OLIF!
Source data analysis and system configuration!
Term candidate extraction!
Term candidate filtering and initial linguistic validation!
Manual validation by domain experts!
Final linguistic validation!
Publishing of validated terms!
SGML & OCR!
Merkel, Foo et al, NORDTERM 2009!
Analysis of source material and system configuration!
!"#$%&'(%)*+&,-..+/,0123-44+5/+/,-+/-4/036+.715/073+
!"#$%&'(%&)+&25.-/4+/8712.-2+90:+;<794/=..-/+
*+ *+ >?)>+ @)A+ B*C+ *+
*+ *+ *+>?)>+
Merkel, Foo et al, NORDTERM 2009!
Extract list of term candidates to be validated!
OLIF!
Source data analysis and system configuration!
Term candidate extraction!
Term candidate filtering and initial linguistic validation!
Manual validation by domain experts!
Final linguistic validation!
Publishing of validated terms!
SGML & OCR!
Merkel, Foo et al, NORDTERM 2009!
Term candidate extraction!
Merkel, Foo et al, NORDTERM 2009!
Client-server infrastructure!
Merkel, Foo et al, NORDTERM 2009!
Merkel, Foo et al, NORDTERM 2009!
Reduce the number of term candidates to be processed by the domain experts!
OLIF!
Source data analysis and system configuration!
Term candidate extraction!
Term candidate filtering and initial linguistic validation!
Manual validation by domain experts!
Final linguistic validation!
Publishing of validated terms!
SGML & OCR!
Merkel, Foo et al, NORDTERM 2009!
Term filtering and initial linguistic validation!
!! Filtering criteria!!! General language filtering!!! Q-value (~alignment confidence)!!! Link errors!!! Source OR target frequency > 4!
Merkel, Foo et al, NORDTERM 2009!
Term filtering and initial linguistic validation!
!! Example: C04B!
Total number of term candidates: 143,341 General language entries: 18,764 Link errors: 653 Freq >4 src|trg: 9,064 Q-value filtering: keep 4,076 DEF95.G(HIJ+
Total after filtering: 3,179!
Merkel, Foo et al, NORDTERM 2009!
Manual validation by domain experts!
Merkel, Foo et al, NORDTERM 2009!
Overview of the term extraction and validation process!
OLIF!
Source data analysis and system configuration!
Term candidate extraction!
Term candidate filtering and initial linguistic validation!
Manual validation by domain experts!
Final linguistic validation!
Publishing of validated terms!
SGML & OCR!
Merkel, Foo et al, NORDTERM 2009!
Final linguistic validation!
!! To be validated!!! Part-of-speech, Inflection pattern, Gender, Number!
!! Recycle as much information as possible from previously validated terms!
!! Process terms by recycling status!!! Very reliable information!!! Less reliable information!!! No information available!
Merkel, Foo et al, NORDTERM 2009!
Publishing of validated terms!
C21B! C21C! C21D!
C21!C03!
C!
H05C!H05B!
A! F! H!
C03B! C03C!
F42! H05!C11!A61!
Top!
E!
Merkel, Foo et al, NORDTERM 2009!
Final numbers!
!! Processed 91,000 document pairs in 8 months.!
!! Validated term pairs: 181,260!
!! Expert validatation: 4 – 6,000 term candidate pairs/working day!
!! Linguistic validation: 2 – 3,000 term pairs/working day!
Merkel, Foo et al, NORDTERM 2009!
Section!Accumulated amount
of total number of documents (in %)!
Accumulated amount of total number of
documents (in %)!
Accumulated amount of term pairs!
Accumulated amount of UNIQUE term pairs!
D! 2,8! 2,8! 17288! 9697!E! 2,1! 4,9! 32045! 16304!F! 7,1! 12! 78301! 32512!G! 10,2! 22,2! 133912! 53731!H! 10,3! 32,5! 187429! 72721!A! 20,7! 53,2! 289850! 110642!B! 18,1! 71,3! 419185! 146665!C! 28,7! 100! 545143! 181260!
Growth of validated terms!
Merkel, Foo et al, NORDTERM 2009!
0!
100000!
200000!
300000!
400000!
500000!
600000!
0! 20! 40! 60! 80! 100!Num
ber o
f val
idat
ed te
rm p
airs!
Amount of total number of documents (in %)!A blue diamond marks the right edge of a section, left to right: D - E - F - G - H - A - B - C.!
Accumulated amount of validated term pairs!
Accumulated amount of validated UNIQUE term pairs!
Right section edge of: D - E - F - G - H - A - B - C!
Conclusions and future work!
!! Key concepts!!! using previously validated term pairs to avoid doing the same
work twice!!! using students as domain experts!!! using an e"cient validation tool!
!! Future work!!! Improving automated filtering and reduction of term candidates!!! Automating termness detection!
Merkel, Foo et al, NORDTERM 2009!