Automatic extraction and manual validation of a hierarchical English-Swedish terminology

25
Automatic extraction and manual validation of a hierarchical English-Swedish terminology NORDTERM 2009 Magnus Merkel*, Jody Foo*, Mikael Andersson**, Lars Edholm**, Mikaela Gidlund**, Sanna Åsberg** Presented by Jody Foo * Department of Computer and Information Science,Linköping University ** Fodina Language Technology AB

Transcript of Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Page 1: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Automatic extraction and manual validation of a hierarchical English-Swedish terminology

NORDTERM 2009!

Magnus Merkel*, Jody Foo*, Mikael Andersson**, Lars Edholm**, Mikaela Gidlund**, Sanna Åsberg**!

Presented by Jody Foo!

* Department of Computer and Information Science, Linköping University!

** Fodina Language Technology AB!

Page 2: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Overview!

!! Background!

!! Term extraction and validation process!

!! Results!

!! Conclusions and future work!

Merkel, Foo et al, NORDTERM 2009!

Page 3: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Some history!

Merkel, Foo et al, NORDTERM 2009!

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

First attempts at MT @ EPO2004

Patent Abstracts of Japan (PAJ) launches online machine translation initiative2000

PRV term extraction and validation2008 – 2009

NLPLAB, Linköping University Spin-o!: Fodina Language Technology2004

Patent Information Conference2006

Results from initial machine translation projects

EPO launches patent MT service2006

Page 4: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Machine translation!

!! Two main approaches!!! Rule based machine translation (RBMT), e.g. Babelfish!!! Statistical machine translation (SMT), e.g. Google Translate!

!! MT @ EPO!!! Rule-based MT engine: Systran!!! RBMT requires domain specific dictionaries – patent terms!

Merkel, Foo et al, NORDTERM 2009!

Page 5: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Diallo 2006!

Merkel, Foo et al, NORDTERM 2009!

Page 6: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Diallo 2006!

Merkel, Foo et al, NORDTERM 2009!

Page 7: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Input data!

0

2000

4000

6000

8000

10000

12000

14000

A0

1

A2

3

A4

2

A4

5

A6

1

B0

1

B0

4

B0

7

B2

1

B2

4

B2

7

B3

0

B4

1

B4

4

B6

2

B6

5

B6

8

C0

2

C0

5

C0

8

C11

C1

4

C2

3

D0

1

D0

4

D0

7

E0

2

E0

5

F0

1

F0

4

F1

7

F2

3

F2

6

F4

1

G0

2

G0

5

G0

8

G11

H0

1

H0

4

0

1000

2000

3000

4000

5000

6000

7000

8000

A0

1B

A2

2C

A4

1B

A4

5D

A6

1D

A6

3D

B0

4B

B2

1G

B2

3P

B2

7B

B2

9C

B4

1N

B6

0F

B6

1C

B6

2M

B6

5G

C0

1D

C0

6F

C0

8K

C1

0K

C1

2S

C2

3G

D0

3J

D0

6P

E0

2B

E0

5D

F0

1N

F0

4B

F1

6L

F2

1S

F2

3Q

F2

8B

G0

1B

G0

1V

G0

5B

G0

7D

G11

B

H0

1J

H0

2N

H0

4K

0

5000

10000

15000

20000

25000

30000

A B C D E F G H

Merkel, Foo et al, NORDTERM 2009!

Page 8: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Merkel, Foo et al, NORDTERM 2009!

Page 9: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Overview of the term extraction and validation process!

OLIF!

Source data analysis and system configuration!

Term candidate extraction!

Term candidate filtering and initial linguistic validation!

Manual validation by domain experts!

Final linguistic validation!

Publishing of validated terms!

SGML & OCR!

Merkel, Foo et al, NORDTERM 2009!

Page 10: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Perform necessary steps before term extraction is possible!

OLIF!

Source data analysis and system configuration!

Term candidate extraction!

Term candidate filtering and initial linguistic validation!

Manual validation by domain experts!

Final linguistic validation!

Publishing of validated terms!

SGML & OCR!

Merkel, Foo et al, NORDTERM 2009!

Page 11: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Analysis of source material and system configuration!

!"#$%&'(%)*+&,-..+/,0123-44+5/+/,-+/-4/036+.715/073+

!"#$%&'(%&)+&25.-/4+/8712.-2+90:+;<794/=..-/+

*+ *+ >?)>+ @)A+ B*C+ *+

*+ *+ *+>?)>+

Merkel, Foo et al, NORDTERM 2009!

Page 12: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Extract list of term candidates to be validated!

OLIF!

Source data analysis and system configuration!

Term candidate extraction!

Term candidate filtering and initial linguistic validation!

Manual validation by domain experts!

Final linguistic validation!

Publishing of validated terms!

SGML & OCR!

Merkel, Foo et al, NORDTERM 2009!

Page 13: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Term candidate extraction!

Merkel, Foo et al, NORDTERM 2009!

Page 14: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Client-server infrastructure!

Merkel, Foo et al, NORDTERM 2009!

Page 15: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Merkel, Foo et al, NORDTERM 2009!

Page 16: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Reduce the number of term candidates to be processed by the domain experts!

OLIF!

Source data analysis and system configuration!

Term candidate extraction!

Term candidate filtering and initial linguistic validation!

Manual validation by domain experts!

Final linguistic validation!

Publishing of validated terms!

SGML & OCR!

Merkel, Foo et al, NORDTERM 2009!

Page 17: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Term filtering and initial linguistic validation!

!! Filtering criteria!!! General language filtering!!! Q-value (~alignment confidence)!!! Link errors!!! Source OR target frequency > 4!

Merkel, Foo et al, NORDTERM 2009!

Page 18: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Term filtering and initial linguistic validation!

!! Example: C04B!

Total number of term candidates: 143,341 General language entries: 18,764 Link errors: 653 Freq >4 src|trg: 9,064 Q-value filtering: keep 4,076 DEF95.G(HIJ+

Total after filtering: 3,179!

Merkel, Foo et al, NORDTERM 2009!

Page 19: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Manual validation by domain experts!

Merkel, Foo et al, NORDTERM 2009!

Page 20: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Overview of the term extraction and validation process!

OLIF!

Source data analysis and system configuration!

Term candidate extraction!

Term candidate filtering and initial linguistic validation!

Manual validation by domain experts!

Final linguistic validation!

Publishing of validated terms!

SGML & OCR!

Merkel, Foo et al, NORDTERM 2009!

Page 21: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Final linguistic validation!

!! To be validated!!! Part-of-speech, Inflection pattern, Gender, Number!

!! Recycle as much information as possible from previously validated terms!

!! Process terms by recycling status!!! Very reliable information!!! Less reliable information!!! No information available!

Merkel, Foo et al, NORDTERM 2009!

Page 22: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Publishing of validated terms!

C21B! C21C! C21D!

C21!C03!

C!

H05C!H05B!

A! F! H!

C03B! C03C!

F42! H05!C11!A61!

Top!

E!

Merkel, Foo et al, NORDTERM 2009!

Page 23: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Final numbers!

!! Processed 91,000 document pairs in 8 months.!

!! Validated term pairs: 181,260!

!! Expert validatation: 4 – 6,000 term candidate pairs/working day!

!! Linguistic validation: 2 – 3,000 term pairs/working day!

Merkel, Foo et al, NORDTERM 2009!

Section!Accumulated amount

of total number of documents (in %)!

Accumulated amount of total number of

documents (in %)!

Accumulated amount of term pairs!

Accumulated amount of UNIQUE term pairs!

D! 2,8! 2,8! 17288! 9697!E! 2,1! 4,9! 32045! 16304!F! 7,1! 12! 78301! 32512!G! 10,2! 22,2! 133912! 53731!H! 10,3! 32,5! 187429! 72721!A! 20,7! 53,2! 289850! 110642!B! 18,1! 71,3! 419185! 146665!C! 28,7! 100! 545143! 181260!

Page 24: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Growth of validated terms!

Merkel, Foo et al, NORDTERM 2009!

0!

100000!

200000!

300000!

400000!

500000!

600000!

0! 20! 40! 60! 80! 100!Num

ber o

f val

idat

ed te

rm p

airs!

Amount of total number of documents (in %)!A blue diamond marks the right edge of a section, left to right: D - E - F - G - H - A - B - C.!

Accumulated amount of validated term pairs!

Accumulated amount of validated UNIQUE term pairs!

Right section edge of: D - E - F - G - H - A - B - C!

Page 25: Automatic extraction and manual validation of a hierarchical English-Swedish terminology

Conclusions and future work!

!! Key concepts!!! using previously validated term pairs to avoid doing the same

work twice!!! using students as domain experts!!! using an e"cient validation tool!

!! Future work!!! Improving automated filtering and reduction of term candidates!!! Automating termness detection!

Merkel, Foo et al, NORDTERM 2009!