Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del...

22
Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    2

Transcript of Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del...

Page 1: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

Supporting e-learning with

automatic glossaryextraction

Experiments with Portuguese

Rosa Del Gaudio, António BrancoRANLP, Borovets 2007

Page 2: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

Presentation Plan

● LT4eL project● ILIAS● Corpus● Tool● Grammars

● Copula● Other Verbs● Punctuation

● Results● Conclusion

Page 3: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

LT4eL● Improve retrieval and accessibility of LO in learning management systems●Employ language technology resources and tools for the semi-automatic generation of descriptive metadata .

●Develop new functionalities such as a key word extractor and a glossary candidate detector, semantic search, tuned for the various languages addressed in the project (Bulgarian, Czech, Dutch, English, German, Maltese, Polish, Portuguese, Romanian).

Page 4: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

ILIAS

Page 5: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

Objective

● Build a Glossary in an automatic way to support e-learning process. In practice this means to extract a definition from unstructured text (scientific papers, enciclopedia, web pages)

● Better access to information for student ●Accelerate the work of the tutor

Page 6: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

ILIAS: Glossary Candidate Detector

Page 7: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

The Corpus

• 274.000 tokens • Tutorials

• PhD Thesis

• Scientific papers

• 3 Domains evenly represented

• e-learning

• Technology for non experts

• Calimera

Page 8: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

XML format

<definingText continue="y" def="m147" def_type1="is_def" id="d5"><markedTerm dt="y" id="m147" kw="y"><tok base="intranet" class="word" ctag="PNM" id="t9032" sp="y">Intranet</tok></markedTerm><tok base="ser" class="word" ctag="V" id="t9033" msd="pi-3s" sp="y">é</tok><tok base="uma" class="word" ctag="UM" id="t9034" msd="fs" sp="y">uma</tok><tok base="rede" class="word" ctag="CN" id="t9035" msd="fs" sp="y">rede</tok><tok base="desenvolver,desenvolvido" class="word" ctag="PPA" id="t9036" msd="fs"

sp="y">desenvolvida</tok><tok base="para" class="word" ctag="PREP" id="t9037" sp="y">para</tok><tok base="processamento" class="word" ctag="CN" id="t9038" msd="ms"

sp="y">processamento</tok><tok base="de" class="word" ctag="PREP" id="t9039" sp="y">de</tok><tok base="informação" class="word" ctag="CN" id="t9040" msd="fp"

sp="y">informações</tok><tok base="em" class="word" ctag="PREP" id="t9041" sp="y">em</tok><tok base="uma" class="word" ctag="UM" id="t9042" msd="fs" sp="y">uma</tok><tok base="empresa" class="word" ctag="CN" id="t9043" msd="fs" sp="y">empresa</tok><tok base="ou" class="word" ctag="CJ" id="t9044" sp="y">ou</tok><tok base="organização" class="word" ctag="CN" id="t9045" msd="fs">organização</tok><tok class="punctuation" ctag="PNT" id="t9046" sp="y">.</tok></definingText>

Page 9: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

LxTransduce

• Input: simple text or xml

• Regular expressions

• Substitution and markup

• Output the same file with changes

• Match tree using elements

• Quick

• Unicode friendly

• freeware

• Easy to integrate in other tools (java)

Page 10: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

Rules in lxtransduce

<rule name="Conj"> <query match="tok[@ctag =

'CJ']"/></rule>

<rule name="Coor"> <!--Conjunctions or comma -->

<first><query match="tok[. = ',']"/><ref name="Conj" mult="+"/></first></rule>

<rule name="PARopen"> <query match="tok[.~'^\($']"/> </rule>

<rule name="PARcl"> <query match="tok[.~'^\($']"/> </rule>

<rule name="parenthetic"><seq><ref name="PARopen"/><repeat-until name="tok"><ref name="PARcl"/></repeat-until><ref name="PARcl"/></seq></rule>

Page 11: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

First developmentphase

● Less than 50% of the corpus● Focus on the verb● Precision: manually marked/all automatic● Recall: correct automatic/manually marked● F2 :3*(precision*recall)/2*precision+recall

0.220.200.31Gr 01

0.260.440.14Gr 00

F2RecallPrecision

Page 12: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

Second developing phase

• 75% of the corpus for developing

• 25% of the corpus for testing

• Specific grammar/rules for each type

Page 13: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

Copula baseline grammar

<rule name="euristic"><seq><repeat-until name="tok"><ref name="SERdef" mult="+"/></repeat-until><ref name="SERdef" mult="+"/><not><ref name="PPA"/></not><ref name="tok" mult="*"/><end/></seq></rule>

Verb “to be” third person singular or plural present indicative

<rule name="SERdef"><best><ref name="Ser3"/><ref name="PoderSer"/></best></rule>

Page 14: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

Copula base result

• Sentence level results

• Problem with precision

Page 15: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

Copula Grammar

Page 16: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

Rules for is_type

<!-- To Be 3rd person pl and s -->

<rule name="Serdef"> <querymatch="tok[@ctag = ’V’ and

@base=’ser’ and(@msd[starts-with(.,’fi-

3’ )]or @msd[starts-with(.,’pi-

3’ )])]</rule>....

<rule name="copula1"><seq><ref name="SERdef"/><best><seq><ref name="Art"/><ref name="adj|adv|prep|"

mult="*"/><ref name="Noun" mult="+"/></seq>....</best><ref name="tok" mult="*"/><end/></seq></rule>

Page 17: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

Confronting Results

Include that patterns that were excluded

Try to gather the syntactic pattern of non definition and confront with the syntactic pattern of definition.

Page 18: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

Other_Verbs grammar

• Collect verbs in a lexicon• Three different category:

reflexive, active, passive.• 22 different verbs

<lex word="chamar"><cat>ref</cat></lex><lex word="chamar,chamado"><cat>pas</cat></lex>

<rule name="Vpas"><seq><ref name="tok"/><not><ref name="not"/> </not><ref name="tok" mult="?"/><query match="tok[mylex(@base)

and (@ctag='PPA')]" constraint="mylex(@base)/cat='pas'"/>

</seq></rule>

Page 19: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

Results for verb_type

• Analyze each verbs separately as with is_type

• Richer syntactic patterns

Page 20: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

Punctuation Grammar

<rule name="punct_def"><seq><start/><ref name="CompmylexSN"

mult="+"/><query match="tok[.~’^:\$’]"/><ref name="tok" mult="+"/><end/></seq></rule>

●Preliminary work

●Definition introduced by colon mark (most frequent)

Page 21: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

All-in-one

• Combination of the previous grammars

• The type is not take into account to calculate precision and recall

Page 22: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.

Conclusions and Future Work

• Overall results: Recall 86%, Precision 14%

• Difference among domains: the style of a document influence the result.

• Improve the rules for verb_type and punc_type

• Combining with other techniques such as ML