Logics and Ontologies for Portuguese Understanding

50
+ Logics and Ontologies for Portuguese Understanding Valeria de Paiva, Nuance Comms, CA Visiting Prof DI-PUC-RJ, CSF (with Alexandre Rademaker, Gerard de Melo, Claudia Freitas, Livy Real, Dario Oliveira,Suemi Higuchi …)

Transcript of Logics and Ontologies for Portuguese Understanding

Page 1: Logics and Ontologies for Portuguese Understanding

+

Logics and Ontologies for Portuguese Understanding

Valeria de Paiva, Nuance Comms, CA Visiting Prof DI-PUC-RJ, CSF (with Alexandre Rademaker, Gerard de Melo, Claudia Freitas, Livy Real, Dario Oliveira,Suemi Higuchi …)

Page 2: Logics and Ontologies for Portuguese Understanding

+Nuance Comms, AI and NL Lab, Sunnyvale, CA

Page 3: Logics and Ontologies for Portuguese Understanding

+Ron Kaplan “Beyond the GUI: It’s Time for a Conversational User Interface”, Wired 2013

http://www.wired.com/2013/03/conversational-user-interface/

Page 4: Logics and Ontologies for Portuguese Understanding

+Siri and other personal assistants

Page 5: Logics and Ontologies for Portuguese Understanding

+The Future is Meaning…

Page 6: Logics and Ontologies for Portuguese Understanding

+Recent Past…

Page 7: Logics and Ontologies for Portuguese Understanding

+PARC’s Bridge System (1999-2008)

Page 8: Logics and Ontologies for Portuguese Understanding

+ PARC’s Bridge System (1999-2008)

F-structure semantics

KR

Parsing KR Mapping

Inference Engines Text

Sources

Question

Assertions

Query

Grammar Stanford Parser

Textual Inference logics

Term rewriting OpenWN-PT SUMO-PT KR mapping rules

Page 9: Logics and Ontologies for Portuguese Understanding

+Powerset

Acquired by Microsoft, 2008

Page 10: Logics and Ontologies for Portuguese Understanding

+and Cuil…

Page 11: Logics and Ontologies for Portuguese Understanding

+Another story

https://www.parc.com/event/934/adventures-in-searchland.html

Page 12: Logics and Ontologies for Portuguese Understanding

+Today: redoing PARC work in Portuguese…

Page 13: Logics and Ontologies for Portuguese Understanding

+Goals in 2010

Page 14: Logics and Ontologies for Portuguese Understanding

+Goals in 2010

Page 15: Logics and Ontologies for Portuguese Understanding

+Goals in 2010…

¨  Content analysis ⁄ large-scale intelligent information extraction, access and retrieval ¨  Text understanding ¨  Text generation ¨  Text simplification ¨  Automatic summarization ¨  Dialogue systems ¨  Question answering ¨  Machine Translation ¨  Named Entity Recognition, ¨  Anaphora/co-reference resolution, ¨  Reading, writing, grammar aids, etc...

Page 16: Logics and Ontologies for Portuguese Understanding

+Pipeline envisaged in 2010 Plug and Play…

Page 17: Logics and Ontologies for Portuguese Understanding

+SRI Talk: www.slideshare.net/valeria.depaiva/bridge-sri

Page 18: Logics and Ontologies for Portuguese Understanding

+ PARC’s Bridge System (1999-2008)

Idea: Simplify and reproduce components in PORTUGUESE

F-structure semantics

KR

Parsing KR Mapping

Inference Engines Text

Sources

Question

Assertions

Query

Grammar Stanford Parser

Textual Inference logics

Term rewriting OpenWN-PT SUMO-PT KR mapping rules

Page 19: Logics and Ontologies for Portuguese Understanding

+Reality Check…

n  What PARC considered pre-processing is MOST of the processing…

n  Got the XLE research license, but hard to use it, needed several lexicons that DO NOT exist in Portuguese, notably WordNet

n  There are several open toolkits that can be used instead: FREELING

OpenNLTK

StanfordNLP

More usable, more community, less expertise required

Page 20: Logics and Ontologies for Portuguese Understanding

+Why do we need lexical resources?

n  Sem eles nao fazemos nada!

n  Precisamos de dicionarios, gazetteers, tesauros, lexica de todos os tipos

n  Queremos um sistema que ENTENDE e RACIOCINA em PORTUGUES

n  MAS pra fazer logica e raciocinio com representacoes de conhecimento vindo de textos, de maneira escalavel e robusta, temos que lidar com linguagem seriamente…

n  QUAIS recursos lexicos? Comecamos com WordNet e NomLex…

Page 21: Logics and Ontologies for Portuguese Understanding

+Since 2010…

n Dois recursos lexicos grandes e importantes: OpenWN-PT e NomLex-PT

n Grupo cresceu e varias agendas de pesquisa estao misturadas

n Hoje: uma tentativa de organizar o projeto mais ambicioso, no estilo do de 2010.

Page 22: Logics and Ontologies for Portuguese Understanding

+Portuguese WordNet?

n OpenWordNet-PT our big success so far…

n  Used by FreeLing as their internal dictionary for Portuguese

n  Used by Onto.PT for calibration..

n  Most importantly used by Bond in Open Multilingual WordNet http://compling.hss.ntu.edu.sg/omw/

n  Used by other researchers (anedoctal evidence)

Page 23: Logics and Ontologies for Portuguese Understanding

+WordNet?

http://wordnetweb.princeton.edu/

Page 24: Logics and Ontologies for Portuguese Understanding

+WordNet…

n  Is a combination of dictionary and thesaurus that is intuitive, usable, and supports automatic text analysis and artificial intelligence applications. Released under a BSD style license, can be downloaded and used freely.

n WordNet is the most commonly used computational lexicon of English.

n  WordNet created at Princeton University under George A. Miller, since 1985. A lexical database for English: groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these

n  [ Some complaints that WordNet encodes sense distinctions that are too fine-grained even for humans. The granularity issue has been tackled by proposing clustering methods that automatically group together similar senses of the same word.]

Page 25: Logics and Ontologies for Portuguese Understanding

+Global WordNet Association n  Christiane Fellbaum and Piek Vossen

(EuroWordNet 1996-1999, GWA since)

n  The Global WordNet Association (GWA) is a free, public and non-commercial organization that provides a platform for discussing, sharing and connecting wordnets for all languages in the world.

n  Open Multilingual Wordnet Francis Bond http://compling.hss.ntu.edu.sg/omw

n  A simple user interface: Welcome to the Open Multi-lingual Wordnet (1.2) http://compling.hss.ntu.edu.sg/omw/cgi-bin/wn-gridx.cgi?gridmode=grid

Page 26: Logics and Ontologies for Portuguese Understanding

+Multilingual Wordnet 1.0

Page 27: Logics and Ontologies for Portuguese Understanding

+Multilingual Wordnet 1.2

data for over 150 languages, with an estimated accuracy of 94%....

Page 28: Logics and Ontologies for Portuguese Understanding

+Multilingual Wordnet 1.2

Page 29: Logics and Ontologies for Portuguese Understanding

+ OpenWordnet-PT? (aren’t all wordnets open?)

We needed a Portuguese Wordnet for our work, but none of the previous projects was openly available.

Previous work: WordNet.PT and WordNet.PT global (Lisboa), MultiWordNet.PT and Brazilian WordNet by Bento Dias.

Page 30: Logics and Ontologies for Portuguese Understanding

+Portuguese WordNets… n  WordNet.PT (P. Marrafa) since 1999, part

of EuroWordNet, 19K expressions, manually curated, online consulting only. Some domains

n  MWN.PT - MultiWordnet of Portuguese (A. Branco), since 2008, part of MWN, over 17,200 manually validated concepts/synsets, not free

n  WN.Br (B. Dias da Silva) since 2000, not open, not available online, REBECA 2010 only ‘wheeled vehicles’….

n  SIMULTANEOUS to ours: Onto.PT, Coimbra http://ontopt.dei.uc.pt/ Hugo Gonçalo Oliveira e Paulo Gomes

Page 31: Logics and Ontologies for Portuguese Understanding

+ OpenWN-PT:What?

n  Leverage EuroWordNet, MultiWordNet, Global WordNet experience

n  Gerard de Melo joined project, leverage YAGO, UWN/Menta experience…

n  UWN/MENTA (de Melo/Weikum) A large-scale multilingual lexical knowledge base built using statistical methods, transforming WordNet into a massively multilingual resource (over 1 million words and several million named entities in a single large multilingual taxonomy)

n  Portuguese `projection’ of UWN/Menta the basis of automated version of OpenWordNet-PT, publicly available.

https://github.com/arademaker/wordnet-br

Page 32: Logics and Ontologies for Portuguese Understanding

+OpenWN-PT: the method n  a two-tiered methodology: high precision for more

frequent words, high recall to cover the long tail

n  Translation dictionaries to map the English members of a synset to possible Portuguese translation candidates

n  To disambiguate and choose the correct translations, feature vectors for possible translations are created by computing graph-based statistics in the graph of words, translations, and synsets.

n  Monolingual wordnets and parallel corpora used to enrich this graph. Statistical learning techniques used to iteratively refine this information and build an output graph connecting Portuguese words to synsets.

n  Wikipedia pages are then linked to relevant WordNet synsets by learning from similar graph-based features as well as gloss similarity scores.

Page 33: Logics and Ontologies for Portuguese Understanding

+OpenWN-PT: some problems…

Capitalized items, plurals, duplicates, a few gender issues, missing items…

Page 34: Logics and Ontologies for Portuguese Understanding

+OpenWN-PT: what’s missing? how to correct it?

n  Correct items one by one as applications require, doesn’t work. We tried it.

n  Need broad phenomena to be dealt with

n  Need context to understand synsets, both in English and in Portuguese

n  So far concentrate in the phenomenon of NOMINALIZATION (DEVERBALS), Livy Real’s thesis, Valeria’s previous work with Olya Gurevich

n  Hence two resources OpenWordNet-PT and NomLex-PT

Page 35: Logics and Ontologies for Portuguese Understanding

+OpenWN-PT: RDF Representation

n  Why? Forinteroperability between wordnets. To be able to rely on Linked Data and Semantic Web standards such as RDF and OWL.

n  Both the data model and the actual data in the same format. Range of existing data processing tools, including databases (“triple stores”) with SQL-like query interfaces (SPARQL).

n  Standard W3C encoding of WordNet in RDF since 2006. OpenWN-PT is is modelled after and fully interoperable with Princeton WordNet.

n  This means that one can easily find Portuguese equivalents for specific English word senses and vice versa. Also means OpenWN-PT is part of a large ecosystem of compatible resources, including domain identifiers and mappings to Wikipedia.

Page 36: Logics and Ontologies for Portuguese Understanding

+Use Cases: FreeLing

n  Word Sense Disambiguation via FreeLing 3.0 An Open Source Suite of Language Analyzers

n  OpenWN-PT has been incorporated into FreeLing (Padro’ and Stanilovsky, 2012)

n  Using Freeling’s word sense disambiguation framework, a given Portuguese text can automatically be annotated with word senses.

n  UPC, Barcelona

Page 37: Logics and Ontologies for Portuguese Understanding

+Use Cases: Sentiment Analysis

n  Sentiment Analysis, using tweets about soccer games

n  OpenWN-PT and SentiWordNet to compare the MachineLearning-based sentiment analysis integrated into IBM InfoSphere Streams (ISS) platform.

n  1 million tweets, 4 friendly matches Brazilian team in 2013, 7 classes of positivity

n  IBM Research, BR

Page 38: Logics and Ontologies for Portuguese Understanding

+Use Cases: NomLex-PT (Livy Real)

n  extension of OpenWN-PT aims at incorporating links to connect deverbal nouns with their corresponding verbs.

n  For English, NOMLEX (Macleod et al., 1998) has provided extensive descriptions of nominalizations via extensions of initial core.

n  NOMLEX-BR Translation of initial core, plus French Nomage

n  Overall, we initially created over 2,000 entries. These have been integrated into OpenWN-PT.

n  Hope to facilitate the use of nominalizations for linguistic research as well as information extraction

The destruction of the city by Alexander in 330BC…

Page 39: Logics and Ontologies for Portuguese Understanding

+Use Cases: NomLex-PT (Livy Real)

n  Incorporating NOMLEX-PT data into OpenWN-PT has shown itself useful in pinpointing some issues with the coherence and richness of OpenWN-PT.

n  the word abasement corresponds in NOMLEX to the verb abase, and thus we would like a similar correspondence between the Portuguese noun aviltamento and the verb aviltar (our suggested translations). OpenWN-PT simply has two synsets humilhar, abaixar and humilhar, rebaixar. The more common verb humilhar is repeated, while the uncommon aviltar was left out.

n  Other useful kinds of relationships between parts of speech (say the connections between adjectives and adverbs) are likely to also help to improve the accuracy and richness

Page 40: Logics and Ontologies for Portuguese Understanding

+Use Cases: NomLex-PT (Claudia Freitas)

n  Given that NOMLEX-PT has been created via data in lexica: NomLex, Nomage, Wicktionary, etc.. Maybe should use corpora to check this data?

n  Add all the nominalizations with suffixes chosen that were in the corpus, but not in NomLex-PT.

n  Classify these as AGENTIVE or NOT, or LEXICALIZED or NOT, or BOTH—lexicalized and not.

n  Examples: procuracao. (LEX), declaracao (BOTH)

Page 41: Logics and Ontologies for Portuguese Understanding

+Main Application: DHBB

n  Dictionario Historico de Biografias Brasileiro building a domain-specific ontology using NLP

n  Paper (next month) on Digital Humanities (Dario and Suemi CPDOC, FGV), preliminary results

n  Need serious NER (Named Entity Recognition)

n  Various systems: FreeLing, Stanford, NLTK, OpenNLP….

Page 42: Logics and Ontologies for Portuguese Understanding

+More Experiments: Verb Lexicon

n  Restrict to the verbs, what can we say about OpenWN-PT?

n  TorporES paper

n  VerbOcean comparison

n  What about language reasoning? Factives and Implicatives, CICLing paper 2014

n  What about the real world? SUMO in Portuguese in the plans

Page 43: Logics and Ontologies for Portuguese Understanding

+ Conclusions n  Discussed the implementation and some

applications of OpenWordNet-PT, an open Word- Net for Brazilian Portuguese.

n  Cleaning up, better coverage and nominalization links connecting nouns and verbs. (9 papers in 2014..)

n  The resource has been used in developing a high-throughput commercial system as well as in a cultural heritage project, and we anticipate that numerous further applications will follow.

n  The data is freely available from https://github.com/arademaker/openWordnet-PT and a SPARQL Endpoint at logics.emap.fgv.br:10035.

n  Browsing via Open Multilingual Wordnet is fun http://compling.hss.ntu.edu.sg/omw/cgi-bin/wn-gridx.cgi

Page 44: Logics and Ontologies for Portuguese Understanding

+ OpenWN-PT: next steps?..

n  First finish translating the “core” synsets in the Princeton WordNet to Portuguese?

n  Glosses? From Onto.PT?

n  Increase number of relations in OpenWN-PT as a way of improving adequacy and coherence?

n  Adding the Portuguese terms that satisfy different relations?

n  OpenVerbNet-PT?

n  Since we have a first target corpus, the Brazilian Historical Biographic Dictionary, we can also calculate word frequency to prioritize expansion of the OpenWN-PT and MORE ontology building...

n OpenBridge-PT? umbrella term for all possible improvements?

Page 45: Logics and Ontologies for Portuguese Understanding

+OntoLog 2014

Page 46: Logics and Ontologies for Portuguese Understanding

+

Thanks!

Page 47: Logics and Ontologies for Portuguese Understanding

+References Revisiting a Brazilian Wordnet. Valeria de Paiva, Alexandre Rademaker,  (2012) Proceedings of Global Wordnet Conference, Global Wordnet Association, Matsue. OpenWordNet-PT: An Open Brazilian WordNet For Reasoning. de Paiva, Valeria, Alexandre Rademaker, and Gerard de Melo. In Proceedings of the 24th International Conference On Computational Linguistics. http://hdl.handle.net/10438/10274. OpenWordNet-PT: A Project Report. Alexandre Rademaker, Valeria de Paiva, Gerard de Melo, Livy Real and Maira Gatti. Proceedings of the 7th Global Wordnet Conference, Tartu, Estonia. Global Wordnet Association, 2014. Embedding NomLex-BR Nominalizations Into OpenWordnet-PT. Coelho, Livy Maria Real, Alexandre Rademaker, Valeria De Paiva, and Gerard de Melo. 2014. In Proceedings of the 7th Global WordNet Conference. Tartu, Estonia

Page 48: Logics and Ontologies for Portuguese Understanding

+References Towards a Universal Wordnet by Learning from Combined Evidence  Gerard de Melo, Gerhard Weikum (2009) 18th ACM Conference on Information and Knowledge Management (CIKM 2009), Hong Kong, China. Bridges from Language to Logic:  Concepts, Contexts and Ontologies Valeria de Paiva (2010)Logical and Semantic Frameworks with Applications, LSFA'10, Natal, Brazil, 2010. `A Basic Logic for Textual inference", AAAI Workshop on Inference for Textual Question Answering, 2005. ``Textual Inference Logic: Take Two", CONTEXT 2007. ``Precision-focused Textual Inference", Workshop on Textual Entailment and Paraphrasing, 2007. PARC's Bridge and Question Answering System Proceedings of Grammar Engineering Across Frameworks, 2007.

Page 49: Logics and Ontologies for Portuguese Understanding

+Progress Report n  Checking is much easier than starting from scratch..

n  But long and tedious work to check even the initial 5k synsets suggested by GWA (not done, yet!), let alone all synsets in OpenWN-PT

n  Necessary? YES! Lexical gaps of all sorts

n  But resource is being used, warts and all…

n  Improving the resource: new data from Bond/Foster and some manual additions

Page 50: Logics and Ontologies for Portuguese Understanding

+Other stuff to add in?…

n  Onto.PT, ES wordnet?

n  Editing interfaces?

n  BabelNet?

n  NER issues?

n  Temporal issues?

n  Work with grammar?

n  Work on implicatives/factives in Portuguese?

n  FOIS workshop