Learning ontologies from the web for microtext processing Boris A.Galitsky, Gábor Dobrocsi, and...

15
Learning ontologies from the web for microtext processing Boris A.Galitsky, Gábor Dobrocsi, and Josep Lluis de la Rosa . University of Girona Spain .

Transcript of Learning ontologies from the web for microtext processing Boris A.Galitsky, Gábor Dobrocsi, and...

Learning ontologies from the web for microtext

processingBoris A.Galitsky, Gábor Dobrocsi,

and Josep Lluis de la Rosa.

University of Girona Spain

.

What can be a scalable way to automatically build a

taxonomies of entities to improve search relevance?

Taxonomy construction starts from the seed entities and mines the web for new entities associated with them.

To form these new entities, machine learning of syntactic parse trees (syntactic generalization) is applied

It form commonalities between various search results for existing entities on the web.

Taxonomy and syntactic generalization is applied to relevance improvement in search and text similarity assessment in commercial setting; evaluation results show substantial contribution of both sources.

Automated customer service rep.

Q: Can you reactivate my card which I am trying to use in Nepal?

A: We value you as a customer… We will cancel your card… New card will be mailed to your California address …

A child with severe form of autism

Q: Can you give your candy to my daughter who is hungry now and is about to cry?

A: No, my mom told me not to feed babies. Its wrapper is nice and blue. I need to wash my hands before I eat it … …

Entities need to make sense together

Why ontologies are needed for microtext

Human and auto agents having difficulties processing texts if required ontologies are missing

Knowing how entities are connected would improve search results

Condition “active paddling” is ignored or misinterpreted, although Google knows that it is a valid combination (‘paddling’ can be ‘active’)

• In the above example “white water rafting in Oregon with active paddling with kids” active is meaningless without paddling.

• So if the system can’t find answers with ‘active paddling’, try finding with ‘paddling’, but do not try finding with ‘active’ but without ‘paddling’.

Difficulty in building taxonomies

Building, tuning and managing taxonomies and ontologies is rather costly since a lot of manual operations are required.

A number of studies proposed automated building of taxonomies based on linguistic resources and/or statistical machine learning, (Kerschberg et al 2003, Liu &Birnbaum 2008, Kozareva et al 2009).

However, most of these approaches have not found practical applications due to:

– insufficient accuracy of resultant search,

– limited expressiveness of representations of queries of real users,

– high cost associated with manual construction of linguistic resources and their limited adjustability.

• It is based on initial set of key entities (a seed) for given vertical knowledge domain.

• This seed is then automatically extended by mining of web documents which include a meaning of a current taxonomy node.

• This node is further extended by entities which are the results of inductive learning of commonalities between these documents.

• These commonalities are extracted using an operation of syntactic generalization, which finds the common parts of syntactic parse trees of a set of documents, obtained for the current taxonomy node.

We propose automated taxonomy building mechanism

Therefore automated or semi-automated approach is required for practical apps

Providing multiple answers as a result of

default reasoning

Facts Si , comprising the query representation (occurrences of words in a query)

Default rules, establishing the meanings of words based on the other words and the meanings that have been established

Successful & closed process: extension @S1, @S2 ,… answer 1

Successful & closed process: extension @S3, @S1 ,… answer 2

Either unsuccessful or non-closed process: No extension

Using default logic to handle ambiguity in microtext

Building extensions of default theory for each meaning

A simplified step 1 of ontology learningCurrently available: tax – deduct1) Get search results for currently available expressions2) Select attributes based on their linguistic occurrence (shown in yellow)3) Find common attributes (commonalities between search results, shown in red, like ‘overlook’).4) Extend the taxonomy path by adding newly acquired attributeTax-deduct-overlook

Step 2 of ontology learning (more details)

Currently available taxonomy path: tax – deduct - overlook1) Get search results 2) Select attributes based on their linguistic occurrence (modifiers of entities from the current taxonomy path)3) Find common expressions between search results as syntactic generalization, like ‘PRP-mortgage’

4) Extend the taxonomy path by adding newly acquired attributeTax-deduct-overlook- mortgage, Tax-deduct- overlook – no_itemize…

Step 3 of ontology learningCurrently available taxonomy path: tax – deduct – overlook-mortgage1) Get search results 2) Perform syntactic generalization, finding common maximal parse sub-trees excluding the current taxonomy path 3) If nothing in common any more, this is the taxonomy leave (stop growing the current path).

Possible learning results (taxonomy fragment)

If a keyword is in a query, and in the closest taxonomy path, it HAS TO BE

in the answerQuery:can I deduct tax on mortgage escrow account:

Closest taxonomy path:tax – deduct – overlook-mortgage- escrow_account

Then keywords/multiwords have to be in the answer:{deduct ,tax , mortgage escrow_account }

Wrong answers

Improving the precision of text similarity: articles, blogs, tweets, images and

videos

We verify if an image belongs here, based on

its caption Using syntactic generalization to access relevance

Generalizing two sentences and its application

Improvement of search relevance by checking syntactic similarity between query and sentences in search hits. Syntactic similarity is measured via generalization.

Such syntactic similarity is important when a search query contains keywords which form a phrase , domain-specific expression, or an idiom, such as “shot to shot time”, “high number of shots in a short amount of time”.

Based on syntactic similarity, search results can be re-sorted based on the obtained similarity score

Based on generalization, we can distinguish meaningful (informative) and meaningless (uninformative) opinions, having collected respective datasets

Meaningful sentence to be shown as search result

Not very meaningful sentence to be shown,

even if matches the search query

Generalizing sentences & phrases

noun phrase [ [JJ-* NN-zoom NN-* ], [JJ-digital NN-camera ]]

About ZOOM and DIGITAL CAMERA

verb phrase [ [VBP-* ADJP-* NN-zoom NN-camera ], [VB-* NN-zoom IN-* NN-camera ]]

To do something with ZOOM –…- CAMERA

prepositional phrase [ [IN-* NN-camera ], [IN-for NN-* ]]

With/for/to/in CAMERA, FOR something

Obtain parse trees. Group by sub-trees for each phrase type

Extend list of phrases by paraphrasing (semantically equivalent expressions)

For every phrase type

For each pair of tree lists, perform pair-wise generalization

For a pair of trees, perform alignment

For a pair of words (nodes), generalize them

Remove more general trees (if less general exist) from the resultant list

VP [VB-use DT-the JJ-digital NN-zoom IN-of DT-this NN-camera IN-for VBG-filming NNS-insects ] +

VP [VB-get JJ-short NN-focus NN-zoom NN-lens IN-for JJ-digital NN-camera ]

= [VB-* JJ-* NN-zoom NN-* IN-for NNS-* ] score = score(NN) + score(PREP) + 3*score(<POS*>)

Meaning:“Do-something with some-kind-of ZOOM something FOR

something-else”

Generalizing phrasesDeriving a meaning by generalization

Generalization: from words to phrases to sentences to paragraphs

Syntactic generalization helps with microtext when

ontology use is limited

Learning similarity between syntactic trees

1. Obtain parsing tree for each sentence. For each word (tree node) we have lemma, part of speech and form of word information, as well as an arc to the other node.

2. Split sentences into sub-trees which are phrases for each type: verb, noun, prepositional and others; these sub-trees are overlapping. The sub-trees are coded so that information about occurrence in the full tree is retained.

3. All sub-trees are grouped by phrase types. 4. Extending the list of phrases by adding equivalence

transformations Generalize each pair of sub-trees for both sentences for each phrase type.

5. For each pair of sub-trees yield the alignment, and then generalize each node for this alignment. For the obtained set of trees (generalization results), calculate the score.

6. For each pair of sub-trees for phrases, select the set of generalizations with highest score (least general).

7. Form the sets of generalizations for each phrase types whose elements are sets of generalizations for this type.

8. Filtering the list of generalization results: for the list of generalization for each phrase type, exclude more general elements from lists of generalization for given pair of phrases.

Generalization of semantic role expressions

Generalization algorithm

*

EvaluationMedia/ method of text similarity assessment

Full size news articles

Abstracts of articles

Blog posting

Comments

Images Videos

Frequencies of terms in documents

29.3% 26.1% 31.4% 32.0% 24.1% 25.2%

Syntactic generalization

17.8% 18.4% 20.8% 27.1% 20.1% 19.0%

Taxonomy-based

45.0% 41.7% 44.9% 52.3% 44.8% 43.1%

Hybrid (taxonomy + syntactic)

13.2% 13.6% 15.5% 22.1% 18.2% 18.0%

Hybrid approach improves text similarity/relevance assessment

Ordering of search results based on generalization, taxonomy, and conventional

search engine

Classification of short texts

Related work Conclusions

Ontologies are more sensitive way to match keywords in micro-text (compared to bag-of-words and TF*IDF)

Since microtext includes abbreviations and acronyms, and we don’t ‘know’ all mappings, semantic analysis should be tolerant to omits of some entities and still understand “what this text fragment is about”.

Since we are unable to filter out noise “statistically” like most NLP environments do, we have to rely on ontologies.

Syntactic generalization takes bag-of-words and pattern-matching classes of approaches to the next level allowing to treat unknown words systematically as long as their part of speech information is available from context.

•Mapping to First Order Logic representations with a general prover and without using acquired rich knowledge sources

•Semantic entailment [de Salvo Braz et al 2005]

•Semantic Role Labeling, for each verb in a sentence, the goal is to identify all constituents that fill a semantic role, and to determine their roles, such as Agent, Patient or Instrument [Punyakanok et al 2005].

•Generic semantic inference framework that operates directly on syntactic trees. New trees are inferred by applying entailment rules, which provide a unified representation for varying types of inferences [Bar-Haim et al 2005]

•Generic paraphrase-based approach for a specific case such as relation extraction to obtain a generic configuration for relations between objects from text [Romano et al 2006]