Cogito classic connected_watch_app_tutorial_android_20140620
Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The...
-
Upload
semanticsconference -
Category
Technology
-
view
92 -
download
1
Transcript of Stefan Geißler | Hybrid semantic document enrichment using machine learning and linguistics - The...
Hybrid semantic document enrichment using machine learning and linguisticsStefan Geißler, SEMANTICS, Leipzig Sept 14 2016
Expert System
• TitleWhat is this?
A graph showing the distribution of large cities in the world
Size of the city (population)
The city‘s rank
• TitleWhat is this?
A graph showing the richest people of the world
Wealth of the person
The person‘s rank
• TitleWhat is this?
A graph showing the most frequent words from a large text corpus
Frequency of the word
The word‘s rank
Empirical evidence: Many types of data from physics, social sciences etc follow such a distribution
„Zipf‘s law“:
The number of data points (cities, rich people, words) with a value higher than S (on the y axis) is proportional to 1/S.
• TitleDistribution of categories in many categorized/tagged corpora
Frequency of the category
The category‘s rank
Problem #1:
How does that fit the requirement at the start of many categorization projects that a category will need a decent amount of data (>100 documents) to be trained?
Larger categories can be trained (learned automatically) smaller ones often can‘t.
Problem #2:
Even for the frequent enough categories: Is a training corpus really representative?
Is „Greece“ always about „debt crisis“?Is „Ansbach“ always about „terror“?
Learning method may learn unwanted associations
• TitleSolution?
More data? No because,- The graph here is
scale-free- More data is often
not available or very costly
Frequency of the category
The category‘s rank
Solution: Let the human expert refine the automatically created modelHuman document categorization:
If („Etna“ or „Vesuv“ or „Pinantubo“) AND („lava“ or „eruption“)
Then „Volcanism“
Machine document categorization:
This is seldomly a subject in scientific work on document categorization.
Different classification methods most often compared only on the basis of their (automatic) performance on a evaluation corpus
… but this is often a requirement in real-world document categorization projects.
• Training corpora alone are often not enough to attained expected levels of quality.
• Additional data hard to find (manual preparation or curation very costly)
• Existing corpora may not always be representative.
Our suggestion
• Use available training data to train a model
• Make the model available in a human readable formal language
• Allow user to inspect and refine model where needed in a dedicated developement&testing environment
• A rich formal language (strings, lemmas, regexps, semantic concepts, operators …) allows to express learnt associations for bag of words models
• … as well as detailed syntactic/semantic constraints
• … and visualize and evaluated the result in the same application
• For the reasons explained above, the statistical learning approach may erroneously learn a rule that the words „Athens“ or „Greece“ allone justify assigning the document to „Banking Crisis“
• The user can refine the learnt rule, adding the further constraint that features like „Debt“, „Schäuble“ or „Troika“ are required before the category is assigned.
… Sample projects
• <US Media company>• Large category schema for news articles • Task: set up solution that allows combining
automatically created rule sets with manual refinement
• <Insurance company>• Categorize medical reports using ICD category
scheme• Go beyond quality that can be attained by using
only the manually coded training set
Conclusion• Requirements in categorization projects in the
industry are sometimes not identical to the scenarios in academic categorization benchmarks
• Available training data sometimes limited even in the age of big data
• Allow the seamless (one language, one development environment) application of both learnt as well as manually crafted rules
Expert SystemWho we are
Expert System: Largest European provider of pure semantic technologies
• 7 Geographies• 250+ team members• Listed on the AIM exchange• Recommended by Gartner,
Forrester, IDC ...
• Experiences from hundreds of projects
• Award winning technology: Taxonomy / Ontology Management, NLP, Information extraction, Question Answering, Cognitive Computing
Global Positioning – Selected Clients
21
ENERGY, OIL & GAS
GOVERNMENT
FEDERAL AGENCIES
MEDIA & PUBLISHING
Life Sciences
FINANCE