Languages at Inxight Ian Hersey Co-Founder and SVP, Corporate Development and Strategy.
-
Upload
dustin-cameron -
Category
Documents
-
view
216 -
download
0
Transcript of Languages at Inxight Ian Hersey Co-Founder and SVP, Corporate Development and Strategy.
Languages at Inxight
Ian HerseyCo-Founder and SVP, Corporate Development and Strategy
Inxight Confidential
2
20+ years of Xerox PARC research - 70 patents Content & linguistic analysis (27 languages today) Information visualization and discovery
Silicon Valley HQ; offices in US, Europe
250 major customers
Seasoned management team
Solid investor backing: Vantage Point, Reed Elsevier, Deutsche Bank,
Dresdner Bank, Xerox, In-Q-Tel
Inxight at a Glance
Inxight provides the only complete solution for organizing and accessing unstructured data to increase
the speed and accuracy of information discovery
Inxight Confidential
3
What we mean by language support
Not pure statistics “Language independence” is a fallacy when it comes
to text Whitespace parsing + algorithmic stemming is a
cheap hack Stem-internal changes Compounding Agglutination Vocalization or lack thereof Non-breaking languages
Phrases, terms and named entities can’t be extracted effectively by n-gram indexing or pure machine learning
Inxight Confidential
4
Text analysis fundamentals
Base layer Language and character set identification Document analysis Tokenization Stemming/normalizationContextual analysis Part-of-speech tagging “Grouping”Find the interesting stuff Named entity extraction Syntactic analysis (clause boundary identification,
subject/object identification, etc.)Relate the interesting stuff; analyze meaning Semantic analysis (fact extraction, etc.)
Inxight Confidential
5
Don’t ignore statistics
Feed linguistic markup into probabilistic processing Categorization (choose your algorithm) Search/relevance ranking Summarization Co-occurrence analysis/entity resolution Link analysis Predictive analysis/data mining
Inxight Confidential
6
Base layer (LinguistX Platform)
Morphological analyzer Lexicon + rules Compiled as a finite-state machine Resource efficient, very fast
French lexicon recognizes 5M words; takes up 300K on disk/RAM, and runs at over 2 GB/hr on a low-end machine
Xerox finite-state tools tested on many languages (Inxight’s 27 + others in research)
Corpora to produce statistical models Language and character set detection Tagged corpus to produce Hidden Markov Model for
POS tagger Groupers
Finite-state “chunkers” – compiled regex
Inxight Confidential
7
Named entity extraction (ThingFinder) Builds on base platform Requires additional resources
Enhanced lexicon (POS tagset insufficient for high quality extraction)
Entity-specific groupers Tagged corpus for accuracy testing
Sometimes you need more Genre-specific document analysis Specialized tokenization, tagging Knowledge base (“Name Catalog”) Custom groupers
Inxight Confidential
8
Statistical models
Summarization Base layer + feature model (feature weights, stop
words, cue phrases) Categorization
Labeled training data
…and lots of interactive tools
Inxight Confidential
9
Fact extraction
Builds on base of linguistic markup + named entities
Modeled on specific templates Rules populate the templates
Additional linguistic resources Intra-document
Document analysis/genre identification Subject/object identification Anaphora resolution
Inter-document Entity resolution
Inxight Confidential
10
Developing a new language
Resource acquisition Corpora Lexicon Team
Computation linguist familiar with tools Native speaker
Resource enhancement Label tagged truth sets Build out morphological classes Fill lexical gaps
Build, test and refine
Soup to nuts: $500K to $1M for V1.0
Inxight Confidential
11
Challenge of low-density languages
Commercial non-viability Lack of lexical resources and corpora Lack of native speakers, or even proficient
speakers Greed
Inxight Confidential
12
Future developments on the language frontier New languages Increased depth in existing languages
Named entity extraction Added Arabic, Farsi and Chinese this year Enhanced English for DoD and DOJ
Fact extraction Other challenges
Name transliteration Translation/glossing Question answering