Languages at Inxight Ian Hersey Co-Founder and SVP, Corporate Development and Strategy.

Languages at Inxight

Ian HerseyCo-Founder and SVP, Corporate Development and Strategy

Inxight Confidential

2

20+ years of Xerox PARC research - 70 patents Content & linguistic analysis (27 languages today) Information visualization and discovery

Silicon Valley HQ; offices in US, Europe

250 major customers

Seasoned management team

Solid investor backing: Vantage Point, Reed Elsevier, Deutsche Bank,

Dresdner Bank, Xerox, In-Q-Tel

Inxight at a Glance

Inxight provides the only complete solution for organizing and accessing unstructured data to increase

the speed and accuracy of information discovery


3

What we mean by language support

Not pure statistics “Language independence” is a fallacy when it comes

to text Whitespace parsing + algorithmic stemming is a

cheap hack Stem-internal changes Compounding Agglutination Vocalization or lack thereof Non-breaking languages

Phrases, terms and named entities can’t be extracted effectively by n-gram indexing or pure machine learning


4

Text analysis fundamentals

Base layer Language and character set identification Document analysis Tokenization Stemming/normalizationContextual analysis Part-of-speech tagging “Grouping”Find the interesting stuff Named entity extraction Syntactic analysis (clause boundary identification,

subject/object identification, etc.)Relate the interesting stuff; analyze meaning Semantic analysis (fact extraction, etc.)


5

Don’t ignore statistics

Feed linguistic markup into probabilistic processing Categorization (choose your algorithm) Search/relevance ranking Summarization Co-occurrence analysis/entity resolution Link analysis Predictive analysis/data mining


6

Base layer (LinguistX Platform)

Morphological analyzer Lexicon + rules Compiled as a finite-state machine Resource efficient, very fast

French lexicon recognizes 5M words; takes up 300K on disk/RAM, and runs at over 2 GB/hr on a low-end machine

Xerox finite-state tools tested on many languages (Inxight’s 27 + others in research)

Corpora to produce statistical models Language and character set detection Tagged corpus to produce Hidden Markov Model for

POS tagger Groupers

Finite-state “chunkers” – compiled regex


7

Named entity extraction (ThingFinder) Builds on base platform Requires additional resources

Enhanced lexicon (POS tagset insufficient for high quality extraction)

Entity-specific groupers Tagged corpus for accuracy testing

Sometimes you need more Genre-specific document analysis Specialized tokenization, tagging Knowledge base (“Name Catalog”) Custom groupers


8

Statistical models

Summarization Base layer + feature model (feature weights, stop

words, cue phrases) Categorization

Labeled training data

…and lots of interactive tools


9

Fact extraction

Builds on base of linguistic markup + named entities

Modeled on specific templates Rules populate the templates

Additional linguistic resources Intra-document

Document analysis/genre identification Subject/object identification Anaphora resolution

Inter-document Entity resolution


10

Developing a new language

Resource acquisition Corpora Lexicon Team

Computation linguist familiar with tools Native speaker

Resource enhancement Label tagged truth sets Build out morphological classes Fill lexical gaps

Build, test and refine

Soup to nuts: $500K to $1M for V1.0


11

Challenge of low-density languages

Commercial non-viability Lack of lexical resources and corpora Lack of native speakers, or even proficient

speakers Greed


12

Future developments on the language frontier New languages Increased depth in existing languages

Named entity extraction Added Arabic, Farsi and Chinese this year Enhanced English for DoD and DOJ

Fact extraction Other challenges

Name transliteration Translation/glossing Question answering

Languages at Inxight Ian Hersey Co-Founder and SVP, Corporate Development and Strategy.

Documents

Transcript of Languages at Inxight Ian Hersey Co-Founder and SVP, Corporate Development and Strategy.