Cognition Semantic NLP for Search Overview-1

download Cognition Semantic NLP for Search Overview-1

of 20

Transcript of Cognition Semantic NLP for Search Overview-1

  • 7/30/2019 Cognition Semantic NLP for Search Overview-1

    1/20

    CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 1 of 20

    TechnicalOverviewofCognitionsSemanticNLP

    (asAppliedtoSearch)KathleenDahlgren,Ph.D.

    Growth of the Internet, the proliferation of email as the preferred method of information

    exchange, and the creation of huge stores of digitized text have opened the gateway to a

    delugeofinformationthatisoftendifficulttonavigateandsearch. Usersareawashindata

    unabletogleanrelevantinformation.

    To address this need and painpoint, Cognition Technologies, Inc. has introduced Cognitions

    SemanticNLP

    (Natural

    Language

    Processing).

    This

    evolutionary

    software

    uses

    state

    of

    the

    art

    linguistic technology to easily and precisely find ontarget information on the Internet or in

    large libraries of digitized text. Users pose queries in plain English and Cognitions Semantic

    NLP interprets their meaning responding with moreprecise results than is possible with

    traditional search technologies (e.g. pattern matching, concept search, etc.). Cognitions

    SemanticNLPproducesresultswhicharebothhighlyrelevanttotheuserandverycomplete.

    Thisincreasedrelevancyandcompleteness(precisionandrecall)ismuchhigherthanispossible

    withtraditionalSearchtechnologiesnomatterhowtheuserqueryisworded.

    (PleaserefertotheglossaryintheAppendixwhilereadingthisTechnicalOverview.)

    I. SearchTechnologyMostsearchengines inusetodayarefrustratingtousebecausetheyyielda largequantityof

    irrelevantinformation. Paradoxically,theyalsofailtoretrievesignificantamountsofrelevant

    information. Currentsearchtechnologiesonlyworkwellwhentheuserknowsexactlyhowthe

    informationinthetargetdocumentsisworded,andformsasearchquerywithsufficientlyfine

    granularitytoyieldamanageableamountofinformation. Itisimpossibletoknowhowtoword

    aqueryinadvance(itwouldrequiretheusertoknowtheanswertothequeryasthequerywas

    constructed), so typically users spend a lot of time browsing irrelevant information,

    constructing

    ever

    more

    complex

    Boolean

    queries

    with

    only

    marginal

    success,

    or

    face

    thefrustrationoffindingnothingatall.

    CognitionsSemanticNLP issubstantiallymorepreciseandexhaustive initsabilitytosearcha

    dataset, as indicated by commonly used Precision/Recall tests. Precision is a measure of

    retrievalaccuracycalculatedbydividingthetotalnumberofrelevantretrievalsbythenumber

    ofallretrievalsgeneratedbythesearch. Recall isameasureoftheextenttowhichrelevant

    materialinthetotaldocumentbaseisfound. Itiscalculatedbydividingthenumberofrelevant

    retrievals by the total number of potentially relevant retrievals in the document base. One

  • 7/30/2019 Cognition Semantic NLP for Search Overview-1

    2/20

    CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 2 of 20

    comparativebenchmarkformeasuringPrecisionandRecallisthroughagovernmentsponsored

    competitionknowsasTREC ,whichis intendedtotestSearchtechnologies. In2006,thebest

    performerinbioinformaticshad16%precisionand26%recall. CognitionsSemanticNLPhas

    performed a number of internal headtohead comparisons of search comparing Cognitions

    SemanticNLPwithotherwellknownSearchtechnologiesonthesamedatabaseswiththesame

    queries. Documentsetssearchablebyallthe Search technologies were notavailable, sothe

    headtohead comparisons were performed on different document sets, but in each case

    Cognitions Semantic NLP and the competitor Search engine were searching in the identical

    documentswithidenticalqueries. Fiftyqueriesconsidered likelytobeenteredbyuserswere

    formed, searches performed, and result relevancy judged by members of the Cognition

    Technologiesstaff. Precision/Recallresultswerethencompared.

    RecallforCognitionsSemanticNLPandotherSearchengineswasmeasuredbytakingallthe

    relevantretrievalsfoundasabaselineof100%andcomparingtheCompletenessofeachSearch

    enginetothatbaseline. Theremayhavebeenotherretrievalsmissed,butnonewasobserved.

    Note that Cognitions Semantic NLP far outperforms the competitors in both precision and

    recall.

    SearchEngine DocumentBase Precision Recall

    Google globalissues.com(blog) 12% 21%

    Cognitions

    SemanticNLP

    globalissues.com(blog) 91% 90%

    dtSearch MicrosoftAntiTrustCaseemails 24% 19%

    Cognitions

    SemanticNLP

    MicrosoftAntiTrustCaseemails 96% 95%

    Autonomy NewYorkLife.com(corporateWebsite) 1% 40%

    Cognitions

    SemanticNLP

    NewYorkLife.com(corporateWebsite) 92% 87%

    II. CognitionsSemanticNLPLinguisticTechnologyCognitions Semantic NLP searches on meanings, not patterns, therefore, its results are very

    precise. The user poses queries in plain English, and Cognitions Semantic NLP determines

    whatthewordsinthequerymeaninthecontextofthequery. IfyouaskHowcanIbuystock

    on

    the

    market?,

    Cognitions

    Semantic

    NLP

    determines

    that

    stock

    means

    share

    or

    security. Itsearchesonlyonthatmeaningofstockanddoesntretrieveinformationabout

    stockingshelves,cattleorflowers. IfyouaskHowcan Istocktheshelvesofmymarket?, it

    retrieves information about merchandising, and doesnt retrieve information about shares in

    companies. CognitionsSemanticNLPreturns informationwithover90%Relevancy,reducing

    the users need to ponder large numbers of irrelevant retrievals found with other Search

    technologies.

  • 7/30/2019 Cognition Semantic NLP for Search Overview-1

    3/20

    CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 3 of 20

    Simultaneously, Cognitions Semantic NLP overcomes the problem of information underload,

    i.e.notfindinganythingatallbecauseofdifferencesinwording. Itfindsinformationregardless

    of the way a concept is worded in the target documents. If you ask Fatal fumes in the

    workplace?, Cognitions Semantic NLP finds documents that talk about gas, vapor,

    "steam",etc. Itisimportanttonotethatsomeofthesewordshaveambiguousmeanings(e.g.

    fume can mean a vapor or it can mean an aromatic wine), but Cognitions Semantic NLP

    doesnt retrieve irrelevant information triggered by those words used in a different meaning

    thanthequery. Forexample,whensearchingonfume,it retrievestogasmeaningvapor,

    but not gas meaning gasoline. The result is that Cognitions Semantic NLP retrieves 5 to 7

    timesmorerelevantinformationthanotherSearchtechnologies,asmeasuredinheadtohead

    comparisonswithotherSearchengines,whilemaintainingover90%Relevancy.

    AnothersourceofgreaterCompletenessisCognitionsSemanticNLPtaxonomy,whichenables

    Cognitions Semantic NLP to search on specific information when queried on more general

    information. As an example, if the user searches on money, Cognitions Semantic NLP will

    find informationaboutdollar,poundandyen,etc. CognitionsSemanticNLPtaxonomy

    covers 506,000 concepts, and is thus very complete. The customer doesnt have to build a

    taxonomyfrom

    scratch,

    as

    with

    some

    technologies.

    CognitionsSemanticNLPArchitecture

    The components of linguistic processing in Cognition include a reader, a phrase parser, a

    morphological component, a word meaning interpreter, a dictionary, a taxonomy and a

    meaningthesaurus. Thedictionaryandtaxonomyareusedbythephraseparser,morphology

    and word meaning interpreter. The meaning thesaurus is used to find alternate wordings

    duringsearch.

    Reader

    Morphology

    Phraserecognizer

    WordMeaning

    Interpreter

    SearchEngine

    Indexer

    Dictionary

    Taxonomy

    MeaningThesaurus

  • 7/30/2019 Cognition Semantic NLP for Search Overview-1

    4/20

    CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 4 of 20

    1. Thereader. Thereaderreadsthetextorquery,locatesthewords,andlooksthemupinthedictionary. Thiscomponentguaranteesagainstfalsehitswhenonewordispartofanother

    word,asinpartandparty,orlossandfloss.

    2. The morphological component. The morphological component isolates word stems fromprefixes

    and

    suffixes,

    enabling

    Cognitions

    Semantic

    NLP

    to

    recognize

    many

    millions

    of

    word

    forms (actually, an indefinite number). Some words take various forms according to

    morphologicalrules. Someoftheseareenumeratedbelow:

    a. Nounswithirregularpluralmorphologysuchasmousemice,toothteeth.b. Nounswithregularchangesinthepluralsuchasbabybabies.c. Regular verb inflections such as raze razed razing and

    shipshippedshipping".

    d. Verbswithirregularpasttenseformssuchascatchcaughtcatching.e. Regularderivedforms:

    inter communicate intercommunicate

    tion communicate communication

    inter + tion communicate intercommunication

    ize actual actualize

    re marry remarry

    3. ThePhraseRecognizer. Thephraserecognizercombineswordsintophrasesformoreaccurateinterpretation. Thereareseveraldifferenttypesofphrasesithandles.

    a. Names. Thephraseparserrecognizesthatcertainpatternsarepersonalnames,companynamesorplaces,whetherornotthosenamesarerepresentedinthelexicon. Theresultis

    theabilitytomapfromoneformtoanother,includingregularshortforms.Examplesare:

    Mr.SaviSamdi shortformMr.Samdi,Samdiisahumanmale LakeSushortformSu,isalake TheXYZCorporationshortformXYZCorp.orXYZ,isacompany

    b. Dates. Thephraserecognizerseesalldatevariations,andmapsonetotheother,asin December1,1992 12/1/92 12192 Dec.1,92.

  • 7/30/2019 Cognition Semantic NLP for Search Overview-1

    5/20

    CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 5 of 20

    c. Compounds. Thephraserecognizer interprets lexicalphrasessuchasmovieset, heartattack,netrevenue,etc. Suchphrasescanconsistofmanywords. Therearecurrently

    over191,000compoundphrasesinthelexicon.

    d. Acronyms. Thephraserecognizermapsthelongformofacronymstotheshortform,notingthatSecuritiesandExchangeCommissionisthelongformofSEC.

    e. Idioms. Thephraserecognizerpiecestogether idioms, includingmorphologicalvariantsofthem(whicharelinkedtootherwordmeaningsinthemeaningthesaurus).Someexamples

    are:

    kickthebucketkickedthebucket(die,sense1) letthecatoutofthebaglettingthecatoutofthebag (disclose,sense2)

    4. Word Meaning Interpreter. The word meaning interpreter uses context and structure todetermine the meanings of words in context. Several databases are consulted to determine

    wordmeanings. Forexample,thewordcheckincheckupisinterpretedaspartof phrase

    with

    up

    with

    a

    specific

    meaning,

    while

    check

    in

    pay

    with

    a

    check

    is

    interpreted

    as

    an

    individual word meaning a promissory note because of the context with pay. The lexicon

    contains17,000ambiguouswords.

    5. Word Sense Selection Database. This database encodes trigger information and statisticaloccurrence metrics used to contribute to contextual word sense selection. There are over 4

    millionsuchmeaningcontexts.

    6. Dictionary. Inthedictionary,eachmeaningofeachwordisdefined,andgivenmorphological,syntactic, taxonomic and semantic features. This information enables the software to select

    wordmeanings,recognizevariousformsofagivenword,andparsesentences.

    CognitionsSemanticNLPlexiconisquitebroad,including506,000forms,over536,000concepts,

    andmillionsofwordforms. IthasentriesforalmosteverycommonwordofEnglish,andtensof

    thousands of proper names, phrases and acronyms. In combination with the morphology

    component,itrecognizesmillionsofwordforms. Ithasvocabularyinmanydomainsincluding

    law,healthandmedicine,biology,genomics,finance,terrorism,recreation,household,human

    resources, encyclopediaarticles, nuclearenergy,softwaretechnical notes,newspaperarticles,

    governmentregulations,telecommunications,humanfactorsengineering,andmilitary.

    7. Taxonomy. Cognitions Semantic NLP taxonomy classifies all objects and events in aninheritance

    hierarchy.

    An

    abbreviated

    piece

    of

    the

    taxonomy

    is

    shown

    below:

    bovidmammal

    antelelope_mammal bovine_mammal ovine_mammal

    gnu antelopegazelle cow bull domestic_sheep ovine Ovispoli

    sheep lamb ewe Ovispoli

  • 7/30/2019 Cognition Semantic NLP for Search Overview-1

    6/20

    CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 6 of 20

    There are approximately 7,000 unique nonterminal nodes and 536,000 unique leaves or

    wordsenses.

    8. Meaning Thesaurus. The meaning thesaurus maps meanings to each other, formingmeaning classes. Unlike ordinary thesauri, the mapping is from word meaning to word

    meaning(orphrase).Themeaningswithinagroupingarejudgedtobelooselysynonymous.

    The synonymy is "loose" in the sense that related meanings may have different parts of

    speech, but they evoke the same ideas in the mind of the reader. For example, the

    followingconceptsareinonethesauralgrouping:

    bank9,column2,file3,line3,queue1,rank4,row1,tier1,alignment1,align1

    Thedigitsindicatethespecificsensesormeaningsofthewords. Inthisexample,

    bank9 means a set of similar things, column2 means a line of similar things, file3

    meansalinedupgroupofthings,andsoon.

    The word support illustrates that a given word in different meanings may belong to a

    numberof

    different

    meaning

    classes

    or

    thesaural

    groups,

    as

    shown

    below:

    support1,abet1,assist1,bail3,sponsor1,benefit1,benefit2,assistance1 support2,attest1,back7,affirm3,establish2,prove1 support3,bear3,carry9,fortify1,shore2 support4,helpdesk1,hotline1,service3,serve3,help2

    Phraserelationsarealsoindicatedintheconceptthesaurus:

    kick

    the

    bucket

    die1

    expire4

    SECSecuritiesandExchangecommissionTherearecurrentlyover76,000conceptthesauralgroups.

    This database is a primary source of Cognitions Semantic NLP unique combination ofRelevancy and Completeness. Cognitions Semantic NLP not only knows all the different

    waysofsayingthingsforfullCompleteness,italsoknowswhichsensesofthewordsshould

    be counted as equivalents for high Relevancy. In fact, Cognition Semantic NLPs wordmeaning interpreter has 94% Relevancy. It is a commonplace in search to say that high

    recall comes at the expense of high precision. Since Cognitions Semantic NLP

    disambiguates

    words,

    and

    maps

    meaning

    to

    meaning

    in

    the

    thesaurus,

    the

    thesaurus

    improvesrecallwithoutloweringprecision.

    9. Synographs. Synographs are alternate spellings for entries in the dictionary, such ascookie and cooky. This database allows for recognition of alternate spellings and

    commonmisspellings. Therearecurrentlyover12,000synographentries.

  • 7/30/2019 Cognition Semantic NLP for Search Overview-1

    7/20

    CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 7 of 20

    III. LexicographyToolsThesemanticdatabaseshavebeendevelopedwithover100personyearsoflexicographywork.

    A number of tools have been developed to assist lexicographers in the development of the

    semanticdatabases.

    1. LexiconToolEntries inthedictionary includesinglewords, multiplewordphrases,andthe nodes ofthe

    ontology. Each entry has one or more senses (meanings) with syntactic and semantic

    features.

    Eachwordsensehasseveralcomponents,asfollows:

    a. plainEnglishdefinitionThisiswhatisdisplayedtotheuserinthesearchinterfaceiftheuserchoosestoviewthe

    searchconcepts.

    b. ontologicalattachment(s)Everysenseisattachedtooneormorenodesinthe ontology. For example, the

    firstsenseof"dog"isattachedto"pet_node"and"canine_mammal".

    c. syntacticfeaturesThere are currently about 1,250 unique syntactic features. Each sense has 2 or more

    featuresassociatedwithit. Thesyntacticfeaturesincludemaincategoryfeatures(noun,

    verb, etc.), morphological features for classifying the different forms of a word (e.g.

    "wind", "winding", "wound"), and subcategory features (such as intransitive for verbs).

    For example,the first sense of "dog" has the category feature"noun", a morphological

    featureindicatinghowtopluralizeit,andnosubcategoryfeatures.

    d. semanticfeaturesEach

    sense

    may

    have

    semantic

    features,

    such

    as

    "domain"

    features

    (used

    to

    prefer

    a

    sense in a particular domain) and selectional restrictions (for use with a parser).

    Selectional features help guide word sense disambiguation. For example, the verb

    charge in the meaning indict requires a sentient object and a crime as the oblique

    object,whereaschargeinthemeaningelectrifyrequiresanelectricaldeviceasobject

    andaformofenergyasobliqueobject.

  • 7/30/2019 Cognition Semantic NLP for Search Overview-1

    8/20

    CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 8 of 20

    The tool runs as a server and guarantees that no word is edited by more than one

    lexicographer at a time, for integrity of the database. It also makes changes to words

    availabletoalllexicographersimmediatelyaftertheyhavebeensavedintothedatabase.

    2. MeaningThesaurusToolTheconceptthesaurustoolenablesthe lexicographerto lookupwords inthe lexicon,viewthe

    meanings,selectameaning,andviewalloftheconceptgroupsthatthemeaningisamemberof.

    Thelexicographermayaddtotheconceptgroups,deletefromthem,andcreatenewones. This

    canbedoneinparallelwithmorethanoneconceptgroupatatime.

    The tool runs as a server and guarantees that no concept group is edited by more than one

    lexicographer at a time, for integrity of the database. It also makes upgrades available to all

    lexicographersimmediatelyaftertheyhavebeensavedintothedatabase.

    IV. CustomerToolsA number of tools have been developed to enable customers to index their documents,

    customizetheirspecificjargonsuchasproductnames,andsearchtheirdocuments. Cognitions

    SemanticNLPemploysclientservercommunication foroptimaleaseofuseandefficiencyon

    largedocumentbases. Astandalone,nonclientserverversionofCognitionsSemanticNLPis

    alsoavailable. AtthecoreofthesystemistheCognitionsSemanticNLPServer,whichcanbe

    configuredforindexing,searching,orboth.

    1. IndexingToolThe

    Cognition

    Indexer

    GUI

    is

    the

    primary

    interface

    used

    to

    create

    and

    index

    aCognitions

    SemanticNLPProject. ACognitionsSemanticNLPProjectissimplyalistofdocuments(in

    anyoftheformatsCognitionsSemanticNLPhandles),togetherwithasetofparametersto

    beusedwhentheyareindexedandsearched. Itisstraightforwardandeasytouse,though

    aswithanysoftware,goodresultswilldependontraining(orreferencetotheuser'sguide)

    andpractice.

    ThefollowingscreenshotshowsthemainwindowoftheIndexerGUI.

  • 7/30/2019 Cognition Semantic NLP for Search Overview-1

    9/20

    CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 9 of 20

    Togetstarted,theuseremploystheCognitionIndexerGUItocreatealistofthedocuments

    thattheuserwantstosearch. ThefollowingscreenshotshowstheIndexerwindowusedfor

    selectingfilesforindexingfromafilesystem.

  • 7/30/2019 Cognition Semantic NLP for Search Overview-1

    10/20

    CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 10 of 20

    The Indexer is then used to send an indexing request to the Cognitions Semantic NLP

    Service. ThedocumentstobeindexedcanbeinHTML,plaintext,MSWord,WordPerfect,

    RTF,orPDFformats,andcanbe includedfromthe localnetworkorfromtheWebviathe

    CognitionSpider.

    Following its initialcreation,an indexmay beupdatedas oftenas needed. Updating the

    index does not entail reindexing the entire document set, but only those documents for

    which a change is indicated. The Cognition Indexer GUI can automatically detect when

    documentshavechanged,andmarkthemforreindexing. Documentscanalsobeaddedto

    orremovedfromthedocumentsetbytheuser. Inaddition,adaemonmaybeinvokedto

    automaticallydetectchangesandupdatetheindexusingthecommandlineversionofthe

    CognitionIndexer.

    2. SampleScriptsOncethe indexhasbeencreated,theuseremploysaWebbrowser(e.g. InternetExplorer

    or Netscape Communicator) toSearch it. TheCognitions Semantic NLP package includes

    sample

    Web

    pages

    for

    this

    purpose.

    These

    sample

    pages

    may

    be

    used

    without

    modification,ortheymaybecustomizedbytheusertoobtainthedesiredappearanceand

    functionality. SamplescriptsforASP,Python,Java,PHPandPerlareavailable.

    3. AutomaticDictionaryexpansionCognitionsSemanticNLPdictionarycanbeexpandedtoincludelargenumbersofcustomer

    vocabulary words automatically. Any file of terms and words used in the customers

    business along with the categories of the terms can be merged automatically with

    Cognitions Semantic NLP dictionary. Recently, Cognition expanded its vocabulary in

    medicine

    and

    molecular

    biology

    by

    over

    130,000

    words

    using

    semi

    automated

    techniques.

    4. CustomerDictionaryexpansion

    The customer may add words in ontological classes if desired. The customer may force

    CognitionsSemanticNLPintothedesiredmeaningofaword,andthecustomermayforce

    CognitionsSemanticNLPtoconsiderawordalastname,ornotalastname,asdesired.

    V. ProductFeaturesofCognitionsSemanticNLP1. Relevance ranking. Cognitions Semantic NLP returns a list of retrievals containing

    documents in which the query concepts were found together in a sentence. Those with

    exactwordmatchestothequerytermscomefirst.Thenextgroup isdocuments inwhich

    there were exact matches to some query terms in the body of the document, but other

    querytermsonlymatchedconceptually. Whichtermsmatchexactly isindicated. The last

    group is documents in which there were conceptual matches to the query terms. Within

    eachgroup,documentsarelistedaccordingtothenumberofsentencesinwhichallquery

    termswerefound.

  • 7/30/2019 Cognition Semantic NLP for Search Overview-1

    11/20

    CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 11 of 20

    2. Spellingcorrection. IntheSearchinterface,unrecognizedwordsarenotedandtheuserisgivena listofalternativespellingstoselectfrom. Theusermayalso leavethewordas is.

    CognitionsSemanticNLPdoessearchonwordsthatareunrecognizediftheusersodesires.

    3. Specific retrievals highlighted. When the user clicks on the Highlighted Text linkcorresponding to a retrieved document, Cognitions Semantic NLP highlights the relevant

    section.Additionalrelevantretrievalswithinadocumentare indicatedbyapointinghand

    figureattheendofahighlightedsection. Ifthereisnopointinghandfigureattheendofa

    highlightedsection, ittellstheuserthattherearenoadditionalrelevantresults. Inother

    words,itbehaveslikeaclippingservice,whichismostusefulwithlargerdocuments.

    4. Specific words highlighted. When the user clicks on the Highlighted Text linkcorresponding to a retrieved document, Cognitions Semantic NLP also colorcodes the

    specificwordswhichmatchedthequerywithintherelevanthighlightsection. Astheuser

    hovers the cursor over a matched word, the corresponding query term is indicated.

    Sometimesagivenword may correspond tomore thanonequeryterm. In thiscase the

    word

    is

    highlighted

    according

    to

    the

    first

    query

    term

    matched,

    but

    the

    hoverover

    text

    indicatesallmatchedterms.

    5. LinguisticBooleanSearch. Cognitions Semantic NLP searches can be formed using fullyrecursive Boolean expressions with AND, OR, WITH and AND NOT operators. The

    expressions connected with the Boolean operators are interpreted for meaning. See the

    help function at http://medline.cognition.com or http://wikipedia.cognition.com for

    completeinstructionsandexamplesofuseforlinguisticBooleans.

    6. Fuzzy Search. Cognitions Semantic NLP searches can be formed using wildcardsand

    fuzzy

    operators

    (e.g.,

    /Liebowicz/n

    matches

    names

    that

    sound

    like

    Liebowicz,

    and

    imatcheswordsthatstartwithdrandendwithg,regardlessofcase)suchthat

    proper names and other words can be matched approximately. See the help function at

    http://medline.cognition.com or http://wikipedia.cognition.com for complete instructions

    andexamplesofuseforfuzzysearch.

    7. Review Tool. As part its ASP solution, Cognitions Semantic NLP has a review tool forreviewing and categorizing document sets. Features include the ability to create project

    categoriesandclassifydocumentsintothosecategories,tolimitsearcheswithincategories,

    to browse document sets without querying and to export archives of categorized

    documents. When combined with a relational database,the features provided allow the

    user

    tocreate

    and

    modify

    database

    tables,

    display

    document

    related

    data

    from

    tables

    as

    partofqueryresultsandrestrictsearchesbasedontablecolumnvalues. Alsoincludedare

    scripts that allow the user to take notes and have the notes indexed and available to be

    searchedintandemwiththerelateddocumentset.

    8. Formats. CognitionsSemanticNLPindexesdocumentsinHTML,XML,OCR'dtextandplainASCIItext. Documents inWord,PowerPointRTF,orWordPerfectareconvertedtoHTML

    beforebeingindexed. SomeengineeringmayberequiredforXML. DocumentsinPDFare

  • 7/30/2019 Cognition Semantic NLP for Search Overview-1

    12/20

    CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 12 of 20

    convertedtoplaintextbeforebeing indexed. Theusermayviewretrievals indocuments

    converted to HTML or plain text with the specific retrieved sections highlighted, but may

    alsochoosetoviewtheoriginalfilewithouthighlighting.

    9. Customerandmetatags. CognitionsSemanticNLPcansearchincustomerspecifiedtags,ifdesired,withsomeengineeringassistancefromCognitionTechnologies.

    10.Indexing local directories. The Cognition Indexer interface permits the user to selectindividualfilesorwholedirectoriesforindexing.

    11.Spider. Cognitions Indexer GUI includes an interface for creating a list of Web files toindex. TheuserentersaURLfromwhichtostart,adesireddepthtocrawl,andparameters

    toincludeorexcludeparticularURLs.

    12.Authentication. Thespidercanbedirectedtousepasswordsorcookiestoentersitesthatrequireauthentication,sothattheusercanindexthesesites.

    13.Languages. CognitionsSemanticNLPsearchesinanylanguagehandledbyUnicode.14.Searchandretrievalpages. CustomizablesearchandretrievalASPpagesareprovidedfor

    theuser. Theusercanmakethesearchandretrievallookanywaydesired.

    15.Partialupdating. Theusermayselectanynumberoffilestoreindex,ratherthanhavingtoreindexanentiredocumentbasewhenindividualdocumentsareaddedorchanged.

    16.Automatic updating. A consolelevel indexing command is provided so that systemadministrators can automatically update new files or changed files on a regular basis,

    initiated

    by

    the

    computer

    clock.

    17.Loadbalancing. TheCognitionsSemanticNLPindexinginterfaceautomaticallydistributesdocumentindexingacrossasmanyserversastheadministratorselects.

    18.Brokering. Administrative tools enable the system administrator to manually controlindexing and searching load, and membershipaccess todocument bases. The tools send

    queries to servers in response to load and allocate databases to specified servers.

    Customerspecific criteria that may involve user parameters such as subscription

    membership.

    19.Categorization. Theinterfacequeriesusersforcategoriesandsaveswholeretrievallistsorindividualfilesintothecategories. Newcategoriescanbecreatedonthefly. Subsequent

    searchescanberestrictedtocategories.

    20.Userdefinedontology. Userscanaddontologicalclassesbyeditingastandardfile. Inthisway users can define search into classes unknown to Cognition Semantic NLP, such as

    companywidgetnamesorphrases. Forexample,Sonycouldaddacategoryvideorecorder

    withspecificvideorecordnamesascategorymembers. Usingthisnewcategory,anduser

  • 7/30/2019 Cognition Semantic NLP for Search Overview-1

    13/20

    CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 13 of 20

    could ask videorecorder and retrieve to specific video recorder names mentioned in

    indexed documents.

    21.Usercontrolofnames. UserscanforcepreferencefornameornonnameinterpretationsofwordslikeBushandStone,whichcaneitherbenamesorcommonwords.

    VI. ComparisonwithOtherSearchEnginesThe low Relevancy/Completeness performance of most Search engines is due to the use of

    patternmatchingtechnology. Patternmatchingmatchesstringsoflettersinaquerytostrings

    ofletters within the document set, ignoring context and meaning. If you search on check

    meaningapromissorynote,apatternmatchingsearchengineonlysearchesforinstancesof

    letterpatterncheck,regardlessofthemeaningofthewordincontext. Yourresultsinclude

    check meaning see whether, postponement, and hold back, as well as promissory

    note.

    Patternmatching

    technology

    results

    in

    information

    overload

    because

    of

    word

    ambiguity,

    so

    that

    the

    bestthistechnologycanofferistypicallynobetterthan33%Relevancy. Itisimportanttonotethat

    wordambiguityismostprominentinthemostfrequentlyusedwordsofalanguage. Sotheproblem

    isconcentratedamongtheverywordswhichoccurmostofteninqueriesanddocuments.

    Also, most Search engines dont require that your search terms be near each other in a

    document, so if you search on pay with a check, they will retrieve to documents in which

    payisfoundinthelastsentence,andcheckinthefirstsentence. Someenginesdonteven

    respect word boundaries, retrieving texts that contain one of your Search terms as part of

    anotherword. Ifyousearchonloss,theywillreturndocumentswithglossandfloss.

    Patternmatching technologies also miss relevant information. They only find material with the

    exact words of the search. If you search on profit meaning net revenue, they dont find

    documents containing net revenue, or income. If you search on SEC, they dont find

    documents containing Securities and Exchange Commission. Conversely, if you search on

    SecuritiesandExchangeCommission,theydontfinddocumentscontainingSEC. Ifyousearch

    onvehicle,theyonlyfindthatword,nottypesofvehiclessuchascar,truckandplane. Thisis

    why such technology has vastly inferior Completeness to linguistic technology employed by

    Cognition. Typically,patternmatchingtechnologiesretrievenomorethan20%oftheinformation

    CognitionsSemanticNLPretrieves.

    SomeSearch

    engines

    have

    added

    statistical

    information

    along

    with

    pattern

    matching

    search,

    but

    the

    endresultisstillthesametoomuchirrelevantinformationandtoolittlerelevantinformation.

    Verity,Yahoo.

    Theseenginesemploypatternmatchingtechnology. Theyproduceoverretrievalbecausethey

    donotdisambiguatewordsandtheylackphrases. Ifyouask"HowdoIcheckonmystocks?",

    theyretrieveto"Checkyourstockinthecattleyard". Ifyouaskabout"math",theyretrieveto

    "aftermath". CognitionsSemanticNLPhaslittleoverretrievalbecauseitdisambiguateswords

    and has over 100,000 phrases. For the query "How do I check on my stocks", Cognitions

  • 7/30/2019 Cognition Semantic NLP for Search Overview-1

    14/20

    CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 14 of 20

    SemanticNLPinterprets"check"tomean"review"or"update",NOT"noteforpayment" and

    "stocks"tomeansharesincompanies,NOTbovineanimals. Patternmatchersunderretrieve

    due to lack of synonyms. If you ask about "pay raises", they won't find "salary increases".

    Cognitions Semantic NLP has little or no underretrieval because of its synonyms and

    taxonomy. Ifyouaskabout"payraises",itfiguresoutthemeaningsof"pay"and"raise",and

    thenretrievesto"salaryincreases","wagehikes",etc. Patternmatchersunderretrievedueto

    lack of taxonomy. If you ask about "vehicles", they do not retrieve "car", "ship" or "plane".

    Cognitions Semantic NLP has taxonomy, and thus can reason from the general to the

    particular.

    Google

    Google also employs patternmatching technology, but with a statistical boost. It never

    retrievestoyourinformationunlessthequeryiswordedthesamewayasthetargetdocument.

    However,ittrackspopularityandplacesthemostpopularWebsitesfirst. Thesetendtobethe

    sitesotheruserswanttolookat,soitseemsmoreontargetthanSearchengineswithoutthe

    popularitymeasure. CognitionTechnologiesconductedacomparisonofCognitionsSemantic

    NLPvs.Googleonthewww.globalissues.orgsite,aworldpoliticalsite. Therewere50queries

    inthe

    test.

    ExamplesofGooglesearchissuesthatareresolvedbyCognition.

    Searchquery Google CognitionSemanticNLP

    treasonousbehavior 0retrievals

    Googledoesn'tknowsynonyms

    2relevant

    0irrelevant

    casualtiesof

    natural

    disasters

    0relevant

    107irrelevant

    7relevant

    2irrelevant

    tidalwave Noreply

    Googledoesn'tknowthatatsunami

    isatidalwave

    8relevant

    0irrelevant

    heatinguptheglobeand

    biodiversity

    0relevant

    4irrelevant

    1relevant

    0irrelevant

    turmoilintheMiddleEastand

    economicdownturns

    Noreply 4relevant

    0irrelevant

  • 7/30/2019 Cognition Semantic NLP for Search Overview-1

    15/20

    CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 15 of 20

    Autonomy

    Autonomyalsoemploysapatternmatchingsearchenginewithsomestatisticalenhancements

    usingBayesianinference. IntestingAutonomyontheNewYorkLifeSite,CognitionsSemantic

    NLP did not see the effects of the statistical reasoning. If working, such technology would

    recognize topics of documents based upon the probability of occurrence of individual words.

    These probabilities are calculated by associating a handassigned document topic with the

    words that occur in the document. For example, a text that has been assigned the topic

    baseball will have large numbers of occurrences of the words ball, bat, hit, field,

    strike, etc. Thus upon searching for baseball in documents that have not been assigned

    topicsbyhand,documentswithhighnumbersofthoseassociatedwordswillberecognizedas

    being about baseball. This type of technology only increases retrieval for those topics that

    have been identified in advance as of interest to many users, and for which additional

    processingoftrialdocumentshasbeenformed. Thus itonlyworkswell inasmallnumberof

    cases,andtogeteventhesecasestoworkislaborintensive.

    Inpractice,asCognitiontestedit,Autonomyhastwoproblems. Itdoesntdisambiguatewordsand

    atthesametime itaddsmanysearchtermswithstatistics,so itsretrievalRelevancy isonly0.5%.

    Secondly,ithaspoorCompletenessbecauseithasnoparaphrasingorthesaurus. Itmisses60%of

    therelevant

    material

    that

    Cognitions

    Semantic

    NLP

    finds

    on

    the

    same

    site

    (www.nylife.com).

    VIII. CognitionsSemanticNLPSpecificationsDeployment:

    Inorderto deployCognition, theuser installstheCognitionsSemanticNLPprogramand the

    Cognitions Semantic NLP Web component for Search. After installation is complete, the

    CognitionsSemanticNLPservice(forWindows)ordaemon(forLinux)isstartedautomatically.

    ThisCognitionsSemanticNLPServerfunctionsforindexing,searchorboth,assetbytheuser

    inthe

    server

    administration

    interface.

    Thefirststepistoindexthetargetdocuments,whichcanbeeitheronthelocalnetworkoronthe

    Web. Ifthedocumentsarelocal,theuserselectswhichdocumentstoindexfromthefilesystem. If

    thedocumentsareontheWeb,theuseremploystheCognitionSpidertoobtainalistofdocuments.

    Oncetheprojecthasbeencreated,thedocumentscanbeindexed.

    ThesecondstepistheSearchitself. AtleastoneCognitionsSemanticNLPServermustberunning

    withSearchenabled. Theusercreatesa script page (ASP andPerlarebothsupported)as inthe

    sampleprovidedwiththeprogram,andthenvisitsthatpageinabrowser(suchasIEorNetscape).

    From

    that

    page,

    the

    user

    can

    search.

    Platforms(operatingsystems):

    Windows(32bit)(Windows2000,WindowsXP,Windows2003)

    Unix,Solaris,RedHat,Linux,FreeBSD,Centos

    RAM: 2GBpreferred,1GBminimum

  • 7/30/2019 Cognition Semantic NLP for Search Overview-1

    16/20

    CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 16 of 20

    Processor: Minimum3.0GHzpreferred.MulticoreHyperthreadedpreferred.

    Disk: SCSI/SASdrivespreferred.CognitionsSemanticNLPrequires140Mbofdiskspaceplus

    spaceforconceptindices,whichrangefromonetotwotimesthesizeoftheoriginaltexts.

    Search Interface: Any HTML browser such as Thunderbird or Internet Explorer, accessing a

    script which sends requests to the Cognitions Semantic NLP Server. ASP, Perl, Python, and

    PHP4areallsupported.

    API: A C++ API to the Cognitions Semantic NLP system is available. An API to the ASP

    Componentisprovidedinthedocumentationthatcomeswiththesoftware.

    Speed: CognitionsSemanticNLPbuildsConceptIndicesatabout1hourperGB,dependingon

    the machine, the configuration, and the text itself. It can handle an unlimited number of

    queriessimultaneouslyusingqueuingtechnologyandasufficientnumberofcomputers.

    Vocabulary:CognitionsSemanticNLPknowsmostEnglishareasofinterestordomains,except

    companyspecific

    terminology,

    such

    as

    product

    names

    (which

    can

    be

    learned).

    TechnicalSupport: Supportisavailablefrom9amto6pmPacifictime.Extendedsupportwitha

    3hourminimumresponsetimecanbearranged.Withaservicecontract,CognitionsSemantic

    NLP will regularly monitor performance and make system adjustments to further enhance

    Relevancyandprovideoptimumqualitycontrol.

    IX. ConclusionCognitions

    Semantic

    NLP,

    when

    applied

    to

    Search

    technology,

    returns

    the

    most

    relevant

    and

    complete results in the industry by employing linguistic techniques and huge semantic

    databases. Pattern matching and statisticallybased technologies lack the knowledge of

    languagethatenablesCognitionsSemanticNLPtooutperformthemsodramatically. Because

    of itsvastknowledgeofEnglish, little or no customization is required. Search functions like

    thoseusersareaccustomedtoareincluded,suchasBooleansearch(withconceptualBooleans)

    and fuzzy search. It comes with an easytouse indexer, and sample scripts for browser

    searching. CognitionsSemanticNLPhasanAPIforembeddingitinothersoftwareplatforms.

    AfterashorttimeusingCognitionSemanticNLPspatentedtechnology,usersareloathetogo

    backtopatternmatchers.

  • 7/30/2019 Cognition Semantic NLP for Search Overview-1

    17/20

    CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 17 of 20

    AppendixGlossary

    ambiguousword an ambiguous word has more than one meaning. "strike" is ambiguous

    becauseitcanmean"tohit","toignite","towalkoutofajob","agoodpitchinbaseball", or

    other

    meanings.

    An

    unambiguous

    word

    has

    only

    one

    meaning,

    as

    in

    "deer".

    Bayesian In search, a technique for classifying documents uses the statistical theorem of

    Bayes. Inbrief,thistheoremexplainswhattheprobabilityofonewordis,givenanother. Soif

    you'veseen"bat"inadocument,theprobabilityofseeing"ball"ishigherthantheprobability

    ofseeing"rock".

    concept thesaurus A traditional thesaurus lists all the synonyms of words in meaningful

    groupings,e.g.: "strikehit beat"as onegroup, "strike walkout protest" asanothergroup. A

    conceptthesauruslistsMEANINGSofwordsingroups,soifthefirstmeaningof"strike"means

    "to hit or beat", then one thesaural group is "strike1 hit3 beat2" (assuming that the third

    meaningof

    "hit"

    and

    second

    meaning

    of

    "beat"

    mean

    the

    same

    thing).

    If

    the

    second

    meaning

    of"strike"means"walkoutorprotest",thenanother, independent,thesauralgroupwouldbe

    "strike2 walkout1 protest2". Thus a concept thesaurus maps meaning to meaning, while a

    traditional thesaurus maps word to word, with no way of deciding which instance of a given

    wordshouldbeconsideredsynonymouswithwhichotherwords.

    compound a word that is formed from two or more identifiable words, e.g. "blackbird,"

    "cookbook"

    concept awordmeaning(seewordmeaning)

    computational thepropertyofbeinganactioncarriedoutbyacomputer.

    derived form a word that is derived through morphological rules, such as "rerun" or

    "derivation".

    domain asubjectareaofhumanactivitywhichhasasubvocabularyofitsown,suchaslawor

    medicine. Also"domain"referstoabasicpartofaURLaddresswhichbreaksdownintothree

    pieces,"site.domain.suffix"andin"www.cognition.com".

    finegranularityquery a finegrainedquery isonewhich is looking fordetailed information.

    Forexample,

    afine

    granularity

    query

    would

    be

    "What

    is

    the

    cure

    for

    leaf

    wilt

    on

    fuschias?".

    In

    contrast a coarse granularity query (or general query) would be "How can I buy a car in Los

    Angeles?"

    GUI graphicaluserinterface,suchasthebasicinterfaceforWindows.

    idiom a fixeddistinctive expressionwhosemeaning cannot bededuced fromthecombined

    meaningsofitsactualwords,suchas"kickthebucket".

  • 7/30/2019 Cognition Semantic NLP for Search Overview-1

    18/20

    CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 18 of 20

    incrementalindexing theabilitytoaddanewdocumentindextoanexistingdocumentbase

    index. This ability is an advance over many available indexers, which require users to

    completelyreindexanentiredocumentbasewhennewdocumentsareaddedtoit.

    index

    a

    computational

    representation

    of

    the

    location

    of

    words

    or

    concepts

    in

    a

    document

    base. Thiscanincludedetailssuchaswordandsentencepositions,ornot.

    latentsemanticindexing Astatisticaltechniqueforclassifyingdocumentsbasedoncounting

    cooccurrences of words. Documents that share many words are considered to have similar

    contentandareclassedtogether.

    linguisticalgorithms rulesoflanguageappliedascomputeralgorithms. Forexample,givena

    wordthatpluralizeslike"bat",addan"s"tomakeitplural,orremovean"s"tofindthebase

    formorstem.

    linguistics thestudyofhumanlanguageincludingelements,structure,rulesandhistory.

    morphological feature a property of a word that indicates how it changes form in specific

    syntactic situations. For example, the fact that the past tense of "catch" is "caught" is

    expressedinthedictionarywithamorphologicalfeature(seemorphology).

    morphology therulesofwordformation. Suchrulesdictatethatthepasttenseof"cook" is

    "cooked", while the past tense of "catch" is "caught", and that the plural of "bat" is "bats",

    whilethepluralof"mouse"is"mice". Morphologyalsocontrolsthederivationofwordsfrom

    otherwords,suchasthechangefrom"derive"to"derivation",orfrom"run"to"rerun".

    overload retrieval of many documents that are not relevant in response to a query; poor

    relevancy;poorprecision

    parsing combining the words in a sentence into a structure which elucidates or shows the

    syntactic role of each word in the sentence. For example, in "John loves pizza", the parser

    creates a structure showing that "John" is the subject and "loves pizza" is the predicate, and

    thatwithinthepredicate,"love"istheverb,and"pizza"isthedirectobjectoftheverb.

    patternmatch(orstringpatternmatching) Insearch,decidingtoretrieveadocumentbased

    on

    an

    exact

    match

    of

    strings

    in

    the

    query

    and

    document.

    For

    a

    query

    "what

    is

    the

    form

    of

    the

    law?", a patternmatching search engine would match documents containing phrases like

    "conforms to the law", "..inform him that the lawn", "the laws of nature dictate that

    waterformscrystalsatatemperatureof"

    phrase ameaningfulsequenceofwords. "bokchoyisaphrase,while"theand"isnot.

    phraseparser acomputerprogramthatreadsintextanddetermineswhichsubsequencesof

    sentencesarephrases

  • 7/30/2019 Cognition Semantic NLP for Search Overview-1

    19/20

    CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 19 of 20

    precision/recall Thisisthesamethingasrelevancy/completeness. Precisionisthepercentage

    ofasetofretrievalsthat isrelevant. Iftherearetenretrievals,and9ofthemarerelevantto

    the query, then precision is 90%. Recall is the percentage of the relevant documents in the

    targetdatabasethatisretrievedinresponsetoaquery. Ifthereare10documentsthatexistin

    thetarget

    database

    that

    would

    be

    relevant

    to

    aquery,

    and

    the

    software

    retrieves

    9relevant

    documents,thenrecallis90%.

    reasoning from thegeneral to theparticular reasoning in a taxonomic tree from higher to

    lower nodes. For example, reasoning in the taxonomy example under the definition of

    "taxonomy",from"vehicle"to"car",or"vehicle"to"boat"(seetaxonomy).

    relevant Insearch,documentcontentthataddressesor issimilartothecontentofaquery;

    apt;ontarget

    relevancy/completeness This is the same thing as precision/recall. Relevancy is the

    percentage of a set of retrievals that is relevant. If there are ten retrievals in response to a

    query,and9ofthemarerelevant,thenrelevancy is90%. Completeness isthepercentageof

    therelevantdocumentsinthetargetdatabasethatisretrievedinresponsetoaquery. Ifthere

    are10documentsthatexistinthetargetdatabasethatwouldberelevanttoaquery,andthe

    softwareretrieves9relevantdocuments,thencompletenessis90%.

    semantics In linguistics, the rules for determining meaning of words, sentences and

    discourses. Themeaningsofindividualwordsaretypicallylistedinalexiconordictionaryina

    computationallinguisticsystem.

    sense

    Anindividual

    meaning

    of

    an

    ambiguous

    word.

    "strike"

    meaning

    "to

    hit

    or

    beat"

    is

    one

    senseof"strike".

    spider a computer program that searches the internet by connecting from one URL to the

    nextviathelinksinit.

    string asequenceorpatternofletters. Astringcanbeawordornot. "XXX"isastring,"cat"

    isastring.

    synograph analternatespellingofaword,suchas"center"or"centre".

    syntactic feature a property of a word that indicates how it functions in grammar. For

    example,thefactthataword isanoun isasyntactic feature.Thefactthataverbrequiresa

    directobject isasyntacticfeature. Averbwhichrequiresanobject,usedwithoutone,forms

    ungrammaticalsentences,asin"Thegirlwants."

    underload retrieval of very few, if any relevant documents in response to a query; poor

    completeness;poorrecall

  • 7/30/2019 Cognition Semantic NLP for Search Overview-1

    20/20

    taxonomy ahierarchyofISArelationshipsbetweenconceptsthatformatree,withthetopof

    the tree the most general concept. For example, a car is a motor vehicle in the vehicle

    taxonomy:

    thesaural enhancer A type of search engine that uses standard, nonconceptual thesaural

    groupstoenhancequeryterms. Ifthequerycontainstheword"strike",thethesauralenhancer

    addstermsfromallthethesauralgroups"strike" isamemberof. So itwouldsearchon"hit,

    beat", but also "walkout, protest", "ignite", etc. Note that many of the synonyms are

    ambiguous here. The thesaural enhancer doesn't distinguish meanings, and doesn't know

    whichmeaningof"hit"or"beat"isrelevant.

    wordmeaning anindividualsenseofaword. Asenseisadescriptionormentalpictureofthe

    objectsreferredtobythewordsense. Forexample,theword"bank"caneithermean"aplace

    wheremoneyisstored",oritcanmean"thesideofariver". Themeaningofthesense"where

    moneyis

    stored"

    is

    adescription

    of

    atypical

    bank

    with

    tellers,

    counters,

    little

    windows

    to

    the

    tellers,aguard,avault,etc.

    wordstem thebaseformofawordwithnomorphologicalrulesapplied. "bat"isabaseform,

    "bats"isnot. "run"isabaseform,"ran"isnotand"rerun"isnot.

    vehicle

    motorvehicle watervehicle spacevehicle

    car truck boat ship rocketspaceship