Download - Search and Discovery Technology Stack v4 - Green … › uploads › Search_and_Discovery...1 Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

Transcript
Page 1: Search and Discovery Technology Stack v4 - Green … › uploads › Search_and_Discovery...1 Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

1

Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

The Lion thought itmight be as well to frighten theWizard, so he gave alarge, loud roar, whichwas so fierce and dreadful that Toto jumped awayfromhiminalarmandtippedoverthescreenthatstoodinacorner.Asitfellwith a crash they looked thatway, and thenextmomentall of themwerefilledwithwonder. For they saw, standing in just the spot the screen hadhidden,alittleoldman,withabaldheadandawrinkledface,whoseemedtobeasmuchsurprisedastheywere.TheTinWoodman,raisinghisaxe,rushedtowardthelittlemanandcriedout,"Whoareyou?""I amOz, theGreat andTerrible," said the littleman, in a trembling voice."Butdon'tstrikeme--pleasedon't--andI'lldoanythingyouwantmeto.”

L.FrankBaum.TheWonderfulWizardofOz.ProjectGutenberg(2008)TheWizardofOzcapturesthelifetrajectoryofmanytechnologies,andsearchisnoexception:thetechnologystartsinthepublicdomain,itisappropriatedandmademysterious(hiddenbehindthecurtain)bycommercialinterests,thentoutedandoverblownasmagical,eventuallythecurtainfalls,andwecanseewhathasbeenthereallalong,andworkwithitmoreintelligentlyasitis,knowingwhatit’sgoodforandknowingwhatit’snotgoodfor.Thisistrueofsearch,asitistrueofauto-classification.ModernsearchtechnologyhasitsrootsinVannevarBusch’sseminal1945paper“Aswemaythink”anditsfoundationsweredevelopedbyresearchgroupsatMIT,CornellandHarvardintothe1970s.Oneoftheconsequencesofhidingatechnologybehindthecommercialcurtainisthatdifferenttechnologiescompetewitheachotherforthesamemarket,andthereforedownplaytheirperceivedcompetitors,orattempttoduplicatethefunctionalitiesoftheirperceivedcompetitors.Searchtechnologyhascompetedwithtaxonomyworkforyears–“youdon’tneedtaxonomies,”wentthemantraofoneinfamoussearchvendor,“oursoftwarewillunderstandyourcontentforyou.”ThecurtainfellwiththeunveilingofGoogleKnowledgeGraph,whenitbecameobviousjusthowmuchGooglewasinvestinginknowledgeorganisationstructures(controlledvocabularies,taxonomies,ontologies,knowledgegraphs)tomakeitssearchsmarter.Thesuperiorperformanceofopensourcesearchengines(andtheirstrikingadoptionrates)overcommercialsearchenginesinsolvingparticulartypesofproblemis,likeToto’sleap,anothermanifestationofthedroppedcurtain.Machineclassificationofcontenthasbeenanotherexample.Itisfrequentlytoutedbyitscommercialvendorsasasingletechnologyperformingasingle“we’lltagyour

Page 2: Search and Discovery Technology Stack v4 - Green … › uploads › Search_and_Discovery...1 Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

2

contentforyou”functionwhereitisactuallyabundleofquitedistincttechnologies,androotedininformationretrievalresearchandsearchtechnologiesdevelopedinthe1970sand1980s.Machineclassificationvendorshavefoundthemselvescompetingwithsearch(“wecanmakeyoursearchsmarter”)andtaxonomies(“wecanderivetaxonomiesautomaticallyforyou”).ThecurtainisnowstartingtofallwithlargeorganisationslearninghowtousetheopensourcetoolkitssuchastheUniversityofSheffield’sGATEtoolset,tosolvetheirsearchanddiscoveryproblems.The“goitalone”propagandaofthetechnologyvendorsisdamagingbecauseitlocksbuyersintoasingletoolsetdesignedforcommondenominators,withprohibitivecustomisationandmaintenancecosts.The“magicblackbox”propagandablindsorganisationstotheneedtobuildcapabilitiesinarangeoftools.Andthe“wedoeverything”propagandahidesthesynergiesandcapabilitiesthatcanbeattainedbyintelligentlyusingsearch,machineclassificationandknowledgeorganisationtechnologiesasatoolkit,withdifferenttoolsetstobeusedindifferentcombinationstosolvedifferentkindsofsearchanddiscoveryproblem.Thepurposeofthisarticleistoexplaintheusesandinteractionsofthreetechnologiesthattogethersupportsearchanddiscovery,sothatorganisationscanlearnwhichtoolstoadoptinwhichcombinationtosolvedifferentproblems,andtheycanseewhatkindofinternalcapabilitiestheyneedtobuildandmaintaintocontinuesolvingthoseproblems:

• Enterprisesearch• Taxonomyandontologymanagement• Textanalyticstoperformmachineclassification.

Enterprisesearchisatechnologyfordeliveringusefulandrelevantresultstousersexploringalargecontentbase.Asatechnologyitisrelativelysimpleandnotespeciallysmart,andtodeliversuperiorresultsincomplexenvironmentsitneedstobesupportedbyothertechnologies.“Complexity”canrefertothediversityofusercommunities,andtothediversityofthecontentitself.Searchenginesbythemselvesdon’tunderstandanything.Theycrawlandindex,processqueriesandserveupuseranalytics.Tobecomesmart,theyneedhumandesignedcurationtools(taxonomyandontologytools)andtoolsforscalingthehumantaggingofcontent(textanalyticstools).Taxonomyandontologymanagementtechnologiessupportdesigning,deliveringandmaintainingknowledgestructuresandcontrolledvocabulariestoenhancethesearchanddiscoveryexperience.Thesestructuresarebasedonananalysisofbusinessanduserneedstogetherwithananalysisofthetargetcontentforsearchanddiscovery,andtheyhelptodisambiguateconcepts,capturevariationsinlanguageandmapthemassynonyms,andidentifyrelationshipsbetweenconceptssothatsearchescanbeexpandedornarrowed.Theconstraintsoftaxonomyandontologymanagementmostoftenrelate(a)tostayinguptodatewithnewandemergingneedsofdiverseusersets(whichiswheresearchanalyticscanhelp),and(b)toscalingthewaythatconceptscanbeidentifiedincontentandmetadataenriched(whichiswheretextanalyticsandmachineclassificationcanhelp).

Page 3: Search and Discovery Technology Stack v4 - Green … › uploads › Search_and_Discovery...1 Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

3

Textanalyticsreferstoaclusteroftechnologiesthatextractmeaningfromtextandturnitintometadata.Someoftheroottechnologies(crawlingandindexing)aresharedwithsearch.Thesetechnologiesarearesponsetotheproblemsofhumaninconsistency(peoplearenotconsistentinhowtheyapplytagstocontent)andofthescaleofcontenttobetagged(whenitwouldnotbefeasibletoapplyhuman-curatedtaggingtolargeamountsofcontentinashortperiodoftime).Differenttechnologiesareusedtosupplydifferenttypesoftaggingoperation,rangingfromidentificationofknownentitiessuchaspeople,organisationsandplaces,quantitiessuchasmoneyamounts,totheidentificationofabstractconceptsthatcharacterizewhatadocumentis“about”inwaysthatmakesensetotargetusers.Theconstraintsoftextanalyticsare(a)thatithasahardtimefiguringoutwithoutintensivehumancurationorverywellstructuredtrainingsetswhichconceptsaremostsalienttothemostpeople(whichiswheretaxonomyandontologymanagementhelps)and(b)likesearch,itrequiresconstanttuningbasedonemerginguserneeds(whichiswheresearchanalyticsonuserbehaviourscanhelp).Thesetechnologiescanbeextremelypowerfulwhenusedincombinationinsupportofthesearchanddiscoveryexperience.However,itisimportanttounderstandwhattheyarecapableof,howtheywork,andthelimitationsofwhattheycando.FigureA1showsadetailedviewofthethreepillarsofthesearchanddiscoverytechnologystack,howtheywork,andtheirkeyinteractions.Webeginatthetopofthediagram.Sincethesetechnologiesaresupposedtobedeployedintheserviceofuserneeds,decisionsabouttheirconfigurationanddesignneedtobemadeinthecontextofknownuserneeds,aswellasathoroughunderstandingofthecontenttobecovered.Notethatthisisasimplifiedconceptualframework,anditdoesnotfullyrepresentthecomplexitiesoftherelationshipsbetweenthepillars,noralltheoverlapsbetweenthem.

Page 4: Search and Discovery Technology Stack v4 - Green … › uploads › Search_and_Discovery...1 Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

4

Figu

reA1De

tailview

ofthese

archand

disc

overytechno

logystack

Page 5: Search and Discovery Technology Stack v4 - Green … › uploads › Search_and_Discovery...1 Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

5

“Content”canbestructured,semi-structuredorunstructured.Itcanbeintheformofdata,documents,webcontent,audiovisualfiles,discussions,orquestions.“People”canalsobeconsideredaformofcontentsinceknowledgeablepeoplemayalsobeausefulresultinssearchonaparticulartopic.Inacontentmanagementsystemapersonprofilecanbeusedasaproxyforthatperson,andcanbetaggedwithrelevantsubjecttagssoastosurfaceinsearchalongsideotherformsofcontent.Allthreetechnologiesinthestackdependuponathoroughsurveyofthetargetcontenttobecovered,andpreparationandstructuringofthecontentstoressothatthesearch,taxonomymanagementandtextanalyticstechnologiescanaccessthemandoperateuponthem.

Enterprise Search Let’sbeginwiththecentralpillar,theenterprisesearchstack.Therearesixmaincomponentsinthesearchstack.

CrawlingThefirstcomponentisthecrawlingmodule.Thisisdirectedtocontentsources,anditcrawlsthecontentinpreparationforindexing.Youmaywanttobeabletodirectthecrawlertospecificpartsofyourcontent.

ContentProcessingContentisthenprocessed,usingtoolsincommonwithinitialprocessingofcontentbytextanalyticstools.Atextfileiscreatedforparsingandindexing,andthecontentistokenized,meaningeachelementisgivenauniqueidentifiersothattheycanbemanipulatedandlocatedlateron.Thisisalsothepointatwhichlemmatization(resolvingvariantgrammaticalformssuchastensesandpluralstothesamegrammaticalroot)andstemming(resolvingdifferentwordendingstothesameword-stem)occurs.Youmayalsowanttoremove“stopwords”thatwillcreatemeaninglessnoise(commonoperatorssuchasif,but,and,the,a)fromtheindexing.Yourstopwordsmayneedtobetuned.Sometimesstopwordsthatarenoisyinsearchindexesprovidesemanticcluesinsomerules-drivenapplicationsoftextanalytics.

IndexingTheindexingmodulecreatestermindexesfromtheparsedtext.Eachtermislinkedtoitssourcecontent,andtokenizationprovidesthecontextinwhichthesearchtermappears.Thepre-processedindexesprovidespeedinsearch.Crawlingandpre-processingoflarge-scalecontentcollectionscanbeveryslowandoccupiessignificantprocessingcapacity.Hencemostindexingconsistsofincrementalupdatestotheindexesbynoticing,crawlingandindexingnewcontentwheneveritisadded.Itisveryimportanttoplantheconfigurationandsetupofyoursearchenginecarefullyinadvance.Anymajorchangestothesearchengineconfiguration,ortothe

Page 6: Search and Discovery Technology Stack v4 - Green … › uploads › Search_and_Discovery...1 Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

6

wayitexploitstaxonomyortextanalyticsmayrequireanexpensiveandslowcompletere-crawlandre-index.

Auto-categorizationSomesearchenginesalsooffersimpleauto-categorizationfunctions.Intheabsenceofsophisticatedtextanalyticstools,theseareusuallyentityrecognitionbasedonlookupsofreferencetermlists.Theselistscanbemaintainedwithinthesearchengineitself,orcanbesuppliedfromataxonomymanagementsystem.Thisisreferredtoas“knownentityextractionvialookup”.Theentitiesyouareinterestedincallingout(forexamplecompanynamesorcountrynames)arecompiledintolists,andareregisteredassignificantentitiesifthemainsearchindexescomeacrossthesetermsortheirsynonyms.Thiscanprovideaverysimplefilteringcapabilitybasedontheentitytypesyouareinterestedin.However,thisisafairlybasictechnology.Justbecauseadocumentmentions“Google”doesnotmeanthatthedocumentissubstantially“about”Google.

QueryProcessingMuchoftheutilityofasearchengineliesinthesophisticationwithwhichithandlesqueriesandqueryinterfaces.Queryinterfacescanincludetheubiquitoussearchquerybox,browsingtaxonomystructureswhereconceptsarepre-linkedtocontentthroughstoredsearches(requiresintegrationwithataxonomymanagementmodule),filteringofsearchresultsusingmetadataelementsortaxonomyfacets(againrequiresintegrationwithtaxonomy).Manysearchenginesofferauto-completeorauto-suggestofsearchqueriessuggestingcommonsearchqueriesastheusertypestheirqueryintothesearchbox.Auto-completecanexploitcommonsearchqueries,existingtaxonomyorthesaurusterms(linkedtoresultsthroughstoredsearches),and/orkeywordsincontextfrommetadatasuchastitlesofhighvaluecontent.Therelevanceofsearchresultsforcommonqueriescanbetunedattheback-endbyasearchmanagerbyalteringtherulesthatcalculatethelikelyrelevanceofagivenpieceofcontentforagivenquery.Wherethereisaknown“best”pieceofcontentforagivenquery,thiscanbepromotedtothetopoftheresultspage.Relevancetuningcanalsobeautomated,bylookingatclickbehaviorsbetweenqueryandclickingonresultsandpromotingcontentthatisfrequentlyclickedon.Relevance-tuningcanexploitwhatisknownabouttheusersandtheircontexts.Knowingthatauserissearchingfromwithinagivenorganizationalfunctionenablesrulestobewrittenforhowqueriesfromthoseusersshouldbehandled,fromzoningthesearchtoknowntargetcontentforthatfunctionalgroup,tohandlingrelevancecalculationbasedonmetadataassociatedwiththosefunctionsandusers,andmatchingitwiththeindexedcontent.Contentrankingmechanismscanalsobepulledintothemixed.

Page 7: Search and Discovery Technology Stack v4 - Green … › uploads › Search_and_Discovery...1 Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

7

SearchAnalyticsQueryhandlingandsearchtuningrequiresophisticationinthesearchengineitselfaswellasarelativelysophisticatedsearchmanagementcapability–i.e.thewaythatsearchisstaffed,administered,configuredandconstantlytunedinrelationtoknownuserneeds.Searchanalyticsisthemodulethatpowersthissophisticationandtuning.Searchengineshaveverypowerfulreportingcapabilities,trackinghowqueriesareconductedanduserinteractionswithresultspages,butasinallcomplexreportingcapabilities,thesmartslieinhowthereportsaredefinedandconfigured.Whatwillyoutrack,how,andhowfrequently?Howwillyoubackuphunchesgatheredfromthereports,andvalidatethemwithusers?Theinitialsearchdesignstrategybasedonusecasesandcontentanalysiswillgiveyouastartingpoint.Thiswillberefinedbytheinsightsyougainbyobservinguserbehaviorsthroughthesearchroll-outandsubsequentmaintenance.

Taxonomy and Ontology Management Likesearch,taxonomyandontologymanagementrequiresveryrobustanalysisofhowusersneedtointeractwithcontent,andanalysisofthecontentitselftounderstandhowitisdescribed,organizedandused.Taxonomydevelopmentandmanagementisstillbestconductedasaworkofskilledhumandesign,butitcanbesubstantiallyenhancedwiththeappropriatetechnologies.Therearetwobroadfunctionswithinenterprisetaxonomyandontologymanagementsystems.

TaxonomyBuildingandMaintainingEnterprisetaxonomymanagementsystemssupporttheworkofbuildingandmaintainingtaxonomiesbyprovidingworkflowsandmetadataaroundtaxonomyconcepts.Forexample,theycanhousesourcevocabulariesofvaryingauthorityandstructurethatcanbeusedtogeneratecandidatetermsforthetaxonomy.Thereareapprovalworkflowstotracktheapprovalprocess.Theremayberoles-basedpermissionsforcomplexenvironmentswheremultipletaxonomyprofessionalsareinvolvedinmaintainingthevocabularies.Sourcesandusagesoftaxonomytermscanbetrackedasattributesofthetaxonomyconcepts.Scopenotescanbeaddedtoexplicatetaxonomyterms,andtherevisionhistoryoftermsandtaxonomyfacetscanbetracked.

ReferenceStoreThesecondmajorfunctionofanenterprisetaxonomymanagementsystemistoactasareferencestoreforothersystems,supplyingtaxonomyterms,relationshipsbetweenterms,synonymsandcontrolledvocabulariestoothersystems.Itcansupply:

• controlledvocabularytermstocontentmanagementsystemstosupportmanualtaggingofcontent

• termsandrelationshipstosupportauto-categorizationviaknownentitylookup,whetherappliedwithinasearchengineoratextanalyticsengine;

Page 8: Search and Discovery Technology Stack v4 - Green … › uploads › Search_and_Discovery...1 Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

8

• synonymsandrelationshipstosearchsothatitcanimproverelevanceandprecisionofresults,andsupportexpansionofsearch

• taxonomyfacetstosearchtosupportfilteringofsearchresults.Becausetheyarebuilttomanipulatedefinedrelationshipsbetweenconceptsandtheirattributes,taxonomyandontologymanagementsystemscanalsoactasbrokerstootherLinkedDataorLinkedOpenDatasources,tocallauthoritytermsandrelationshipsfromelsewhere,tosupportsearchand/ortosupportenrichmentofcontentviatextanalytics.Finally,enterprisetaxonomyandontologymanagementsystemsarebuilttoactastaxonomy/ontologyhubscateringtomultiplesystemsandaudiences.Theyarecapableofinteractingdifferentlywithdifferentplatformsandaudiences.Forexample,differentversionsofthesamecoretaxonomycanbepresenteddifferentlytodifferentaudiences,dependingontheirneeds.Differentvocabulariesandtaxonomyservicescanbesuppliedtodifferentsystemsbasedontheircontentanduses.

Text Analytics Textanalyticsreferstoaquitediversefamilyofcomputer-assistedtechniquesforassigningmeaningtocontent.Differentmodulescanbeusedtoservedifferentpurposes.

ContentProcessingTherearesomebasiccontentprocessingfunctionsthattextanalyticshasincommonwithenterprisesearch:preparationoftextforparsing,tokenization,lemmatization,stemming,stopwords.

SyntacticandSemanticPre-ProcessingSeveralfunctionalapplicationsoftextanalyticsdependontwofoundationalprocessingtechniques.Thefirst,syntacticanalysis,analyzesthelanguagestructureoftextgrammatically,usingverylargeopensourcetrainingsets.Sentencesplittersbreakupsentencesintopartsofspeech,distinguishingnounsandnounphrasesfromverbsandoperators,andsoon,andsyntacticanalyzerscreatealogicalrepresentationofthesentences.Semanticanalysisattemptstoinfermeaningfromanalyzedtexts.Likesyntacticanalysisitdependsonreferencetrainingsets.Itenablescomputerstoinfermeaningfulassertionsandfactsfrombodiesoftext,andunderpinsthebasetechnologiesbehindmachinetranslation.Syntacticanalysisisbasedonknownrulesofgrammar.Semanticanalysishasamuchmorechallengingtask,whichistoinferhuman-understandablemeanings.Notallgrammaticallycorrectsentencesaremeaningfultohumans–"Colorlessgreenideassleepfuriously"isafamousexample

Page 9: Search and Discovery Technology Stack v4 - Green … › uploads › Search_and_Discovery...1 Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

9

presentedbyNoamChomskyofasyntacticallycorrectsentencethatissemanticallyincoherent.Syntacticandsemantictechniquesunderpintheidentificationof“entities”withincontent.Inthecontextoftextanalytics,an“entity”canbeaperson,athing,aplace,atime,atopic.Itisanydistinctconceptthatcanbeidentifiedfromtext.Identificationandextractionofentitiescanbedonemostsimplybysimplylookingupreferencelists,asinauto-categorizationofknownentitiesprovidedbysomesearchengines.However,withsyntacticandsemanticanalysistextanalyticscanalsoidentifyunknown(unpredicted)entitiesforwhichyoudon’tyethaveexamplesinyourlists.ForexampleinEnglishpropernamesarecapitalized,andpersonalnameshaveknownforms(thatalsovarybyculture).Hencecandidatesforpersonsmentionedintextcouldbepickedupbyarulethatsays“lookforoneormorecapitalizedsubjectsorobjectsofsentences,andthenlookforlemmatizedverbsfromthefollowinglist‘say’…‘reply’…etc”.Suchrulesarebasedoncommonusageasrepresentedintheirtrainingsetsandaretypicallyprovidedasstandard.Theseruleswilltypicallyneedtobetestedandrefinedbasedonlocalusage.Forexample,wefoundthatonestandardtooldidnotpickupcountryoforiginincontentfromanimmigrationandcustomsauthority,becauseitwasusingsimpleterm-phraselookup.However,customsofficersatcheckpointshabituallyusedtheconvention“[Country]-registeredvehicle”.Thehyphenationandextensionofthetermdisruptedthesimplelookupfunction.Noticethatthesolutiontothisproblemisacombinationoftechniques,likelysupplementingknownentitylookupwithsyntacticallydrivenentityextractionrules.Thepracticeofusingdifferenttechniquesincombinationischaracteristicoftextanalyticsapproaches.Withthesefoundationalcapabilities,therearemanyspecializedapplicationsoftextanalytics,notallofwhichwillbesuitabletoeverypurpose.

Auto-categorizationTheabilitytoidentifybothknownandunknownentities/conceptsisasignificantimprovementinauto-categorizationcapabilities.Lookuplistsforknownentities/conceptscanbesupportedbyataxonomymanagementsystem,andstrengthenedthroughthesupplyofsynonyms.Conversely,throughthejudicioususeofrules,textanalyticscandiscoverentities/conceptsmentionedintextthatarenotyetinthetaxonomyorcontrolledvocabularies,andcanbeconsideredascandidateterms.

TextMiningTextminingreferstoanumberoftechniquesforanalyzingtextbasedonstatisticalalgorithms.Inverybroadterms,largebodiesoftextaresubmittedtoiterative

Page 10: Search and Discovery Technology Stack v4 - Green … › uploads › Search_and_Discovery...1 Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

10

operationsusingalgorithmsto“sort”thetextintoclustersor“buckets”thatareinternallystatisticallysimilar,andstatisticallydissimilartotheotherclustersorbuckets.Statisticaltechniquesarealsousedtodeterminethewordsorword-phrasesthataremostlikelytouniquelyrepresenttheirbucketcomparedtootherbuckets.Thesewordstringsareoftencalled“topicmodels”.Thelikelihoodofspecifictermsoccurringinproximitytoeachothercanalsobecalculated.Textminingissometimestoutedasafully-automatedalternativetodescribingandtaggingcontent.Inpracticehowever,theclusteringtechniquesusefuzzyalgorithms,andtheend“described”stateishighlyinfluencedbywhichclusteringpathwaysaretakenatthebeginningoftheoperation.Textminingtechniquessufferfromnon-reproducibility(thesameseriesofoperationsonthesamecontentcanproducedifferentresultsondifferentoccasions)andfromnon-comprehensibility–thetopicmodelsgeneratedoftendonotmatchhowusersthemselvesdescribethecontent.Soinpractice,applicationsoftextminingareheavilytunedbytheirhumanoperators,anditisnotalwaystransparenthowagivensetofresultsareachieved.Textminingalsosuffersfromendogeneity–meaningthatthetagsthatareextractedtodescribesimilarbucketsofcontentarewhollyderivedfromthatcontentset(orwhentrainingsetsareusedforcomparison,fromthetargetcontent+thetrainingset).Endogeneityisaproblemwhenyouwanttosurveydiversecollectionsofcontentandmapmeaningsconsistentlyacrossthem.Inthatcase,ataxonomyorontologythatcrossesthosecontentsetsandaudiencesprovidesanexogenousreferencepointthatcanbrokermeaningsacrossdifferentcontentsetsanduseraudiences.However,textminingcanbeapowerfultoolusedincombinationwithsearchandtaxonomymanagement.Itsstatisticaltechniquestoclusterbasedonsimilaritycanproposenewcategorizationsandpotentiallyrelevantrelationshipsbetweentopicstothetaxonomymanager.Itsabilitytoextracttopicmodelscanbeusefulinautomatedsummarizationtechniques.

SemanticAnalysisWehavealreadydiscussedsomeofthechallengesofcomplexityfacingsemanticanalysis,notleasttheheavyinfluenceofcontextandculture(factorsexternaltothecontent)onmeaning.Inpracticaltextanalyticsapplications,semanticanalysisreliesonanumberoftechniquesandnotjustcomputationalsemanticanalysis.Forexample,oneofthemainapplicationsofsemanticanalysisistoextractassertionsorfactsfromcomplexdocumentation.Ifsemanticanalysishasaccesstoontologiesviaataxonomymanagementsystemitcanthenenrichitsinferencesbasedonknownandvalidatedrelationshipsbetweenentities.

Page 11: Search and Discovery Technology Stack v4 - Green … › uploads › Search_and_Discovery...1 Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

11

SentimentAnalysisSentimentanalysisisaveryspecializedapplicationoftextanalytics,oftenusedinmarketing,socialmediaanalysis,andqualitativefeedback.Itlooksatthesemanticsofpositive,negativeandneutralexpressions,inassociationwithknown(extracted)entities.Againitishighlycontextual,andsubjecttoover-simplification–forexample,sarcasticlanguagecanbesuperficiallypositive,butthelinguisticmarkersofsarcasmareoftennotveryobvious.Aswithmosttextanalyticstechniques,therulesandalgorithmsneedtobetested,validatedandconstantlysupervisedforaccuracy.

SummarizationAutomaticsummarizationoftenexploitsknowndocumentordatastructures(itknowswheretolookforkeycontent),anditmayexploitsyntacticanalysis(e.g.removingextraneous“padding”languagesuchasadjectivesandadverbs).Itmayusesemanticanalysistoextractkeyfactsfromdocuments.

How the Pieces of the Stack Work Together Let’sbringthisbacktogethertogetabetterunderstandingofhowthepillarsofthestackworktogether.FigureA2illustratesthehighlevelviewofhowthedifferentpillarsofthesearchanddiscoverytechnologystackinteractandworktogethertoproducesuperiorresultsfororganisations.Thefocusforhowthestackisdeployed,andwhichparticulartechnologycomponentsareused,alwaysspringsfromuseranalysis,andanunderstandingofthequeriesthatusersmake,againstanunderstandingofthetargetcontent,howitisstructured,howandwhyitisproduced,howitiscurrentlybeingused,andhowitcouldpotentiallybeused,ifthesearchanddiscoveryfunctionalitycouldbeimproved.Indigitalenvironments,“targetcontent”canbeanycombinationofdocuments,webcontent,data,dashboards,andpeopleprofiles.TheconfigurationoftheelementsoftheSearchandDiscoverystackspringsfromanunderstandingofuserneedsagainstanunderstandingoftargetcontent.Determininghowtousethepillarsofthestacksitswithinalargerdiagnosticanddecision-makingframework.

1. Whatisthebusinessproblem(orsetofbusinessproblems)youaretryingtosolve?Theanswertothisquestioncanoftencomethroughaknowledgemanagementdiagnosticsactivitysuchasaknowledgeaudit,combinedwithanunderstandingoftheorganisation’sbusinessdirectionandstrategy.

2. Whoarethekeyusercommunitiesinrelationtothebusinessproblem,andhowwillyouunderstandtheirworkingneedsandopportunitiesforimprovement?(Thiscanalsocomefromaknowledgeaudit).

3. Whatisthetargetcontent(theknowledgeresourcesthatyourtargetusercommunitiesneedstoworkwithmoreeffectively),whereisit,howisitstructuredandusedcurrently?(Contentanalysisandmodelingcanhelp).

Page 12: Search and Discovery Technology Stack v4 - Green … › uploads › Search_and_Discovery...1 Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

12

4. Whattoolsfromwithinthesearchanddiscoverystackcanhelptoaddresstheproposedsolution?(Thefocusofthispaper)

5. Whatneedstochangeonthebehavioural,processandgovernancelevelsforthenewsolutiontowork?(Forsustainableimplementationplanning,checkoutTheKnowledgeManager’sHandbook(MiltonandLambe,2016)fromaknowledgemanagementperspectiveorMaishNichani’sseriesofarticlesathttp://olasearch.com/articlesfromasearchdesignperspective).

FigureA2Interactionsbetweentaxonomymanagement,searchandtextanalytics

SearchandTaxonomySearchistheuser-facingcomponentofthestack.Itmediatescontenttousers.Taxonomymanagementsuppliesknownsalientconceptsandrelationshipstosearch,andittellssearchwhichquerytermsaresynonymsofthesametaxonomyconcepts.By“salient”wemeanthattheconceptsarefrontofmind,action-relatedconceptsintheheadsofusersastheyperforminformationandknowledgeseekingactivitiestocompleteworktasks.Taxonomiesmakesearchsmarterbytellingitwhichconceptsareimportanttousers,howtheyarerelatedtootherconcepts,whetherthroughhierarchicalorotherrelationships.Hencesearchcanexploittaxonomiestohelpuserstorefineorexpandtheirsearch,tofollowmeaningfulpathwaysandexplorecontent.Taxonomiesandontologiesenhancetherelevanceandprecisionofsearchforusers.Becausesearchisuserfacing,itgathersalotofdataaboutuserneedsandbehaviors.Frequencyanalysisofsearchqueriestellsthetaxonomistalotabouthowusersthinkabouttheircontentandtheirsearchrequirements.Clickthroughactivity

Page 13: Search and Discovery Technology Stack v4 - Green … › uploads › Search_and_Discovery...1 Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

13

onsearchresultspagesalsoprovidesevidenceaboutuserperceptionsofwhatmayormaynotbeuseful.Hencesearchanalyticsprovidesafeedbacklooptohelpthetaxonomymanagerconstantlyrefineandimprovethetaxonomytouserneeds.Thisisapositivefeedbackloop,wheretaxonomyprovidesrelevanceandprecisiontosearch,andsearchprovidesaconstantflowofrealtimelarge-scaleevidenceonwhatisactuallyrelevantandusefultousersatthepresenttime.Searchanalyticscanalsopickupemergingtermsandconceptsthatareimportanttousers,andproposethemascandidatetermsfortaxonomiesandtheirsupportingthesauri.Withouttaxonomymanagement,searchisrelativelydumb,whetherintermsofwhatitcanindex,howitcanhelpusersmaketheirqueries,andhowitcanhelpusersactupontheirresults.Withoutsearch,taxonomymanagementlacksaconvincingmediumtodemonstrateitsvalue,anditlacksacontinuousflowofevidencetomaintainitsrelevanceandusefulnesstousers.

SearchandTextAnalyticsTextminingcanproposenewconceptsandnewassociationsbetweencontentitemsforthesearchmanagertobuildintoqueryprocessingandrelevancytuning.Semanticanalysisandsummarizationtechniquescanbeusedtoprovide“smart”summariesofcontentitemstobepreviewedinsearchsnippetsandinsearchresultspages,makingiteasierforuserstoassesswhetherornotthecontentisrelevanttotheirneed,beforetheyopenthecontentitem.Textanalyticscanextendthewayinwhichcontentcanbepulledtogetherinspecializedsearchbasedapplications.Allthreepillarsofthestack(search,taxonomyandtextanalytics)dependonconstanttuningoftheinfrastructuretodelivergoodsearchanddiscoveryresults.Textanalyticsinparticulardependsonrulestoguidehowthecontentisprocessedandtagged.Theserulesneedmaintenanceandtuning.Thistuningisnecessarybecauseoftheconstantongoingchangesincontent,context,priorities,usersandneeds.Searchanalyticsprovidealargescale,real-timewindowintouserneedsandbehaviors,andso,aswithtaxonomymanagement,itprovidestheevidencebaseneededtokeepthetextanalyticsstacktunedtodeliveraccurateandusefulresultsforusers.Whentherearelargeandcomplexknowledgebases,withouttextanalytics,searchfindsitdifficulttopickoutsalientconceptsrelatedtocontent.Aswithtaxonomymanagement,textanalyticsdependsondeepuserandcontentknowledgetoknowhowtoconfigureandtuneitsoperations.Withoutsearch,textanalyticsfindsithardtodothisunlesstheteamispreparedtoinvestinconstant(andexpensive)userresearchandtesting.

TextAnalyticsandTaxonomyManagementTaxonomiesaredesigned,evidence-basedartifacts.Assuch,theyconsistentlylagthepaceofchange.Theyareoptimizedtoexploitknownanddocumentedconcepts.Textanalyticstechniquessuchasauto-categorizationandtextminingcansupplycandidatetermsandrelationshipstothetaxonomymanagerwithouttheneedfor

Page 14: Search and Discovery Technology Stack v4 - Green … › uploads › Search_and_Discovery...1 Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

14

comprehensivere-surveyingofcontentanduserneeds.Theycanpickupemergingconceptsinthecontent,justassearchanalyticspicksupandproposesemergingconceptsevidencedinsearchqueries.Sotextanalyticscanactasapowerfultaxonomyenrichmenttool.Taxonomiesandontologiescanguideandstrengthentheapplicationoftextanalyticstechniquessuchasauto-categorization(knownentityextractionthroughlookup),factextractionthroughsemanticanalysis,andsentimentanalysis(byhelpingthesentimentengineresolvedifferentsynonymstothesameentitybeingreferencedinthecontent).Whentherearelargeandcomplexknowledgebases,withouttextanalytics,taxonomytagscannoteasilyandconsistentlybeappliedtolarge,diversebodiesofcontent.Withouttaxonomymanagement,textanalyticsstrugglestodescribecontentintermsthatmakesensetokeyusersandinthecontextofuseractivities.Machinetaggingtechniquesalonedonotdescribecontentintermsthathumansintuitivelyunderstand,withoutsomeformofhuman-curatedfilter.Taxonomymanagementprovidesthis.

Implications Fortoolongorganisationshavebeenlisteningtotheboomingvoicegivingmagicalreassurancesfrombehindthecurtain,andhaveneglectedtheirowncapabilities,andthetoolsetsthatexistintheopensourcearena(thatcommercialproductsalsoexploit).Astheworldbecomesmorecomplex,solutionstosearchanddiscoveryproblemswillbecomemorediverse.Weneedtobecomebetterjourneymen,moreknowledgeableaboutthecapabilitiesdifferenttoolsetsthatareavailable,sothatwecanchoosetherighttoolsforanygivenproblem.Someofthemwillbecommercialtools,acquiredfortheirexcellenceinaspecificsetoffunctions.Someofthemwillbeopensourcetools,acquiredandtunedtosolvespecifickindsofproblem.Theageofsinglesource“doitall”commercialplatformsisover.Thispaperhasproposedthatknowledgeofthesearchanddiscoverytechnologystackisanimportantcapabilitytoacquire.Itneedstobeatleastadequatetothejobofevaluatingneedsandpossibilities,andofaskingvendorssearchingquestionsabouthowtheirtechnologyworks,whatitisgoodat,andwhatitslimitationsare.Insomecases,organisationswillwanttointernalizeadeepertechnicalcapabilityinworkingwithanumberofsetsoftools.Thereareotherimportantsearchanddiscoveryrelatedcapabilitiesthatenterprisesneedtoacquirebeyondaknowledgeofthesearchanddiscoverytechnologystack–forexample,howuserneedsandcontentcanbeanalysedtodetermineprioritiesandopportunities,howcontentcanbemanagedandenrichedtoprovideoptimaluserexperiences,andhowbusinessproblemscanbetransformedthroughdesignprocessesintoeffectivesolutionsforbusinessneeds,problemsandopportunities.

Page 15: Search and Discovery Technology Stack v4 - Green … › uploads › Search_and_Discovery...1 Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

15

I would like to acknowledge the following colleagues who have provided valuable insights and advice informing this paper: Dave Clarke, Ahren Lehnert, Agnes Molnar, Maish Nichani and Tom Reamy. Patrick Lambe October 2016-January 2017