Download - Search and Discovery Technology Stack v4 - Green … › uploads › Search_and_Discovery...1 Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

1

Behind the Curtain: Understanding the Search and Discovery Technology Stack Patrick Lambe

The Lion thought itmight be as well to frighten theWizard, so he gave alarge, loud roar, whichwas so fierce and dreadful that Toto jumped awayfromhiminalarmandtippedoverthescreenthatstoodinacorner.Asitfellwith a crash they looked thatway, and thenextmomentall of themwerefilledwithwonder. For they saw, standing in just the spot the screen hadhidden,alittleoldman,withabaldheadandawrinkledface,whoseemedtobeasmuchsurprisedastheywere.TheTinWoodman,raisinghisaxe,rushedtowardthelittlemanandcriedout,"Whoareyou?""I amOz, theGreat andTerrible," said the littleman, in a trembling voice."Butdon'tstrikeme--pleasedon't--andI'lldoanythingyouwantmeto.”

L.FrankBaum.TheWonderfulWizardofOz.ProjectGutenberg(2008)TheWizardofOzcapturesthelifetrajectoryofmanytechnologies,andsearchisnoexception:thetechnologystartsinthepublicdomain,itisappropriatedandmademysterious(hiddenbehindthecurtain)bycommercialinterests,thentoutedandoverblownasmagical,eventuallythecurtainfalls,andwecanseewhathasbeenthereallalong,andworkwithitmoreintelligentlyasitis,knowingwhatit’sgoodforandknowingwhatit’snotgoodfor.Thisistrueofsearch,asitistrueofauto-classification.ModernsearchtechnologyhasitsrootsinVannevarBusch’sseminal1945paper“Aswemaythink”anditsfoundationsweredevelopedbyresearchgroupsatMIT,CornellandHarvardintothe1970s.Oneoftheconsequencesofhidingatechnologybehindthecommercialcurtainisthatdifferenttechnologiescompetewitheachotherforthesamemarket,andthereforedownplaytheirperceivedcompetitors,orattempttoduplicatethefunctionalitiesoftheirperceivedcompetitors.Searchtechnologyhascompetedwithtaxonomyworkforyears–“youdon’tneedtaxonomies,”wentthemantraofoneinfamoussearchvendor,“oursoftwarewillunderstandyourcontentforyou.”ThecurtainfellwiththeunveilingofGoogleKnowledgeGraph,whenitbecameobviousjusthowmuchGooglewasinvestinginknowledgeorganisationstructures(controlledvocabularies,taxonomies,ontologies,knowledgegraphs)tomakeitssearchsmarter.Thesuperiorperformanceofopensourcesearchengines(andtheirstrikingadoptionrates)overcommercialsearchenginesinsolvingparticulartypesofproblemis,likeToto’sleap,anothermanifestationofthedroppedcurtain.Machineclassificationofcontenthasbeenanotherexample.Itisfrequentlytoutedbyitscommercialvendorsasasingletechnologyperformingasingle“we’lltagyour

2

contentforyou”functionwhereitisactuallyabundleofquitedistincttechnologies,androotedininformationretrievalresearchandsearchtechnologiesdevelopedinthe1970sand1980s.Machineclassificationvendorshavefoundthemselvescompetingwithsearch(“wecanmakeyoursearchsmarter”)andtaxonomies(“wecanderivetaxonomiesautomaticallyforyou”).ThecurtainisnowstartingtofallwithlargeorganisationslearninghowtousetheopensourcetoolkitssuchastheUniversityofSheffield’sGATEtoolset,tosolvetheirsearchanddiscoveryproblems.The“goitalone”propagandaofthetechnologyvendorsisdamagingbecauseitlocksbuyersintoasingletoolsetdesignedforcommondenominators,withprohibitivecustomisationandmaintenancecosts.The“magicblackbox”propagandablindsorganisationstotheneedtobuildcapabilitiesinarangeoftools.Andthe“wedoeverything”propagandahidesthesynergiesandcapabilitiesthatcanbeattainedbyintelligentlyusingsearch,machineclassificationandknowledgeorganisationtechnologiesasatoolkit,withdifferenttoolsetstobeusedindifferentcombinationstosolvedifferentkindsofsearchanddiscoveryproblem.Thepurposeofthisarticleistoexplaintheusesandinteractionsofthreetechnologiesthattogethersupportsearchanddiscovery,sothatorganisationscanlearnwhichtoolstoadoptinwhichcombinationtosolvedifferentproblems,andtheycanseewhatkindofinternalcapabilitiestheyneedtobuildandmaintaintocontinuesolvingthoseproblems:

• Enterprisesearch• Taxonomyandontologymanagement• Textanalyticstoperformmachineclassification.

Enterprisesearchisatechnologyfordeliveringusefulandrelevantresultstousersexploringalargecontentbase.Asatechnologyitisrelativelysimpleandnotespeciallysmart,andtodeliversuperiorresultsincomplexenvironmentsitneedstobesupportedbyothertechnologies.“Complexity”canrefertothediversityofusercommunities,andtothediversityofthecontentitself.Searchenginesbythemselvesdon’tunderstandanything.Theycrawlandindex,processqueriesandserveupuseranalytics.Tobecomesmart,theyneedhumandesignedcurationtools(taxonomyandontologytools)andtoolsforscalingthehumantaggingofcontent(textanalyticstools).Taxonomyandontologymanagementtechnologiessupportdesigning,deliveringandmaintainingknowledgestructuresandcontrolledvocabulariestoenhancethesearchanddiscoveryexperience.Thesestructuresarebasedonananalysisofbusinessanduserneedstogetherwithananalysisofthetargetcontentforsearchanddiscovery,andtheyhelptodisambiguateconcepts,capturevariationsinlanguageandmapthemassynonyms,andidentifyrelationshipsbetweenconceptssothatsearchescanbeexpandedornarrowed.Theconstraintsoftaxonomyandontologymanagementmostoftenrelate(a)tostayinguptodatewithnewandemergingneedsofdiverseusersets(whichiswheresearchanalyticscanhelp),and(b)toscalingthewaythatconceptscanbeidentifiedincontentandmetadataenriched(whichiswheretextanalyticsandmachineclassificationcanhelp).

3

Textanalyticsreferstoaclusteroftechnologiesthatextractmeaningfromtextandturnitintometadata.Someoftheroottechnologies(crawlingandindexing)aresharedwithsearch.Thesetechnologiesarearesponsetotheproblemsofhumaninconsistency(peoplearenotconsistentinhowtheyapplytagstocontent)andofthescaleofcontenttobetagged(whenitwouldnotbefeasibletoapplyhuman-curatedtaggingtolargeamountsofcontentinashortperiodoftime).Differenttechnologiesareusedtosupplydifferenttypesoftaggingoperation,rangingfromidentificationofknownentitiessuchaspeople,organisationsandplaces,quantitiessuchasmoneyamounts,totheidentificationofabstractconceptsthatcharacterizewhatadocumentis“about”inwaysthatmakesensetotargetusers.Theconstraintsoftextanalyticsare(a)thatithasahardtimefiguringoutwithoutintensivehumancurationorverywellstructuredtrainingsetswhichconceptsaremostsalienttothemostpeople(whichiswheretaxonomyandontologymanagementhelps)and(b)likesearch,itrequiresconstanttuningbasedonemerginguserneeds(whichiswheresearchanalyticsonuserbehaviourscanhelp).Thesetechnologiescanbeextremelypowerfulwhenusedincombinationinsupportofthesearchanddiscoveryexperience.However,itisimportanttounderstandwhattheyarecapableof,howtheywork,andthelimitationsofwhattheycando.FigureA1showsadetailedviewofthethreepillarsofthesearchanddiscoverytechnologystack,howtheywork,andtheirkeyinteractions.Webeginatthetopofthediagram.Sincethesetechnologiesaresupposedtobedeployedintheserviceofuserneeds,decisionsabouttheirconfigurationanddesignneedtobemadeinthecontextofknownuserneeds,aswellasathoroughunderstandingofthecontenttobecovered.Notethatthisisasimplifiedconceptualframework,anditdoesnotfullyrepresentthecomplexitiesoftherelationshipsbetweenthepillars,noralltheoverlapsbetweenthem.

4

Figu

reA1De

tailview

ofthese

archand

disc

overytechno

logystack

5

“Content”canbestructured,semi-structuredorunstructured.Itcanbeintheformofdata,documents,webcontent,audiovisualfiles,discussions,orquestions.“People”canalsobeconsideredaformofcontentsinceknowledgeablepeoplemayalsobeausefulresultinssearchonaparticulartopic.Inacontentmanagementsystemapersonprofilecanbeusedasaproxyforthatperson,andcanbetaggedwithrelevantsubjecttagssoastosurfaceinsearchalongsideotherformsofcontent.Allthreetechnologiesinthestackdependuponathoroughsurveyofthetargetcontenttobecovered,andpreparationandstructuringofthecontentstoressothatthesearch,taxonomymanagementandtextanalyticstechnologiescanaccessthemandoperateuponthem.

Enterprise Search Let’sbeginwiththecentralpillar,theenterprisesearchstack.Therearesixmaincomponentsinthesearchstack.

CrawlingThefirstcomponentisthecrawlingmodule.Thisisdirectedtocontentsources,anditcrawlsthecontentinpreparationforindexing.Youmaywanttobeabletodirectthecrawlertospecificpartsofyourcontent.

ContentProcessingContentisthenprocessed,usingtoolsincommonwithinitialprocessingofcontentbytextanalyticstools.Atextfileiscreatedforparsingandindexing,andthecontentistokenized,meaningeachelementisgivenauniqueidentifiersothattheycanbemanipulatedandlocatedlateron.Thisisalsothepointatwhichlemmatization(resolvingvariantgrammaticalformssuchastensesandpluralstothesamegrammaticalroot)andstemming(resolvingdifferentwordendingstothesameword-stem)occurs.Youmayalsowanttoremove“stopwords”thatwillcreatemeaninglessnoise(commonoperatorssuchasif,but,and,the,a)fromtheindexing.Yourstopwordsmayneedtobetuned.Sometimesstopwordsthatarenoisyinsearchindexesprovidesemanticcluesinsomerules-drivenapplicationsoftextanalytics.

IndexingTheindexingmodulecreatestermindexesfromtheparsedtext.Eachtermislinkedtoitssourcecontent,andtokenizationprovidesthecontextinwhichthesearchtermappears.Thepre-processedindexesprovidespeedinsearch.Crawlingandpre-processingoflarge-scalecontentcollectionscanbeveryslowandoccupiessignificantprocessingcapacity.Hencemostindexingconsistsofincrementalupdatestotheindexesbynoticing,crawlingandindexingnewcontentwheneveritisadded.Itisveryimportanttoplantheconfigurationandsetupofyoursearchenginecarefullyinadvance.Anymajorchangestothesearchengineconfiguration,ortothe

6

wayitexploitstaxonomyortextanalyticsmayrequireanexpensiveandslowcompletere-crawlandre-index.

Auto-categorizationSomesearchenginesalsooffersimpleauto-categorizationfunctions.Intheabsenceofsophisticatedtextanalyticstools,theseareusuallyentityrecognitionbasedonlookupsofreferencetermlists.Theselistscanbemaintainedwithinthesearchengineitself,orcanbesuppliedfromataxonomymanagementsystem.Thisisreferredtoas“knownentityextractionvialookup”.Theentitiesyouareinterestedincallingout(forexamplecompanynamesorcountrynames)arecompiledintolists,andareregisteredassignificantentitiesifthemainsearchindexescomeacrossthesetermsortheirsynonyms.Thiscanprovideaverysimplefilteringcapabilitybasedontheentitytypesyouareinterestedin.However,thisisafairlybasictechnology.Justbecauseadocumentmentions“Google”doesnotmeanthatthedocumentissubstantially“about”Google.

QueryProcessingMuchoftheutilityofasearchengineliesinthesophisticationwithwhichithandlesqueriesandqueryinterfaces.Queryinterfacescanincludetheubiquitoussearchquerybox,browsingtaxonomystructureswhereconceptsarepre-linkedtocontentthroughstoredsearches(requiresintegrationwithataxonomymanagementmodule),filteringofsearchresultsusingmetadataelementsortaxonomyfacets(againrequiresintegrationwithtaxonomy).Manysearchenginesofferauto-completeorauto-suggestofsearchqueriessuggestingcommonsearchqueriesastheusertypestheirqueryintothesearchbox.Auto-completecanexploitcommonsearchqueries,existingtaxonomyorthesaurusterms(linkedtoresultsthroughstoredsearches),and/orkeywordsincontextfrommetadatasuchastitlesofhighvaluecontent.Therelevanceofsearchresultsforcommonqueriescanbetunedattheback-endbyasearchmanagerbyalteringtherulesthatcalculatethelikelyrelevanceofagivenpieceofcontentforagivenquery.Wherethereisaknown“best”pieceofcontentforagivenquery,thiscanbepromotedtothetopoftheresultspage.Relevancetuningcanalsobeautomated,bylookingatclickbehaviorsbetweenqueryandclickingonresultsandpromotingcontentthatisfrequentlyclickedon.Relevance-tuningcanexploitwhatisknownabouttheusersandtheircontexts.Knowingthatauserissearchingfromwithinagivenorganizationalfunctionenablesrulestobewrittenforhowqueriesfromthoseusersshouldbehandled,fromzoningthesearchtoknowntargetcontentforthatfunctionalgroup,tohandlingrelevancecalculationbasedonmetadataassociatedwiththosefunctionsandusers,andmatchingitwiththeindexedcontent.Contentrankingmechanismscanalsobepulledintothemixed.

7

SearchAnalyticsQueryhandlingandsearchtuningrequiresophisticationinthesearchengineitselfaswellasarelativelysophisticatedsearchmanagementcapability–i.e.thewaythatsearchisstaffed,administered,configuredandconstantlytunedinrelationtoknownuserneeds.Searchanalyticsisthemodulethatpowersthissophisticationandtuning.Searchengineshaveverypowerfulreportingcapabilities,trackinghowqueriesareconductedanduserinteractionswithresultspages,butasinallcomplexreportingcapabilities,thesmartslieinhowthereportsaredefinedandconfigured.Whatwillyoutrack,how,andhowfrequently?Howwillyoubackuphunchesgatheredfromthereports,andvalidatethemwithusers?Theinitialsearchdesignstrategybasedonusecasesandcontentanalysiswillgiveyouastartingpoint.Thiswillberefinedbytheinsightsyougainbyobservinguserbehaviorsthroughthesearchroll-outandsubsequentmaintenance.

Taxonomy and Ontology Management Likesearch,taxonomyandontologymanagementrequiresveryrobustanalysisofhowusersneedtointeractwithcontent,andanalysisofthecontentitselftounderstandhowitisdescribed,organizedandused.Taxonomydevelopmentandmanagementisstillbestconductedasaworkofskilledhumandesign,butitcanbesubstantiallyenhancedwiththeappropriatetechnologies.Therearetwobroadfunctionswithinenterprisetaxonomyandontologymanagementsystems.

TaxonomyBuildingandMaintainingEnterprisetaxonomymanagementsystemssupporttheworkofbuildingandmaintainingtaxonomiesbyprovidingworkflowsandmetadataaroundtaxonomyconcepts.Forexample,theycanhousesourcevocabulariesofvaryingauthorityandstructurethatcanbeusedtogeneratecandidatetermsforthetaxonomy.Thereareapprovalworkflowstotracktheapprovalprocess.Theremayberoles-basedpermissionsforcomplexenvironmentswheremultipletaxonomyprofessionalsareinvolvedinmaintainingthevocabularies.Sourcesandusagesoftaxonomytermscanbetrackedasattributesofthetaxonomyconcepts.Scopenotescanbeaddedtoexplicatetaxonomyterms,andtherevisionhistoryoftermsandtaxonomyfacetscanbetracked.

ReferenceStoreThesecondmajorfunctionofanenterprisetaxonomymanagementsystemistoactasareferencestoreforothersystems,supplyingtaxonomyterms,relationshipsbetweenterms,synonymsandcontrolledvocabulariestoothersystems.Itcansupply:

• controlledvocabularytermstocontentmanagementsystemstosupportmanualtaggingofcontent

• termsandrelationshipstosupportauto-categorizationviaknownentitylookup,whetherappliedwithinasearchengineoratextanalyticsengine;

8

• synonymsandrelationshipstosearchsothatitcanimproverelevanceandprecisionofresults,andsupportexpansionofsearch

• taxonomyfacetstosearchtosupportfilteringofsearchresults.Becausetheyarebuilttomanipulatedefinedrelationshipsbetweenconceptsandtheirattributes,taxonomyandontologymanagementsystemscanalsoactasbrokerstootherLinkedDataorLinkedOpenDatasources,tocallauthoritytermsandrelationshipsfromelsewhere,tosupportsearchand/ortosupportenrichmentofcontentviatextanalytics.Finally,enterprisetaxonomyandontologymanagementsystemsarebuilttoactastaxonomy/ontologyhubscateringtomultiplesystemsandaudiences.Theyarecapableofinteractingdifferentlywithdifferentplatformsandaudiences.Forexample,differentversionsofthesamecoretaxonomycanbepresenteddifferentlytodifferentaudiences,dependingontheirneeds.Differentvocabulariesandtaxonomyservicescanbesuppliedtodifferentsystemsbasedontheircontentanduses.

Text Analytics Textanalyticsreferstoaquitediversefamilyofcomputer-assistedtechniquesforassigningmeaningtocontent.Differentmodulescanbeusedtoservedifferentpurposes.

ContentProcessingTherearesomebasiccontentprocessingfunctionsthattextanalyticshasincommonwithenterprisesearch:preparationoftextforparsing,tokenization,lemmatization,stemming,stopwords.

SyntacticandSemanticPre-ProcessingSeveralfunctionalapplicationsoftextanalyticsdependontwofoundationalprocessingtechniques.Thefirst,syntacticanalysis,analyzesthelanguagestructureoftextgrammatically,usingverylargeopensourcetrainingsets.Sentencesplittersbreakupsentencesintopartsofspeech,distinguishingnounsandnounphrasesfromverbsandoperators,andsoon,andsyntacticanalyzerscreatealogicalrepresentationofthesentences.Semanticanalysisattemptstoinfermeaningfromanalyzedtexts.Likesyntacticanalysisitdependsonreferencetrainingsets.Itenablescomputerstoinfermeaningfulassertionsandfactsfrombodiesoftext,andunderpinsthebasetechnologiesbehindmachinetranslation.Syntacticanalysisisbasedonknownrulesofgrammar.Semanticanalysishasamuchmorechallengingtask,whichistoinferhuman-understandablemeanings.Notallgrammaticallycorrectsentencesaremeaningfultohumans–"Colorlessgreenideassleepfuriously"isafamousexample

9

presentedbyNoamChomskyofasyntacticallycorrectsentencethatissemanticallyincoherent.Syntacticandsemantictechniquesunderpintheidentificationof“entities”withincontent.Inthecontextoftextanalytics,an“entity”canbeaperson,athing,aplace,atime,atopic.Itisanydistinctconceptthatcanbeidentifiedfromtext.Identificationandextractionofentitiescanbedonemostsimplybysimplylookingupreferencelists,asinauto-categorizationofknownentitiesprovidedbysomesearchengines.However,withsyntacticandsemanticanalysistextanalyticscanalsoidentifyunknown(unpredicted)entitiesforwhichyoudon’tyethaveexamplesinyourlists.ForexampleinEnglishpropernamesarecapitalized,andpersonalnameshaveknownforms(thatalsovarybyculture).Hencecandidatesforpersonsmentionedintextcouldbepickedupbyarulethatsays“lookforoneormorecapitalizedsubjectsorobjectsofsentences,andthenlookforlemmatizedverbsfromthefollowinglist‘say’…‘reply’…etc”.Suchrulesarebasedoncommonusageasrepresentedintheirtrainingsetsandaretypicallyprovidedasstandard.Theseruleswilltypicallyneedtobetestedandrefinedbasedonlocalusage.Forexample,wefoundthatonestandardtooldidnotpickupcountryoforiginincontentfromanimmigrationandcustomsauthority,becauseitwasusingsimpleterm-phraselookup.However,customsofficersatcheckpointshabituallyusedtheconvention“[Country]-registeredvehicle”.Thehyphenationandextensionofthetermdisruptedthesimplelookupfunction.Noticethatthesolutiontothisproblemisacombinationoftechniques,likelysupplementingknownentitylookupwithsyntacticallydrivenentityextractionrules.Thepracticeofusingdifferenttechniquesincombinationischaracteristicoftextanalyticsapproaches.Withthesefoundationalcapabilities,therearemanyspecializedapplicationsoftextanalytics,notallofwhichwillbesuitabletoeverypurpose.

Auto-categorizationTheabilitytoidentifybothknownandunknownentities/conceptsisasignificantimprovementinauto-categorizationcapabilities.Lookuplistsforknownentities/conceptscanbesupportedbyataxonomymanagementsystem,andstrengthenedthroughthesupplyofsynonyms.Conversely,throughthejudicioususeofrules,textanalyticscandiscoverentities/conceptsmentionedintextthatarenotyetinthetaxonomyorcontrolledvocabularies,andcanbeconsideredascandidateterms.

TextMiningTextminingreferstoanumberoftechniquesforanalyzingtextbasedonstatisticalalgorithms.Inverybroadterms,largebodiesoftextaresubmittedtoiterative

10

operationsusingalgorithmsto“sort”thetextintoclustersor“buckets”thatareinternallystatisticallysimilar,andstatisticallydissimilartotheotherclustersorbuckets.Statisticaltechniquesarealsousedtodeterminethewordsorword-phrasesthataremostlikelytouniquelyrepresenttheirbucketcomparedtootherbuckets.Thesewordstringsareoftencalled“topicmodels”.Thelikelihoodofspecifictermsoccurringinproximitytoeachothercanalsobecalculated.Textminingissometimestoutedasafully-automatedalternativetodescribingandtaggingcontent.Inpracticehowever,theclusteringtechniquesusefuzzyalgorithms,andtheend“described”stateishighlyinfluencedbywhichclusteringpathwaysaretakenatthebeginningoftheoperation.Textminingtechniquessufferfromnon-reproducibility(thesameseriesofoperationsonthesamecontentcanproducedifferentresultsondifferentoccasions)andfromnon-comprehensibility–thetopicmodelsgeneratedoftendonotmatchhowusersthemselvesdescribethecontent.Soinpractice,applicationsoftextminingareheavilytunedbytheirhumanoperators,anditisnotalwaystransparenthowagivensetofresultsareachieved.Textminingalsosuffersfromendogeneity–meaningthatthetagsthatareextractedtodescribesimilarbucketsofcontentarewhollyderivedfromthatcontentset(orwhentrainingsetsareusedforcomparison,fromthetargetcontent+thetrainingset).Endogeneityisaproblemwhenyouwanttosurveydiversecollectionsofcontentandmapmeaningsconsistentlyacrossthem.Inthatcase,ataxonomyorontologythatcrossesthosecontentsetsandaudiencesprovidesanexogenousreferencepointthatcanbrokermeaningsacrossdifferentcontentsetsanduseraudiences.However,textminingcanbeapowerfultoolusedincombinationwithsearchandtaxonomymanagement.Itsstatisticaltechniquestoclusterbasedonsimilaritycanproposenewcategorizationsandpotentiallyrelevantrelationshipsbetweentopicstothetaxonomymanager.Itsabilitytoextracttopicmodelscanbeusefulinautomatedsummarizationtechniques.

SemanticAnalysisWehavealreadydiscussedsomeofthechallengesofcomplexityfacingsemanticanalysis,notleasttheheavyinfluenceofcontextandculture(factorsexternaltothecontent)onmeaning.Inpracticaltextanalyticsapplications,semanticanalysisreliesonanumberoftechniquesandnotjustcomputationalsemanticanalysis.Forexample,oneofthemainapplicationsofsemanticanalysisistoextractassertionsorfactsfromcomplexdocumentation.Ifsemanticanalysishasaccesstoontologiesviaataxonomymanagementsystemitcanthenenrichitsinferencesbasedonknownandvalidatedrelationshipsbetweenentities.

11

SentimentAnalysisSentimentanalysisisaveryspecializedapplicationoftextanalytics,oftenusedinmarketing,socialmediaanalysis,andqualitativefeedback.Itlooksatthesemanticsofpositive,negativeandneutralexpressions,inassociationwithknown(extracted)entities.Againitishighlycontextual,andsubjecttoover-simplification–forexample,sarcasticlanguagecanbesuperficiallypositive,butthelinguisticmarkersofsarcasmareoftennotveryobvious.Aswithmosttextanalyticstechniques,therulesandalgorithmsneedtobetested,validatedandconstantlysupervisedforaccuracy.

SummarizationAutomaticsummarizationoftenexploitsknowndocumentordatastructures(itknowswheretolookforkeycontent),anditmayexploitsyntacticanalysis(e.g.removingextraneous“padding”languagesuchasadjectivesandadverbs).Itmayusesemanticanalysistoextractkeyfactsfromdocuments.

How the Pieces of the Stack Work Together Let’sbringthisbacktogethertogetabetterunderstandingofhowthepillarsofthestackworktogether.FigureA2illustratesthehighlevelviewofhowthedifferentpillarsofthesearchanddiscoverytechnologystackinteractandworktogethertoproducesuperiorresultsfororganisations.Thefocusforhowthestackisdeployed,andwhichparticulartechnologycomponentsareused,alwaysspringsfromuseranalysis,andanunderstandingofthequeriesthatusersmake,againstanunderstandingofthetargetcontent,howitisstructured,howandwhyitisproduced,howitiscurrentlybeingused,andhowitcouldpotentiallybeused,ifthesearchanddiscoveryfunctionalitycouldbeimproved.Indigitalenvironments,“targetcontent”canbeanycombinationofdocuments,webcontent,data,dashboards,andpeopleprofiles.TheconfigurationoftheelementsoftheSearchandDiscoverystackspringsfromanunderstandingofuserneedsagainstanunderstandingoftargetcontent.Determininghowtousethepillarsofthestacksitswithinalargerdiagnosticanddecision-makingframework.

1. Whatisthebusinessproblem(orsetofbusinessproblems)youaretryingtosolve?Theanswertothisquestioncanoftencomethroughaknowledgemanagementdiagnosticsactivitysuchasaknowledgeaudit,combinedwithanunderstandingoftheorganisation’sbusinessdirectionandstrategy.

2. Whoarethekeyusercommunitiesinrelationtothebusinessproblem,andhowwillyouunderstandtheirworkingneedsandopportunitiesforimprovement?(Thiscanalsocomefromaknowledgeaudit).

3. Whatisthetargetcontent(theknowledgeresourcesthatyourtargetusercommunitiesneedstoworkwithmoreeffectively),whereisit,howisitstructuredandusedcurrently?(Contentanalysisandmodelingcanhelp).

12

4. Whattoolsfromwithinthesearchanddiscoverystackcanhelptoaddresstheproposedsolution?(Thefocusofthispaper)

5. Whatneedstochangeonthebehavioural,processandgovernancelevelsforthenewsolutiontowork?(Forsustainableimplementationplanning,checkoutTheKnowledgeManager’sHandbook(MiltonandLambe,2016)fromaknowledgemanagementperspectiveorMaishNichani’sseriesofarticlesathttp://olasearch.com/articlesfromasearchdesignperspective).

FigureA2Interactionsbetweentaxonomymanagement,searchandtextanalytics

SearchandTaxonomySearchistheuser-facingcomponentofthestack.Itmediatescontenttousers.Taxonomymanagementsuppliesknownsalientconceptsandrelationshipstosearch,andittellssearchwhichquerytermsaresynonymsofthesametaxonomyconcepts.By“salient”wemeanthattheconceptsarefrontofmind,action-relatedconceptsintheheadsofusersastheyperforminformationandknowledgeseekingactivitiestocompleteworktasks.Taxonomiesmakesearchsmarterbytellingitwhichconceptsareimportanttousers,howtheyarerelatedtootherconcepts,whetherthroughhierarchicalorotherrelationships.Hencesearchcanexploittaxonomiestohelpuserstorefineorexpandtheirsearch,tofollowmeaningfulpathwaysandexplorecontent.Taxonomiesandontologiesenhancetherelevanceandprecisionofsearchforusers.Becausesearchisuserfacing,itgathersalotofdataaboutuserneedsandbehaviors.Frequencyanalysisofsearchqueriestellsthetaxonomistalotabouthowusersthinkabouttheircontentandtheirsearchrequirements.Clickthroughactivity

13

onsearchresultspagesalsoprovidesevidenceaboutuserperceptionsofwhatmayormaynotbeuseful.Hencesearchanalyticsprovidesafeedbacklooptohelpthetaxonomymanagerconstantlyrefineandimprovethetaxonomytouserneeds.Thisisapositivefeedbackloop,wheretaxonomyprovidesrelevanceandprecisiontosearch,andsearchprovidesaconstantflowofrealtimelarge-scaleevidenceonwhatisactuallyrelevantandusefultousersatthepresenttime.Searchanalyticscanalsopickupemergingtermsandconceptsthatareimportanttousers,andproposethemascandidatetermsfortaxonomiesandtheirsupportingthesauri.Withouttaxonomymanagement,searchisrelativelydumb,whetherintermsofwhatitcanindex,howitcanhelpusersmaketheirqueries,andhowitcanhelpusersactupontheirresults.Withoutsearch,taxonomymanagementlacksaconvincingmediumtodemonstrateitsvalue,anditlacksacontinuousflowofevidencetomaintainitsrelevanceandusefulnesstousers.

SearchandTextAnalyticsTextminingcanproposenewconceptsandnewassociationsbetweencontentitemsforthesearchmanagertobuildintoqueryprocessingandrelevancytuning.Semanticanalysisandsummarizationtechniquescanbeusedtoprovide“smart”summariesofcontentitemstobepreviewedinsearchsnippetsandinsearchresultspages,makingiteasierforuserstoassesswhetherornotthecontentisrelevanttotheirneed,beforetheyopenthecontentitem.Textanalyticscanextendthewayinwhichcontentcanbepulledtogetherinspecializedsearchbasedapplications.Allthreepillarsofthestack(search,taxonomyandtextanalytics)dependonconstanttuningoftheinfrastructuretodelivergoodsearchanddiscoveryresults.Textanalyticsinparticulardependsonrulestoguidehowthecontentisprocessedandtagged.Theserulesneedmaintenanceandtuning.Thistuningisnecessarybecauseoftheconstantongoingchangesincontent,context,priorities,usersandneeds.Searchanalyticsprovidealargescale,real-timewindowintouserneedsandbehaviors,andso,aswithtaxonomymanagement,itprovidestheevidencebaseneededtokeepthetextanalyticsstacktunedtodeliveraccurateandusefulresultsforusers.Whentherearelargeandcomplexknowledgebases,withouttextanalytics,searchfindsitdifficulttopickoutsalientconceptsrelatedtocontent.Aswithtaxonomymanagement,textanalyticsdependsondeepuserandcontentknowledgetoknowhowtoconfigureandtuneitsoperations.Withoutsearch,textanalyticsfindsithardtodothisunlesstheteamispreparedtoinvestinconstant(andexpensive)userresearchandtesting.

TextAnalyticsandTaxonomyManagementTaxonomiesaredesigned,evidence-basedartifacts.Assuch,theyconsistentlylagthepaceofchange.Theyareoptimizedtoexploitknownanddocumentedconcepts.Textanalyticstechniquessuchasauto-categorizationandtextminingcansupplycandidatetermsandrelationshipstothetaxonomymanagerwithouttheneedfor

14

comprehensivere-surveyingofcontentanduserneeds.Theycanpickupemergingconceptsinthecontent,justassearchanalyticspicksupandproposesemergingconceptsevidencedinsearchqueries.Sotextanalyticscanactasapowerfultaxonomyenrichmenttool.Taxonomiesandontologiescanguideandstrengthentheapplicationoftextanalyticstechniquessuchasauto-categorization(knownentityextractionthroughlookup),factextractionthroughsemanticanalysis,andsentimentanalysis(byhelpingthesentimentengineresolvedifferentsynonymstothesameentitybeingreferencedinthecontent).Whentherearelargeandcomplexknowledgebases,withouttextanalytics,taxonomytagscannoteasilyandconsistentlybeappliedtolarge,diversebodiesofcontent.Withouttaxonomymanagement,textanalyticsstrugglestodescribecontentintermsthatmakesensetokeyusersandinthecontextofuseractivities.Machinetaggingtechniquesalonedonotdescribecontentintermsthathumansintuitivelyunderstand,withoutsomeformofhuman-curatedfilter.Taxonomymanagementprovidesthis.

Implications Fortoolongorganisationshavebeenlisteningtotheboomingvoicegivingmagicalreassurancesfrombehindthecurtain,andhaveneglectedtheirowncapabilities,andthetoolsetsthatexistintheopensourcearena(thatcommercialproductsalsoexploit).Astheworldbecomesmorecomplex,solutionstosearchanddiscoveryproblemswillbecomemorediverse.Weneedtobecomebetterjourneymen,moreknowledgeableaboutthecapabilitiesdifferenttoolsetsthatareavailable,sothatwecanchoosetherighttoolsforanygivenproblem.Someofthemwillbecommercialtools,acquiredfortheirexcellenceinaspecificsetoffunctions.Someofthemwillbeopensourcetools,acquiredandtunedtosolvespecifickindsofproblem.Theageofsinglesource“doitall”commercialplatformsisover.Thispaperhasproposedthatknowledgeofthesearchanddiscoverytechnologystackisanimportantcapabilitytoacquire.Itneedstobeatleastadequatetothejobofevaluatingneedsandpossibilities,andofaskingvendorssearchingquestionsabouthowtheirtechnologyworks,whatitisgoodat,andwhatitslimitationsare.Insomecases,organisationswillwanttointernalizeadeepertechnicalcapabilityinworkingwithanumberofsetsoftools.Thereareotherimportantsearchanddiscoveryrelatedcapabilitiesthatenterprisesneedtoacquirebeyondaknowledgeofthesearchanddiscoverytechnologystack–forexample,howuserneedsandcontentcanbeanalysedtodetermineprioritiesandopportunities,howcontentcanbemanagedandenrichedtoprovideoptimaluserexperiences,andhowbusinessproblemscanbetransformedthroughdesignprocessesintoeffectivesolutionsforbusinessneeds,problemsandopportunities.

15

I would like to acknowledge the following colleagues who have provided valuable insights and advice informing this paper: Dave Clarke, Ahren Lehnert, Agnes Molnar, Maish Nichani and Tom Reamy. Patrick Lambe October 2016-January 2017