e-Discovery Team at TREC 2015 Total Recall Track
Transcript of e-Discovery Team at TREC 2015 Total Recall Track
e-DiscoveryTeamatTREC2015TotalRecallTrack
RalphC.Losey∗ JimSullivanandTonyReichenbergerNationale-DiscoveryCounsel Sr.DiscoveryServicesConsultants,
JacksonLewisP.C. KrollOntrack,Inc.
e-DiscoveryTeam.com eDiscovery.com
[email protected] [email protected]
ABSTRACT The2015TRECTotalRecallTrackprovidedinstantrelevancefeedbackinthirtyprejudged
topicssearchingthreedifferentdatasets.Thee-DiscoveryTeamofthreeattorneysspecializingin
legalsearchparticipatedinallthirtytopicsusingKrollOntrack’ssearchandreviewsoftware,
eDiscovery.comReview(EDR).Theyemployedahybridapproachtocontinuousactivelearningthatusesbothmanualandautomaticsearches.Avarietyofmanualsearchmethodswereused
tofindtrainingdocuments,includinghighprobabilityrankeddocumentsandkeywords,anadhocprocesstheTeamcallsmultimodal. Intheonetopic(109)requiringlegalanalysistheTeam’sapproachwassignificantlymore
effectivethanallotherparticipants,includingthefullyautomatedapproachesthatotherwise
attainedcomparablescores.InalltopicstheTeam’shybridmultimodalmethodconsistently
attainedthehighestF1valuesatthetimeofReasonableCall,equivalenttoastoppoint.InalltopicstheTeam’smultimodalhumanmachineapproachalsofoundrelevantdocumentsmore
quicklyandwithgreaterprecisionthanthefullyautomatedorothermethods.
CategoriesandSubjectDescriptors:H.3.3InformationSearchandRetrieval:Searchprocess,
relevancefeedback,supervisedlearning,bestpractices.
Keywords:HybridMultimodal;AI-enhancedreview;predictivecoding;predictivecoding
3.0;electronicdiscovery;e-discovery;legalsearch;activemachinelearning;continuousactive
learning;CAL;Computer-assistedreview;CAR;Technology-assistedreview;TAR;relevant
irrelevanttrainingratios.
1. INTRODUCTION Thee-DiscoveryTeamparticipatedinallthirtyTotalRecallTracktopicsintheAthomegroupwherebothmanualandautomaticmethodswerepermitted.TheTeamiscomposedofthree
practicingattorneyswhospecializeinlegalsearch.TheyusedKrollOntrack’ssearchandreview
software,eDiscovery.comReview(“EDR”),employingwhattheycallahybridmultimodalmethod.
1Theyattainedhighrecallandprecisioninmostofthethirtytopics.Thefewexceptions
appearderivedfromthefactthattheattorneysareaccustomedtoself-definingtheground
truth,and,insometopics,theiropinionsonrelevancedifferedsignificantlyfromtheTREC
assessors.InlatertopicstheattorneyTeamlearnedtoturnofftheirownjudgmentsandrely
primarilyontheirsoftware’sautomatedprocesses,whichgenerallyledtoimprovedscores
bettermatchingtheTRECrelevanceassessments.TheTeam’smanualefforts,asmeasuredby
timeexpendedandnumberofdocumentsmanuallyreviewed,wereverylowbylegalsearch
standards.
∗Theviewsexpressedhereinaresolelythoseoftheauthor,RalphLosey,andshouldnotbeattributedtohisfirmoritsclients.
2
ThefullyautomaticmethodsemployedbytheSandboxgroupparticipantsintheTotalRecallTrackattainedcomparablehighrecallandprecisioninmosttopics.TheTeam’shybridmultimodalmethoddid,however,consistentlyattainthehighestF1valuesatthetimeof
ReasonableCall,equivalenttoatrainingstoppoint,whichisveryimportanttolegalsearch.Oneofthethirtytopics,109-ScarletLetterLaw-requiredasmallamountoflegalknowledgeand
analysistounderstandrelevance(mostoftheothersrequirednone).Onthistopicourlegal
team,asyouwouldexpect,attainedsignificantlybetterresultsthanthefullyautomated
methodsthatcontainednobaselegalknowledge.
Thee-DiscoveryTeam’shybridmultimodalmethodisatypeofcontinuousactivelearning
textretrievalsystemthatemployssupervisedmachinelearningandavarietyofmanualsearch
methods.2,3
TheTeamattainedveryhighrecallandprecisionratesinmost,butnotall,ofthe
thirtyTotalRecalltopics.TheTeam’sF1scoresatthetimeofReasonableCallrangedfroma
perfectscoreof100%inonetopic(3484),to91%to99%ineighttopics,and82%-87%infive
others.Although,ofcourse,notdirectlycomparable,thesescoresarefarhigherthanany
previouslyrecordedinthesixyearsofTRECLegalTrack(2006-2011)oranyotherstudyoflegal
search.Onereasonforthismaybethatthethirtytopicsinthe2015TotalRecalltrackpresented
relativelysimpleinformationneedsbylegalsearchstandards,withoneexception(Topic109–
ScarletLetterLaw).AnothermaybeimprovedsoftwareandtheTeam’simprovedhybridmultimodalmethodthatincludescontinuousactivelearning.
Thee-DiscoveryTeamwasabletofindthetargetrelevantdocumentsinallthirtytopicswith
relativelylittlehumaneffortandalmostnolegalanalysis.OnlyTopic109requiredlegal
knowledgeandanalysis,withfourothers-101,105,106,107-requiringsomesmallmeasureof
analysis.
Atotalof16,576,798documentswereclassifiedinthirtytopics.Ofthesedocuments70,414
werepredeterminedbyTRECassessorstoberelevant.Thee-DiscoveryTeamfoundthese
relevantdocumentsbymanualreviewofonly32,916documents.Theother37,498relevant
documentswerefoundwithnohumanreviewofthesedocuments.1.1 TotalRecallTrackDescription–AthomeandSandbox. TheTotalRecalltrackoffered30differentpre-judgedtopicsforsearchintwodifferent
divisions,AthomeandSandbox.OurTeamonlyparticipatedintheAthomeexperiments.IntheAthomeexperimentsthedatawasloadedontotheparticipants’owncomputers.Therewereno
restrictionsonthetypesofsearchesthatcouldbeperformed.Thesetupallowedthee-DiscoveryTeamtouseaslightlymodifiedversionofourstandardHybridMultimodalmethod,
which,asmentioned,employsbothadhocmanualreviewandmachinelearning.
TheSandboxparticipantswereonlypermittedtousefullyautomatedsystemsandthedata
remainedonTRECadministratorcomputers.TheysearchedthesamethreedatasetsasAthome,plustwomorenotincludedintheAthomedivisionduetoconfidentialityrestrictions.TheSandboxparticipantswereprohibitedfromanymanualreviewofdocumentsoradhocsearchadjustments.
4Evenafterthesubmissionsended,theSandboxparticipantsreportedatthe
Conferencethattheyneverlookedatanydocuments,eventheunrestrictedAthomeshareddatasets.Theynevermadeanyefforttodeterminewheretheirsoftwaremadeerrorsin
predictingrelevance,orforanyotherreasons.Totheseparticipants,allofwhomwereacademic
institutions,thegroundtruthitselfwasofnorelevance.
ThreedifferentdatasetsweresearchedinboththeAthomeandSandboxevents,withthesametentopicsineach.Eventhoughthedatasearchedandtopicsoverlappedinthetwo
divisions,noneoftheparticipantsinonedivisionparticipatedintheotherdivision.Thisis
unfortunatebecauseitmakesdirectcomparisonsproblematic,ifnotimpossible,especiallyasto
3
thesoftwaresystemsused.ItishopethatsomeparticipantswillparticipateinbotheventsinfutureTotalRecalltracks. Thee-DiscoveryTeamparticipatedinallthirtyoftheAthometopics.Weweretheonlymanualparticipanttodoso,withallotherscompletingtenorfewertopics.ThelackofparticipationbyothersintheAthomegroupalsomakemeaningfulcomparisonsverydifficultorimpossible,butwenotethatthee-DiscoveryTeam’sscoreswereconsistentlyhigherthananyotherAthomeparticipants. AtHomeparticipantswereaskedtotrackandreporttheirmanualefforts.Thee-DiscoveryTeamdidthisbyrecordingthenumberofdocumentsthatwerehumanreviewedandclassified.Virtuallyalldocumentshumanreviewedwerealsoclassified,althoughalldocumentsclassifiedwerenotusedforactivetrainingofthesoftwareclassifier.Moreover53%oftherelevantdocumentsusedfortrainingwereneverhumanreviewed.Wealsotrackedeffortbynumberofattorneyhoursworkedasistraditionalinlegalservices. TheTeamusedKrollOntrack’ssoftware,knownaseDiscovery.comReview,orEDR,whichincludesactivemachinelearningfeatures,a/k/apredictivecodinginlegalsearch.EDRemploysaproprietaryprobabilistictypeoflogisticregressionalgorithmfordocumentclassificationandranking. TheAtHomeparticipantsusedtheirowncomputersystemsandsoftwareforsearch,andthensubmitteddocumentstotheTRECadministratorthattheyconsideredrelevant.TRECsetupa“jig”wherebyinstantfeedbackwasprovidedtoaparticipantaswhethereachdocumentsubmittedasrelevantwasinfactpreviouslyjudgedtohavebeenrelevantbyTRECassessors.Whenaparticipantdeterminedthatareasonableefforthadbeenmadetofindallrelevantdocumentsrequired,whichisimportantinlegalsearchandrepresentsastoppingpointforfurthermachinetraininganddocumentreview,theywouldnotifyTRECofthissuppositionand“CallReasonable.”Continuedsubmissionsweremadeafterthatpointsothatalldocumentswereclassifiedaseitherrelevantorirrelevant.ThegoalasweunderstooditwastosubmitasmanyrelevantdocumentsaspossiblebeforetheReasonablecall,andthereaftertohaveallfalsenegativesappearinsubmissionsassoonaftertheReasonableCallaspossible. Mostofthethirtytopicspresentedonlysimple,single-issueinformationneedssuitableforsingle-facetclassification.Further,onlyafewofthetopicsrequiredanylegalanalysisforrelevanceidentification.Thesetwofactors,plustheomissionofmetadata,was,wethink,adisadvantagetothee-DiscoveryTeamoflawyers.Conversely,itappearsthatthesesamefactorsmadeitsimplerfortheacademicSandboxparticipantstoperformwellinmosttopicsusingfullyautomatedmethods.ItshouldalsobenotedthatalthoughourlawyerTeamwaspracticedandskilledincomplexinformationneedsrequiringextensivelegalanalysis,andhadlongexperiencewithprojectsusingSMEdefinedgroundtruths,nonehadanypriorexperienceusingmachinelearningforthetypesofsearchespresentedinthe2015RecallTrack. TheoneexceptionthatbroughtinlegalanalysiswithbeneficialSMEanalysis,wasTopic109,ScarlettLetterLaw.Itrequiredsomelegalknowledge,albeitveryrudimentary,tobeginlocatingrelevantdocuments.Thekeywordsalone-“ScarlettLetterLaw”–wouldonlyfindrelevantdocumentswiththiswordcombinationandsimilartextpatterns.ThesewordswerejustthenicknameoftheproposedandeventuallyenactedFloridaStatute.Anyattorneywouldknowthattofindrelevantinformationtheywouldnotonlyhavetosearchthename,buttheywouldalsohavetosearchthevarioushouseandsenatebillnumbersforthislaw.Thesenumberswouldnotoftenappearinthesamedocumentasthenickname,andsincethemachinedidnotknowtosearchforthesenumbers,itdidnotrealizethesignificance.Eventuallytheautomatedmachinelearningdidseetheconnection,aftermanyrelevancefeedbacksubmissions.These
4
submissionsandinstantfeedbackofrelevant,ornot,would,ofcourse,nothappeninreallegalsearch.1.2GovernorBushEmail ThefirstsetofAthomeTopicssearchedacorpusof290,099emailsofFloridaGovernorJebBush.Mostofthemetadataoftheseemailsandassociatedattachmentsandimageshadbeenstrippedandconvertedtopuretextfiles.ThisincreasedthedifficultyoftheTeam’ssearch,whichnormallyincludesamixtureofmetadataspecificsearches. AsignificantpercentageoftheBushemailswereformtypelobbyingemailsfromconstituents,whichrepeatedthesamelanguagewithlittleofnovariance.Theunusuallyhighprevalenceofnear-duplicateemailsmadesearchofmanyoftheBushtopicseasierthanistypicalinlegalsearch. ThetenBushemailtopicssearched,andtheirnames,whichweretheonlyguidanceonrelevanceprovidedtoeithertheAthomeorSandboxparticipants,areshownbelow.
Topic100SchoolandPreschoolFunding
Topic101 JudicialSelectionTopic102 CapitalPunishmentTopic103 ManateeProtectionTopic104 NewMedicalSchoolsTopic105 AffirmativeActionTopic106 TerriSchiavoTopic107 TortReformTopic108 ManateeCountyTopic109 ScarletLetterLaw
E-DiscoveryTeamleader,RalphLosey,alifelongFloridanative,personallysearchedeachofthesetenTopics.Inabouthalfofthetopicshispersonalknowledgeoftheissueswashelpful,butinseveralothersitwasdetrimental.HehaddefinitepreconceptionsofwhatemailshethoughtshouldberelevantandthesesometimesdifferedsignificantlyfromtheTRECassessors.InalloftheBushTopicsLoseywasatleastsomewhatassistedbyasingle“contractreviewattorney.”5ThecontractattorneysinmostofthesetenTopicsdidamajorityofthedocumentreviewunderLosey’sveryclosesupervision,buthadonlylimitedinvolvementininitialkeywordsearches,andnoinvolvementinpredictivecodingsearchesorrelateddecisions. Allparticipantsinthe2015RecallTrackwererequiredtocompletealltenoftheBushEmailTopics.CompletionoftheothertwentyTopicsinthetwootherdatacollectionswasoptional.SeveralparticipantsstartedreviewoftheBushTopics,butdidnotfinish,andthuswerenotpermittedtosubmitareportorattendtheTRECConference.OnlyoneotherAthomeparticipant,Catalyst,completedalltenBushTopics.NootherAthomeparticipantsevenattemptedtheothertwentytopics,andthuscomparisonswiththee-DiscoveryTeam’sresultsarelimitedtothefullyautomaticparticipants.1.3BlackHatWorldForums. ThesecondsetofAthomeTopicssearchedacorpusof465,149poststakenfromBlackHatWorldForums.Again,almostallmetadataofthesepostsandassociatedimageshadbeenstrippedandconvertedtopuretextfiles.Thetentopicssearched,andtheirnames,whichagainweretheonlyguidanceinitiallyprovidedonrelevance,areshownbelow.
5
Topic2052
PayingforAmazonBook
Reviews
Topic2108 CAPTCHAServices
Topic2129 FacebookAccounts
Topic2130 SurelyBitcoinscanbeUsed
Topic2134 PayPalAccounts
Topic2158
UsingTORforAnonymous
InternetBrowsing
Topic2225 Rootkits
Topic2322 WebScraping
Topic2333 ArticleSpinnerSpinning
Topic2461 OffshoreHostSites
TheTeammembersagainhadexpertiseissueswithsomeofthesearcanetopicsthattheyhappenedtobefamiliarwith.Theirknowledgewouldsometimesprovedetrimental.Again,asthereviewcontinued,theTeammemberslearnedtosuspendtheirownknowledgeandgroundtruthjudgmentsandinsteadrelyentirelyontheautomatedrankingsearches,muchlikethefullyautomatedparticipantsalwaysnecessarilydid.1.4 LocalNewsArticles.
ThethirdsetofAthomeTopicssearchedacorpusof902,434onlineLocalNewsArticles,againintextonlyformat.Thetentopicssearched,andtheirnames,whichagainweretheonlyguidanceprovidedonrelevanceasidefromtheinstantfeedback,areshownbelow.
Topic3089 PicktonMurders
Topic3133 PacificGateway
Topic3226 TrafficEnforcementCameras
Topic3290
RoosterTurkeyChicken
Nuisance
Topic3357 OccupyVancouver
Topic3378
RobMcKennaGubernatorial
Candidate
Topic3423 RobFordCuttheWaist
Topic3431 KingstonMillsLockMurders
Topic3481 Fracking
Topic3484 PaulandCathyLeeMartin
TheTeamfoundtheNewsArticleslessdifficulttoworkwiththanourtypicallegalsearchofcorporateESI.Still,thesamekindofgroundtruthvalidityandconsistencyissueswerenotedinsomeofthenewstopics,buttoalesserdegreethantheothertwodatasets.1.5 E-DiscoveryTeam’sThreeResearchQuestions. Ourfirstandprimaryquestionwastodetermine:WhatRecall,PrecisionandEffortlevelsthee-DiscoveryTeamwouldattaininTRECtestconditionsoverall30TopicsusingtheTeam’s
6
PredictiveCoding3.0hybridmultimodalsearchmethodsandKrollOntrack’ssoftware,eDiscovery.comReview(EDR). Oursecondaryquestionwas:HowwilltheTeam’sresultsusingitssemi-automated,supervisedlearningmethodcomparewithotherRecallTrackparticipantsusingsemiautomatedsupervisedorfullyautomatedunsupervisedlearningmethods.Ourlastquestionwas:Whataretheidealratios,ifany,forrelevantandirrelevanttrainingexamplestomaximizeeffectivenessofactivemachinelearningwithEDR. 2.RELATEDWORK Itisgenerallyacceptedinthelegalsearchcommunitythattheuseofpredictivecodingtypesearchalgorithmscanimprovethesearchandreviewofdocumentsinlegalproceedings.6Theuseofpredictivecodinghasalsobeenapproved,andevenencouragedbyvariouscourtsaroundtheworld,includingnumerouscourtsintheU.S.7 Althoughthereisagreementonuseofpredictivecoding,thereiscontroversyanddisagreementastothemosteffectivemethodsofuse.8Thereare,forinstance,proponentsforavarietyofdifferentmethodstofindtrainingdocumentsforpredictivecoding.Someadvocatefortheuseofchanceselectionalone,othersfortheuseoftoprankeddocumentsalone,othersforacombinationoftoprankedandmid-levelrankeddocumentswhereclassificationisunsure,andstillothers,includingLosey,callfortheuseofacombinationofallthreeoftheseselectionprocessesandmore.9ThelatestrespectfuldisagreementisbetweenLosey’se-DiscoveryTeam,andtheAdministratorsoftheTotalRecallTrack,GrossmanandCormack,concerningtheadvisabilityof:1)keepingattorneysearchexpertsintheloop,thehybridapproach,asopposedtothefullyautomatedapproach;and2)usingavarietyofsearchmethods,themultimodalapproach,asopposedtorelianceonhighrankingdocumentsaloneformachinetraining.10
Someattorneys,predictivecodingsoftwarevendors,and,apparently,GrossmanandCormack,advocatefortheuseofpredictivecodingsearchmethodsalone,andforegoothersearchmethodswhentheydoso,suchaskeywordsearch,conceptsearches,similaritysearchesandlinearreview.E-DiscoveryTeammembersrejectthatapproachandinsteadadvocateforahybridmultimodalapproachthattheycallPredictiveCoding3.0,furtherdescribedbelow.Itusesallmethods.AsdiscussedinEndnote2,werejectthenotionofinherentlawyerbiasthatunderliessomeexperts’fullyautomatedapproaches,including,buttoalesserdegree,GrossmanandCormack.Weinsteadseektoaugmentandenhanceattorneysearchexperts,notautomateandreplacethem.Wedo,however,favorcertainsafeguardsagainstthepropagationoferrors,intentionalorinadvertent,andadvocatewithinthelegalcommunityforcontinuousactivetrainingoflawyersinsearchtechniquesandethics. Ourparticipationinthe2015TRECTotalRecallTrack,theresearchquestionsweposed,andtheexperimentsweperformed,werenotinanymannerdesignedorintendedtoattempttoresolvethiscurrentmethodologydisputewiththeAdministratorsofthisTrack.Infact,itwasonlyatthe2015Conferencethatwefullyunderstoodtheextentofthesedifferences.AlthoughGrossmanandCormackdidindividuallyparticipateinthisTrack,aswellasadministratorit,andsotoodidothergroupsfromCormack’suniversity,theydidnotparticipateinthemanualAthomedivisionthatwedid.ToourknowledgetheTotalRecalltrackwasnotdesignedtoaddressthisnewlyemergingdisagreementinpreferredmethodologies,noradvanceanyoneparticularmethodology.Still,wewouldconcedethat,subjecttonormalcaveats,someindirectlessonscanbederivedonthisissuefromtheTotalRecallTrackresults.
7
3.HYBRIDMULTIMODALAPPROACH Thee-DiscoveryTeamapproachincludesalltypesofsearchmethods,withprimaryrelianceplacedonpredictivecodingandtheuseofhigh-rankeddocumentsforcontinuousactivetraining.InthatwayitissimilartotheapproachusedbyGrossmanandCormack,11butdiffersinthattheTeamusesamultimodalselectionofsearchmethodstolocatesuitabletrainingdocuments,includinghighrankingdocuments,somemid-levelrankeduncertaindocuments,andallothersearchmethods,includingkeywordsearch,similaritysearch,conceptsearchandevenoccasionaluseoflinearreviewandrandomsearches.ThevarioustypesofsearchesusuallyincludedintheTeam’smultimodalapproachareshowninthesearchpyramid,below.
Thestandardeight-stepworkflowusedbytheTeaminlegalsearchprojectsisshowninthediagrambelow.Astepbystepdescriptionsoftheworkflowcanbefoundine-DiscoveryTeamwritings.12TheapplicationofthismethodologycanbeseentheTeam’sdescriptionoftheirworkineachofthethirtyTopicsthatisincludedintheAppendix.OurusualstepsOne,ThreeandSevenhadtobeomittedorseverelyconstrainedtomeettheTRECexperimentformat.
8
StandardstepsThreeandSevenoftheworkflowwereomittedtomeetthetimerequirementsofcompletingeveryreviewprojectin1.5days.Skippingthesestepsallowedustocomplete30reviewprojectsin45daysintheTeam’ssparetime,buthadadetrimentalimpact. Ourusualfirststep,ESIDiscoveryCommunications,iswhereourinformationneedsareestablished.ThishadtoomittedtofittheformatoftheRecallTrackAthomeexperiments.TheonlycommunicationundertheTRECprotocolwasaveryshort,oftenjusttwo-worddescriptionofrelevance,plusinstantfeedbackintheformoryesornoresponsesastowhetherparticulardocumentssubmittedwererelevant.Inthee-DiscoveryTeam’stypicalworkflowdiscoverycommunicationstypicallyinvolve:1)detailedrequestsforinformationcontainedincourtdocumentssuchasubpoenasorRequestForProduction;2)inputfromaqualifiedSME,whoistypicallyalegalexpertwithdeepknowledgeofthefactualissuesinthecaseandhowthepresidingjudgeinthelegalproceedingwilllikelyruleonborderlinerelevantissues;and,3)dialogueswiththeclient,witnesses,andwiththepartyrequestingtheproductionofdocumentstoclarifythesearchtarget. TheTeamneverreceivesarequestforproductionwithjusttwoorthreeworddescriptionsasencounteredintheTRECexperiments.WhentheTeamreceivesvaguerequests,whichiscommon,theTeamseeksclarificationindiscussions(StepOne).Inpracticeifthereisdisagreementastorelevancebetweentheparties,whichisalsocommon,thepresidingjudgeisaskedtomakerelevancerulings.Again,noneofthiswaspossibleintheTRECexperiments. AllofourusualpracticesinStepOnehadtobeadjustedtothesubmissionsformatofthe30AthomeTopics.ThemostprofoundimpactoftheseadjustmentswasthattheattorneysontheTeamoftenlackedaclearunderstandingastotheintendedscopeofrelevanceandtherationalebehindtheautomatedTRECrelevancerulingsonparticulardocuments.TheseprotocolchangeshadtheimpactofminimizingtheimportanceoftheSMEroleontheactivemachinelearningprocess.Instead,thisrolewasoftenshiftedalmostentirelytotheanalyticsoftheEDRsoftware.Thesoftwareanalyticscouldoftenseepatterns,andcorrectlypredictrelevance,thatthehumanattorneyreviewerscouldnot(often,butnotalways,becausethehumanreviewersdisagreed
9
withtheTRECassessorshumanjudgmentofgroundtruthinseveraltopics,andotherwisecouldnotfolloworseeanylogictothedocumentsreturnedasrelevant). ThisminimizationoftheimportanceoftheSMEroleisnotcommoninlegalsearchwhereattorneyreviewersalwayshavesomesortofunderstandingofrelevance.TheroleoftheSMEintheTeam’sdecadesofexperienceinlegalsearchhasalwaysbeenimportanttohelpensurehighquality,trustworthyresults.ContrarytotheunfortunatepopularbeliefamonglaypersonsgoingbacktothetimeofShakespeare,13thevastmajorityoflegalprofessionalsmaintainveryhighstandardsofethicsandtrustworthiness.Inspiteoftheallegednegativeinfluencesofthecenturiesoldadversarialtraditionofthecommonlaw,attorneysarededicatedtouncoveringthetruth,thewholetruth,andnothingbutthetruth,regardlessoftheparticularcaseimpact.Anynotionofinherentbiasbyattorneysismisplaced.Itis,afterall,attorneyswhocontrolthediscoveryprocessanddefinerelevance,andattorneys,notrobotsorscientists,whomaketheproductionofrelevantdocumentstotheotherside.14 Scientificresearchisbetterservedwhendrivenbyreasonandobjectivemeasurements,notprejudicesandassumptionsaboutanentireprofessionandourcommonlawsystemofjustice,basedasitisonanadversarialtruthseekingprocess.Thee-DiscoveryTeamwillcontinuetolookforwaystoimprovequalitycontrol,andguardagainstinadvertenterrors,whichalwaysexistsinanyhumanendeavor,andidentifyintentionalerrors,whichrarelyexistinlegalsearch,but,weconcedemaysometimestakeplace.Forthatreasonwewillexploregreaterrelianceonautomatedprocessinourfutureresearchandotherqualitycontroltechniques.15Wewillnot,however,abandonahybridapproachwhereahumanremains,ifnotincontrol,thenatleastasanactivepartner,outofanysubjectiveprejudicesagainstlawyers.Wealsorefusetoaccepttheunprovenassumptionthatouradversarialsystemisinherentlysuspect,encouragesbias,andotherwiserequiresthathumansberemovedfrome-discoveryandreplacedbyrobots.Conversely,wedonotnaivelyassumelawyersareautomaticallysuperiortomachines.Wehavelongadvocatedagainstthecurrentlegalstandardofonlyusingmanualreviewofeverydocument.TheTeam’shybridapproachaimsforaproportionalbalance.4.EXPERIMENTSANDDISCUSSIONS Thee-DiscoveryTeamsoughttoanswerthethreepreviouslylistedResearchQuestionsinitsexperimentsatthe2015TRECTotalRecallTrack.4.1FirstandPrimaryResearchQuestion. WhatRecall,PrecisionandEffortlevelswillthee-DiscoveryTeamattaininTRECtestconditionsoverall30TopicsusingtheTeam’sPredictiveCoding3.0hybridmultimodalsearchmethodsandKrollOntrack’ssoftware,eDiscovery.comReview(EDR). Weprimarilymeasuredeffortbythenumberofdocumentsthatwereactuallyhuman-reviewedandcodedrelevantorirrelevant.TheTeamhuman-reviewedonly32,916documentstoclassify16,576,798documents.Asanadditionalmeasureofeffort,weestimatedourtotaltimespentonallTopics.TheTeamspent45daysdoingallofthework,withanestimatedaverageof8hoursperdaytotalexpendedbytheTeam.(AllTeammemberscarriedontheirnormalemploymentactivitiesononlyasomewhatreducedbasisduringthe45daysofthereview,andTRECworkwasalsoreducedonmostweekends.)TheestimatedtotalhoursspentbyTeammembersforbothanalysisandreviewisthusapproximately360hours. Itistypicalinlegalsearchtotrytomeasuretheefficiencyofadocumentreviewbythenumberofdocumentsclassifiedinanhour.Forinstance,atypicalcontractreviewattorneycanclassifyanaverageof50documentsperhour.HereusingPredictiveCoding3.0ourTeamclassified16,576,798documentsin360hours.Thatisanaveragespeedof46,047filesperhour.
10
Inlegalsearchitisalsotypical,indeedmandatory,tomeasurethecostsofreviewandbillclientsaccordingly.Ifwehereassumeahighattorneyhourlyrateof$500perhour,thenthetotalcostofthereviewofall30Topicswouldbe$180,000.Thatisacostoflessthan$0.01perdocument.Inatraditionallegalreview,wherealawyerreviewsonedocumentatatime,thecostwouldbefarhigher.Evenifyouassumealowattorneyrateof$50perhour,andreviewspeedof50filesperhour,thetotalcosttoreviewwouldbe$16,576,798.Thatisacostof$1.00perdocument,whichisactuallylowbylegalsearchstandards.16
Analysisofprojectdurationisalsoveryimportantinlegalsearch.Insteadofthe360hoursexpendedbyourTeamusingPredictiveCoding3.0,traditionallinearreviewwouldhavetaken331,536hours(16,576,798/50).Inotherwords,whatwedidin45days,taking360hours,wouldhavetakenateamoftwolawyersusingtraditionalmethodsover45years. CompletedetailsanddescriptionsoftheadhocmethodsemployedinallthirtytopicsareincludedintheAppendix.4.2ResearchQuestionNo.2. HowwilltheTeam’sresultsusingitssemi-automated,supervisedlearningmethodcomparewithotherRecallTrackparticipantsusingsemiautomatedsupervisedlearningmethods. UnfortunatelynootherAthomeparticipantscompletedallthirtytopicsandonlyonecompletedalltenBushemailtopics.ThelackofparticipationbyothersintheAthomegroupmakesmeaningfulcomparisonsverydifficultorimpossible,butwenotethatthee-DiscoveryTeam’sscoreswereconsistentlyhigherthananyotherAthomeparticipants. TheSandboxparticipants’workincludedthesamethreedatasetsasAtHome,butnoneofthemalsoparticipatedintheAthomedivision.Thisisunfortunatebecauseitmakesdirectcomparisonsproblematic,ifnotimpossible,especiallyastothesoftwaresystemsused.Still,withsomecaveats,afewlimitedcomparisonsarepossiblebetweenthetwodivisionsbecausethesametopicsanddatasetsweresearched.4.3ResearchQuestionNo.3. Whataretheidealratios,ifany,forrelevantandirrelevanttrainingexamplestomaximizeeffectivenessofactivemachinelearningwithEDR. TheTeamexperimentedwithvariouspositiveandnegativetrainingratiosusingthepredictivecodingtrainingfeaturesoftheirsoftware.Mostoftheseexperimentswereposthoc,butsomewerecarriedoutduringtheinitialTRECsubmissions.Insomeofthethirtytopicsourreviewworkwouldhavebeenconcludedearlierbutforthesesideexperiments.5.RESULTS5.1ResearchQuestionNo.1. TheTRECmeasuredresultsdemonstratedhighlevelsofRecallandPrecisionwithrelativelylittlehumanrevieweffortsusingthee-DiscoveryTeam’smethodsandEDR.Thethree-manattorneyTeamwasabletoreviewandclassify16,576,798documentsin45daysunderdifficultTRECtestconditions.TheyattainedtotalRecallofallrelevantdocumentsinall30Topicsbyhumanreviewofonly32,916documents.Theydidsowithtwo-manattorneyteamsinthe10BushEmailTopics,andone-attorneyteamsinthe20otherTopics.InTopic3484,whichsearchedacollectionof902,434NewsArticles,theTeamattainedboth100%Recalland100%Precision.OnmanyotherTopicstheTeamattainednearperfectionscores.Intotal,veryhighscoreswererecordedin18ofthe30topicswithgoodresultsobtainedinall,especiallywhenconsideringthelowhumaneffortsinvolvedinthesupervisedlearning.Moreover,theTeam’sF1scoresatthetimeofReasonableCallrangedfromaperfectscoreof100%inTopic3484,to91%to99%ineighttopics,and82%-87%infiveothers.
11
Consideringthelimitedhumaneffortputintothereviews,andthespeedofthereviews,weconsidertheresultsinallTopicstobeexcellent.Asshownbythecomparisonswithtraditionalreviewdiscussedabove,theseresultsarefarsuperiortothetypicallinearlegaldocumentreviewdonebylawfirmattorneysandcontractreviewattorneys. TheeffortsbynumberofdocumentshumanreviewedinallthirtytopicsareshowninthebelowchartFigure1.Asyoucansee,theTeamreviewed32,916documentstoattaintotalrecallofthe70,414documentspredeterminedbyTRECasrelevantinall30Topicsfromoutofatotalof16,576,798documents.TheaveragenumberofdocumentsreviewedtoattaintotalRecallineachtopicwas1,097.Thefigurerangedfromalowof19documentsreviewedinTopic2134(PayPal),whichhad252relevantdocuments,toahighof7,203inTopic103(ManateeProtection),whichhad5,725relevantdocuments.
TheTeam’sattainmentofhighlevelsofRecallandPrecisioninmultipleprojectsconfirmsthehypothesisthatEDRsoftwareandtheTeam’sPredictiveCoding3.0hybridmultimodalmethodsareeffectiveinmostprojectsatattaininghighlevelsofRecallandPrecisionwithminimalhumanefforts. ThebelowchartssummarizeforeachofthethreedatasetsthePrecisionresultsobtainedineachtopicat70%orhigherRecalllevels.PrecisionisshownontheleftandRecalllevelsattainedbysubmissionsareshownonthebottom.AdifferentcoloredlineshowseachTopic.AlthoughPrecisionwasnotthefocusoftheeffortsintheTeam’sRecallTrackparticipation,insteadthe
Topic NeedTotal
DocumentsTotal
Relevant 70% 80% 90% 95% 97.5% 100%
Topic100 SchoolandPreschoolFunding 290,099 4,542 651 651 651 651 651 651Topic101 JudicialSelection 290,099 5,834 6,841 6,895 6,895 6,895 6,895 6,896Topic102 CapitalPunishment 290,099 1,624 1,493 1,493 1,493 1,493 1,493 1,493Topic103 ManateeProtection 290,099 5,725 7,203 7,203 7,203 7,203 7,203 7,203Topic104 NewMedicalSchools 290,099 227 1,091 1,091 1,091 1,091 1,091 1,091Topic105 AffirmativeAction 290,099 3,635 582 582 582 674 674 674Topic106 TerriSchiavo 290,099 17,135 831 1,987 1,995 2,005 2,025 2,226Topic107 TortReform 290,099 2,369 877 1,142 1,164 1,164 1,164 1,164Topic108 ManateeCounty 290,099 2,375 696 696 696 696 696 696Topic109 ScarletLetterLaw 290,099 506 491 496 639 753 753 753Topic2052 PayingforAmazonBookReviews 465,147 265 1,842 1,960 2,213 2,325 2,325 2,325Topic2108 CAPTCHAServices 465,147 656 2,101 2,101 2,101 2,101 2,101 2,101Topic2129 FacebookAccounts 465,147 589 94 94 94 94 94 94Topic2130 SurelyBitcoinscanbeUsed 465,147 2,299 283 283 285 285 285 285Topic2134 PaypalAccounts 465,147 252 19 19 19 19 19 19Topic2158 UsingTORforAnonymousInternetBrowsing 465,147 1,261 1,332 1,332 1,332 1,332 1,332 1,335Topic2225 Rootkits 465,147 182 183 186 205 214 219 225Topic2322 WebScraping 465,147 10,145 194 195 195 195 195 195Topic2333 ArticleSpinnerSpinning 465,147 4,805 190 228 228 228 228 228Topic2461 OffshoreHostSites 465,147 179 32 32 32 32 32 32Topic3089 PicktonMurders 902,434 255 472 516 779 834 834 836Topic3133 PacificGateway 902,434 113 49 49 49 49 49 49Topic3226 TrafficEnforcementCameras 902,434 2,094 18 18 18 78 81 81Topic3290 RoosterTurkeyChickenNuisance 902,434 26 137 191 306 306 310 310Topic3357 OccupyVancouver 902,434 629 751 751 920 920 920 920Topic3378 RobMcKennaGubernatorialCandidate 902,434 66 79 161 200 200 200 200Topic3423 RobFordCuttheWaist 902,434 76 92 92 92 92 92 92Topic3431 KingstonMillsLockMurders 902,434 1,111 272 272 272 272 272 302Topic3481 Fracking 902,434 1,966 31 236 367 367 367 367Topic3484 PaulandCathyLeeMartin 902,434 23 22 22 22 22 73 73
Figure1 TOTALS 16,576,800 70,964 28,949 30,974 32,138 32,590 32,673 32,916
Effort(Docsreviewed)byRECALLSCORES
12
focuswasonRecallandeffort,stillthemeasurementsofPrecisionacrosstheRecalllevelsprovidevaluableinsightsintotheoverallwork.Figure2belowshowstheresultsofthe10TopicsinJebBushEmailcollectionof290,099emails.Figure3showstheresultsofthe10TopicsinBlackHatWorldForumcollectionof465,149posts,andFigure4showstheresultsoftheNewsArticlescollectionof902,434articles.
Figure1
AquickexamoftheresultsoftheBushEmailTopicsshowsthatfourofthetenTopicshadsignificantlylessPrecisioninattaining80%orhigherRecallthantheothers.Theyare:Topic104NewMedicalSchools,showninpurple;Topic100SchoolandPreschoolFunding,showninblue;Topic102CapitalPunishment,showningreen;and,Topic108ManateeCounty.Topic108wasprobablythemosterror-filledofalloftheTopicstandards,andthismayexplainpartoftheoutlierresultsforthattopicandothersinthislowperforminggroup.InvestigationoftheoutliersshowedthattheprimarycauseoftheseresultswasdisagreementbytotheTeam’sleadattorneyfortheBushemail,aFloridalife-longresidentwhoisusedtoservingastheSMEdefininggroundtruth,andtheTRECassessors’relevancedeterminations.Also,thesetenBushtopicswerecarriedoutatthebeginningoftheprojectbeforetheTeamadoptedmitigatingcounterstrategiesofgreaterrelianceonmachinerankingtomitigatetheimpactofthepersonaljudgmentdisagreements.
13
Figure2
AnalysisoftheresultsofthetenTopicsinBlackHatWorldalsoindicatedthattherelevancedisagreementsaccountedformostofthediscrepancies. ItappearsthaterrorsandinconsistenciesintheTRECstandardjudgingexplainmostofthePrecisiondifferencesamongtheTopics,especiallytheTopicsintheBlackHatWorlddataset.InseveraloftheseTopicstheTeamoftenhaddifficultydetectinganylogicalpatterntotherelevancescope.Theyinstead,asmentioned,hadtorelyalmostentirelyontheEDRrelevancepredictions.OnlytheTeamsoftwareinsomeoftheseTopicscoulddetectanyconnectivityandpatterntotheTRECrelevantstandards. TheresultsonthelocalNewsdatasetof902,434articles(Figure4below)againshowssignificantdivergencesinPrecision,althoughlessthanthedifferencesseeninBushEmailorBlackHatWorlddatasets.AnalysisoftheresultsofthetenNewsArticlesTopicsagainshowsconsiderabledisagreementonrelevancejudgmentsinsometopics.InherentdifficultyofthevariousissuesintheTopicsmayalsoexplainsomeofthedifferences.ThesizeoftherelevancepoolalsohasadirectrelationshiponthePrecision.
14
Figure3
ThefollowingresultsarehighlightsoftheTeam’stop18topicswhereatleastseventy-five
percentofthetargetdocuments(Recall75%+)werefoundwithaPrecisionrateof80%or
higher.TheTop-18ProjectsoftheTeamarerankedbyus,somewhatarbitrarily,asfollows,
startingwithapreviouslyunheardofperfectscore.1. InTopic3484(Paul&KathyMartin),thee-DiscoveryTeam(JimSullivan)attainedaperfect
scoreof100%Precisionand100%Recall.All23ofthetargetdocumentswerefoundinthefirst
23documentssubmitted.SullivanthencalledReasonableafterthe23rdrelevantdocumentwas
submittedandsoplayedtheperfectgame.Hepredictedthattheremaining902,411articlesin
theNewscollectionwouldbeirrelevant.Sullivanwasright.Theeffortexpendedforperfection
washispersonalreviewof73newsreportsoutofthetotalcollectionof902,434.100%Recall
with100%Precisioninalargesearchprojectwaspreviouslythoughtimpossiblebymosttext
retrievalexperts.
2. InTopic3431(KingstonMillsMurders),100%RecallwasattainedbytheTeam(Tony
Reichenberger)with82.3%Precision.Heattained97.5%RecallwithaPrecisionof98.9%,and
95%Recallwith99%Precision.Theeffortexpendedtoreach100%Recallwashispersonal
reviewof332newsreportsoutofthetotalcollectionof902,434.
3. InTopic106(TerrySchaivo),whichhadthehighestprevalenceofanytopic(5.9%),98.47%
RecallwasattainedbytheTeam(RalphLosey)with97.22%Precision.Atthattime,after
submitting2,025documents,hecalledreasonable.TheF1measurethenattainedwas97.84%.
Theeffort,ornumberofdocumentsreviewedandcodedbyLoseytoattainthisresult,was
2,025Bushemails,outofthetotalcollectionof290,099,andtotalrelevantof17,135.Acontract
reviewattorney,whosestandardbillingrateisone-tenththatofLosey’s,assistedinthereview
effort.Loseyalsoattained99.7%RecallinthisTopicwithaPrecisionof70%.
4. InTopic2158(UsingTOR),theTeam(JimSullivan)attained97.5%Recallofthetargetwhile
maintainingaPrecisionof95%.Heattained95%RecallwithaPrecisionof98.4%,and90%
15
Recallwith99%Precision.Theeffortexpendedtoreach97.5%Recallwashispersonalreviewof1,332BlackHatForumposts,outofthetotalcollectionof465,149.5. Topic103(ManateeProtection),whichhadthethirdhighestPrevalenceof1.97%,theTeam(RalphLosey)attained97.5%RecallwithaPrecisionof90.6%,95%RecallwithaPrecisionof98.8%,and90%Recallwith99.3%Precision.Theeffortexpendedtoreach97.5%Recallwashispersonalreviewof7,203Bushemails,outofthetotalcollectionof290,099.Againhewasassistedbyacontractreviewattorney.ThehighreviewcounthereisduetothefactthisisoneoftwoprojectswherethePredictiveCoding3.0secondstepofrandomsamplingwasincluded.Thisisalsothefirstprojectundertaken.6. InTopic109(ScarlettLetterLaw),theTeam(RalphLosey)attained97.5%Recallwith84.4%Precision,95%Recallwith95.4%Precision,and90%Recallwith96%Precision.Theeffortexpendedtoreach97.5%Recallwashispersonalreviewof753Bushemails,againoutofthetotalcollectionof290,099.Onecontractreviewattorneyassisted.7. InTopic3378(RobMcKenna),theTeam(TonyReichenberger)attained100%Recallafterthesubmissionofonly192documentsandreviewofonly200documents.ThiswasalowprevalenceTopicwithonly66relevantoutofthetotalcollectionof902,434.ForthesereasonsthePrecisionwas34.31%,eventhoughonly192documentsweresubmittedtoattain100%Recall. TheTeamresultsexceededexpectations,whereourRecallgoalwas90%,inmanyadditionalTopics:8. InTopic3481(Fracking),theTeam(JimSullivan)attained95%Recallwith95.2%Precisionbyreviewingonly367newsarticles.9. InTopic105(AffirmativeAction),theTeam(RalphLosey)attained90%Recallwith99.7%Precisionbyreviewingonly582mails(onecontractreviewattorneyassisted).10. InTopic3089(PicktonMurders),theTeam(JoeWhite)attained90%Recallwith97.9%Precisionbyreviewingonly779articles.A99.61%Recalllevelwasattainedwith54.98%Precision,againwithreviewofonly799articles.11. InTopic3226(TrafficCameras),theTeam(JimSullivan)attained90%Recallwith95.9%Precisionbyhispersonalreviewonly18forumposts.12. InTopic101(JudicialSelection),whichhadthesecondhighestPrevalencerateof2%,theTeam(RalphLosey)attained90%Recallwith87.8%Precisionbyreviewing6,895emails(onecontractreviewattorneyassisted).13. InTopic3357(OccupyVancouver),theTeam(TonyReichenberger)attained90%Recallwith82.4%Precisionbyreviewingonly920newsarticles.14. InTopic107(TortReform),theTeam(RalphLosey)attained90%Recallwith80.9%Precisionbyreviewingonly1,164emails(onecontractreviewattorneyassisted). FouradditionalTopicsalsodidquitewell,andattainedRecalllevelsover75%withhighPrecisionrates:15. InTopic2225(Rootkits)theTeam(RalphLosey)attained80%Recallwith88%Precisionbyreviewingonly186forumposts.16. InTopic2333(ArticleSpinner)theTeam(RalphLosey)attained80%Recallwith79%Precisionbyreviewingonly228forumposts.17. InTopic2052(PayingforBookReviews)theTeam(JimSullivan)attained80%Recallwith73.4%Precision)byreviewing1,960forumposts.18. InTopic3133(PacificGateway)theTeam(RalphLosey)attained76.99%Recallwith89.69%Precisionbyreviewingonly49NewsArticles.Figure5belowshowstherecallandprecisionofthesetop18projects.
16
Figure5 TheTeam’slowerperformanceintheother12projectswas,accordingtoouranalysis,primarilycausedbythefactthattheattorneyTeammembersareaccustomedtoself-definingthegroundtruth,andtheiropinionsonrelevancedifferedsignificantlyfromtheTRECassessors.InlatertopicstheattorneyTeamlearnedtoturnofftheirownjudgmentsandrelyprimarilyontheirsoftware’sautomatedprocesses,atwhichpointtheirscoresimproved.InalltopicsthemachinelearningoftheTeam’sEDRsoftwarewasabletofinddocumentsthatTRECwouldconsiderrelevant,evenwherethehumanteammemberscouldseenoconnection.ButinsometopicsthehumansearcherswouldbecompletelybewilderedbythezigzagrelevancescopeshownbyTREC’sresponsetosubmissions.TheattorneyswouldnotseeanykindoflogicalconnectingpatterntosomeofthedocumentsthatTRECdeterminedtoberelevant.Sometimestheattorneysonlysawwronganswersandinconsistencies.Eventhoughtheattorneyscouldnotseeanypattern,theylearnedthattheirEDRsoftwarecouldoftenstillfindthepatternsandcorrectlypredictwhichdocumentsTRECwouldlabelrelevant.WhenthishappenedtheywouldineffectturnallsubmissiondecisionsovertoEDRandonlysubmitthehighest-rankingdocuments.Thecut-offpointofrankingforsubmissions,beittop5%ortop100documents,orsomeotherscheme,wasstilldeterminedbythehumanincharge.ThatispartoftheTeam’shybriddesign. ThereareprobablyotherexplanationsforthebottomtwelvescoringtopicsasidefromquestionableTRECassessoradjudications,including:thedataitself;thedifficultyoftheissuesaddressedintheTopic;relativeperformanceofhumanreviewers;and,theimpactoftheomissionofStepsThreeandSevenfromtheTeam’sstandardworkflowtomeetthe45daytimelimitation,andtheradicalchangetoStepOne.See:ConceptDriftandConsistency:TwoKeystoDocumentReviewQuality,e-DiscoveryTeam(Jan.20,2016).AlloftheTeam’sinconsistencieswerenotcausedbydifferencesofopiniononTRECrelevanceadjudications,onlysome.Weappreciatethedifficultyofcreatinginterestingtopicsforsuchadiversegroupofparticipants,mostofwhomusedfullyautomatedCALapproaches.WeunderstandtheinherentdifficultiesinsettingagroundtruthforprejudgedrelevancewherethetraditionalTRECpoolingmethodscouldnotbeused.17Inspiteofourcriticismshere,weoverallhavehighpraiseandthanksfortheTRECadministrators’tirelesseffortsandagreewiththemajorityoftheassessmentstheymadeunderdifficult,timeconstrainedconditions.
17
Regardlessoftheseissuesandmetricinconsistencies,theTeam’smanualefforts,as
measuredbytimeexpendedandnumberofdocumentsmanuallyreviewedwereconsistently
verylowinalltopics.Morethanhalfoftherelevantdocumentsfoundwerenotmanually
reviewed.Instead,theTeamwasroutinelyabletodelegaterelevancecodingtotheEDR
software,eitherbychoiceandconvenience,orsometimes,asdiscussed,bynecessityinthe
topicswherethegroundtruthofrelevancewasunknownandincomprehensibletothe
attorneys.Thisresultshouldshatteronceandforallthealreadyweakenedlegalsearchmyththatalldocumentsmustbemanuallyreviewedforrelevance.
Althoughnotdirectlycomparableduetodifferenttestconditions,differentsearches,etc.,
thee-DiscoveryTeam’sscoreswerefarhigherthananypreviouslyrecordedinthesixyearsofTRECLegalTrack(2006-2011)
18oranyotherstudyoflegalsearch.
19TheresultsofBlairand
MaronandTRECfrom2007to2011aresummarizedbelowinFigure6withF1scores.
Figure6
Thisisnotalistingoftheaveragescoreperyear,suchscoreswouldbefar,farlower.Ratherthisshowstheverybesteffortattainedbyanyparticipantinthatyearinanytopic.Theseare
thehighestscoresfromeachTRECyear.NotehowtheycomparewiththeTeam’shighscoresin
2015,Figure7.
Figure7
18
Onereasonforthissignificantjumpinhighscoresmaybethatmanyofthethirtytopicsinthe2015TotalRecallTrackpresentedrelativelysimpleinformationneedsbylegalsearchstandards,withonemajorexception,Topic109–ScarletLetterLaw.Itrequiredsomelegalknowledgeandanalysis.Therewerealsofourotherminorexceptions–Topics101,105,106,107–thatrequiredsomemeasureoflegalanalysis.AnotherexplanationmaybeimprovedsoftwareandtheTeam’shybridmultimodalmethodthatincludescontinuousactivelearning.ThelaterisstronglysuggestedbecausetheresultsinTopic109,aswellasTopics101,105,106and107,areclosetotypicallegalsearchtypeprojectsandtheTeam’sresultsinthesetopicswereallconsistentlyhigh:Topic109(ScarlettLetterLaw)-95%F1atReasonableCall;Topic101(JudicialSelection)-87%F1atReasonableCall;Topic105(AffirmativeAction)-95%F1atReasonableCall;Topic106(TerriSchiavo)-98%F1atReasonableCall;Topic107(TortReform)-84%F1atReasonableCall.ThisisshowninFigure8below.
Figure85.2ResearchQuestionNo.2. TheTeamattainedveryhighrecallandprecisionratesinmost,butnotall,ofthethirtyTotalRecalltopics.TheTeam’sF1scoresatthetimeofReasonableCallrangedfromaperfectscoreof100%inonetopic(3484),to91%to99%ineighttopics,and82%-87%infiveothers. Although,ofcourse,notdirectlycomparable,thesescoresarefarhigherthananypreviouslyrecordedinthesixyearsofTRECLegalTrack(2006-2011)oranyotherstudyoflegalsearch.Onereasonforthismaybethatthethirtytopicsinthe2015TotalRecalltrackpresentedrelativelysimpleinformationneedsbylegalsearchstandards,withoneexception(Topic109–ScarletLetterLaw).AnothermaybeimprovedsoftwareandtheTeam’shybridmultimodalmethodthatincludescontinuousactivelearning. Sincemostofthethirtytopicspresentedonlysimple,single-issueinformationneedssuitableforsingle-facetclassification,theyhadsomewhatlimitedvalueforpurposesoflegalsearchexperimentation.Further,onlyafewofthetopicsrequiredanylegalanalysisforrelevanceidentification.Thisagainlimitedtheuseoftheseexperimentsforpurposesoflegalsearchresearch.Thesetwofactors,plustheomissionofmetadata,wasadisadvantagetothee-DiscoveryTeamoflawyerswhoarepracticedinmorecomplexinformationneedsrequiringextensivelegalanalysisandSMEdefinedgroundtruths.Further,theirmethodsandEDR
19
softwarearedesignedtoutilizefullmetadataderivedfromnativefiles.Conversely,itappearsthatthesesamefactorsmadeitsimplerfortheSandboxparticipantstoperformwellinmosttopics. TheoneexceptionwasTopic109,ScarlettLetterLaw,which,asmentioned,wastheonlytopicrequiringlegalanalysisandsomeveryrudimentaryknowledgetobeginlocatingrelevantdocuments.Thekeywordsalone-“ScarlettLetterLaw”–wouldonlyfindrelevantdocumentswiththiswordcombinationandsimilartextpatterns.ThesewordswerejustthenicknameoftheproposedandeventuallyenactedFloridaStatute.Anyattorneywouldknowthattofindrelevantinformationtheywouldnotonlyhavetosearchthename,theywouldhavetosearchthevarioushouseandsenatebillnumbersforthislaw.Thesenumberswouldnotoftenappearinthesamedocumentasthenickname,andsincethemachinedidnotknowtosearchforthesenumbers,itdidnotrealizethesignificance.Eventuallytheautomatedmachinelearningsawtheconnection,aftermanyrelevancefeedbacksubmissions.Thesesubmissionswould,ofcourse,nothappeninreallegalsearch,andeveniftheydid,thisimprecisionwouldequatetosubstantialadditionalhumanreviewsandthusexpense. Somewhatsurprisinglytous,thefullyautomaticmethodsemployedbytheSandboxparticipantsattainedrecallandprecisionscorescomparabletothatofthee-DiscoveryTeaminmostofthetopics.Moreover,therewerefewdifferencesbetweenthevariousfullyautomatedapproaches.Still,thehighestF1valuesatthetimeofReasonableCallwereattainedbythee-DiscoveryTeamintwentyofthethirtytopics,andthesecondorthirdbestF1scoresinfourothers.ThisisshowninFigure9below.TheTeamF1rankingsforeachtopicareshowninthethirdcolumn.
Figure9
InTopic109,ScarletLetterLaw,wheresomelegalknowledgeandanalysiswasrequiredtounderstandrelevance,theTeamattainedsignificantlybetterresults-96%F1-atthetimeofReasonableCallthandidtheautomaticruns.IntheSandboxautomaticrunstheF1valuesatthetimeofReasonableCallrangedfrom0%to29%.Moreover,atthe1RpointinTopic109,thee-
F1 Topic100 Rank Topic101 Rank Topic102 Rank Topic103 Rank Topic104 Rank Topic105 Rank Topic106 Rank Topic107 Rank Topic108 Rank Topic109 RankeDiscoveryTeam 68.96% 2 82.45% 4 69.88% 1 90.69% 1 73.53% 1 95.07% 1 97.38% 1 84.40% 1 47.03% 5 95.58% 1NINJA 22.74% 8 79.17% 5 56.38% 5 83.79% 3 57.40% 4 77.24% 2 88.90% 5 50.89% 8 13.43% 11 48.79% 2UvA.ILPS-baseline 73.55% 1 86.36% 1 56.38% 4 89.94% 2 10.27% 10 64.13% 5 95.87% 2 77.26% 4 64.47% 1 28.88% 3UvA.ILPS-baseline2 45.56% 5 71.04% 7 42.42% 8 77.24% 6 2.42% 11 43.27% 7 84.67% 6 47.81% 9 35.13% 8 26.90% 6WaterlooClarke-UWPAH1 11.95% 9 9.98% 11 32.16% 11 10.46% 11 68.51% 2 15.99% 10 3.61% 10 22.96% 11 21.61% 9 0.73% 8WaterlooClarke-UWPAH2 10.37% 10 9.98% 10 32.16% 10 10.46% 10 65.93% 3 15.99% 11 3.54% 11 23.11% 10 21.54% 10 0.73% 9WaterlooCormack-Knee100 45.02% 6 67.65% 9 42.32% 9 71.10% 9 28.49% 7 34.08% 8 77.03% 9 53.92% 7 42.65% 7 0.94% 7WaterlooCormack-Knee1000 41.82% 7 67.67% 8 45.21% 7 71.11% 8 31.06% 5 33.90% 9 77.03% 8 57.79% 5 42.65% 6 27.17% 5WaterlooCormack-stop2399 68.21% 3 72.02% 6 51.74% 6 75.55% 7 14.34% 9 58.92% 6 81.60% 7 57.77% 6 58.96% 2 27.17% 4Webis-baseline 66.96% 4 83.87% 3 68.36% 2 82.42% 5 27.95% 8 64.91% 4 94.89% 4 79.24% 3 58.76% 3 0.00% 11Webis-keyphrase 0.14% 11 85.21% 2 67.71% 3 83.15% 4 31.04% 6 65.13% 3 94.90% 3 79.24% 2 58.34% 4 0.33% 10
F1 Topic2052 Rank Topic2108 Rank Topic2129 Rank Topic2130 Rank Topic2134 Rank Topic2158 Rank Topic2225 Rank Topic2322 Rank Topic2333 Rank Topic2461 RankeDiscoveryTeam 45.21% 1 53.99% 1 26.10% 6 64.31% 1 12.23% 6 95.61% 1 84.90% 1 72.60% 3 73.23% 1 16.68% 7NINJA 58.13% 2 53.66% 2 49.22% 2 52.18% 2 39.70% 2 76.26% 2 39.43% 4 24.83% 9 62.65% 6 24.48% 5UvA.ILPS-baseline 10.74% 3 22.74% 9 21.88% 7 41.12% 4 8.08% 7 42.02% 7 7.20% 9 73.20% 2 69.80% 2 7.33% 9UvA.ILPS-baseline2 10.37% 4 22.45% 10 19.23% 8 30.88% 5 6.96% 8 22.47% 9 6.45% 10 48.11% 6 46.02% 9 6.53% 10WaterlooClarke-UWPAH1 78.54% 5 52.20% 3 56.89% 1 13.42% 8 63.18% 1 40.08% 8 61.45% 2 5.85% 10 12.22% 10 49.90% 1WaterlooCormack-Knee100 41.43% 6 33.89% 5 28.52% 5 19.49% 6 18.45% 3 16.15% 10 41.33% 3 47.39% 7 47.33% 7 43.87% 2WaterlooCormack-Knee1000 38.10% 7 34.00% 4 30.91% 4 19.45% 7 18.45% 4 60.57% 5 27.02% 5 44.11% 8 47.30% 8 21.65% 6WaterlooCormack-stop2399 16.94% 8 31.35% 7 31.01% 3 46.56% 3 15.51% 5 45.06% 6 11.84% 8 75.86% 1 68.87% 3 11.72% 8Webis-baseline 13.24% 9 32.65% 6 7.73% 10 0.00% 10 2.21% 10 61.11% 4 18.36% 6 67.40% 5 68.07% 4 43.56% 3Webis-keyphrase 10.53% 10 30.56% 8 8.29% 9 0.00% 9 2.21% 9 62.14% 3 12.97% 7 67.72% 4 68.04% 5 31.95% 4
F1 Topic3089 Rank Topic3133 Rank Topic3226 Rank Topic3290 Rank Topic3357 Rank Topic3378 Rank Topic3423 Rank Topic3431 Rank Topic3481 Rank Topic3484 RankeDiscoveryTeam 93.28% 1 82.46% 1 55.39% 4 37.70% 2 86.70% 2 68.21% 1 58.12% 1 99.24% 1 95.48% 1 100.00% 1NINJA 86.84% 2 67.97% 2 22.75% 9 38.98% 1 89.95% 1 67.88% 2 57.85% 2 74.67% 4 71.59% 2 100.00% 1UvA.ILPS-baseline 5.47% 9 2.47% 9 37.25% 5 0.57% 9 12.75% 9 1.39% 9 1.26% 9 21.90% 7 35.00% 7 0.51% 9UvA.ILPS-baseline2 5.35% 10 2.39% 10 34.75% 6 0.39% 10 11.82% 10 1.38% 10 0.74% 10 21.74% 8 29.19% 9 0.51% 10WaterlooClarke-UWPAH1 76.14% 3 50.45% 3 24.73% 7 11.90% 5 62.65% 3 32.58% 4 18.65% 5 44.29% 6 26.87% 10 12.99% 6WaterlooCormack-Knee100 57.66% 4 49.02% 4 64.61% 2 26.09% 3 55.57% 4 57.87% 3 30.70% 3 93.34% 3 53.62% 5 34.07% 4WaterlooCormack-Knee1000 37.35% 5 18.38% 6 68.61% 1 4.59% 7 48.23% 5 11.26% 7 6.77% 7 93.77% 2 61.55% 4 4.07% 7WaterlooCormack-stop2399 16.41% 7 8.43% 7 56.65% 3 2.01% 8 32.80% 6 5.01% 8 3.56% 8 44.78% 5 53.56% 6 1.78% 8Webis-baseline 14.77% 8 47.06% 5 24.51% 8 19.31% 4 18.84% 7 27.37% 5 28.16% 4 19.71% 9 65.54% 3 34.59% 3Webis-keyphrase 19.10% 6 6.40% 8 18.29% 10 10.22% 6 17.98% 8 18.23% 6 16.04% 6 19.19% 10 32.89% 8 30.08% 5
20
DiscoveryTeamhadattainedover95%recall,whereasalloftheautomatedmethodswerestilllessthan1%recall.Thisisshowninthechartbelow,Figure10.
Figure10
TheTeam’smultimodalhumanmachineapproachalsoconsistentlyfoundmorerelevantdocumentsatthestartofasearch,anddidsowithgreaterprecisionthanthefullyautomatedapproaches.Further,thehybridman-machineapproachwasconsistentlymoreeffectiveatdeterminingastoppoint,referredtobytheRecallTrackasa“ReasonableCall.”AnexampleofthisisshownintheFigure11forTopic109.ThedarkgreenlinerepresentstheReasonableCallpoint,recallisshowninthevertical,andhorizontalisthenumberofdocumentssubmitted.
Figure11
21
Anotherwaytoevaluatetheperformanceofthemulti-modalapproachistoconsiderhowprecisethecodingsuggestionswereduringthecourseofreview.Thiswouldindicateanefficientreview,whichiscriticalinlegalsearchtocostsavings.AstotheAthome109topic,thebelowFigure12contrastsprecisionpercentageontheY-axis,withrecallpercentageontheX-axis.Precisiondoesnotbegintodropuntilapproximately95%Recall.Notethatthegreenlinerepresentingpercentofthedatabasesubmittedbarelymovesoffthebaseline.Figure13showstheactualdocumentcountsreviewedandsubmittedinordertoobtainthevariousprecisionthresholds.
Figure12
Figure13
22
ForfurthercomparisonFigure14below(preparedbytheTotalRecalladministrators)plotstheaverageAthome3precisionbyrecallresults.Thee-DiscoveryTeamresults(barelyvisibleontop)followacurveverysimilartotheAthome109topic.TheTeam’sresultsoutperformedtheautomatedrunsformostofthedurationoftheprocess,demonstratingaconsistentefficiencyinresults.WhilevariousautomatedrunsexperiencedcomparableresultsintheAthome1andAthome2sets,theconsistentlyhighlevelofthemultimodalapproachcorroboratesaconsistentefficientprocessacrossalldatasets.
Figure145.3ResearchQuestionNo.3. TheTeam’sexperimentswithdifferentpositivenegativetrainingratiosshowedthattrainingusinga50/50ratioofrelevanttoirrelevantdocumentsperformedconsistentlybetterthananyotherratios.ThisresultisbelievedtobespecifictotheproprietarytypeoflogisticregressionalgorithmusedinKrollOntrack’sEDR.Itmaynothaveapplicationsbeyondthissoftware,orevenothermorecomplexprojects.Ourworkonthisquestioncontinues.6.CONCLUSIONS TheresultsinTopic109andothertopicsindicatethathybridman-machinelearningbyskilledattorneysis,atthecurrenttime,significantlymoreeffectiveatmeetingcomplexlegalsearchneedsthanfullyautomatedapproaches.Thisseemsobvious,butmoreexperimentsonthisissueareneededbeforethiscanbeaccuratelyquantified.ThesurprisingsuccessoftheSandboxparticipantsusingfullyautomatedsearch,eventhoughlimitedtonon-legaltopicsandsituationswithonlysimpleinformationneeds,suggeststhatgreaterrelianceonautomatedmethodscouldbeplacedinlegalsearchwherethecasesandneedsaresimple.Therelativelyloweffortinvolvedinautomatedlearning,andthuslowexpense,iscompelling,especiallyinviewoftheproportionalityanalysisrequiredbylawundertheDecember2015AmendmentstotheFederalRulesofCivilProcedure.TheTeamhasbegunandwillcontinueposthocanalysisandexperimentsusingvarioushybridmethodsthatadjustthebalancebetweenmanandmachine.
23
Weareexperimentingwithmethodsthatplacegreaterrelianceonmachinelearninginalltopics,
including,butnotlimitedto,topicswithlessercomplexityandinformationneeds.Wewillalso
furtherinvestigatetheuseofbothfullyautomatedmethods,andhybridmethods,inlegal
searchqualitycontrol,frauddetection,andinthepredictionoffuturewrongfulconduct.20
The2015TRECTotalRecallTrackresultsalsosuggestthatevenwheninformationneedsare
simpleandrequirenocomplexanalysisorbackgroundknowledge,aswastrueofmostofthe
topics,thatahybridmethodoutperformsfullyautomatedmethodsintwoways:one,atfinding
relevantdocumentsquicklyandwithhighprecision;andtwo,atmakingbetterstopdecisions.Thesetwoconsiderationsareveryimportantinlegalsearchwhereattorneysmustfinda
proportionalbalancebetweenrecallandeffort/expense.Theresultsinalltopics,eventhe
simpleones,thuscautionagainstover-relianceatthistimeonmachinelearningalonewithout
properexpertsupervision.7.ACKNOWLEDGMENTS Thee-DiscoveryTeamwouldliketothankKrollOntrack,Inc.andJacksonLewisP.C.fortheir
generoussupportofthisproject.WewouldalsoliketothankthemanyemployeesatKroll
Ontrackwhopitchedinbehindthescenes,oftenlateatnightandonweekends,tohelpmake
thishappen.
8.REFERENCES(Endnotes)[1] Losey,R.,PredictiveCoding3.0,parttwo(e-DiscoveryTeam,10/18/15);alsosee
PredictiveCodingArticlesbyRalphLosey,(collectionofover50articlesbyRalphLoseyfurtherdescribingthehybridmultimodalapproach).
[2] Thee-DiscoveryTeam’shybridmultimodalapproachissimilartothemethodpromoted
bytheTotalRecallTrackadministrators,MauraGrossmanandGordonCormack,inthat
theybothusecontinuousactivelearning(CAL)inlegalsearchaspartofatechnology-
assistedreview(TAR).Itis,however,fundamentallydifferentfromGrossmanand
Cormack’scurrentmethodsintwoways.
First,ourapproachreliesuponandencouragesparticipationofskilledreviewersin
thesearchprocess,thehybridapproach,whereastheGrossmanandCormackapproach
seekstoeliminatetheroleoftheskilleduser,namelytrainedattorneys.Therationale
fortheirautomationgoalistheunsubstantiatedclaimthattheadversarialcontextof
legalsearchmakesattorneysuntrustworthy.Theyclaimthatinherentuserbiasmeans
fullyautomatedapproachesaretheonlyreliablemethodsoflegalsearch.Grossman&
Cormack,AutonomyandReliabilityofContinuousActiveLearningforTechnology-AssistedReview,CoRRabs/1504.06868atpg.1(2015)(“IneDiscovery,thereviewistypicallyconductedinanadversarialcontext,whichmayofferthereviewerlimitedincentivetoconductthebestpossiblesearch.”)ObviouslytheTeamdisputesthis
assumptionandconclusion.Wedonotendorsetheviewoftheinherentbiasand
untrustworthinessofattorneys.InRalphLosey’sexperienceasapracticingattorney
since1980suchbiasistherareexception,notthenorm,andshouldnotbethebasisof
alegalsearchstrategy.Thebettersolutiontothisminorissueoftrustworthinessis
educational,totrainmoreattorneysinsearchandinprofessionalethics.Sinceourcore
assumptionsonprocessandattorneyhonestyarefundamentallydifferent,sotooare
ourmethodsandgoal.Ouraimisaugmentationofskilledattorneystoperformlegal
search,notautomation,notreplacement.
Second,ourTeamusesavarietyofsearchmethods,amultimodalapproach,whereastheGrossmanandCormackapproachreliessolelyupontheuseofhigh-ranking
24
documentstotrainaclassifier.Thisisconsistentwiththeiraimtofullyautomateandeliminateattorneysfromthelegalsearchprocess,againbasedonthepremisewedisputeofattorneybias.Intheirwords:“Forthereasonsstatedabove,itmaybe
desirabletolimitdiscretionarychoicesintheselectionofsearchtools,tuningparameters,
andsearchstrategy.”Id.Wedisagreeandseektoempowerattorneyswithavarietyofsearchtools,includingtheonesearchmethodthattheyendorseofrelianceonhigh-rankingdocuments.AlsoseeandthediscussionandcitationsinEndnote19.
[3]Intheserespectsthee-DiscoveryTeamfollowstheteachingsofGaryMarchionini,DeanoftheSchoolofInformationandLibrarySciencesofU.N.C.atChapelHill,whoexplainedinInformationSeekinginElectronicEnvironments(Cambridge1995)thatinformationseekingexpertiseisacriticalskillforsuccessfulsearch.ProfessorMarchioniniargues,andweagree,that:“Onegoalofhuman-computerinteractionresearchistoapply
computingpowertoamplifyandaugmentthesehumanabilities.”WealsofollowtheteachingsofUCLAProfessorMarciaJ.Bateswhohasadvocatedforamultimodalapproachtosearchsince1989.Bates,MarciaJ.,TheDesignofBrowsingandBerrypickingTechniquesfortheOnlineSearchInterface,OnlineReview13(October1989):407-424.AsProfessorBatesexplainedin2011inQuora:
“AnimportantthingwelearnedearlyonisthatsuccessfulsearchingrequireswhatI
called“berrypicking.”…Berrypickinginvolves1)searchingmanydifferent
places/sources,2)usingdifferentsearchtechniquesindifferentplaces,and3)
changingyoursearchgoalasyougoalongandlearnthingsalongtheway.Thismay
seemfairlyobviouswhenstatedthisway,but,infact,manysearcherserroneously
thinktheywillfindeverythingtheywantinjustoneplace,andsecond,many
informationsystemshavebeendesignedtopermitonlyonekindofsearching,and
inhibitthesearcherfromusingthemoreeffectiveberrypickingtechnique.”
Alsosee:White&Roth,ExploratorySearch:BeyondtheQuery-ResponseParadigm(Morgan&Claypool,2009).
[4] TheTotalRecallTrackfullyautomatedmethodfollowstheTrackAdministrator’spreferredmethodologyoffullyautomatedmonomodalsearch(highrankingonly)andtheirrecentlyannouncedgoaltoeliminateattorneyreviewinfavoroffullautomation.Grossman&Cormack,AutonomyandReliabilityofContinuousActiveLearningfor
Technology-AssistedReview,supraatpg.1(2015): “Ourgoalistofullyautomatethesechoices,sothattheonlyinputrequiredfromthe
revieweris,attheoutset,ashortquery,topicdescription,orsinglerelevant
document,followedbyanassessmentofrelevanceforeachdocument,asitis
retrieved.” Theycallthemethod“AutonomousTAR.”Id.atpg.6.Theprotocolsofthefully
automateddivisionoftheTotalRecallTrackwereapparentlydesignedinpartbyCormackandGrossmantotestthispremise,andtheresultstheyattainedasparticipantsinthisdivision,alongwithalloftheotherfullyautomatedparticipantsfromUniversitiesaroundtheworld,areveryimpressive.Still,thee-DiscoveryTeam,whodidnotparticipateinthe2015automateddivision,notesthatmanyoftheprotocolsinthisexperimentarebasedonfictionsandconditionsnotfoundintherealworldoflegalsearch,wheretheTeam’smethodsweredeveloped.Thedifferencesinclude,butarenotlimitedto:theexistenceofanomnipotentSMEthatinstantlyprovidesperfectlycorrectjudgmentalfeedbackastorelevanceofalldocumentsselectedbytheautomatedprocessesasprobablerelevant;simple,single-facetissues;relativelysimpledatasetsstrippedofmostnativemetadata;and,perhapsmostimportantly,issues
25
requiringlittleornolegalanalysisorbackgroundlegalknowledge.Note,inposthocrunsthee-DiscoveryTeamranafewfullyautomatedrunsonKrollOntracksystemsandEDR.WeusedthesamehighrankingonlyAutonomousTARtrainingmethodandobtainedthesameresultsasalloftheotherfullyautomateddivisionparticipants.
[5] “Contractreviewattorney,”orsimply“contractattorney,”isatermnowincommonparlanceinthelegalprofessiontorefertolicensedattorneyswhododocumentreviewonaproject-by-projectbasis.Theirpayunderaprojectcontractisusuallybythehourandisatafarlowerratethanattorneysinalawfirm,typicallyonly$50to$75perhour.Theironlyresponsibilityistoreviewdocumentsunderthedirectsupervisionoflawfirmattorneyswhohavemuchhigherbillingrates.
[6] PredictiveCodingisdefinedbyTheGrossman-CormackGlossaryofTechnology-AssistedReview,2013Fed.Cts.L.Rev.7(January2013)(Grossman-CormackGlossary)as:“Anindustry-specifictermgenerallyusedtodescribeaTechnologyAssistedReviewprocessinvolvingtheuseofaMachineLearningAlgorithmtodistinguishRelevantfromNon-RelevantDocuments,basedonSubjectMatterExpert(s)CodingofaTrainingSetofDocuments.”ATechnologyAssistedReviewprocessisdefinedas:“AprocessforPrioritizingorCodingaCollectionofelectronicDocumentsusingacomputerizedsystemthatharnesseshumanjudgmentsofoneormoreSubjectMatterExpert(s)onasmallersetofDocumentsandthenextrapolatesthosejudgmentstotheremainingDocumentCollection.…TARprocessesgenerallyincorporateStatisticalModelsand/orSamplingtechniquestoguidetheprocessandtomeasureoverallsystemeffectiveness.”Alsosee:Technology-AssistedReviewinE-DiscoveryCanBeMoreEffectiveandMoreEfficientThanExhaustiveManualReview,RichmondJournalofLawandTechnology,Vol.XVII,Issue3,Article11(2011).
[7] DaSilvaMoorev.PublicisGroupe868F.Supp.2d137(SDNY2012)andnumerouscaseslatercitingtoandfollowingthislandmarkdecisionbyJudgeAndrewPeck,includingJudgePeck’sownmorerecentRioTintov.Vale,2015WL872294(March2,2015,SDNY).
[8] Grossman&Cormack,EvaluationofMachine-LearningProtocolsforTechnology-AssistedReviewinElectronicDiscovery,SIGIR’14,July6–11,2014;Grossman&Cormack,Commentson“TheImplicationsofRule26(g)ontheUseofTechnology-AssistedReview”,7FederalCourtsLawReview286(2014);HerbertRoitblat,seriesoffiveOrcaTecblogposts(1,2,3,4,5),May-August2014;HerbertRoitblat,Daubert,Rule26(g)andtheeDiscoveryTurkeyOrcaTecblog,August11th,2014;Hickman&Schieneman,TheImplicationsofRule26(g)ontheUseofTechnology-AssistedReview,7FED.CTS.L.REV.239(2013);Losey,R.PredictiveCoding3.0,partone(e-DiscoveryTeam10/11/15).
[9] Id.;Webber,Randomvsactiveselectionoftrainingexamplesine-discovery(Evaluatinge-Discoveryblog,7/14/14).
[10] SeeEndnote[2].Thisdisagreementiswithinageneralframeworkofagreementonthesuperiorityofcomputerassistedmethodsovertraditionallinearreview,jointcriticismofrandomselectionmethodsandcontrolsetsinlegalreview,andagreementontheuseofcontinuousactivelearning,asopposedtooneanddone,identifiedbyLoseyasPredictiveCodingVersion1.0.PredictiveCoding3.0,partone(e-DiscoveryTeam10/11/15).
[11] Grossman&Cormack,AutonomyandReliabilityofContinuousActiveLearningforTechnology-AssistedReview,CoRRabs/1504.06868(2015);Multi-FacetedRecallof
26
ContinuousActiveLearningforTechnology-AssistedReview,SIGIR’15,August09-13,2015,Santiago,Chile.(2015).
[12] Losey,R.,PredictiveCoding3.0,parttwo(e-DiscoveryTeam,10/18/15).[13] Shakespeare,W.,HenryVI,PtII,Act4,Scene2,71-78(“Thefirstthingwedo,let'skillall
thelawyers.”).Thisfamousanti-lawyerlinewasspokenby“Dickthebutcher,”atraitorhopingtostartarevolutionandpropuphisfriendasanautocraticruler.
[14] Losey,R.,PredictiveCoding3.0,partone)(2015e-DiscoveryTeam),seethesubsectiontherein,PredictiveCoding1.0andtheFirstPatents,discussingcommonprejudiceagainstlawyersbyacademicsandITthatdrovetheill-advisedimpositionofsecretcontrolsetsinthefirstversionsofpredictivecodingsoftware.ThenewdrivebyCormackandGrossmantofullyautomatelegalsearchandeliminateSMEsandattorneysearchexpertisefromlegalsearchseemsbased,atleastinpart,onthesamefalsepremises.AlsoseeLosey,R.,Manciav.MayflowerBeginsaPilgrimagetotheNewWorldofCooperation,10SedonaConf.J.377(2009Supp.);Losey,R.,LawyersBehavingBadly,60MercerL.Rev.983(Spring2009).
[15] SeeZeroErrorNumericsforapartiallistofqualitycontrolandqualityassurancemethodsendorsedbythee-DiscoveryTeam,foundatZeroErrorNumerics.com(ZENDocumentReview).Alsosee:ConceptDriftandConsistency:TwoKeystoDocumentReviewQuality,e-DiscoveryTeam(Jan.20,2016).
[16] Thecostoftraditionallineardocumentreviewisoftenfarhigherthan$1.00perfileinpractice.In2007theU.S.DepartmentofJusticespent$9.09perdocumentforreviewintheFannieMaecase,eventhoughitusedcontractlawyersforthereviewwork.InreFannieMaeSecuritiesLitig.,552F.3d814,817(D.C.Cir.2009)($6,000,000/660,000emails).AtaboutthesametimeVerizonpaid$6.09perdocumentforamassivesecondreviewprojectthatenjoyedlargeeconomiesofscaleand,again,utilizedcontractreviewlawyers.Roitblat,Kershaw,andOot,Documentcategorizationinlegalelectronicdiscovery:computerclassificationvs.manualreview.JournaloftheAmericanSocietyforInformationScienceandTechnology,61(1):70–80,2010($14,000,000toreview2.3milliondocumentsinfourmonths).
[17] E.M.Voorhees,VariationsinrelevancejudgmentsandthemeasurementofretrievalEffectiveness,InformationProcessing&Management,36(5):697{716,2000(onpooling);Oard,Baron,Hedlin,lewis,Tomlinson,EvaluationofInformationRetrievalforE-Discovery,JournalArtificialIntelligenceandLaw,Vol.18Issue4,December2010Pgs.347-386.
[18] AutonomyandReliability,supraatpgs.2-3(“Thispaperoffersahistoricalreviewofresearcheffortstoachievehighrecall...”ThepaperalsoestimatestheBlairMaronprecisionscoreof20%andliststhetopscores(withoutattribution)inmostTRECyears);Hedin,Tomlinson,Baron,andOard,OverviewoftheTREC2009LegalTrack(TREC2009);Cormack,Grossman,Hedin,andOard;OverviewoftheTREC2010LegalTrack(TREC2010);Grossman,Cormack,Hedin,andOard,OverviewoftheTREC2011LegalTrack(TREC2011);EvaluationofInformationRetrievalforE-Discovery,supraatpgs.24-27.ThetopTRECresultscitedforthesixyearsofLegaltrackareinthe60%to70%F1rangewithacoupleofresultsinthelow80%F1range.TheRecommindparticipationinthelastTRECLegalTrack2011,andtheirsubsequentprohibitedmarketingadvertisementsclaimingto“win,”whichledtotheirlifetimebanfromTREC,onlyattainedaRecallof62.3%inonetopic(403).OverviewoftheTREC2011LegalTrack(TREC2011)supra.ContrastallofthepriorTRECresultswiththee-DiscoveryTeamresultsin18topicsinthe80%to100%F1range,withnumeroustopicsinthemidtohigh90%F1range.Of
27
course,thesedifferentTRECeventshadvaryingexperimentsandtestconditionsandsodirectcomparisonsbetweenTRECstudiesarenevervalid,butgeneralcomparisonsareinstructiveandfrequentlymadeinthecitedliterature.
[19] SeethereportontheElectronicDiscoveryInstitute(EDI)Oraclelegalsearchexperimentsinvolvingthelargestnumberoflegalsearchparticipantstodatewhereamemberofthee-DiscoveryTeamattainedhighscores.Bay,M.,EDI-OracleStudy:HumansAreStillEssentialinE-Discovery:PhaseIofthestudyshowsthatolderlawyersstillhavee-discoverychopsandyoudon’twanttoturnEDDovertorobots(11/20/13,LTN).MonicaBay,theEditorofLawTechnologyNews,summarizestheconclusionofEDIfromthestudythat:“Conclusion:Softwareisonlyasgoodasitsoperators.Humancontributionisthemostsignificantelement.”PatrickOot,co-founderoftheElectronicDiscoveryInstitutepresentedthefindingsofPhaseIIoftheOraclePredictiveCodingSurveyatILTACONDay3,asreportedinTheRelativityBlog,9/2/15:“[W]henitcomestowhatsomevendorscallContinuousActiveLearning,Ootindicatedthedebatewassomewhatofaredherring,adding,“ContinuousActiveLearningisjustabuzzword.”Ootsummeduphisthoughtsbystressingthehumancomponentoftechnology-assistedreview.NotingthatthebestperformingtechnologyintheOraclestudywastheoneusedbyaseniorattorney,Ootsaid,“Agoodartistwithagoodbrushisbest.”UnfortunatelythefinalresultsoftheEDIOraclestudyhavenotyetbeenpublishedand,asparticipantsinthatstudy,wearecurrentlyconstrainedfromanydetailedreporting.
[20] SeePreSuit.comwherethee-DiscoveryTeam’sproposalisoutlinedtomonitortheITsystemsoflargeorganizationswithadvancedanalyticsandothersearchmethodstopredictandavoidfutureillegalconduct.Thisman-machinehybridtypeofearlywarningsystemincludessafeguardstoprotectbothindividualprivacyrightsandconfidentialcorporateinformation.
APPENDIX
E-DiscoveryTeam89-PageNarrativeReportofall30Topics
ThisAppendixNarrativeReportdescribesthesearchofallthirtyTotalRecalltopicsinTREC2015
usingthee-DiscoveryTeam’sHybridMultimodalmethod.Thereportfollowsthechronological
orderinwhichthesearcheswereconducted.ThefirstprojectstartedonJuly14,2015.Itwas
Topic103ManateeProtection.ThelastTopic3089PicktonMurdersconcludedonAugust28,2015.AtthebeginningofeachTopictheresultsarereportedforthatTopic.Eachhasthesame
formanddisclosesmetricsatthetimeswhen:(1)theReasonablecallwasmade;and,(2)the
pointwhere97.5%Recallwasattained.Theyaresummarizedalongwithavariationofa
standardConfusionMatrix,a/k/aContingencyTable1TheConfusionMatrixitselfishighlighted
inblue.Itisfollowedbyalistofthekeythevaluesattained:Recall,Precision,F1Measure,Accuracy,Error,ElusionandFallout.
Workonmultipletopicswasconductedatthesametime.Sullivan,whoworkedoneighttopics,
Reichenberger,whoworkedonfour,andWhite,whodidone,eachworkedonasingletopicata
time.Theydid,however,workconcurrentlywithLoseyandeachother.Losey,whoworkedon
seventeentopics,andhadtheassistanceofacontractreviewattorneyonthetenBushEmail
Topics,typicallyworkedconcurrentlyonmultipletopicsatthesametime.AllTopicswerea
Teameffort,buttheattorneysidentifiedasrunningeachTopiccontrolledthereviewworkforthatTopic.Consultationwascommon,especiallyatfirst.
Topic103ManateeProtection
ConfusionMatrix-Topic103TotalDocuments:290,099
TotalRelevant:5,725
TotalPrevalence:1.97%
1Grossman&CormackGlossary,supraFN1atpg.6.TheConfusionMatrixisalsoreferredtoasaContingencyTable.
@Reas.Call
@97.5%Recall
TruePositives 4,780 5,582
TrueNegatives 284,348 283,793
FalsePositives 26 581
FalseNegatives 945 143
Recall 83.49% 97.50%
Precision 99.46% 90.57%
F1Measure 90.78% 93.91%
Accuracy 99.67% 99.75%
Error 0.33% 0.25%
Elusion 0.33% 0.05%
Fallout 0.01% 0.20%
2
Thee-DiscoveryTeam’sTRECTotalRecallprojectcommencedonJuly14,2015withworkonTopic103ManateeProtection.ThistopicwasrunbyLosey.HedidnotcompleteworkuntilJuly22,2015.Althoughitmayseemfasttoseeareviewof290,099documentscompletedbyoneattorneyinonlyeightdays(withnobreaks),therewasmoretimespentonthistopicthananyoftheothers.Butasignificantamountofthistimewasspentongeneralset-up,procedures,contractreviewertraining,projectorientation,andcommunicationprotocols.CompletionofthisTopicwasalsodelayedduetotheavailabilityofthecontractreviewattorney,AnneBottolene,whoassistedLoseyforthefirstpartoftheworkonTopic103,andduetosomeinitialsoftwareconfigurationsetupissues.TheTeamfoundthisTopicchallengingforavarietyofreasons,includingthefactthattheBushcollectionof290,099emailshadbeenstrippedofitsoriginalmetadata,images,andattachments.Further,wefoundsomeinconsistenciesinjudgingthistopic,althoughnotmany.OverallwefoundTopic103hadoneofthebestgold-standardsofthetenBushEmailTopics.RalphLoseyisanativeFloridianandFloridaattorneyfor35years.HewassomewhatknowledgeableaboutalloftheBushEmailissues,certainlyfarmoresothantheaverageperson,buthedidnotconsiderhimselfabonafidesubjectmatterexpert(SME)onanyofthem.Losey’sknowledgeandinterestonManateeProtectionissueswas,however,higherthantheotherBushTopics.Forthatreasonitwaschosenasthefirsttopic.Losey’sassistant,Bottolene,hadlivedinFloridaforseveralyearsandalsohadsomebackgroundwiththeManateeProtectionissue.TheygenerallyconsideredtheirfamiliaritywiththeissuetobeanassetinthesearchofTopic103.ThesamecannotbesaidofotherBushEmailTopics.TheprojectcommencedafterinitialorientationonJuly14,2015withLoseybeginningStepTwo,MultimodalSearchReviews.BottolenewasassignedStepThree,RandomBaseline.DuetovariousschedulingandimplementationissuesBottolenedidnotcompleteherreviewofthesampleuntilJuly20,2015,lateafternoon.Shereviewedandcodedaseitherrelevantorirrelevantarandomsampleof1,534Bushemails.ThiswasoneofonlytwoTopicswhereinStepThreewasfollowedandafullrandomsamplewastaken.Itprovedveryhelpful.BasedonthesampleprevalencewepredictedaspotprojectionforprevalenceinTopic103of5,175documents(95%+/-2.5%confidencelevels).Infact,thetotalrelevantdocumentsinTopic103provedtobe5,725,wellwithinthe2.5%marginoferror.Basedonthelengthoftimeneededforrandomsamplereview,andourdesiretocompleteallthirtytopicsin45days,wedecidedtoskipthisstepforensuingreviews.(Topic101JudicialSelectionwasstartedshortly
3
afterTopic103,andalsoincludedStepThreeRandomBaseline.)Asmentioned,wealsoskippedmostoftheproceduresinStep7-“ZeroErrorNumerics”concerningqualitycontrolinthisandall30Topics.AfterBottolenecompletedtherandomsamplereviewonJuly20thsheassistedLoseyonJuly21stand22ndinhisworkonStepFiveMultimodalSearchReview.AtthattimesubmissiontoTREChadalreadybegunandtheTeamwasevaluatingtheconfirmedrelevantandirrelevantdocumentsfromTREC.Atotalof24documentsubmissionsweremadetoTRECinthisTopic:fourdocumentsubmissionsonJuly20th,oneofJuly21st,andtheremainingnineteensubmissionsweremadeonJuly22,2015.InbetweenmostofthesesubmissionstheTeamconductedStepsFour,FiveandSixofitsstandardworkflow.Thesearethepredictivecodingstepsthatiterate.InStepFourthesoftware,Mr.EDR,analyzesthedocumentsdesignatedfortraininginStepTwointheseedset,andinStepFivethereafter.Mr.EDRthenranksthewholedatasetaccordingtoprobablerelevanceandirrelevance.InStepFivetheattorneyssearchformoredocumentstousetotrainMr.EDR.ItisessentiallythesameasStepTwo,exceptnowtheattorneyscanaddprobabilityandrankbasedsearchestotheirmultimodaltoolkit.ThatistheTeam’sfullsearchpyramid,shownright.ThemethodsareusedadhocaccordingtowhattheattorneyreviewerconsidersapromisingmethodtofindadditionalrelevantdocumentsbasedinpartonthelatestEDRrankingsandTRECsubmissionreturns.Oncenewdocumentsarefoundthatarelikelytoberelevant,theyarethendesignatedinStepSixforTraining.Notalldocumentsaresodesignated.Againthisisatthediscretionoftheattorneysastowhatdocumentstheythinkwouldbestservetotrainintheongoingactivelearningprocess.InTopic103theuseofpredictivecodingrankedbasedsearcheswasseverelyconstrained.Thiswasduetoinitialconfigurationsetuperrors,whereinputparametersforthelearningengineweresetincorrectly.ThesesetuperrorsweredetectedandcorrectedbyJuly22,2015,andthereafterMr.EDRwasofgreatassistance.Still,asaresultofthedelaysandearlyerrors,thisTopicreliedmuchmoreheavilythananyotheronkeywordsearchesandhumanlinearreviews.Similaritysearcheswerealsousedextensively.BasicallythepredictivecodingassistanceinthisTopicdidnotbeginuntilthe14thsubmission.LoseycalledReasonableafterthe15thsubmission.IntheTRECexperimentsmost,butnotall,ofthedocumentsreturnedasrelevantorirrelevantbyTRECwereincludedintraining(StepSix).Inthatwaytheirrankingimpactwasevaluated(StepFour)beforethenextsubmission.TrainingalsoincludedvariousirrelevantdocumentsthatwerenotTRECadjudicated,butwerethoughttobeobviouslyirrelevant.Experimentsweremadeastotheimpactofvaryingthenumberofirrelevantdocumentsinthehopethatsome
4
idealrangeorratiocouldbedeterminedtomaximizeMr.EDRefficiency.Theseexperimentsarestillunderway.OurconclusionsasoflateDecember2015arestatedinthebodyofthisreport.Afteratotalof15submissionsthatpresented4,806documentstoTRECforadjudication,LoseycalledReasonableandstoppedworkonJuly22,2015,aweekaftertheTopicstarted.Thereafteranadditional9submissionsweremadetoTRECtosubmittheremaining285,293emails(98.34%ofthe290,099total).TherewasTraininginbetweenmostoftheremainingsevensubmissionsbasedontheTRECadjudications,butnofurtherhumaninput.Thefirsttwopost-callsubmissionswerecriticaltotheTeam’sexcellentperformanceonthisTopic.LoseycalledReasonableatthepointhethoughtthatareasonablehumanefforthadbeenmadetofindrelevantdocuments.LoseyandhisassistantBottolenehadpersonallyreviewedandcodedasrelevantorirrelevant7,203documents.(Additionaldocumentshadbeencodedwithoutreview.)Infact,bythetimeLoseyhadsubmitted2,309documentstoTRECforadjudication(the14thsubmission)hehadcompletedallindividualdocumentreview(7,203documents),andhadcompletedallsearchesotherthanpredictivecodingrankingsearcheswheredocumentcontentisnotreviewed.Atthattime(afterthe14thsubmission)heessentiallyturnedtheprocessovertoMr.EDR,whohadbythenjustrecoveredfromanearliertechnicalillnessandhadnotbeenfunctionalbefore.AtthetimeLoseycalledReasonablehehadsubmittedatotalof4,806documents.Ofthose,4,780hadbeenadjudicatedasrelevant.ThiswasanincrediblePrecisionrateof99.46%.ThiswasthemostPreciseproductionthatLoseythinkshehasevermade.Healsothoughtthathemayhaveattainedashighasa90%Recall,but,infactthelatersubmissionsshowedthatatthetimeReasonablewascalledhehadattainedaRecallof83.5%.ThisisstillconsideredahighRecalllevelinlegalsearch,andthecombinedF1measureof90.8%is,inlegalsearch,likeanyother,averyoutstandingeffort.ThenextsubmissionsafterReasonablewascalledwerealwaysthedocumentsthatwerehighestrankedbyMr.EDR,whichiswhywecallthisanautomatedfunction.AsweunderstandthegamesetupbyTRECfortheRecallTrack,theactualscoringisnotimpactedbytheReasonablecall.Thescoringcontinuesforallsubmissionsuntilalldocumentshavebeenreturned.TheReasonablecallismerelyanindicationofefforts.Thesamegoesforthe70%,80%recallcalls,whenandiftheyaremadebeforetheReasonableeffortcall,excepttheyareofevenlessinterest.Thesecallswerenotsupposedtohaveanimpactonscoring.InthefirsttwosubmissionsafterthecallinTopic103,the16thand17thsubmissions,Mr.EDRidentifiedandhighlyranked661additionalrelevantdocuments,bringingthetotalrelevantfoundto5,467outofthetotal5,725.WeweretherebyabletoattaininthatsubmissionaRecallof90%withPrecisionof99.33%,aRecallof95%withPrecisionof98.8%,and97.5%RecallwithaPrecisionof90.57%!AsfarasLoseyknows,thesestatisticsrepresenthispersonalbestefforts,especiallyconsideringthathedidsowithverylittlerelianceonpredictiveranking.Whatmakesthis97.5%Recall,90.6%Precisionallthemoreremarkableforlegalsearchisthatitwasaccomplishedbyonlyoneexpertattorneyassistedbyonecontractreviewattorney.Themeasuredefforttoattainthesehighlevelswasremarkablylow,especiallyconsideringthatasignificantamountoftimeinTopic103wasspentreviewingthebaselinesample(StepThree).Togetherthetwoattorneysonlyreviewed7,203documentsoutofthetotalcorpusof290,099
5
emails(2.5%).Inlegalsearchitiscommonforattorneyreviewteamstoconsistofdozensorevenhundredsofattorneys.Moreover,evenwhenpredictivecodingisused,afarhigherpercentofthecorpusistypicallyreviewedthan2.5%,andRecalllevelsof97.5%areunheardof,muchlessprecisioninexcessof90%.Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%recallcall,andthedarkgreenlinethereasonablerecallcall.(Pleasenote,thatthegraphisnottoscaleasthegraphisbasedonindividualsubmissions.Wethoughtthisabetterdepictionthanbyproportionallyshowingprogressbecauseinmostcasesaproportionalgraphwouldbealinevirtuallystraightupfromthestartandflatgoingover).
ThenextchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheManateeProtectiontopic,bythetime97.5%Recallhadbeenattainedonly2.12%ofthecorpus,6,163documents,hadbeensubmittedforadjudication.Thisisatriumphforthesearchpyramidfoundation,especiallykeywordsearch,thatsupportsAItraining.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining97.88%or283,936documents.
6
Thechartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingthemultimodalhybridmodelofsearchandtrainingofMr.EDR.
______________________________________
Topic2108CAPTCHAServices
ConfusionMatrix-Topic2108TotalDocuments:465,147TotalRelevant:656 TotalPrevalence:0.14%
@Reas.Call
@97.5%Recall
TruePositives 580 640TrueNegatives 463,566 458,906FalsePositives 925 5,585FalseNegatives 76 16Recall 88.41% 97.56%Precision 38.54% 10.28%F1Measure 53.68% 18.60%Accuracy 99.78% 98.80%Error 0.22% 1.20%Elusion 0.02% 0.00%Fallout 0.20% 1.20%
7
Topic2108wasrunbyLoseywithoutanyassistanceofareviewlawyer.Theworktosearchthe465,149BlackHatWorldForumpostsstartedonJuly16,2015,butdidnotconcludeuntilAugust1,2015.ThereasonforthedelayincompletionisthattheTeamencountereddifficultiesinunderstandingtheinitialTRECadjudicationstotheirfirstsubmissions.NeitherLosey,northeotherattorneyTeammembersconsulted,couldunderstandtherelevancepatternbehindTREC’sinitialsubmissionresponses.DuetotheinitialEDRconfigurationerror,predictivecodingwasnotavailabletoassistatfirstinascertainingtherelevancescope.Afterseveraldaysofstrugglingwiththisproject,LoseyputthisTopiconholduntilJuly29thatwhichtimeLoseyreturnedtotheTopictofinish.AsageneralcommenttheTeamfoundalloftheBlackHatWorldForumpostschallengingtosearch,moredifficultthanatypicalsearchofcorporateESI.Thatisinpartbecausealmostallmetadataoftheseposts,andallassociatedimagery,hadbeenstrippedbyTRECandtheESIconvertedtotextfiles.Alsothelanguageandissues(allnon-legal)intheBlackHatWorldForumswereobscure.Eventhoughourattorneysearcherswereallfamiliarwithforumsandhadknowledgeofmostofthetechnologiesandsometimesillegal,nearlyalwaysunethical,marketingpracticesdiscussedinBlackHatWorld,theystillfoundtheslang-filledpostsdifficulttoreviewandanalyze.Thechallengeswerecompoundedbysignificantinconsistencies,andapparentillogicoftheTRECjudginginmanyofthesetopics.Still,theTeamwasabletoovercomethesechallengesand,afterwelearnednottotrytounderstandanyrelevancerules,weoveralldidquitewellinreviewofthetenBlackHatWorldForumTopics.Basedontheelusive(tohumans)relevancestandard,wefoundthatthesetopicsrequiredgreaterrelianceonMr.EDRthantheBushEmailsandNewsArticles.EventhoughwecontinuedtouseamultimodalapproachinForumtopics,ouremphasiswasontheAIfeaturesofrankingandprobability.TheTeamreadilyadmitsthatitsownhumanintelligence,withouttheconsiderableAIenhancementsofMr.EDR,wasnotuptothetaskofmatchingTRECrelevancecallsfortheForumTopics.Butwiththehelpofpredictivecoding(Me.EDR)weovercamethedifficultiesandattainedrelativelyhighrecalllevels.OnJuly31,2015,aftermaking22documentsubmissionstoTRECprovidingatotal1,505documents,Loseyhadfoundatotalof580relevantdocuments.Theeffort,ornumberofdocumentsreviewedandcodedbyLoseytoattainthisresult,was2,101documents.Infact,Loseyhadstoppeddocumentreviewafterthe21stsubmission.His22ndsubmissionwasentirelybasedondocumentrankingswithoutreview.Afterthe22ndTRECsubmission,LoseydecidedtocallReasonable.ItwaslaterdeterminedthataRecallof88.41%hadbeenattained.ThereweresevenadditionalsubmissionstoTRECaftertheReasonablecallpoint.Inthenext,23rdsubmission,95%Recallwasattainedaftersubmittingonly2,130additionaldocuments.Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%recallcall,andthedarkgreenlinethereasonablerecallcall.(Pleasenote,thatthisgraph,andallotherslikeit,arenottoscaleasthegraphsarebasedonindividualsubmissions.Wethoughtthisabetterdepictionthanbyproportionallyshowingprogressbecauseinmostcasesaproportionalgraphwouldbealinevirtuallystraightupfromthestartandflatgoingover).
8
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheCAPTCHAServicestopic,bythetime97.5%Recallhadbeenattainedonly1.34%ofthecorpus,6,225documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining98.66%or458,922documents.
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingthemultimodalhybridmodelofsearchandtrainingofMr.EDR.
9
______________________________________
Topic101JudicialSelectionConfusionMatrix-Topic101TotalDocuments:290,099TotalRelevant:5,834 TotalPrevalence:2.01%
@Reas.Call
@97.5%Recall
TruePositives 5,026 5,688TrueNegatives 283,608 281,901FalsePositives 657 2,364FalseNegatives 808 146Recall 86.15% 97.50%Precision 88.44% 70.64%F1Measure 87.28% 81.93%Accuracy 99.49% 99.13%Error 0.51% 0.87%Elusion 0.28% 0.05%Fallout 0.23% 0.83%
10
Topic101wasrunbyLoseywiththeassistanceofareviewattorney,DavidJensen.Theworktosearchthe290,099BushEmailsstartedonJuly16,2015andconcludedonJuly26,2015.TheprojectcommencedwithLoseybeginningStepTwo,MultimodalSearchReviews,andJensenassignedStepThree,RandomBaseline.JensenfinishedtherandomsamplereviewthenextdayandbeganassistingLoseyinStepTwo,andaftersubmissionsbegan,theechoStepFive,multimodal.LoseyhandledalloftheAIrelatedsearchesinStepFive,includingtheprobabilityandrankingrelatedsearches.Jensenfocusedonkeywordsearchesandalsomadesuggestionsofdocumentstosubmit.FinaldecisionsonsubmissionswerealwaysmadebyLoseyonallTopics.DuetothesamementionedinitialconfigurationsetuperrorstheAIfeaturesdidnotworkuntilneartheendofthisTopic.LoseyinsteadreliedheavilyonKeyword,linear,andanewtypeofSimilaritysearchtheTeaminventedoutofnecessityduringTRECevents.ItisanticipatedthatthenewsimilaritysearchfeaturewillbeincludedinfutureMr.EDRreleases.Reviewoftherandomsampleof1,534Bushemailsfound30thatwererelevant.Thatsuggestedaprevalenceof1.96%andaspotprojectionof5,673documents.Theactualrelevantcountof5,834andprevalenceof2.01%wasveryclosetotheprojection.NotethisisthesecondandlastTopicinwhichafullStepThreerandomsamplewasimplemented.OnJuly25,2015,aftermaking15documentsubmissionstoTRECprovidingatotal5,683documents,Loseyhadfoundatotalof5,026relevantdocuments.Theeffort,ornumberofdocumentsreviewedandcodedbyLoseytoattainthisresult,was6,895documents.Infact,Loseyhadstoppeddocumentreviewafterthe14thsubmission,ashis15thsubmissionwasentirelybasedondocumentrankingswithoutreview.Afterthe15thTRECsubmission,LoseydecidedtocallReasonable.ItwaslaterdeterminedthataRecallof86.15%hadbeenattainedwithaPrecisionof88.44%.Therewereanadditional8submissionstoTRECaftertheReasonablecallpoint.Inthenext,the16ththerewasasubmissionof652documents,345ofwhichwererelevant.95%Recallwith82.7%Precisionwasattainedaftersubmittingonly6,705documents(1,022afterReasonablecall).97.5%Recallwith70.6%Precisionwasattainedaftersubmittingonly8,052documents(2,369afterReasonablecall).Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%recallcall,andthedarkgreenlinethereasonablerecallcall.
11
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheJudicialSelectiontopic,bythetime97.5%Recallhadbeenattainedonly2.78%ofthecorpus,8,052documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining97.22%or282,047documents.
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingthemultimodalhybridmodelofsearchandtrainingofMr.EDR.
12
______________________________________
Topic108ManateeCountyConfusionMatrix-Topic108TotalDocuments:290,099TotalTRECRelevant:2,375 TotalTRECPrevalence:0.82%
Topic108wasrunbyLoseywiththeassistanceofareviewattorney,Bottolene.Theworktosearchthe290,099BushEmailsalsostartedonJuly16,2015andconcludedonJuly24,2015.TheprojectcommencedwithLoseyandhisassistantbeginningStepTwo,MultimodalSearch
UsingTRECrelevantcalls
@Reas.Call
@97.5%Recall
TruePositives 734 2,316TrueNegatives 287,712 26,197FalsePositives 12 261,527FalseNegatives 1,641 59Recall 30.91% 97.52%Precision 98.39% 0.88%F1Measure 47.04% 1.74%Accuracy 99.43% 9.83%Error 0.57% 90.17%Elusion 0.57% 0.22%Fallout 0.00% 90.90%
13
Reviews.StepThree,RandomBaseline,wasomitted.Aftersubmissionsbegan,theechoStepFive,multimodal,wasdonebyLoseywithassistanceatfirstofBottolene.LoseyhandledalloftheAIrelatedsearchesinStepFive,includingtheprobabilityandrankingrelatedsearches.Hisassistantfocusedonkeywordsearchesandalsomadesuggestionsofdocumentstosubmit.AllfinalsubmittaldecisionsweremadebyLosey.
ObservationsontheErrorsofRelevanceJudgmentsinThisandOtherTopicsThiswasthemostfrustratingofalloftheTRECRecallTopicsfortheTeamtoworkonbecausethejudgmentsonrelevancecontainedmoreobviouserrorsandinconsistenciesthananyother.ThisTopicwasManateeCounty,asopposedtoTopic103,whichwasManateeProtection,whichofcoursereferredtotheendangeredmammal.Unfortunately,asalifelongFloridaattorney,LoseyhassubstantialindependentknowledgeofManateeCountyandmanatees.BottolenehadalsobeenaFloridaresidentforseveralyearsandanattorney.TheirdirectpersonalknowledgeofFloridaprovedtobeasignificantdisadvantageinthisTrack(and,toalesserextent,inotherTracks,especiallyonesthatcontainedobviouserrorsinrelevance)becauseTRECadjudicationswerenottiedtoactualfactsandreality(obviouslynooneatTRECwasaFloridaSME)andwereotherwisesurprising.ForinstanceinTopic108,eventhoughthesubjectwastheCountyofManatee,apoliticalentity,sometimes,butnotalways,anemailwithmerementionofthemammalmanateewouldbeconsideredrelevant,eventhoughtherewasnomentionoflocationorthecounty.Also,manyreferencestoManateeParkwereconsideredrelevanttoTREC,eventhoughthatparkis,asanyFloridianwouldknow,especiallyLoseywholivesinCentralFlorida,notlocatedinManateeCountyandotherwisehasnoconnectiontothecounty.Also,almostallemailaddressesthathadmanateeinthenamewerecalledrelevantbyTREC,eveniftheemailhadnothingtodowiththeCountyofManatee.Theremaywellbesomepatterntotheso-calledgoldstandardusedinthisTopic,butifso,itwasnotlogicalandnotknowntoBottoleneorLosey.ItappearedtotheseFloridians,afterthefact,tobelackofexpertiseonthepartofTREC.Otherteammembersreviewedtheseadjudicationslateragreed.Oneexamplewewerelaterabletofigureout:awell-knownFloridalawfirm(Holland&Knight)hasahomeofficeinBradenton,Florida,andtheattorneystherewouldoftenwritetothegovernor.Aspartofpost-hocanalysiswesawthatalmostalloftheseemailswereconsideredrelevantbyTRECassessorstothistopicsimplybecausetheofficecitywasintheirstandardsignaturelineaddress,eventhoughthecontentoftheemailshasnothingtodowithManateeCounty.SinceLoseyisusedtodirectinglegalsearchasanSME,ordirectSMEsurrogate,hisusualapproachtolegalsearchinvolvesusinghisknowledgeandunderstandingtodifferentiaterelevantfromirrelevant.Asmentioned,inlegalsearchunderstandingofrelevanceiscritical,infact,itisalegaldutyandresponsibilityoftheattorneysearchers.ThushispositionasanactualFloridaSMEservedasadisadvantageinmanyoftheBushemailTopics,includingthisone.TheTeamlaterencounteredotherTopicswithinconsistenciesandmistakeslikeTopic108.InsuchcasesweeventuallylearnedtostepoutoftheprocessandstoptryingtounderstandorlookforarationalbasisfortheTRECrelevancecalls.WewouldputasideourtraditionalSMErole,whichisotherwisethefirmlyestablishednorminlegalsearch.Instead,whenwefound
14
ourselvesinthissituation(andthishappenedinalittlelessthanhalfoftheTopics),wewould
basicallyturnthesearchandsubmissiondecisionsovertoMr.EDR.Inthosesituationswedid
noteventrytoseeanypatternorconsistencytotheadjudications.Whenweadoptedthis
approachinlatertopicswedidquitewell,inspiteofdefectswesawintheTRECgoldstandards.
ThissuggeststhatTREC’sselectionofrelevantdocumentsinsomeoftheTopicssufferedfrom
over-delegationtocomputerselectionwithoutadequateSMEbasedqualitycontrols.Itis
unknownwhatsoftwarewasusedbyTRECtocreatetherelevantgoldstandarddocumentset,
butlikeanypredictivecodingsoftwaretoday,itobviouslycanbeledastraywithoutadequate
humansupervisionandqualitycontrolsafeguards.Thisiswhythee-DiscoveryTeamadoptsa
hybridapproach,computerandhuman,includingSMEs,andwhyinnormalcircumstancesStep
SevenforqualitycontrolissoimportantundertheirPredictiveCoding3.0method.
Topic108Description
OnJuly23,2015,aftermaking10documentsubmissionstoTRECprovidingatotal746
documents,Loseyhadfoundatotalof734relevantdocuments(Precisionof98.4%).Theeffort,
ornumberofdocumentsreviewedandcodedbyLoseytoattainthisresult,was696documents.
Afterthe10thTRECsubmission,LoseydecidedtocallReasonable.Itwaslaterdeterminedthata
Recallof31%hadbeenattained.ThedecisiontocallReasonableprovedtobeabigmistake
becausetheTRECadjudicationswerenotlimitedtoManateeCountyrelevanceastheTeamhad
assumed.Asmentioned,theerrorwasbasedupontheTeam’sconstructionofrelevanceina
muchnarrowermannerthanTREC.ThedivergencewasnotknownbecausetheTeamdidnotdo
enoughexplorationofirrationalconstructionsandsodidnotdetectthe,toourmind,outlier
natureofTREC’sapproachtothisTopic.
TheTeamshouldhavebeenlessprecise(itssubmissionshadaPrecisionof98.4%),andshould
havepresentedmoredocumentsforsubmission,eventhoughtheTeamdidnotpersonally
considerthemtoberelevant.Itshouldhavebettertesteditsrelevanceconcept.Butas
mentioned,asanSMELoseywasusedtosettingthescopeofrelevance,andaslawyers,the
entireTeamwasusedtorationaladjudicationsofrelevancealonglinesthatmakesensetothem.
15
Thiswasanearlytopicforusintheprocessandwehadnotyetlearnedtomistrustourownassessments.Therewere6additionalsubmissionstoTRECaftertheReasonablecallpoint.Inretrospect,thiswasalsoanerror.TheTeamshouldhavesubmittedmultiplesmallersubmissionsaftertheystartedtodiscovertheoutliernatureoftheTRECadjudications,withtrainingbetweeneachsubmissionwhereMr.EDRcouldtakeoverinanautomatedfashion.Thiswasanothergame-typelessonlearnedthehardwaybythisTopic,whichprovedtobetheTeam’sworstperformance.EvenintheworstcasewithmultiplemistakestheTeamstillmanagedtoattain78%Recallwithreviewofonly696documents,andsubmissionofonly60,817ofthetotal290,099documents.Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%recallcall,andthedarkgreenlinetheReasonablerecallcall.
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheManateeCountytopic,bythetime97.5%Recallhadbeenattained90.95%ofthecorpus,263,843documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining9.05%or26,256documents.
16
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingthemultimodalhybridmodelofsearchandtrainingofMr.EDR.
CorrectionoftheGoldStandardRelevanceSetinTopic108SincetheTeamisconsideringuseoftheBushemailsetinfurthertesting,trainingandresearch,theywantedtotrytocorrectthemanydeficienciestheysawinTREC’sdeterminationofthegoldstandardforthisTopic.TheyalsowantedtobetterunderstandwhythescoreonthisTopicwassooutofrangefromtheirotherscores.Withthisinmindtheyre-reviewedtheTREC
17
adjudicationsandsetupathree-attorneypeerreviewofallerrorsspottedintherelevancydeterminations.AconservativeapproachwastakenanddeferencewasgiventotheTRECadjudicationswherearational,consistentbasiscouldbefound.Losey’spersonal,narrowviewofwhatshouldberelevantwasnotfollowed,iftherewasareasonseentofollowTREC’sadjudications.(Note,theTeamandothersinthefiledofLegalSearch,haveobservedovermanyprojectsthatSMEstypicallytakeamorenarrowviewofrelevancethannon-SMEswho,bydefinition,donotunderstandthesubjectaswell.)Loseyacceptedalladverserulingsagainsthisownpositionsaspartofthisprocess.AlsonotethatsuggestionstoreviseTRECadjudicationscamefromallthreeTeammembers,notjustLosey,andwereallsubjecttomultiplereviewsandobjections.Afterthere-reviewandre-adjudicationprocesswascompleted,1,264documentsadjudicatedasrelevantbyTRECwerechangedtoIrrelevant.Further,3documentsadjudicatedasirrelevantbyTRECwerechangedtorelevant.BelowarethecorrectedmetricsoftheTeam’sreviewundertheimprovedadjudications.ConfusionMatrix(Adjusted)-Topic108TotalDocuments:290,099TotalAdjustedRelevant:1,114(was2,375)(1,264changedtoIrrelevant,3ChangedtoRelevant) TotalAdjustedPrevalence:0.38%(was0.82%)
Afterthe10thTRECsubmission,whenLoseydecidedtocallReasonable,Loseyhadfoundatotalof736relevantdocuments(anincreaseof2documents)undertheadjustedgoldstandard.ThiswasaRecallof66.07%andPrecisionof98.66%undertheadjustedstandard.TheF1measurewas79.14%.Notethatthesemetricsaremuchmoreinlinewiththeother29projects,althoughtheadjusted66%RecallisstilltheTeam’ssecondtolowestRecallscoreattheReasonablecallpoint.UnderthecorrectedstandardtheTeamattained94.43%Recallwithreviewofonly696documents,andsubmissionofonly60,817ofthetotal290,099documents.AgraphmappinghowthereviewbyRecallattainedafternumberofdocumentssubmittedisshownbelowwithboththeoriginalTRECstandard(blue)andtheTeamadjustedstandard(red).
Usingadjustedrelevantcalls
@Reas.Call
@97.5%Recall
TruePositives 736 1,087TrueNegatives 288,975 131,844FalsePositives 10 157,141FalseNegatives 378 27Recall 66.07% 97.58%Precision 98.66% 0.69F1Measure 79.14% 1.36%Accuracy 99.87% 45.82%Error 0.13% 54.18%Elusion 0.13% 0.02%Fallout 0.00% 54.38%
18
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholdsundertheadjustedstandard.
______________________________________
19
Topic2052PayingforAmazonBookReviewsConfusionMatrix-Topic2052TotalDocuments:465,147TotalRelevant:265 TotalPrevalence:0.06%
Topic2052wasrunbySullivan,whostartedonJuly20,2015,andconcludedJuly22,2015.ThiswasSullivan’sfirstTopic.Forthatreasonhespentmoretimethaninhislaterreviewsintryingtounderstandthedatasetandprocesses.Sullivanhasabackgroundincomputersandprogramming.Hehassubstantialexperienceinforumstounderstandtheuniquecharacteristicspresentinforumcommunications.Whileheconsidershimselffarmoreknowledgeablethantheaverageperson,hehasnoexperiencewiththeunethicalworldofBlackhatForumsanddoesnotconsiderhimselftobeabonafidesubjectmatterexpert(SME)onanyofthem.Allforumtopicspresentedauniquechallengeofidentifyingvariationsoftermsandunderstandinguseofslang.Whilethisprovedtobeeasytoovercome,itcertainlyplayedavitalroleintheprocessinawaynotnecessaryintheNewstopics,wherespellingerrorswerelargelynon-existent.Onthefirstday,SullivanstartedwithStepThree,RandomBaselineandreviewedarandomsampleof1,534documents.ThiswasusedbothasamethodtoestimateprevalenceandameansofgainingbetterunderstandingofthedatasetforthisandfuturetopicsinAtHome2.Thisrandomsampleyielded1relevantdocument.Basedonthesampleprevalencewepredicted303relevantdocumentsexistedinthedataset(95%confidencelevelwith2.5%marginoferror).Wewouldlaterdiscoverthedatasetcontained265relevantdocuments,whichiswellwithinthemarginoferror.Giventheamountoftimenecessarytocompletethisrandomsample,andthelittlevaluegained,StepThreewasomittedfromallsubsequenttopicsreviewedbySullivan.
@Reas.Call
@97.5%Recall
TruePositives 257 259TrueNegatives 464,364 464,165FalsePositives 518 717FalseNegatives 8 6Recall 96.98% 97.74%Precision 33.16% 26.54%F1Measure 49.42% 41.74%Accuracy 99.89% 99.84%Error 0.11% 0.16%Elusion 0.00% 0.00%Fallout 0.11% 0.15%
20
DaytwowasspentrunningkeywordsearchestofinddocumentsforseedingintothepredictivecodingalgorithmandsubmittingdocumentstogetabetterunderstandingtheTRECstandardforrelevance.Attheendofdaytwo,273documentshadbeensubmitted,with204beingreturnedasrelevant.Thisprovidedanadequateseedsettobeingrelyingmoreheavilyonpredictivecoding.Ondaythree,Sullivandevelopedastrategywhichhereliedheavilyinfuturetopics.RatherthanrelyingonMr.EDRaloneandreviewingthedocumentsthatweregivenhighscoresbythemachine,heusedthemulti-modalapproachtoprioritizedocumentsforreview.Startingwithallvariationsof“Amazon”w/5“Review,”heworkeddownreviewingandcategorizingthehighestscoringdocumentsfirst.Whenhehitapointwherefewrelevantdocumentswerebeingfound,heiterativelyexpandedthescopeofhisreviewuniverse.Hemovedtoallvariationsof“Amazon”w/10“Review,then“Amazon”w/25“Review,”and“Amazon”AND“Review.”Heexpandedinto“Amazon”and(“Review”or“Book”or“Feedback”or“Purchase”)andeventuallytoanydocumentcontainingavariationof“Amazon.”Aspreviouslymentioned,theuniquecharacteristicsoftheforumsrequiredmorecreativesearchesthannecessaryinotherdatasets.UsingtheConceptSearchingtoolasaguide,itwasdeterminedthatalmostallreasonablevariationsof“Amazon”couldbefoundusingthefollowingsearch:(“amazon*”OR“@mazon”OR“@maz0n”OR“azmon*”OR“azmn*”OR“amzn*”).Thismethodprovedeffectiveineliminatingissuesofmisseddocumentsduetoslangormisspelling.Usingthismethod,Sullivanwasabletoidentify257ofthe265relevantdocumentsatthetimehecalledReasonableeffort.2,325totaldocumentshadbeenreviewed,includedthe1,534documentsintheinitialrandomsample.AftercallingReasonableeffort,Sullivancontinuedbysubmittingalldocumentsthatcontainedanyvariationoftheterm“Amazon”inorderofpriorityscoredescending.100%recallwasobtainedthroughthismethod.Allremainingdocumentswerethensubmittedindescendingpriorityorder,withnomorerelevantdocumentsbeingreturned.Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%Recallcall,slightlydarkerlinesignifies80%RecallcallandthedarkgreenlinetheReasonableRecallcall.
21
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OnthePayingforAmazonBookReviewstopic,bythetime97.5%Recallhadbeenattainedonly0.21%ofthecorpus,976documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining99.79%or464,171documents.
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%recallusingthemultimodalhybridmodeloftrainingEDR.
______________________________________
22
Topic2225Rootkits
ConfusionMatrix-Topic2225TotalDocuments:465,147TotalRelevant:182 TotalPrevalence:0.04%
Topic2225wasrunbyLoseywhostartedthesearchof290,099BlackHatForumpostsonJuly21,2015andconcludedonAugust18,2015.LoseyputasideworkonthisTopicseveraltimeswhilehegaveprioritytotheJebBushEmailTopics.TheprojectcommencedasusualwithLoseybeginningStepTwo,MultimodalSearchReviews.StepThree,RandomBaseline,wasomitted.Aftersubmissionsbegan,theechoStepFive,multimodalsearchbegan,includingpredictivecodingfeatures,withiteratedtraining.OnAugust,2015,aftermaking12submissionstoTREC,andtrainingafteralmosteverysubmission,Loseyhadprovidedatotal201documentstoTRECandconfirmedatotalof163relevantdocuments.Theeffort,ornumberofdocumentsreviewedandcodedbyLoseytoattainthisresult,was205documents.Afterthe12thTRECsubmission,LoseydecidedtocallReasonable.ItwaslaterdeterminedthataRecallof89.56%hadbeenattainedwithaPrecisionof81%.Therewere23additionalsubmissionstoTRECaftertheReasonablecallpoint.A90%Recallwasattainedaftersubmittingonly212documents.A95%Recallwasattainedaftersubmitting891documents,and97.5%Recallattainedafter3,188documents.TotalRecallwasattainedaftersubmitting12,109documentsoutofthecorpustotalof465,147.Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%Recallcall,andthedarkgreenlinetheReasonableRecallcall.
@Reas.Call
@97.5%Recall
TruePositives 163 178TrueNegatives 464,927 461,955FalsePositives 38 3,010FalseNegatives 19 4Recall 89.56% 97.80%Precision 81.09% 5.58%F1Measure 85.11% 10.56%Accuracy 99.99% 99.35%Error 0.01% 0.65%Elusion 0.00% 0.00%Fallout 0.01% 0.65%
23
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheRootkitstopic,bythetime97.5%Recallhadbeenattainedonly0.69%ofthecorpus,3,188documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining99.31%or461,959documents.
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingthemultimodalhybridmodelofsearchandtrainingofMr.EDR.
24
______________________________________
Topic102CapitalPunishmentConfusionMatrix-Topic102CapitalPunishmentTotalDocuments:290,099TotalRelevant:1,624 TotalPrevalence:0.56%
@Reas.Call
@97.5%Recall
TruePositives 941 1,583TrueNegatives 288,345 17,048FalsePositives 130 271,427FalseNegatives 683 41Recall 57.94% 97.50%Precision 87.86% 0.58%F1Measure 69.83% 1.15%Accuracy 99.72% 6.42%Error 0.28% 93.58%Elusion 0.24% 0.24%Fallout 0.05% 94.09%
25
Topic102wasrunbyLoseywiththeassistanceofareviewattorney,Jensen.Theworktosearchthe290,099BushEmailsstartedonJuly26,2015andconcludedonJuly29,2015.TheprojectcommencedwithLoseyandhisassistantbeginningStepTwo,MultimodalSearchReviews.StepThree,RandomBaseline,wasomitted.Aftersubmissionsbegan,theechoStepFive,multimodal,washandledwiththeassistance,atfirst,ofJensen.LoseyperformedalloftheAIrelatedsearchesinStepFive,includingtheprobabilityandrankingrelatedsearches.Hisassistantfocusedonkeywordsearchesandalsomadesuggestionsofdocumentstosubmit.Again,allfinaldecisionsonsubmittalweremadebyLosey.OnJuly28,2015,aftermaking20submissionstoTREC,andtrainingafteralmosteverysubmission,Loseyhadprovidedatotal1,071documentstoTRECandconfirmedatotalof941relevantdocuments.Theeffort,ornumberofdocumentsreviewedandcodedbyLoseytoattainthisresult,was1,493documents.Afterthe20thTRECsubmission,LoseydecidedtocallReasonable.ItwaslaterdeterminedthataRecallof57.94%hadbeenattainedwithaPrecisionof87.86%,sohiscallprovedtobeearly.Therewereonly3additionalsubmissionstoTRECaftertheReasonablecallpoint,whichwelaterlearnedwasamistake.WelearnedlaterthathigherRecallandoverallTRECscoringcomesfrommultiple,smallersubmissions,withtrainingaftereach.ThisisanotherTopicinwhichwefoundmanyoftheTRECjudgmentsinconsistentandincomprehensible.Still,evenwiththeseproblemsanderrors,aRecallof70%wasattainedafteratotalofonly7,785documentshadbeensubmittedoutof290,099,andonly1,493documentshadbeenreviewed.Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%RecallCall,andthedarkgreenlinetheReasonableRecallcall.
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheCapitalPunishmenttopic,bythetime97.5%Recallhadbeenattained94.11%ofthecorpus,273,010documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining5.89%or17,089documents.
26
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingthemultimodalhybridmodelofsearchandtrainingofMr.EDR.
______________________________________
27
Topic 106TerriSchiavoConfusionMatrix-Topic106TotalDocuments:290,099TotalRelevant:17,135 TotalPrevalence:5.91%
Topic106wasrunbyLoseywiththeassistanceofareviewattorney,Bottolene.Theworktosearchthe290,099BushEmailsstartedonJuly27,2015andconcludedonAugust2,2015.TheprojectcommencedwithLoseyandhisassistantbeginningStepTwo,MultimodalSearchReviews.StepThree,RandomBaseline,wasomitted.Aftersubmissionsbegan,theechoStepFive,multimodalwashandledwiththeassistanceatfirstofBottolene.LoseyperformedalloftheAIrelatedsearchesinStepFive,includingtheprobabilityandrankingrelatedsearches.Hisassistantfocusedonkeywordsearchesandalsomadesuggestionsofdocumentstosubmit.Again,allfinaldecisionsonsubmittalweremadebyLosey.ThisreviewprocesswentlongerthanotherbecausethisprovedtobethehighestprevalenceTopic(5.91%).OnAugust2,2015,aftermaking25submissions,withtrainingaftermostofthese,Loseyhadsubmittedatotal17,354documents.Atotalof16,872ofthesesubmissionswereconfirmedrelevantbyTREC,foraPrecisionrateof97.22%.Theeffort,ornumberofdocumentsreviewedandcodedbyLoseytoattainthisresult,was2,025documents.Afterthe25thTRECsubmission,LoseydecidedtocallReasonable.ItwaslaterdeterminedthatanincredibleRecallof98.47%hadbeenattained.TheF1measurewas97.84%.ThatistheTeam’sbestresultonanyoftheBushEmailTopics.Further,LoseybelievesthismaybeapersonalbestforRecallandF1scores.Therewere7additionalsubmissionstoTRECaftertheReasonablecallpoint.Inthe29thsubmission,99.7%Recallwasattainedaftersubmittingonly7,060additionaldocuments.ThePrecisionwas70%.Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%recallcall,andthedarkgreenlinetheReasonableRecallCall.
@Reas.Call
@97.5%Recall
TruePositives 16,872 16,707TrueNegatives 272,482 272,551FalsePositives 482 413FalseNegatives 263 428Recall 98.47% 97.50%Precision 97.22% 97.59%F1Measure 97.84% 97.54%Accuracy 99.74% 99.71%Error 0.26% 0.29%Elusion 0.10% 0.16%Fallout 0.18% 0.15%
28
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheTerriSchiavotopic,bythetime97.5%Recallhadbeenattainedonly5.90%ofthecorpus,17,120documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining94.10%or272,979documents.
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingthemultimodalhybridmodelofsearchandtrainingofMr.EDR.
29
______________________________________
Topic105AffirmativeActionConfusionMatrix-Topic105TotalDocuments:290,099TotalRelevant:3,635 TotalPrevalence:1.25%
@Reas.Call
@97.5%Recall
TruePositives 3,353 3,544TrueNegatives 286,399 281,585FalsePositives 65 4,879FalseNegatives 282 91Recall 92.24% 97.50%Precision 98.10% 42.08%F1Measure 95.08% 58.78%Accuracy 99.88% 98.29%Error 0.12% 1.71%Elusion 0.10% 0.03%Fallout 0.02% 1.70%
30
Topic105wasrunbyLoseywiththeassistanceofareviewattorney,Jensen.Theworktosearchthe290,099BushEmailsstartedonJuly29,2015andconcludedonJuly31,2015.TheprojectcommencedwithLoseyandhisassistantbeginningStepTwo,MultimodalSearchReviews.StepThree,RandomBaseline,wasomitted.Aftersubmissionsbegan,theechoStepFive,multimodal,wasperformedwiththeassistanceatfirstofJensen.LoseyhandledalloftheAIrelatedsearchesinStepFive,includingtheprobabilityandrankingrelatedsearches.Hisassistantfocusedonkeywordsearchesandalsomadesuggestionsofdocumentstosubmit.Again,allfinaldecisionsonsubmittalweremadebyLosey.OnJuly30,2015,aftermaking23documentsubmissionstoTRECprovidingatotal3,418documents,Loseyhadfoundatotalof3,353relevantdocuments.Theeffort,ornumberofdocumentsreviewedandcodedbyLoseytoattainthisresult,was674documents.Afterthe23rdTRECsubmission,LoseydecidedtocallReasonable.ItwaslaterdeterminedthataRecallof92.24%hadbeenattained,withPrecisionof98.1%,andF1of95.08%.Therewere7additionalsubmissionstoTRECaftertheReasonablecallpoint.Inthe27thsubmission,aftersubmittingonly3,427additionaldocuments(total6,845),95%Recallwasattained.Thiswasattainedaftersubmissionofonly2.36%ofthetotaldocuments.Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%recallcall,andthedarkgreenlinetheReasonableRecallcall.
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheAffirmativeActiontopic,bythetime97.5%Recallhadbeenattainedonly2.90%ofthecorpus,8,423documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining97.10%or281,676documents.
31
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingthemultimodalhybridmodelofsearchandtrainingofMr.EDR.
______________________________________
32
Topic3357OccupyVancouverConfusionMatrix-Topic3357TotalDocuments:902,434TotalRelevant:629 TotalPrevalence:0.07%
Topic3357wasrunbyReichenberger.Theworktosearchthe902,434NewsArticlesdatabasestartedonJuly29,2015,andcompletedonJuly30,2015.Theinitialsubmissionsonthefirstdayweretotesttheoutlinesofthecategory.Theinitialsearchof“Occupy”AND“Vancouver”identifiedaseriesofprotestsinVancouverabouteconomicincomeinequality.Documentswereselectedbasedonavaryingofcontent,including“Occupy”movementsinothercities,riots/proteststhattookplaceinthesamearea(butnotsametime)astheOccupyVancouverprotests,andgenericstoriesabout“Occupy”proteststhatreferenceprotestsinVancouverbutdonotspecificallynamethemas“OccupyVancouver.”Varioussourceswerealsotested,suchasLetterstotheEditor,storiessourcedinothercitiesandsoforth.Resultshelpedformulateananticipatedruleonrelevance.AftertrainingEDRandreceivingpriorityscores,relevantdocumentsonsubsequentsubmissionswereconfirmedbytheserulesandtheirpriorityscores.Infact,ofthefiveirrelevantdocumentsfoundinthelast2submissionsonJuly29th,threescoredover97%andcontainedsubstantialanddirectreferencestoOccupyVancouver;thesemaybeTRECcodingerrors.AmodifiedStepThree,RandomSampleof1,000documentswastakenafterStepTwowascomplete.Thefirst500contained50“training”documentstofocuson,whilethesecond500documentscontained250.Alldocumentshittingon“Occupy”OR“Vancouver”OR“AshlieGough”(astudentwhodiedattheprotests)OR“RobsonSquare”(locationoftheprotests)werereviewed,whileallothersmasstrainedasirrelevant.ThelastTRECsubmissiononJuly29thwasfromthe1,000randomdocuments.Ofthe1,000documents,33wereidentifiedasrelevant,confirmedbysubmission.
@Reas.Call
@97.5%Recall
TruePositives 576 613TrueNegatives 901,680 900,834FalsePositives 125 971FalseNegatives 53 16Recall 91.57% 97.46%Precision 82.17% 38.70%F1Measure 86.62% 55.40%Accuracy 99.98% 99.89%Error 0.02% 0.11%Elusion 0.01% 0.00%Fallout 0.01% 0.11%
33
Onthesecondday,the30th,submissionsbydocumentscontainingsearchtermsandescalatedasrelevantwerereviewedandsubmittedinpriorityorder.Inthefirstsubmissionoftheday,123weresubmittedasrelevantand118camebackasconfirmedrelevant.Ofthefiveirrelevantinthatset,fourweredocumentsthathadtheexactsamerelevanttextasdocumentsTRECpreviouslyconfirmedasrelevant.Thisisanotherexampleofthekindof“goldstandard”inconsistenciestheTeamencounteredinmostoftheTopics.Inthenextsetofsubmissions,documentsescalatedasrelevantbyMr.EDRincludedstoriessourcedintheVancouverpaperonOccupymovementselsewhere,andsportsstorieswiththeword“occupy”inthearticle(e.g.“AnotherVancouverplayeroccupiedthepenaltybox”).Oncethosedocumentswereremovedasirrelevant,allothersweresubmittedandconfirmedasrelevantonsubmission.Someadditional“grayarea”documentsweresubmitted(e.g.“OccupyChristmas”whichwasanoffshootoftheprotests,orcampaignquestionsposedtocandidatesabouttheOccupyVancouverprotests).AstheMr.EDRrankingscoresdecreased,theprecisiondropped.Priortothefinalsubmissions,alldocumentswith“Occupy”and“Vancouver”withrelevanceprobabilityscoresover0.1%hadeitherbeensubmittedorreviewed,andalldocumentswithscoresover75%withoutthosetermshadalsobeenreviewed.AfterthefinalReasonablecallwasmadetheremainingdocumentsweresubmittedinthefollowinggroupsindescendingpriorityorder:1)alldocumentscurrentlycodedasirrelevantbythehumanreviewernotyetsubmitted(2,212documents,ofwhich45werefoundtoberelevant);2)anythingremainingwith“Occup!”AND“Vancouver”(493documents,allthesehadscoresbelow0.1%,ofwhich8werefoundtoberelevant);andthen3)allelse(norelevantdocumentsfoundinthisset).Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%Recallcall,andthedarkgreenlinetheReasonablecall.
34
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheOccupyVancouvertopic,bythetime97.5%Recallhadbeenattainedonly0.18%ofthecorpus,1,584documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining99.82%or900,850documents.
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingthemultimodalhybridmodelofsearchandtrainingofMr.EDR.
35
______________________________________
Topic 2158UsingTORforAnonymousBrowsingontheInternetConfusionMatrix-Topic2158TotalDocuments:465,149TotalRelevant:1,261 TotalPrevalence:0.27%
Topic2158wasrunbySullivanwhoalsostartedonJuly29,2015.Hefinishedhisreviewof465,149forumpostsinBlackHatWorldonJuly31,2015Sullivan’scomputerbackgroundprovedtobehelpfulinanotheruncommonforumtopic.Heconsidershimselfmoreknowledgeableonthistopicthantheaverageperson,butdoesnotconsiderhimselftobeasubjectmatterexpertonTOR.Day1ofthistopicstartedwithconceptsearchingtofindotherkeywordsrelatingtoTORandanonymousbrowsing.Manypreviouslyunknowntermscametolight,suchasvpn,torbrowser,proxy,andip.ThisprocessofusingconceptsearchingatthebeginningofeverytopicbecamestandardprocessforallremainingreviewsdonebySullivan.Theresultsofthisexercisewereusedinfuturekeywordsearchesaswellasdatabase-widekeywordhighlighting.Next,Sullivanstartedmanuallyreviewingsomeofthehitsontermshefeltwouldbemostlikelytoyieldresponsivedocuments.Startingwith102documentsthathiton“TOR”and“anonym*”andmovingontohitson“TORBrowser,”then“TOR”and“Prox*.”Itwasnotdifficulttofindarelativelyhighquantityofrelevantdocuments.108relevantdocumentsand100irrelevantdocumentsweretrainedforpredictivecodingwhenthefirstlearningsessionwasrun.Afterthefirstlearningsessioncompleted,Sullivanmanuallyreviewedthehighestscoringdocumentsthatcontainedtheterm“TOR”andfoundalmostalltoberelevant.Atthe
@Reas.Call
@97.5%Recall
TruePositives 1,243 1,230TrueNegatives 463,793 463,824FalsePositives 95 64FalseNegatives 18 31Recall 98.57% 97.54%Precision 92.90% 95.05%F1Measure 95.65% 96.28%Accuracy 99.98% 99.98%Error 0.02% 0.02%Elusion 0.00% 0.01%Fallout 0.02% 0.01%
36
conclusionofthefirstday,214documentshadbeensubmittedtoTREC,withall214being
returnedasrelevant.
Day2consistedofmanyiterationsoflearningsessionsandevaluatingsearchresults.Similarto
howSullivanreviewedTopic2052,hestartedwithanarrowlistofkeywordsearchesand
broadenedthetermsiteratively.Foreachset,hereviewedthedocumentswiththehighest
predictivecodingscores.Startingthedaywith“TOR”and“prox*,”hemovedto“TryTOR,”“Try
usingTOR,”and“UseTOR.”Eventuallyhemovedtoalldocumentsthatcontained“TOR”or
“T0R.”EverydocumenthedeterminedtoberelevantwassubmittedtoTREC.
Attheendoftheexercise,Sullivanhadsubmitted1,339documents,with1,244beingreturned
asrelevantand95beingreturnedasnotrelevantaccordingtotheTRECstandard.Atthispoint
hecalledhisshotatReasonableRecall.
Day3startedwiththesubmissionofallremainingdocumentsthatcontainedtheterm“TOR”as
amethodtocatchanydocumentspotentiallymissed.Noadditionalrelevantdocumentswere
returned.
Allremainingdocumentsinthedatabaseweresubmittedinorderofdescendingpredictive
codingscore.14morerelevantdocumentswerereturned.Evaluationofthesedocumentsled
tofindingspectacularerrorsintheTRECstandard.All14contained“*tor*”insomecontext,
butnonehadanyevenmarginallinkstothecurrenttopic.Amajorityofthemisseddocuments
containedtheterm“hostigator.com.”Evaluationofthese14documentsresultedina
determinationthatall14werecausedbyanerrorintheTRECclassificationsystem.
Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenline
signifyingtheanticipated70%Recallcall,andthedarkgreenlinetheReasonableRecallcall.
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocuments
submitted(green)astrackedacrossvaryingrecallthresholds.OntheUsingTORforAnonymous
InternetBrowsingtopic,bythetime97.5%Recallhadbeenattainedonly0.28%ofthecorpus,
37
1,294documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining99.72%or463,855documents.
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingthemultimodalhybridmodelofsearchandtrainingofMr.EDR.
______________________________________
38
TOPIC104NewMedicalSchools
ConfusionMatrix-Topic104TotalDocuments:290,099TotalRelevant:227 TotalPrevalence:0.08%
Topic104wasrunbyLoseywiththeassistanceofareviewattorney,Jensen.Theworktosearchthe290,099BushEmailsstartedonJuly31,2015andconcludedonAugust4,2015.TheprojectcommencedwithLoseyandhisassistantbeginningStepTwo,MultimodalSearchReviews.StepThree,RandomBaseline,wasomitted.Aftersubmissionsbegan,theechoStepFive,multimodal,wasperformedwiththeassistanceatfirstofJensen.LoseyhandledalloftheAIrelatedsearchesinStepFive,includingtheprobabilityandrankingrelatedsearches.Hisassistantfocusedonkeywordsearchesandalsomadesuggestionsofdocumentstosubmit.Again,allfinaldecisionsonsubmittalweremadebyLosey.OnAugust3,2015,aftermaking8documentsubmissionstoTRECprovidingatotal199documents,Loseyhadfoundatotalof157relevantdocuments.Theeffort,ornumberofdocumentsreviewedandcodedbyLoseytoattainthisresult,was1,091documents.Afterthe8thTRECsubmission,LoseydecidedtocallReasonable.ItwaslaterdeterminedthataRecallof69.16%hadbeenattained,withPrecisionof78.89%,andF1of73.71%.HemadethecalldecisionalittleprematurelyonthisTopic.Inthenextsubmissionofonly20documents,LoseybroughttheRecalllevelupto71.37%withPrecisionof73.97%.Inthenextsubmissionof781documentshebroughttheRecalllevelto77.97%.Therewereatotalof7additionalsubmissionstoTRECaftertheReasonablecallpoint.Aftersubmittingatotalof1,611documents,whichisonly0.56%ofthetotaldocuments,andreviewingonly1,091documents,an80%Recallwasattained.Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%Recallcall,andthedarkgreenlinetheReasonableRecallcall.
@Reas.Call
@97.5%Recall
TruePositives 157 222TrueNegatives 289,830 51,763FalsePositives 42 238,109FalseNegatives 70 5Recall 69.16% 97.80%Precision 78.89% 0.09%F1Measure 73.71% 0.19%Accuracy 99.96% 17.92%Error 0.04% 82.08%Elusion 0.02% 0.01%Fallout 0.01% 82.14%
39
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheNewMedicalSchoolstopic,bythetime97.5%Recallhadbeenattained82.16%ofthecorpus,238,331documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining17.84%or51,768documents.
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingthemultimodalhybridmodelofsearchandtrainingofMr.EDR.
40
______________________________________
Topic109ScarletLetterLaw
ConfusionMatrix-Topic109ScarletLetterLawTotalDocuments:290,099TotalRelevant:506 TotalPrevalence:0.17%
@Reas.Call
@97.5%Recall
TruePositives 485 494TrueNegatives 289,568 289,502FalsePositives 25 91FalseNegatives 21 12Recall 95.85% 97.63%Precision 95.10% 84.44%F1Measure 95.47% 90.56%Accuracy 99.98% 99.96%Error 0.02% 0.04%Elusion 0.01% 0.00%Fallout 0.01% 0.03%
41
Topic109wasrunbyLoseywiththeassistanceofareviewattorney,Bottolene.Theworktosearchthe290,099BushEmailsstartedonAugust3,2015andconcludedonAugust11,2015.TheprojectcommencedwithLoseyandhisassistantbeginningStepTwo,MultimodalSearchReviews.StepThree,RandomBaseline,wasomitted.Aftersubmissionsbegan,theechoStepFive,multimodal,wasperformedwiththeassistanceatfirstofBottolene.LoseyhandledalloftheAIrelatedsearchesinStepFive,includingtheprobabilityandrankingrelatedsearches.Hisassistantfocusedonkeywordsearchesandalsomadesuggestionsofdocumentstosubmit.Again,allfinaldecisionsonsubmittalweremadebyLosey.OnAugust11,2015,aftermaking26submissionstoTRECprovidingatotal510documents,Loseyhadfoundatotalof485relevantdocuments.Theeffort,ornumberofdocumentsreviewedandcodedbyLoseytoattainthisresult,was953documents.Afterthe26thTRECsubmission,LoseydecidedtocallReasonable.ItwaslaterdeterminedthataRecallof95.85%hadbeenattained,withPrecisionof95.1%.Therewere14additionalsubmissionstoTRECaftertheReasonablecallpoint.Inthenextsubmissionafterthecallofonly121documentsaRecallof98.62%wasattained.Recallof100%wasattainedthreesubmissionslateraftersubmittingonly1,074documents,0.37%ofthetotal,andreviewofonly953documents.Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%recallcall,andthedarkgreenlinetheReasonableRecallcall.
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheScarletLetterLawtopic,bythetime97.5%Recallhadbeenattainedonly0.20%ofthecorpus,585documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining99.80%or289,514documents.
42
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingthemultimodalhybridmodelofsearchandtrainingofMr.EDR.
______________________________________
43
Topic100SchoolandPreschoolFundingConfusionMatrix-Topic100TotalDocuments:290,097TotalRelevant:4,542 TotalPrevalence:1.57%
Topic100wasrunbyLoseywiththelimitedassistanceofareviewattorney,Jensen.Theworktosearchthe290,099BushEmailsstartedonAugust4,2015andconcludedonAugust8,2015.TheprojectcommencedwithLoseyandhisassistantbeginningStepTwo,MultimodalSearchReviews.StepThree,RandomBaseline,wasomitted.Aftersubmissionsbegan,theechoStepFive,multimodal,wasperformedwithsomeassistanceatfirstofJensen.LoseyhandledalloftheAIrelatedsearchesinStepFive,includingtheprobabilityandrankingrelatedsearches.Hisassistantfocusedonkeywordsearchesandalsomadeacoupleofsuggestionsofdocumentstosubmit.Again,allfinaldecisionsonsubmittalweremadebyLosey.OnAugust6,2015,aftermaking44submissionstoTRECprovidingatotal2,537documents,Loseyhadfoundatotalof2,441relevantdocuments.Theeffort,ornumberofdocumentsreviewedandcodedbyLoseytoattainthisresult,was651documents.Afterthe44thTRECsubmission,LoseydecidedtocallReasonable.Thisprovedtobeaprematurecall.ItwaslaterdeterminedthataRecallof53.74%hadbeenattained,withPrecisionof96.22%,andF1of68.96%.Therewere19additionalsubmissionstoTRECaftertheReasonablecallpoint.Aftersubmittingatotalof7,541documents,whichisonly2.6%ofthetotaldocuments,andreviewingonly651documents,a70%Recalllevelwasattained.ARecallof80%wasattainedaftersubmitting6.28%ofthetotaldocuments,andRecallof90%aftersubmitting7.92%.Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%recallcall,andthedarkgreenlinethereasonablerecallcall.
@Reas.Call
@97.5%Recall
TruePositives 2,441 4,429TrueNegatives 285,459 199,460FalsePositives 96 86,095FalseNegatives 2,101 113Recall 53.74% 97.51%Precision 96.22% 4.89%F1Measure 68.96% 9.32%Accuracy 99.24% 70.28%Error 0.76% 29.72%Elusion 0.73% 0.06%Fallout 0.03% 30.15%
44
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheSchoolandPreschoolFundingtopic,bythetime97.5%Recallhadbeenattainedonly31.20%ofthecorpus,90,524documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining68.80%or199,573documents.
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingthemultimodalhybridmodelofsearchandtrainingofMr.EDR.
45
______________________________________
Topic107TortReformConfusionMatrix-Topic107TotalDocuments:290,099TotalRelevant:2,369 TotalPrevalence:0.82%
@Reas.Call
@97.5%Recall
TruePositives 1,950 2,310TrueNegatives 287,421 284,197FalsePositives 309 3,533FalseNegatives 419 59Recall 82.31% 97.51%Precision 86.32% 39.53%F1Measure 84.27% 56.26%Accuracy 99.75% 98.76%Error 0.25% 1.24%Elusion 0.15% 0.02%Fallout 0.11% 1.23%
46
Topic107wasrunbyLoseywiththelimitedassistanceofareviewattorney,Jensen.Theworktosearchthe290,099BushEmailsstartedonAugust5,2015andconcludedonAugust15,2015.TheprojectcommencedwithLoseyandhisassistantbeginningStepTwo,MultimodalSearchReviews.StepThree,RandomBaseline,wasomitted.Aftersubmissionsbegan,theechoStepFive,multimodal,wasperformedwithsomeassistanceatfirstofJensen.LoseyhandledalloftheAIrelatedsearchesinStepFive,includingtheprobabilityandrankingrelatedsearches.Hisassistantfocusedonkeywordsearchesandalsomadeacoupleofsuggestionsofdocumentstosubmit.Again,allfinaldecisionsonsubmittalweremadebyLosey.OnAugust14,2015,aftermaking48submissionstoTRECprovidingatotal2,259documents,Loseyhadfoundatotalof1,950relevantdocuments.Theeffort,ornumberofdocumentsreviewedandcodedbyLoseytoattainthisresult,was1,164documents.Afterthe48thTRECsubmission,LoseydecidedtocallReasonable.ItwaslaterdeterminedthataRecallof82.31%hadbeenattained,withPrecisionof86.32%,andF1of84.27%.Therewere31additionalsubmissionstoTRECaftertheReasonablecallpoint.Aftersubmittingatotalof2,648documents,whichisonly0.91%ofthetotaldocuments,andreviewingonly1,164documents,a90%Recalllevelwasattainedwith80.55%Precision.Recallof95%wasattainedaftersubmitting3,963documents,1.37%oftotal.Recallof98%wasattainedaftersubmitting5,843documents,2.01%oftotal.Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%Recallcall,andthedarkgreenlinetheReasonableRecallcall.
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheTortReformtopic,bythetime97.5%Recallhadbeenattainedonly2.01%ofthecorpus,5,843documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining97.99%or284,256documents.
47
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingthemultimodalhybridmodelofsearchandtrainingofMr.EDR.
______________________________________
48
Topic3481Fracking
ConfusionMatrix-Topic3481FrackingTotalDocuments:902,434TotalRelevant:1,966 TotalPrevalence:0.22%
Topic3481wasrunbySullivanwhostartedonAugust4,2015.Hefinishedhisreviewof902,434NewsArticlesonAug.7,2015after7totalhoursofeffort.
Sullivanhadnobackgroundorknowledgeoffrackingpriortothisexercise.Whileexpert
knowledgewasnotnecessary,therewereafewinstanceswheresomeadditionalknowledgeof
thetopicwouldhavebeenhelpful.
Sullivanhadpreviouslytackledtopicsintheforumsdataset,butthiswashisfirsttopicinthe
Newsdataset.Hefoundthelackofspellingissuesandoverallconsistencyinthedocuments
providedamucheasiersetofdatatoreview.Muchlessmanualreviewwasnecessarywiththe
newstopics.
Onthefirstday,Sullivanusedconceptsearchingtoidentifysimilartopics,perhisstandard
process.Hecreatedalistofmostlikelyrelevantkeywordsandusedthelistforsearchingand
keywordhighlighting.Bothsearchandkeywordhighlightinglistsweremodifiedthroughthe
courseofthereviewasnewinformationwasobtained.
Sullivandecidedtogowithadifferentapproachtothistopic.Ratherthanperformingamanual
reviewofdocumentstobegin,hedecidedtosubmitasrelevantanydocumentthatcontained
over5instancesoftheterm“fracking”withoutreview.286documentsmetthisstandard,and
allwerereturnedasrelevantwhensubmittedtoTREC.
Whilethedatausedforthisexercisedidnotcontainanymetadata,Sullivandeterminedanytext
thatappearedinthefirst2linesofthedocumentcouldbeconsideredthedocument’stitle.He
found61documentsthatcontained“fracking”inthetitleandanadditionalinstanceoffracking
elsewhereinthedocument.All60werereturnedasrelevant,with1onenotrelevant.Further
@Reas.Call
@97.5%Recall
TruePositives 1,893 1,917TrueNegatives 900,284 899,841FalsePositives 184 627FalseNegatives 73 49Recall 96.29% 97.51%Precision 91.14% 75.35%F1Measure 93.64% 85.01%Accuracy 99.97% 99.93%Error 0.03% 0.07%Elusion 0.01% 0.01%Fallout 0.02% 0.07%
49
evaluationdeterminedthenotrelevantdocumentwasanerrorintheTRECstandard.Next,9documentswerefoundwhichcontained“hydrofracking”inthetitle.All9werereturnedasrelevant.Hethencontinuedwithslightvariationsuntilsubmittingalldocumentsthatcontain2ormorehitsontheterm“fracking.”After1hourandmanualreviewof29documents,746documentshadbeensubmittedwith745beingreturnedasrelevant.Sullivancontinuedmanuallyreviewingthedocumentswithasinglehitonfrackingtosortoutthefalsepositives.Afterreviewingacouplesetsofdocuments,heinitiatedhisfirstpredictivecodinglearningsessionforthistopic.OnthestartofDay2,Sullivanbelievedhehadfoundnearlyallrelevantdocumentsforthistopic.However,afterreviewingdocumentswithhighpredictivecodingscores,hequicklyrealizedthat“fracturing”wasanotherkeytermhehadn’tpreviouslyconsidered.Theuseofpredictivecodinghelpedhimquicklyfindanadditional400relevantdocumentsthatwouldhavebeenlostifusingkeywordsearchingalone.ReasonableRecallwascalledaftersubmitting2,077documents,with1,893returnedasrelevant.Theremainingdocumentsweresubmittedinorderofdescendingpredictivecodingscores,and73morerelevantdocumentswerereturned.AnevaluationofthereturneddocumentscontainedmanyerrorsintheTRECstandard,aswellasafairnumberofrelevantdocumentsthatwerenotproperlycapturedduetoSullivan’slackofknowledgeoffrackingandrelatedminingterms.Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%recallcall,andthedarkgreenlinethereasonablerecallcall.
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheFrackingtopic,bythetime97.5%Recallhadbeenattainedonly0.27%ofthecorpus,2,439documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining99.73%or899,995documents.
50
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%recallusingthemulti-modalhybridmodeloftrainingEDR.
______________________________________
51
Topic3431KingstonMillsLockMurdersConfusionMatrix-Topic3431TotalDocuments:902,434TotalRelevant:1,111 TotalPrevalence:0.12%
Topic3431wasrunbyReichenberger.Theworktosearchthe902,434NewsArticlesdatabasestartedonAugust4,2015,andwascompletedonAugust5,2015.Theinitialsubmissionsonthefirstdayweretotesttheoutlinesofthecategory.Theinitialsearchof“Kingston”AND“murder”identifiedasensationalizedmurderstoryaboutamanwiththelastname“Shafia”murderinghisdaughtersinan“honorkilling.”Documentscontainingtheinformationinvariousforms(headline,text,“clickbait”linkreferenceatendofarticle)weresubmitted.Resultshelpedformulateananticipatedruleonrelevance.AftertrainingMr.EDRandreceivingrelevancepriorityscores,asearchonthespecificvictimnamesor“Shafia”weresortedbyprioritizationorder.Samplesof10documentsabove90%,10between80-90%,10between60-80%,10between25-60%and10below25%showedthatdocumentsabove60%wereverylikelyrelevant.Infact,documentsscoringover90%allhadmultiplenamehitsandwerespecificallyonpoint;documentsinthemiddlerangeswereusuallyindirectlyrelated(e.g.about“honorkilling,”ordomesticabuse,ormoreofacasualreferencetotheKingstonMillsmurders);andthosedocumentsbelow5%werealmostalwaysirrelevant.Asatest,thesecondsubmissioncontainedalldocumentswithascoreover90%,alongwithsamplesofseveraldocumentsatvariousscoresgreaterthan50%,cuttingthesubmissionoffat200documentseven.Withonly111documentsreviewedeyesontothispoint,Reichenbergerhada98.5%precisionon205documentssubmitted.Ofthe205documentssubmittedtothispoint,theonly3irrelevantdocumentsallhadthesametrait:“Shafia”appearedintheheaderbuttherewasnoreferencetoitinthetext.Similardocumentsweremasscodedasirrelevantgoingforward.Likewise,peoplewithnamessimilartothevictimswerefoundinthe40-60%probabilityrangebutwere“falsepositive”documents.TheseincludedanAPphotographer,thePresidentofGambia,andprotestersinYemenwithfirst
@Reas.Call
@97.5%Recall
TruePositives 1,107 1,084TrueNegatives 901,309 901,311FalsePositives 14 12FalseNegatives 4 27Recall 99.64% 97.57%Precision 98.75% 98.91%F1Measure 99.19% 98.23%Accuracy 100.00% 100.00%Error 0.00% 0.00%Elusion 0.00% 0.00%Fallout 0.00% 0.00%
52
namesthesameasoneofthevictims.Searchesweredoneonthosespecificnamesandmass-taggedasirrelevant.Afteramachinelearningsession,thescoresadjusteddroppingthosefalsepositivenamestothebottom.Atthispoint,asamplingofkeytermhitsshowedeverythingover20%scoreswererelevant,andeverythingbelow1%wereirrelevant.Everythinginbetweenwerelowqualityreferencestothemurderswithsomeirrelevantdocumentsmixedin.Assuch,thenextsubmissionwasforeverythingwithakeytermover25%relevantscore(456documents)ofwhich449werefoundrelevant.The7documentsfoundirrelevantweremisclicksbyReichenberger(humanerror).Inonecaseadocumentwasprimarilyaboutadifferentmurder,butlaterinthearticletherewasrelevantdiscussionofthetargetmurder.Mr.EDRpickedthisup,butitwasapparentlymissedbyTREC’srelevancescopeadjudications.The70%Recallcallwasthenmadehavingreviewedonly209documents.ItturnedoutthatRecallwasactually58.6%withPrecisionat98.5%.Thenextsubmissionconsistedlargelyofdocumentscontainingasinglelineof“clickbait”linktextfoundbyTRECtoberelevant.Otherdocumentsconsideredweredocumentswithkeytermsthathadscoresraiseabove20%followingthemachinelearningsessionfromtheprevioussetanddocumentswithscoresabove50%withnokeyterms.Whiledocumentswithkeytermswerelargelyfoundtoberelevant,mostofthedocumentswithoutthetermswerefoundtobeirrelevant.Infact,documentsscoringabove70%wereoftentangentialtotheissuesinthemurder(domesticviolencemostly)butnotrelevant,whilethose50-70%hadnosemblanceofrelevanceatall,andwerebeingescalatedbasedoncoincidental“clickbait”textadvertisementlinesattheendofthearticle.Another459documentsweresubmittedwith456werefoundrelevant.Thethreeirrelevantdocumentsallwereonthelowendscoreswithinthesubmissionandwereonlypassingreferencestothecase.Atthispointthe80%recallcallwasmade.Recallwasactuallyat99.64%withaprecisionat99.34%.Only272documentswerereviewedeyesontothispoint,and1120relevantdocumentshadbeenfound.Alldocumentswithscoresover70%hadbeenreviewedorsubmitted,andallthosewithkeytermsandscoresover20%hadbeenreviewedorsubmitted.Followingthesubsequentmachinelearningsession,30documentswereescalatedtoconsider.Oneborderlinedocumentwasconsideredpotentiallyrelevantandsubmitted,returnedasirrelevant,whiletherestallmarkedirrelevant.TheReasonablecallwasmade.AftertheReasonablecallwasmadedocumentsweresubmittedinthefollowinggroupsindescendingpriorityscoreorder:1)threedocumentspotentiallyrelevantfoundwhilependingresultsoftheprevioussubmission(onewasfoundtoberelevant)2)alldocumentsreviewedeyesonanticipatedtobeirrelevant,butnotyetsubmitted(199documents,ofwhichtwowererelevantandtheonlyrelevanttextwithinthesetwodocumentswerecontainedinadocumentpreviouslysubmittedtoTRECandreturnedasirrelevant);3)anythingmass-codedasirrelevant(thisresultedinonerelevantdocument,ofwhichtheredoesnotappeartobeanyrelevantmaterialwithinitandmaybeyetanotherTRECcodingerror);and4)anythingremaining(allirrelevant).Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%recallcall,andthedarkgreenlinetheReasonablecall.
53
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheKingstonMillsLockMurderstopic,bythetime97.5%Recallhadbeenattainedonly0.12%ofthecorpus,1,096documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining99.88%or901,338documents.
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingtheMultimodalHybridmodeloftrainingMr.EDR.
54
______________________________________
Topic2130SurelyBitcoinsCanBeUsedConfusionMatrix-Topic2130TotalDocuments:465,147TotalRelevant:2,299 TotalPrevalence:0.49%
@Reas.Call
@97.5%Recall
TruePositives 1,961 2,242TrueNegatives 461,007 448,083FalsePositives 1,841 14,765FalseNegatives 338 57Recall 85.30% 97.52%Precision 51.58% 13.18%F1Measure 64.29% 23.23%Accuracy 99.53% 96.81%Error 0.47% 3.19%Elusion 0.07% 0.01%Fallout 0.40% 3.19%
55
Topic2130wasrunbyReichenberger.Theworktosearchthe465,147documentsintheBlackHatWorldForumsdatabasestartedonAugust7,2015andwascompletedAugust13,2015.Theinitialsubmissionsweretotesttheoutlinesofthecategory.ThefirstsubmissionwasninedocumentswithvaryingdiscussionsaboutBitcoin(e.g.bitcoinexchanges,whetherbitcoinwasaccepted,bitcoinmining,etc).Allninecamebackasirrelevant.Asecondsubmissionofninereturnedfiverelevantdocumentsbutnonoticeablecommonalityamongthemexceptthat“acceptbitcoin”wasrelevantand“acceptbitcoins”wasnot.Thenext25documentssubmittedalsofollowedthistrend,withsingular“acceptbitcoin”beingrelevant,thoseinthepluralbeingirrelevant.Alldocumentswith“acceptw/3bitcoin”weresubmittedinthefollowingtwosubmissionsets;however,havingthattextwasnotindicativeofrelevance,assomestillcamebackirrelevant.Likewise,avariationofbitcoin(“BTC”)wassubmitted(15relevant,5irrelevant,noconsistentthread).Afteramachinelearningsession,thesubmitteddocumentswererevisitedanditappearedusingbitcoinforlegalactivityorsomeonevouchingforaforumusertendedtoberelevant,whileillegalorimmoralactivitywereirrelevant.Forthenextsubmission,the60highestscoringdocumentsweresubmittedandanticipatedasrelevant/irrelevantbasedonthepurposeofthetransaction.Whilenotperfect,thislargelycorrelatedwiththeresults.(10expectedrelevant,endresultwas13).Thenextsubmissioncontainedalldocumentswitha90%orhigherprobablerelevantscoreandcontainingtheterm“vouch*”.Ofthe122documents,94wererelevant.Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%Recallcall,andthedarkgreenlinetheReasonablecall.
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheSurelyBitcoinscanbeUsedtopic,bythetime97.5%Recallhadbeenattainedonly3.66%ofthecorpus,17,007documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining96.34%or448,140documents.
56
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingtheMultimodalHybridmodeloftrainingMr.EDR.
______________________________________
57
Topic3089PicktonMurders
ConfusionMatrix-Topic3089TotalDocuments:902,434TotalRelevant:255 TotalPrevalence:0.03%
Topic3089wasrunbyJoeWhite.WorkonthistopiccommencedonAugust5,2015and
concludedonAugust28,2015.Approximately24hourswerespentonthistopic,includingafew
hoursupfrontresearchingthesubjectmatter.Thisservedasaproxyforthee-DiscoveryTeam
HybridMultimodalModel,Step1,ESIDiscoveryCommunications.CompletionofthisTopicwas
drawnoutduetotimeconflictsincludingvacation.
Thecollectionof902,434NewsArticlesweregenerallyeasiertosearchthantheBushEmailsorBlackHatWorldForumposts,thoughthenewsarticlescontainedmanylinks,footersand
subjectmattersthatweresharedwithothernewsstories,creatingtheappearanceofsimilarity.
Aswouldbeexpectedwithnewsarticles,misspelledwordsandnamesseemednonexistent,
whichwashelpful.Whitedid,however,findafewgold-standardinconsistenciesinthistopic.
WhitebeganStepTwo,multimodalsearch,bycreatingseveralkeywordlistsbasedonhis
judgmentandnotesfromtheinitialtopicresearch.Thisresearchincludedevents,names,
locations,andotherinformationrelatedtothecase.Thekeywordlistgoalswereto:(a)to
createaseedsettobeginfindingthepotentiallyrelevantdocumentsandtobegintrainingMr.
EDR;(b)toguesstimatehowlargetherelevantdocumentsetwouldbeakindofrough
substituteforStepThreeSample;and(c)tohighlightrelevanttermsinthesoftwaretofacilitate
moreeffectivereviewandtraining.(Note–allreviewerssohighlightedcertainkeywordsasa
matterofcoursetospeedupandimprovereview.)
Whentheinitialkeywordsbroughtbackonlyjustover220-somedocuments,whilestill
cognizantofthelimitationsofkeywordsearch,Whitebelievedthismeantarelativelysmall
potentialdatasetexisted.Thisaffordedhimtheabilitytoperformalinearreviewofallofthe
keywordhits,butalsomeantthatprecisionwouldbeeasilyharmedbyfalsepositives.Forthat
reasonWhiteknewthatcarewouldbeneededinascertainingtruerelevance.AnormalStep3,
@Reas.Call
@97.5%Recall
TruePositives 236 249TrueNegatives 902,164 901,971FalsePositives 15 208FalseNegatives 19 6Recall 92.55% 97.65%Precision 94.02% 54.49%F1Measure 93.28% 69.94%Accuracy 100.00% 99.98%Error 0.00% 0.02%Elusion 0.00% 0.00%Fallout 0.00% 0.02%
58
initialRandomBaselinesample,wasomittedgiventhelikelylowprevalenceandgeneraltimeconstraintsforthework.BasedontheinitialjudgmentalsamplereviewsinStepTwo,WhitesubmittedinitialsetsofdocumentstoTRECtoestablishrelevanceboundariesandbeginwhittlingdownonthesetofrelevantcandidatedocuments.Aminorlossofprecisionwasanticipatedoncertaindocumentsinexchangeforknowledgethatwouldguidesubsequentsubmissions.Eachtimedocumentsweredeterminedtoberelevant,Whiteupdatedthetrainingandpredictiveranking,tofacilitatepriority-drivenreviewthataugmentedthejudgmentalsamplingwork(seestepsFour,FiveandSix:AIPredictiveRanking,MultimodalSearchReview&HybridActiveTraining).Healsoutilizedconceptualsearch(predominantlyFindSimilar,viaLSI)tobranchoffparticularlyinterestingornoveldocumentstolearnmore.AlthoughWhite,likeallofthereviewers,diduseconceptsearch,andsimilaritysearch,hefoundthatthepredictivecodingrankings(usingamorerobusttechnology)provedtobemoreeffectiveoverall.Allreviewershadthesameexperience.Duringtheinitialpartofthesubmissionprocess,WhitetrainedonalldocumentsdeemedrelevantorirrelevantbyTREC.Thishelpedcreateadditionalseparationinthemodelandrankings.InoneinstanceheleftoneobviousTRECmistaketrainedasrelevant(aduplicateofanotherdocumentthathadbeenadjudicatedrelevant)inordertoensurehewouldfindanyotherslikeit.Duringthepredictiveanalysisandtraining,Whitefounditwasmosthelpfultoreviewcertainsetsofdocumentsfromthebottom-up,toanalyzetheleast-likelycandidatesincaseswhererelevanceseemedclear.Inothersetsofdocuments,whererelevanceseemedlesscertain,Whitereviewedfromthetop-down.Afteradditionalanalysiswascompletedand99documentshadbeensubmittedtoTREC,Whitepredictedtherewouldbe200–250relevantdocumentsintotal.(Intheend,hewouldlearntherewere255totalrelevantdocumentsinthistopic,sotheearlypredictionturnedouttobequiteclose.)Whitealsousedrandomsamplinginoneinstance,totrainasetof100documentsthatseemedclearlyirrelevant.ThesedocumentsassistedMr.EDRinseparatingirrelevantdocsfromrelevantonesatapointearlyintheprocesswhenonlyrelevantdocumentshadbeentrained.ThiswasparttheTeam’sexperimentationoftheidealratiosofirrelevanttorelevantintrainingmodels.Asisalmostalwaysthecasewithaniterativetrainingprocess,asthetrainingandlearningcommenced,additionalrelevantsubjectareascametolight.Whilealmostalloftheseareasweresomewhatapparentfromthestart,fascinatingandsubtlenuancesemerged.Newsstoriesonthecasetooklittleturnsandspawnedentirelynewareasofrelevanceuntothemselves.Whitethoughtthebiggestchallengewiththesedocumentswasn’tasmuchaboutwhethertheyexistedorhowtolocatethem,butaboutwhetherTRECwouldseethemasrelevantornot.Hefoundthatithelpedtotrackeachpocketofrelevanceasaseparatesubjectarea,toutilizekeywordsforeachsubjectareatocreatesmallseedsets,andtothenutilizethepredictiverankingswithineachsubjectareatodivedeeperandensurethateachwasadequatelyexplored.Whitemadeatotalof56documentsubmissionstoTRECinthistopic:6submissionsbetweenAug.6thand12th,encompassing184documents,22submissionsbetweenAug.21and27th,encompassing284documents,andtheremaining28submissionsonAug.28th,encompassing901,966documents.InbetweenmostofthesesubmissionsheconductediterativestepsFour,FiveandSixofthestandardworkflow,utilizingpredictiveranking,search,andtraining.
59
After218documentshadbeensubmittedandadditionalpriority-rankeddocumentsandtopkeywordsetshadbeenevaluated,Whitecalled70%.Therewasstillafairquantityofsuspectedborderlinedocumentsin-hand,buthisintuitionwasthathehadprobablysurpassed70%byafairmarginandsoneededtocalltheshot.ActualRecallatthispointturnedouttobe83.53%.Whitethenstudiedcloselythesuspectedborderlinedocumentsbeforehedecidedtosubmitthem.Hewasattemptingtodeterminethescopeofrelevanceforthesesubjectareas.Afterlocatingwhathebelievedtobethefullextentofthesubject,andhavingfound23morerelevantdocuments,hecalledthe80%shot.Whitebelievedhewasevenfartheralongthan80%,giventherankedresultshewasseeing.AsitturnedouttheactualRecallatthispointwas92.55%.Aftersubmitting8moredocumentsthathethoughtmightbeconsideredrelevant,butwereclosequestionsandprobablywouldnot,WhitecalledReasonable.Thiswaswith251totaldocumentssubmitted,236ofthemrelevant,andonly779documentsreviewed.ActualRecallatthispointwasstill92.55%.HavingcalledReasonableandfindingnothingnewthatlookedrelevant,Whiteturnedtohispoolofremainingdocumentsthatlookedirrelevant,toallowthepredictiverankingtohelphimbeingsubmittingthem.Indeed,Mr.EDRhelpedseethingshecouldnot,andsoonfound18additionaldocumentsthatcontainedanobliquereferencetoasubjectrelatedtothecase.Whilethesedocumentsseemedjustasobliqueasothersthatweredeemedirrelevant,thefactthatthepredictiverankingscaughtthemquicklywasreassuring.Afteranadditionalroundoftrainingandpredictiverankingturnedupnoadditionaldocuments,thesubmissionscontinued.Finally,atthe2,000thdocumentsubmitted,a“relevant”documentwasdiscoveredthatcompletedthe255-docset.Thisdocumentappearedtobeaclearmistake,asitwasonlyareferencetoanunrelatedLondon,UKmurder.Afterthat,allremainingdocumentssubmittedwereconfirmedasirrelevant.OnAugust28,2015,aftermaking19submissionstoTRECprovidingatotal251documents,Whitehadfoundatotalof236relevantdocuments.Theeffort,ornumberofdocumentsreviewedandcodedbyWhitetoattainthisresult,was834documents.Afterthe18thTRECsubmission,WhitedecidedtocallReasonable.ItwaslaterdeterminedthataRecallof92.55%hadbeenattained,withPrecisionof94.02%.Therewere37additionalsubmissionstoTRECaftertheReasonablecallpoint.Aftersubmittingatotalof462documents,whichisonly0.05%ofthetotal902,434documents,andreviewingonly834documents,a99.61%Recalllevelwasattainedwith54.98%Precision.100%Recallwith12.75%Precisionwasattainedaftersubmissionof2,000documents,whichis0.22%ofthetotal.Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%Recallcall,andthedarkgreenlinetheReasonableRecallcall.
60
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OnthePicktonMurderstopic,bythetime97.5%Recallhadbeenattainedonly0.05%ofthecorpus,457documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining99.95%or901,977documents.
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingthemultimodalhybridmodelofsearchandtrainingofMr.EDR.
61
______________________________________
Topic2461OffshoreHostSitesConfusionMatrix-Topic2461OffshoreHostSitesTotalDocuments:465,147TotalRelevant:179 TotalPrevalence:0.04%
Topic2461wasrunbySullivanwhostartedonAugust14,2015.
@Reas.Call
@97.5%Recall
TruePositives 175 175TrueNegatives 463,225 463,408FalsePositives 1,743 1,560FalseNegatives 4 4Recall 97.77% 97.77%Precision 9.12% 10.09%F1Measure 16.68% 18.29%Accuracy 99.62% 99.66%Error 0.38% 0.34%Elusion 0.00% 0.00%Fallout 0.37% 0.34%
62
Hefinishedhisreviewof902,434NewsArticlesonAug.15,2015after5.0totalhoursofeffort.Sullivan’sbackgroundandknowledgeinhostsiteswasexpectedtobehelpfulinthistopic,butinrealityitworkedagainsthim.Whilehedoesnotconsiderhimselftobeasubjectmatterexpertonthistopic,hehasasolidlevelofknowledgewithhostsites.Thisproveddifficult,becausehethoughtheknewwhatdocumentsshouldbeconsideredrelevant,buttheTRECgoldstandarddisagreedwithmostofhisdeterminations.Perhisstandardprocess,Sullivanstartedwithconceptsearchingtoidentifypopularkeywordstouseashighlightingandfuturesearches.ThisgeneratedalonglistoftermsrelatingtodifferenthostingsitesandVPNs.SullivancontinuedwiththenextstepoffindingsomedocumentstoseedforpredictivecodingandgetanunderstandingoftheTREClineforrelevance.Hefound8documentsthathiton“offshorehost*site*”andcontainedclearlyrelevantcontentbyhisdefinition.TRECdeterminedall8tobenotrelevant.Hethenfound5documentsthatrelatetospecificoffshorehostingsites,suchashostingpanamaandanonhoster.TRECreturned1relevantand4notrelevant.HecontinuedtotrydifferentvariationsoftermsrelatingtohostingisspecificcountriesanddocumentswithdifferenttypesofcontentandcouldnotfindanylogictotheTRECrelevancestandard.Frustrated,heinitiatedalearningsessionandtookabreak.Uponreturning,hedecidedtotryatestsubmissionof29topscoringdocumentsthatcontainedthetext“offshore”w/2“host”withoutlookingatanyofthedocuments.Tohissurprise,26ofthedocumentswerereturnedbyTRECasrelevant.Inareviewofthedocuments,hesawnodifferencebetweenthecontentoftheTRECrelevantdocumentsandthedocumentshefoundandsubmittedthatwerereturnedasnotrelevant.TheonlygeneralcorrelationhewasabletoidentifyistheTRECstandardappearedtofavorsmallersizeddocumentswithahigherproportionofcontentdedicatedtooffshorehostsites.Adocumentwithasinglelinediscussingoffshorehostsiteswasmorelikelytoberelevantthanadocumentwith50linesand10references.Beingunabletodetermineanyreasonableconnectionbetweencontentandrelevance,SullivanhadnochoicebuttocontinueridingMr.EDR’ssuggestionsfordocumentstosubmit.Thisprocessconsistedofmanyiterationsoflearningsessionsandsearching.SimilartohowSullivanreviewedTopic2052and3481,hestartedwithanarrowlistofkeywordsearchesandbroadenedthetermsiteratively.Foreachset,hesubmittedthedocumentswiththehighestpredictivecodingscores.Startingwith“offshore”w/2“host*,”hemovedto“offshore”and“host,”“offshore”and“web,”and“offshore”and“vpn.”Eventuallyhemovedtoalldocumentsthatcontained“offshore”or“hosting.”ThedifferencebetweenthisprocessandwhatwasusedinpriorreviewsisSullivandidnotactuallylookatanyofthedocuments.AshefoundhisjudgmenttobeoutoflinewiththeTRECstandard,documentsweresubmittedwithoutreview.Resultsofasearchwouldbetakenandthetopdocumentswouldbesubmitted.Ifmostweredeterminedtoberelevant,lowersetsofdocumentsfromtheresultwouldbesubmitteduntilalowamountofrelevantdocumentswerereturned.Hewouldthenmoveontothenextsearchandrepeat.Afterexhaustingalloftheallkeyterms,Sullivansubmittedallremainingdocumentsindescendingpriorityorder.
63
Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenline
signifyingtheanticipated70%recallcall,andthedarkgreenlinethereasonablerecallcall.
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocuments
submitted(green)astrackedacrossvaryingrecallthresholds.OntheOffshoreHostSitestopic,
bythetime97.5%Recallhadbeenattainedonly0.37%ofthecorpus,1,735documents,had
beensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionof
theremaining99.63%or463,412documents.
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain
100%recallusingthemulti-modalhybridmodeloftrainingEDR.
______________________________________
64
Topic3290RoosterTurkeyChickenNuisance
ConfusionMatrix-Topic3290TotalDocuments:902,434TotalRelevant:26 TotalPrevalence:0.00%
Topic3290wasrunbyLoseyalonewhostartedonAugust15,2015andconcludedonAugust23,2015.TheprojectcommencedasusualwithLoseybeginningStepTwo,MultimodalSearchReviews.StepThree,RandomBaseline,wasomitted.Aftersubmissionsbegan,theechoStepFive,multimodalsearchbegan,includingpredictivecodingfeatures,withiteratedtraining.OnAugust22,2015,aftermaking14submissionstoTREC,andtrainingafteralmosteverysubmission,Loseyhadprovidedatotalof95documentstoTRECandconfirmedatotalof23relevantdocuments.Theeffort,ornumberofdocumentsreviewedandcodedbyLoseytoattainthisresult,was306documents.Afterthe14thTRECsubmission,LoseydecidedtocallReasonable.ItwaslaterdeterminedthataRecallof88.46%wasattainedbysubmissionofonly95documents,whichis0.01%ofthetotal902,434documents.Thiswasaccomplishedbyreviewofonly0.03%ofthetotalcollection.Therewere23additionalsubmissionstoTRECaftertheReasonablecallpoint.InthenextsubmissionafterReasonablecall,the15th,theRecalllevelroseto96.15%.Recallof100%wasattainedaftersubmissionofonly0.15%.A90%Recallwasattainedaftersubmittingonly129documents.A95%Recallwasattainedaftersubmitting1,923documents,and97.5%Recallattainedafter3,188documents.TotalRecallwasattainedaftersubmitting17,414documentsoutofthecorpustotalof902,43(0.15%).Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%recallcall,andthedarkgreenlinethereasonablerecallcall.
@Reas.Call
@97.5%Recall
TruePositives 23 26TrueNegatives 902,336 885,020FalsePositives 72 17,388FalseNegatives 3 0Recall 88.46% 100.00%Precision 24.21% 0.15%F1Measure 38.02% 0.30%Accuracy 99.99% 98.07%Error 0.01% 1.93%Elusion 0.00% 0.00%Fallout 0.01% 1.93%
65
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheRoosterTurkeyChickenNuisancetopic,bythetime97.5%Recallhadbeenattainedonly1.93%ofthecorpus,17,414documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining98.07%or885,020documents.
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingthemultimodalhybridmodelofsearchandtrainingofMr.EDR.
66
______________________________________
Topic2333ArticleSpinnerSpinningConfusionMatrix-Topic2333TotalDocuments:465,147TotalRelevant:4,805 TotalPrevalence:1.03%
@Reas.Call
@97.5%Recall
TruePositives 4,201 4,685TrueNegatives 457,877 450,329FalsePositives 2,465 10,013FalseNegatives 604 120Recall 87.43% 97.50%Precision 63.02% 31.88%F1Measure 73.24% 48.04%Accuracy 99.34% 97.82%Error 0.66% 2.18%Elusion 0.13% 0.03%Fallout 0.54% 2.18%
67
Topic2333wasrunbyLoseywhoalsostartedonAugust19,2015.Hefinishedhisreviewof465,149forumpostsinBlackHatWorldonAugust23,2015.TheprojectcommencedasusualwithLoseybeginningStepTwo,MultimodalSearchReviews.StepThree,RandomBaseline,wasomitted.Aftersubmissionsbegan,theechoStepFive,multimodalsearchbegan,includingpredictivecodingfeatures,withiteratedtraining.OnAugust21,2015,aftermaking23submissionstoTREC,andtrainingafteralmosteverysubmission,Loseyhadprovidedatotalof6,666documentstoTRECandconfirmedatotalof4201relevantdocuments.Theeffort,ornumberofdocumentsreviewedandcodedbyLoseytoattainthisresult,was228documents.Afterthe23rdTRECsubmission,LoseydecidedtocallReasonable.ItwaslaterdeterminedthataRecallof87.43%wasattainedbysubmissionofonly6,666documents,whichis.043%ofthetotal465,147documents.Thiswasaccomplishedbypersonalreviewofonly228documents,0.05%ofthetotalcollection.Therewere32additionalsubmissionstoTRECaftertheReasonablecallpoint.Recallof90%wasattainedaftersubmittingaftersubmitting7,091documents,and95%Recallafter10,931.Recallof98%Recallwasreachedaftersubmitting14,698documents,whichwasonly3.22%oftotalof456,147collectionofBlackHatWorldForumposts.Again,thiswasaccomplishedbypersonalreviewofonly228documents,0.05%ofthetotalcollection.InalltopicswealwaysstoppedindividualdocumentreviewaftertheReasonablecallandreliedonMr.Robotsautomaticprocesseswhereinthedocumentsweresubmittedinorderofhighestranking.Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%Recallcall,andthedarkgreenlinetheReasonableRecallcall.
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheArticleSpinnerSpinningtopic,bythetime97.5%Recallhadbeenattainedonly3.16%ofthecorpus,14,698documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining96.84%or450,449documents.
68
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingthemultimodalhybridmodelofsearchandtrainingofMr.EDR.
______________________________________
69
Topic2129FacebookAccounts
ConfusionMatrix-Topic2129TotalDocuments:465,147
TotalRelevant:589
TotalPrevalence:0.13%
Topic2129wasrunbySullivanwhostartedonAugust21,2015.Hefinishedhisreviewof
465,149forumpostsinBlackHatWorldonAugust22,2015.
WhilehecountshimselfamongFacebook’s1.5billionactiveusers,Sullivandoesnotconsider
himselfmoreknowledgeableonthistopicthantheaverageperson.
Day1onthistopicstartedlikeallSullivantopicswithconceptsearchingtofindkeywords
relatingtoFacebookaccountsforsearchingandhighlighting.Specifically,variationsof
Facebookspellingandslangwereinvestigatedtoensureallcommonvariantsareidentified.
Manypreviouslyunexpectedvariationsoffacebookwereidentified,suchasfbook.All
variationswereaddedtothehighlightinglistanddocumentedforfuturesearches.
Sullivanspent2.5hoursonDay1tryingtodefinerelevanceaccordingtotheTRECstandard.He
startedwith8documentsthatcontainedclearreferencestofacebookaccounts,andonly1of
thedocumentswasreturnedasrelevantaccordingtotheTRECstandard.Hecontinuedby
isolatingdocumentsthatcontained“Facebookaccount*”inthetitleaswellasanumberof
commonvariants.Attheendoftheday,SullivanwasnoclosertocrackingtheFacebookpuzzle
andwasbarelyabletoexceed50%precisioneventhoughhewasonlysubmittingdocuments
thatwerecertaintoberelevantbyanyobjectivestandard.
Facingwhatappearedtobeadead-end,SullivanstartedDay2byrelyingonthepriorityscores
generatedbyMr.EDR,andstartedtoseemuchbetterresults.WhileSullivanwasunableto
identifywhichdocumentswouldbereturnedasresponsivebyTREC,Mr.EDRseemedtobeable
tofindthepattern.Assuch,hestoppedlookingatthedocuments,andjuststartedsubmitting
alldocumentsthathadahighpriorityscorethatcontainedthetermFacebookoranyknown
@Reas.Call
@97.5%Recall
TruePositives 580 575
TrueNegatives 461,284 462,644
FalsePositives 3,274 1,914
FalseNegatives 9 14
Recall 98.47% 97.62%
Precision 15.05% 23.10%
F1Measure 26.11% 37.36%
Accuracy 99.29% 99.59%
Error 0.71% 0.41%
Elusion 0.00% 0.00%
Fallout 0.70% 0.41%
70
variation,withlearningsessionsbeingrunperiodicallytoupdatethescoresbasedonnewlearning.Oncethosedocumentswereexhausted,allremainingdocumentsweresubmittedindescendingpriorityscoreorder.Hespent2.75hourssubmittingandevaluatingtheresults,foratotalof5.25hoursspentonthistopic.Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%recallcall,andthedarkgreenlinethereasonablerecallcall.
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheFacebookAccountstopic,bythetime97.5%Recallhadbeenattainedonly0.54%ofthecorpus,2,489documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining99.46%or462,658documents.
71
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%recallusingthemulti-modalhybridmodeloftrainingEDR.
______________________________________
Topic3378RobMcKennaGubernatorialCandidateConfusionMatrix-Topic3378TotalDocuments:902,434TotalRelevant:66 TotalPrevalence:0.01%
@Reas.Call
@97.5%Recall
TruePositives 59 65TrueNegatives 902,321 902,264FalsePositives 47 104FalseNegatives 7 1Recall 89.39% 98.48%Precision 55.66% 38.46%F1Measure 68.60% 55.32%Accuracy 99.99% 99.99%Error 0.01% 0.01%Elusion 0.00% 0.00%Fallout 0.01% 0.01%
72
Topic3357wasrunbyReichenberger.Theworktosearchthe902,434NewsArticlesdatabasestartedonAugust22,2015,andwascompletedonAugust23,2015.Theinitialsubmissionsonthefirstdayweretotesttheoutlinesoftherelevancescope.ItwasascertainedinthefirsttwosubmissionsthatdocumentsrelatingtoMcKennaasacandidatewererelevant,andthoserelatedtohisjobasAttorneyGeneralwereirrelevant.BorderlinedocumentswerethoseassociatedwithhisAttorneyGeneraljobthatcouldbepretexttoapoliticalcampaign(e.g.filingasuitrelatedtoObamacareimplementation).Thethirdsubmissionwasmadewiththenext65documentsbasedonprioritizationwithoutlookingatthecontent;theresultslargelyconfirmedtheanticipatedparameters(43relevant,22irrelevant,withtheborderlinedocumentsskewingtotheirrelevant)The70%callwasmadefollowingthereturnofresults.Afterlookingatwhatwasbeingpromotedbyprioritizationandcontaining“McKenna,”thenext13documentsweresubmitted.Mostoftheseappearedtobeborderline,only4wereadjudicatedrelevantbyTREC.The80%recallcallwasmadeatthatpoint.Onemoresetof14documentswassubmittedandonly3camebackresponsive.ThedecisionwasthenmadetocallReasonable,andthereafterthefinalsubmissionsweremade.Thepostcallsubmissionsweremadebythefollowinggroupsindescendingpriorityscoreorder:1)alldocumentsreviewedthatwerecurrentlyanticipatedtobeirrelevant,buthadnowbeensubmitted(129documents,ofwhich7wererelevant);2)anythingremainingwith“McKenna”(695documents,allirrelevant;andthen3)allelse(allirrelevant).Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%Recallcall,andthedarkgreenlinetheReasonablecall.
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingRecallthresholds.OntheRobMcKennaGubernatorialCandidatetopic,bythetime97.5%Recallhadbeenattainedonly0.02%ofthecorpus,169documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining99.98%or902,265documents.
73
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingtheMultimodalHybridmodeloftrainingMr.EDR.
______________________________________
74
Topic2322WebScrapingConfusionMatrix-Topic2322WebScrapingTotalDocuments:456,147TotalRelevant:10,145 TotalPrevalence:2.22%
Topic2322wasrunbyLoseywhoalsostartedonAugust22,2015.Hefinishedhisreviewof465,149forumpostsinBlackHatWorldonAugust25,2015.TheprojectcommencedasusualwithLoseybeginningStepTwo,MultimodalSearchReviews.StepThree,RandomBaseline,wasomitted.Aftersubmissionsbegan,theechoStepFive,multimodalsearchbegan,includingpredictivecodingfeatures,withiteratedtraining.OnAugust25,2015,aftermaking24submissionstoTREC,andtrainingafteralmosteverysubmission,Loseyhadprovidedatotalof12,799documentstoTRECandconfirmedatotalof8,060relevantdocuments.Theeffort,ornumberofdocumentsreviewedandcodedbyLoseytoattainthisresult,was195documents.Afterthe24thTRECsubmission,LoseydecidedtocallReasonable.ItwaslaterdeterminedthataRecallof79.45%wasattainedbysubmissionofonly12,799documents,whichis2.8%ofthetotaldocuments.Thiswasaccomplishedbyreviewofonly0.04%ofthetotalcollection.Therewere21additionalsubmissionstoTRECaftertheReasonablecallpoint.InthenextsubmissionafterReasonablecall,the25th,1,000documentsweresubmittedandtheyallcamebackrelevant.Obviouslyanerroringamesmanshiphadbeenmadeandthecallwasmadealittletooearly.Afterthat25thsubmission,theRecalllevelroseto89.31%andthePrecisionincreasedto65.66%.A90%Recallwasattainedaftersubmitting14,477documents.A95%Recallwasattainedaftersubmitting16,983documents,and97.5%Recallattainedafter19,821documentsweresubmitted,whichwasonly4.35%oftotalof456,147collectionofBlackHatWorldForumposts.Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%Recallcall,andthedarkgreenlinetheReasonableRecallcall.
@Reas.Call
@97.5%Recall
TruePositives 8,060 9,892TrueNegatives 441,263 436,073FalsePositives 4,739 9,929FalseNegatives 2,085 253Recall 79.45% 97.51%Precision 62.97% 49.91%F1Measure 70.26% 66.02%Accuracy 98.50% 97.77%Error 1.50% 2.23%Elusion 0.47% 0.06%Fallout 1.06% 2.23%
75
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheWebScrapingtopic,bythetime97.5%Recallhadbeenattainedonly4.35%ofthecorpus,19,821documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining95.65%or436,326documents.
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingthemultimodalhybridmodelofsearchandtrainingofMr.EDR.
76
______________________________________
Topic3484PaulandCathyLeeMartin
ConfusionMatrix-Topic3484TotalDocuments:902,434TotalRelevant:23 TotalPrevalence:0.00%
@Reas.Call
@97.5%Recall
TruePositives 23 23TrueNegatives 902,411 902,411FalsePositives 0 0FalseNegatives 0 0Recall 100.00% 100.00%Precision 100.00% 100.00%F1Measure 100.00% 100.00%Accuracy 100.00% 100.00%Error 0.00% 0.00%Elusion 0.00% 0.00%Fallout 0.00% 0.00%
77
ThisTopicwasrunbySullivanwhostartedonAugust24,2015.Hecompletedhisreviewof902,434documentsonAugust25,2015.TheentireTeamobservedhisfinalsubmissionsandcheeredonhisperfecthandlingofthissearchproject.ThistopicwascompletelyunknowntoSullivanpriortothisexercise.HisonlyknowledgecamefromaquickGooglesearchonthetopic.SullivanstartedlateonDay1andbeganwithasimplesearchusingthefollowingkeywords:((martinw/3paul)ANDcathy)OR((martinw/3cathy)ANDpaul).Thissearchreturned26documents.Aquickreviewofthedocumentsyielded22clearlyrelevantdocumentsand1marginallyrelevant.Sullivansubmittedthe22relevantdocuments,whichwereallreturnedasrelevantbyTRECandquitforthenightafter15minutesofeffort.OnDay2,Sullivanwentbacktohisstandardprocessofusingconceptsearchingtofindrelevantkeywordsforhighlightingandsearches.Aswithalltopicsindataset3,spellingerrorswerenon-existent,whichremovedtherequirementofbroadsearchingtoaccountforslangorspellingissues.Broadsearcheswererunusingallrelevantkeywordsandtheresultsweresampled.Nextpredictivecodingscoreswereusedtoidentifyadditionalpotentiallyrelevantdocuments.AlargenumberoffalsepositiveswereencounteredwhenitwasdiscoveredapopularhockeyplayerandPrimeMinistersharedthesamenamesastheparties.Thesewerequicklyidentifiedandexcludedfromthepotentiallyrelevantset.After90minutesofwork,Sullivanconcededthathewasunabletofindanyadditionalrelevantdocuments.InreviewingthesinglemarginallyrelevantdocumentfoundonDay1,itwasdeterminedthisdocumentwasverylikelytoberelevant,soitwassubmittedtoTRECandwasinfactreturnedrelevant.Atthispoint,Sullivancalledreasonablerecallandsubmittedallremainingdocumentsindescendingorderofpriorityscore.Afteralldocumentsweresubmitted,itwasdiscoveredthatSullivaninfacthadattained100%recalland100%precisionatthepointthereasonablecallwasmade.Additionally,95.7%recallwasattained,with100%precision,afteronly15minutes.Inall,hewasabletoachieveaperfectgamewithonly1.75hourscommittedtothistopic!Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%recallcall,andthedarkgreenlinethereasonablerecallcall.
78
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OnthePaulandCathyLeeMartintopic,bythetime97.5%Recallhadbeenattainedonly0.00%ofthecorpus,23documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining100.00%or902,411documents.
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%recallusingthemulti-modalhybridmodeloftrainingEDR.
79
______________________________________
Topic2134PaypalAccountsConfusionMatrix-Topic2134TotalDocuments:465,147TotalRelevant:252 TotalPrevalence:0.05%
Topic2134wasrunbySullivanwhostartedonAugust26,2015.Hefinishedhisreviewof465,149forumpostsinBlackHatWorldonAugust26,2015.
@Reas.Call
@97.5%Recall
TruePositives 241 246TrueNegatives 461,447 443,136FalsePositives 3,448 21,759FalseNegatives 11 6Recall 95.63% 97.62%Precision 6.53% 1.12%F1Measure 12.23% 2.21%Accuracy 99.26% 95.32%Error 0.74% 4.68%Elusion 0.00% 0.00%Fallout 0.74% 4.68%
80
AsaregularPayPaluserforabout10years,Sullivanhasahighlevelofknowledgeregardingthis
topic.Thisadvancedknowledgeprovedtobeaburdenonthistopicbecausehisunderstanding
ofwhatshouldberelevantdidnotmatchwiththeTRECgoldstandard.Hewasableto
overcomethisburdenbyrelyingonavarietyofadvancedmethodsratherthanusinghisown
judgmentinreviewofthedocuments.
Sullivanstartedthistopicwithhisusualprocessofrunningconceptsearchestofindsimilarand
relatedkeywordtermsforhighlightingandfuturesearching.Aswithallforumtopics,hespend
sometimeidentifyingcommonvariantsbasedonmisspellingorslang.Allvariationswereadded
tothedatabaseforhighlighting.
Whileusinganumberofmethodstoidentifydocumentshefeltwereclearlyrelevant,Sullivan
quicklyrealizedhewasunabletomakeanylogicoftheTRECrelevancestandard.Documents
withsimilaroridenticalcontentwereseeminglyarbitrarilydesignatedasrelevantornot
relevant.Ratherthanspendaconsiderabletimeevaluatingthedocumentshimself,aswasdone
inTopic2129FacebookAccounts,hewentstraighttoMr.EDRforhelp.
SimilartothemethoddevelopedinTopic2129,Sullivanreliedheavilyonthepredictivecoding
anddidverylittlereviewonanydocuments.Hewoulditerativelysubmitthehighestscoring
documentstoTRECforanalysis,andtrainthedocumentswiththerelevancydetermination
returned.Inadditiontousingacontinuousactivelearningapproach,hestartedusingthe“Find
Similar”featuremuchmoretofinddocumentsthatcontainedsimilarcharacteristicsto
documentsalreadydeterminedtoberelevant.Hestartedwithdocumentsthatcontaineda
variationofPayPalinthesubjectline,thenmovingtodocumentsthatcontainedtheterm
anywhereinthetext.Usingthismultimodalmethodhewasabletoworkhiswaythroughthe
entiredatasetwithalmostnoactualreviewofthedocuments.Inall,Sullivanwasableto
completethereviewforthistopicinlessthan4hours.
Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenline
signifyingtheanticipated70%recallcall,andthedarkgreenlinethereasonablerecallcall.
81
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OnthePaypalAccountstopic,bythetime97.5%Recallhadbeenattainedonly4.73%ofthecorpus,22,005documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining95.27%or443,142documents.
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%recallusingthemulti-modalhybridmodeloftrainingEDR.
______________________________________
82
Topic3423RobFordCuttheWaist
ConfusionMatrix-Topic3423TotalDocuments:902,434TotalRelevant:76 TotalPrevalence:0.01%
Topic3423wasrunbyLoseywhoalsostartedonAugust26,2015.Hefinishedhisreviewof902,434NewsArticlesonAugust27,2015.TheprojectcommencedasusualwithLoseybeginningStepTwo,MultimodalSearchReviews.StepThree,RandomBaseline,wasomitted.Aftersubmissionsbegan,theechoStepFive,multimodalsearchbegan,includingpredictivecodingfeatures,withiteratedtraining.OnAugust26,2015,aftermaking11submissionstoTREC,andtrainingafteralmosteverysubmission,Loseyhadprovidedatotalof40documentstoTRECandconfirmedatotalof34relevantdocuments.Theeffort,ornumberofdocumentsreviewedandcodedbyLoseytoattainthisresult,was92documents.Afterthe11thTRECsubmission,LoseydecidedtocallReasonable.Thisprovedtobeaprematurecall.ItwaslaterdeterminedthataRecallof44.74%wasattained.Inthe17automaticsubmissionsthatfollowed,Recallof76.32%wasattainedwith84.06%Precision.The76.32%Recallwasattainedaftersubmittingonly106documents,whichis0.01%ofthetotalof902,434.Therewere17submissionstoTRECaftertheReasonablecallpoint.Total100%Recallwasattainedaftersubmittingonly35,193documents,whichis3.9%ofthetotal.Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%Recallcall,andthedarkgreenlinetheReasonableRecallcall.
@Reas.Call
@97.5%Recall
TruePositives 34 75TrueNegatives 902,352 867,337FalsePositives 6 35,021FalseNegatives 42 1Recall 44.74% 98.68%Precision 85.00% 0.21%F1Measure 58.62% 0.43%Accuracy 99.99% 96.12%Error 0.01% 3.88%Elusion 0.00% 0.00%Fallout 0.00% 3.88%
83
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheRobFordCuttheWaisttopic,bythetime97.5%Recallhadbeenattainedonly3.89%ofthecorpus,35,096documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining96.11%or867,338documents.
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%RecallusingthemultimodalhybridmodelofsearchandtrainingofMr.EDR.
84
______________________________________
Topic3133PacificGatewayConfusionMatrix-Topic3133TotalDocuments:902,434TotalRelevant:113 TotalPrevalence:0.01%
Topic3133wasrunbyLoseywhoalsostartedonAugust27,2015.Hefinishedhisreviewof902,434NewsArticlesonAugust28,2015.TheprojectcommencedasusualwithLosey
@Reas.Call
@97.5%Recall
TruePositives 87 111TrueNegatives 902,311 799,986FalsePositives 10 102,335FalseNegatives 26 2Recall 76.99% 98.23%Precision 89.69% 0.11%F1Measure 82.86% 0.22%Accuracy 100.00% 88.66%Error 0.00% 11.34%Elusion 0.00% 0.00%Fallout 0.00% 11.34%
85
beginningStepTwo,MultimodalSearchReviews.StepThree,RandomBaseline,wasomitted.Aftersubmissionsbegan,theechoStepFive,multimodalsearchbegan,includingpredictivecodingfeatures,withiteratedtraining.OnAugust28,2015,aftermaking7submissionstoTREC,andtrainingafteralmosteverysubmission,Loseyhadprovidedatotalof97documentstoTRECandconfirmedatotalof87relevantdocuments.Theeffort,ornumberofdocumentsindividuallyreviewedandcodedbyLoseytoattainthisresult,was49documents.Afterthe7thTRECsubmission,LoseydecidedtocallReasonable.Thatcallprovedtobealittlepremature.ItwaslaterdeterminedthataRecallof76.99%wasattainedwithPrecisionof89.69%.Inthe6thautomaticsubmissionafterthecall,aRecallof94.69%wasattainedaftersubmittingonly693documentstotal,whichis0.07%ofthetotalof902,434.Therewere24submissionstoTRECaftertheReasonablecallpoint.Total100%Recallwasattainedaftersubmitting103,189documents,whichis11.43%ofthetotal.Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%recallcall,andthedarkgreenlinethereasonablerecallcall.
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OnthePacificGatewaytopic,bythetime97.5%Recallhadbeenattainedonly11.35%ofthecorpus,102,446documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining88.65%or799,988documents.
86
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%recallusingthemulti-modalhybridmodeloftrainingEDR.
______________________________________
87
Topic3226TrafficEnforcementCamerasConfusionMatrix-Topic3226TotalDocuments:902,434TotalRelevant:2,094 TotalPrevalence:0.23%
Topic3226wasrunbySullivanwhoalsostartedonAugust27,2015.Hefinishedhisreviewof902,434NewsArticlesonAugust28,2015.Sullivanhassomepriorexperienceasacriminaldefenseattorney,withexperiencewithtrafficlaws,buthehasnopriorexperiencewithtrafficenforcementcameras,whichwerenotinuseatthetimehewaspracticing.Asusual,Sullivanstartedhisinvestigationwithhisstandardprocessofusingkeywordandconceptsearchestoformulatealistofrelatedkeywordsforhighlightingandfuturesearching.Forthisexercise,nothingextraordinarywasdiscovered,buthewasabletogenerateagoodlistoftermsrelatingtotrafficcameras,redlightcameras,andtraffictickets.Day1wasashortdayandstartedwithsubmittingtheresultsofthemostpopularkeywordsearcheswithminimalreview.After30minutesofwork,76documentsweresubmittedwith50beingreturnedasrelevant.UsingthedocumentsidentifiedonDay1,SullivanwasabletostartutilizingthepredictivecodingtosupplementhissearchesonDay2.Hewasabletoprogressivelymakehiswaythroughthereviewsetusingacombinationofpredictivecodingscoresandkeywordhits.Heusedthismultimodalapproachtosubmitlargesetsofdocumentswithminimal,ifany,manualreview.Hebelievedhehadfoundallrelevantdocumentsaftersubmittingonly5,347totaldocumentswith2,061relevant.Aftersubmittingalloftheremainingdocumentsindescendingorderbypredictivecodingpriorityscore,itwasdiscoveredheonlymissed33oftherelevantdocumentsinthedatasetaftersubmitting0.6%ofthedocuments!Becauseheminimizedtheamountofmanualreviewonthistopic,hewasabletocompletethistopicafter3.0hoursonDay2,foratotalof3.5hoursonthistopic.
@Reas.Call
@97.5%Recall
TruePositives 2,061 2,042TrueNegatives 897,054 899,807FalsePositives 3,286 533FalseNegatives 33 52Recall 98.42% 97.52%Precision 38.54% 79.30%F1Measure 55.39% 87.47%Accuracy 99.63% 99.94%Error 0.37% 0.06%Elusion 0.00% 0.01%Fallout 0.36% 0.06%
88
Agraphmappinghowthereviewwasconductedappearsbelow,withthelightgreenlinesignifyingtheanticipated70%recallcall,andthedarkgreenlinethereasonablerecallcall.
ThefollowingchartshowsPrecision(leftandblueline),F1(red)andpercentofdocumentssubmitted(green)astrackedacrossvaryingrecallthresholds.OntheTrafficEnforcementCamerastopic,bythetime97.5%Recallhadbeenattainedonly0.29%ofthecorpus,2,575documents,hadbeensubmittedforadjudication.Thelastportionofthegraphthusrepresentsthesubmissionoftheremaining99.71%or899,859documents.
Thelastchartbelowrepresentstheamountofeffortintermsofdocumentsreviewedtoattain100%recallusingthemulti-modalhybridmodeloftrainingEDR.
89