Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling...

Post on 13-Jun-2020

5 views 0 download

Transcript of Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling...

IntelligentEventFocusedCrawling

MohamedMagdyGharibFarag

DissertationsubmittedtothefacultyoftheVirginiaPolytechnicInstituteandStateUniversityinpartialfulfillmentoftherequirementsforthedegreeof

DoctorofPhilosophy

inComputerScienceandApplications

EdwardA.Fox,Co-ChairRihamH.Mansour,Co-Chair

WeiguoFanScotlandC.LemanPadminiSrinivasan

August11,2016Blacksburg,VA

Keywords:FocusedCrawling,EventModeling,WebArchives,DigitalLibraries,

SourceImportance,SocialMedia,SeedGeneration

Copyright2016MohamedM.G.Farag

ii

IntelligentEventFocusedCrawling

MohamedMagdyGharibFarag

ABSTRACTThereisneedforanintegratedeventfocusedcrawlingsystemtocollectWebdataaboutkeyevents.Whenaneventoccurs,manyuserstrytolocatethemostup-to-dateinformationaboutthatevent.Yet,thereislittlesystematiccollectingandarchivinganywhereofinformationaboutevents.Weproposeintelligenteventfocusedcrawlingforautomaticeventtrackingandarchiving,aswellaseffectiveaccess.Weextendthetraditionalfocused(topical)crawlingtechniquesintwodirections,modelingandrepresenting:eventsandwebpagesourceimportance. Wedevelopedaneventmodelthatcancapturekeyeventinformation(topical,spatial,andtemporal).Weincorporatedthatmodelintothefocusedcrawleralgorithm.Forthefocusedcrawlertoleveragetheeventmodelinpredictingawebpage’srelevance,wedevelopedafunctionthatmeasuresthesimilaritybetweentwoeventrepresentations,basedontextualcontent.Althoughthetextualcontentprovidesarichsetoffeatures,weproposedanadditionalsourceofevidencethatallowsthefocusedcrawlertobetterestimatetheimportanceofawebpagebyconsideringitswebsite.Weestimatedwebpagesourceimportancebytheratioofnumberofrelevantwebpagestonon-relevantwebpagesfoundduringcrawlingawebsite.Wecombinedthetextualcontentinformationandsourceimportanceintoasinglerelevancescore.Forthefocusedcrawlertoworkwell,itneedsadiversesetofhighqualityseedURLs(URLsofrelevantwebpagesthatlinktootherrelevantwebpages).AlthoughmanualcurationofseedURLsguaranteesquality,itrequiresexhaustivemanuallabor.WeproposedanautomatedapproachforcuratingseedURLsusingsocialmediacontent.WeleveragedtherichnessofsocialmediacontentabouteventstoextractURLsthatcanbeusedasseedURLsforfurtherfocusedcrawling.Weevaluatedoursystemthroughfourseriesofexperiments,usingrecentevents:Orlandoshooting,Ecuadorearthquake,Panamapapers,Californiashooting,Brusselsattack,Parisattack,andOregonshooting.Inthefirstexperimentseriesourproposedeventmodelrepresentation,usedtopredictwebpagerelevance,outperformedthetopic-onlyapproach,showingbetterresultsinprecision,recall,

iii

andF1-score.Inthesecondseries,usingharvestratiotomeasureabilitytocollectrelevantwebpages,oureventmodel-basedfocusedcrawleroutperformedthestate-of-the-artfocusedcrawler(best-firstsearch).Thethirdseriesevaluatedtheeffectivenessofourproposedwebpagesourceimportanceforcollectingmorerelevantwebpages.Thefocusedcrawlerwithwebpagesourceimportancemanagedtocollectroughlythesamenumberofrelevantwebpagesasthefocusedcrawlerwithoutwebpagesourceimportance,butfromasmallersetofsources.ThefourthseriesprovidesguidancetoarchivistsregardingtheeffectivenessofcuratingseedURLsfromsocialmediacontent(tweets)usingdifferentmethodsofselection.

iv

AcknowledgmentsFirstIwouldliketothankmyadvisorsDr.EdwardA.FoxandDr.RihamMansourforallthecontinuoushelp,support,encouragement,patience,motivation,andguidancethattheygavemethroughmyPh.D.journey.IcouldnothaveimaginedhavingbetteradvisorsandmentorsformyPh.D.study. Besidesmyadvisors,Iwouldliketothanktherestofmythesiscommittee-Prof.PadminiSrinivasan,Prof.PatrickWeiguoFan,andProf.ScotlandLeman-fortheirinsightfulcommentsandencouragement,butalsoforthehardquestionswhichmotivatedmetowidenmyresearchfromavarietyofperspectives.MysincerethanksalsogoestoSethPeery,LukeWard,ShaneColeman,andAndiOgier,whoprovidedmeanopportunitytojointheirteamasintern,andwhohelpedmelearnnewtechnologies,extendmyskills,andapplymyresearchexperiencetopracticalproblems.Ithankmyfellowlabmates:SeungwonYang,SunshinLee,VenkatSrinivasan,TarekKanan,SungHeePark,MonicaAkbar,KiranChitturi,PrashantChandrasekar,EricFouh,andallotherlabmembersIforgottomention,forthestimulatingdiscussions,forthesleeplessnightswewereworkingtogetherbeforedeadlines,andforallthefunwehavehadinthelastfiveyears.IwouldliketothankmywifeSamarElSaadawy.Nowordsorthankswouldsufficeorgiveherwhatshedeserves.Shesufferedalotforme.ShesupportedmeinallthedifferentstagesofmyPh.D.withoutcomplaining.Withouther,Iwouldn’thavedoneanything.ThankyoumyheroSamarElSaadawy. Lastbutnotleast,Iwouldliketothankmyfamily-myparentsandmybrotherandsisters-forsupportingmespirituallythroughoutmyPh.D.andmylifeingeneral.ThanksgotoNSFforsupport,especiallythroughgrantsIIS-1619028,IIS-1619371,IIS-1319578,DUE-1141209,IIS-0916733,andIIS-0736055.ThanksalsogotoVirginiaTech’sDigitalLibraryResearchLaboratoryandDepartmentofComputerScience.

v

TableofContentsABSTRACT....................................................................................................................................................iiAcknowledgments....................................................................................................................................ivTableofContents.......................................................................................................................................vListofFigures............................................................................................................................................viiListofTables...............................................................................................................................................ix1 Introduction.......................................................................................................................................11.1 Motivation.................................................................................................................................11.2 Hypotheses................................................................................................................................51.3 ResearchQuestions...............................................................................................................61.4 5S...................................................................................................................................................61.5 ThesisOrganization..............................................................................................................8

2 RelatedWork.....................................................................................................................................92.1 WebCrawling...........................................................................................................................92.2 TheIDEALproject..................................................................................................................92.3 WebArchivingandArchive-Itservice.......................................................................112.4 FocusedCrawling................................................................................................................132.4.1 MachineLearning......................................................................................................132.4.2 SemanticSimilarity...................................................................................................142.4.3 ContentandLinkAnalysis.....................................................................................15

2.5 EventModeling....................................................................................................................162.6 SocialMediaandFocusedCrawlingSeedSelection.............................................182.7 Evaluatingtopical/focusedcrawlers..........................................................................19

3 FocusedCrawling..........................................................................................................................203.1 TopicRepresentation........................................................................................................203.2 CrawlerArchitecture.........................................................................................................223.3 LargeScaleDesignConsiderations..............................................................................24

4 EventFocusedCrawler...............................................................................................................264.1 EventModelandRepresentation.................................................................................264.1.1 EventModeling...........................................................................................................26

4.2 EventProcessing.................................................................................................................304.2.1 EventModel-basedWebpageScoring..............................................................304.2.2 CalculatingTheWeights.........................................................................................324.2.3 EventModel-basedURLScoring.........................................................................34

5 ExperimentalSetup......................................................................................................................375.1 Datasets...................................................................................................................................375.2 Experiments..........................................................................................................................395.3 EvaluationMetrics..............................................................................................................41

6 Results...............................................................................................................................................436.1 EventModel-basedvs.Topic-OnlyClassification..................................................436.1.1 ClassifyingURLsandwebpagesaboutCaliforniashooting....................43

6.2 EventModel-basedvs.Topic-onlyFocusedCrawler...........................................486.2.1 CaliforniaShooting...................................................................................................486.2.2 BrusselsAttack...........................................................................................................506.2.3 Oregonshooting.........................................................................................................53

vi

6.2.4 Egyptairplanecrash................................................................................................546.2.5 Panamapapers...........................................................................................................556.2.6 Orlandoshooting.......................................................................................................556.2.7 Parisattacks.................................................................................................................566.2.8 Ecuadorearthquake.................................................................................................57

7 WebpageSourceImportanceandSocialmedia-basedSeedSelection..................597.1 WebpageSourceImportance.........................................................................................597.2 SeedURLsforcrawling.....................................................................................................647.3 Semi-automatedSocialMedia-basedSeedURLGeneration.............................657.3.1 SelectingSeedURLs..................................................................................................667.3.2 SeedsURLDomain/SourceImportance..........................................................66

8 ConclusionandFutureWork...................................................................................................748.1 Contributions........................................................................................................................758.2 FutureWork..........................................................................................................................75

References.................................................................................................................................................77

vii

ListofFiguresFigure1OverviewofIDEALsystemandroleofeventfocusedcrawling.........................2Figure2WorkflowforcreatingWebarchivesfromsocialmedia(Twitter)................10Figure3SeedURLsmanualcurationusingArchive-Itservice...........................................13Figure4Architectureofbaselinefocusedcrawlerwithtopicrepresentationinthe

lowerbox,crawlingintheupperbox,andprocessingandrelevanceestimationinthemiddlebox...........................................................................................................................22

Figure5Baselinefocusedcrawleralgorithm.............................................................................24Figure6Stepsofbuildingeventmodelfromseedwebpages.............................................28Figure7Thestepsforcalculatingthescoreofawebpage...................................................33Figure8AnexamplewebpagewitharelevantURLanchortexthighlighted..............34Figure9ThestepsforcalculatingthescoreofaURL.............................................................35Figure10Designofourevaluationmethodoftheeffectivenessoftheeventmodel

forrelevanceestimation;thetwoboxeswithanasteriskindicatethetwoparametersoptimizedintheexperiment...........................................................................40

Figure11Thedesignoftheexperimentforevaluatingtheeffectivenessofeventmodelwithfocusedcrawlertoretrievemorerelevantwebpages.........................41

Figure12 CaliforniashootingURLsevaluationatdifferentthresholdvalues.............46Figure13Californiashootingwebpagesevaluationatdifferentthresholdvalues...47Figure14 Performanceevaluationofeventmodel-basedvs.topic-onlyfocused

crawlersforCaliforniashooting.............................................................................................49Figure15Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforCaliforniashooting(50Kwebpages).........................................50Figure16 Performanceevaluationofeventmodel-basedfocusedcrawlerfor

Brusselsattack...............................................................................................................................51Figure17Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforBrusselsattack(50Kwebpages).................................................52Figure18Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforOregonshooting(100Kwebpages)...........................................53Figure19Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforEgyptairplanecrash(10Kwebpages).....................................54Figure20Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforPanamaPapers(100Kwebpages).............................................55Figure21Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforOrlandoshooting(50Kwebpages)............................................56Figure22Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforParisattack(500Kwebpages).....................................................57Figure23Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforEcuadorearthquake(100Kwebpages)...................................58Figure24EffectofsourceimportanceoneventfocusedcrawlingforBrusselsattack

event...................................................................................................................................................62Figure25EffectofsourceimportanceoneventfocusedcrawlingforCalifornia

shootingevent................................................................................................................................62Figure26EffectofsourceimportanceoneventfocusedcrawlingforEcuador

earthquakeevent..........................................................................................................................63

viii

Figure27EffectofsourceimportanceoneventfocusedcrawlingforOrlandoshootingevent................................................................................................................................63

Figure28Workflowforextracting,expanding,andselectingURLsfromtweets......67Figure29Brusselsattacktweetslanguagedistribution.......................................................68Figure30Brusselsattacktweetscreationdatedistribution...............................................69Figure31BrusselsattacktweetswithURLsdistribution.....................................................69Figure32BrusselsattackseedURLsdomainsdistribution................................................70

ix

ListofTablesTable1Exampleeventsofdifferenteventtypes.........................................................................3Table2Listofeventsusedinthelarge-scalecrawlingexperiments...............................38Table3ValuesoftheparametersthatproducedthebestF1-score.Kisthesizeofthe

topicvectorandthresholdisthecutoffvaluefordeterminingrelevantornon-relevantlabelsbasedonthescore........................................................................................43

Table4 Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,DateevaluatedonthemanuallylabeledTRAININGURLsdatasetforCaliforniashootingevent................................................................................................................................44

Table5 Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,andDateevaluatedonthemanuallylabeledTRAININGwebpagesdatasetforCaliforniashootingevent...........................................................................................................45

Table6Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,andDateevaluatedonthemanuallylabeledTESTwebpagesdatasetforCaliforniashootingevent...........................................................................................................46

Table7 Californiashootingeventmodel......................................................................................48Table8 Brusselsattackeventmodel..............................................................................................51Table9DifferenttypesofseedURLs.............................................................................................65Table10Brusselsattacktweetcollectionstatistics................................................................68Table11Harvestratioforeventfocusedcrawlerusingtwomethodsofseed

selectionwithdifferentnumbersofseedsforBrusselsattack.................................71Table12Numberofdifferentdomainsintheoutputcollectionsofcrawling

experimentsforBrusselsattack.............................................................................................71Table13Harvestratioforeventfocusedcrawlerusingtwomethodsofseeds

selectionwithdifferentnumbersofseedsforOregonshooting..............................71Table14Numberofdifferentdomainsintheoutputcollectionsofcrawling

experimentsforOregonshooting..........................................................................................72Table15Harvestratioforeventfocusedcrawlerusingtwomethodsofseeds

selectionwithdifferentnumbersofseedsforCaliforniashooting.........................72Table16Numberofdifferentdomainsintheoutputcollectionsofcrawling

experimentsforCaliforniashooting.....................................................................................72Table17Harvestratioforeventfocusedcrawlerusingtwomethodsofseeds

selectionwithdifferentnumbersofseedsforOrlandoshooting............................73Table18Numberofdifferentdomainsintheoutputcollectionsofcrawling

experimentsforOrlandoshooting........................................................................................73

1

1 Introduction

1.1 MotivationThereisneedforanintegratedeventfocusedcrawlingsystem[1-3]tocollectWebdataaboutkeyevents.Eventsleadtoourmostpoignantmemories.Werememberbirthdays,graduations,holidays,weddings,andothereventsthatmarkstagesofourlife,aswellasthelivesoffamilyandfriends.Asasocietywerememberassassinations,naturaldisasters,man-madedisasters,politicaluprisings,terroristattacks,andwars--aswellaselections,heroicacts,sportingevents,andothereventsthatshapecommunity,national,andinternationalopinions.WebandTwittercontentdescribesmanyofthesesocietalevents.Inpart,Web2.0[4]isahighlyresponsivesensorofimportantoccurrencesintherealworld,sincepeoplefromacrosstheglobemeetvirtuallyandsharerelatedobservationsandstoriesonline.Wecanleveragethisstreamofdata,forautomaticcollectionofevents,totriggereventarchiving,andlatertoenableeventrelatedservicesthatsupportcommunities.Permanentstorageandaccesstobigdatacollectionsofeventrelateddigitalinformation,includingwebpages,tweets,images,videos,andsounds,couldleadtoanimportantinternationalasset.Regardingthatasset,thereisneedfordigitallibraries(DLs)providingimmediateandeffectiveaccess,andarchiveswithhistoricalcollectionsthataidscienceandeducation,aswellasstudiesrelatedtoeconomic,military,orpoliticaladvantage.Whensomethingnotableoccurs,manyuserstrytolocatethemostup-to-dateinformationaboutthatevent.Later,researchers,scholars,students,andothersseekinformationaboutsimilarevents,sometimesforcross-eventcomparisonsortrendanalyses.Yet,thereislittlesystematiccollectingandarchivinganywhereofinformationaboutevents,exceptwhennationalorstateeventsarecapturedaspartofgovernmentrelatedWebarchives.ThisistheneedaddressedbytheIntegratedDigitalEventArchiveandLibrary(IDEAL)project[5].ThoughtheInternetArchive[6]supportssomeevent-orientedarchiving,coverageislimited.Manyimportanteventsareignored,whileothersareonlycapturedinpart.Somegroupscollectdataonlyuntilthelastvictimisrescued,whileothersstartlateandmissearlyposts.Further,toolsforcapturearecomplex,andfewarchivists

2

mastertheirfeatures,soachievinghighrecallisexpensive.Therearefewmechanismstofilteroutnoiseincollections.Accesstotheresultingarchivesisawkwardandinefficientduetothefactthatmuchofthecontentcapturedisnon-relevant[7].WearguethatmanualcurationofseedURLsisnotscalableandnotfullyeffectiveforarchivingeventsthathavehighimpact.Thus,improvedtechnologyisneeded.TheIDEALprojectisdevelopingadigitallibrary/archivesupportingautomaticeventtracking,crawling,andarchiving,aswellaseffectiveaccess(inthesenseofaidinginthefindingandutilizationofrelevanthighqualityinformation).Figure1showsanoverviewoftheIDEALproject.Bytakinginputfromtweets,news,webpages,(micro)blogs,andqueries,oursystemwillcollectandarchiveeventrelateddigitalobjects,andprovideabroadrangeofhelpfulservices.

Figure1OverviewofIDEALsystemandroleofeventfocusedcrawling

ThisdissertationfocusesonthedatafrontendoftheIDEALproject,i.e.,collectingandarchivingdatausinganewtypeoffocusedcrawler.TheIDEALprojecthasaround11TBofwebpagearchives(WARCfiles)andover1.2billiontweetsacrosshundredsofdifferentevents[8].Earlyon,thewebpagearchiveswerecollectedusingtheInternetArchive’s[6]Archive-Itservice[9],whichusestheHeritrix[10]toolforarchivingwebpages.Originally,theIDEALprojectmanuallypreparedalistofURLsforevents,andfedittotheArchive-Itserviceforcrawling.Theproblemwiththisweaklycuratedapproachisthatweproducedcollectionswithlowprecision(i.e.,withfewrelevantandmanynon-relevantwebpages);HeritrixisageneralWebcrawleranddoesn’tanalyzethetextualcontentofthewebpagesbeforedownloadingthem.Toovercomethisproblem,theIDEALteamshiftedtoanotherapproach:weextractedURLsfromtweetarchivesthatwerebuiltaboutevents,and

3

downloadedonlythecorrespondingwebpages.Theresultingcollectionshavehighprecision(mostofthewebpagesarerelevant)butlowrecall(notalloftherelevantwebpagesarefound).Afocusedcrawlerwouldhelpsolveeachofthepreviousproblemsbycrawling(toincreaserecall)theWWWstartingfromtheURLsextractedfromthetweetarchivesbutthenfollowingonlytherelevantwebpages(toenhanceprecision)inordertofindandcollectasmuchrelevantinformationaspossible.However,inthepast,focusedcrawlingwasmainlyappliedtotopicalcrawling,i.e.,collectingwebpagesaboutacertaintopicordomain.Accordingly,sinceweareworkingwithevents,weextendedtheapproachpreviouslyusedbytheIDEALteamandadapted/changedthetraditionalfocusedcrawlerapproachtoaccommodateourneeds.

Table1Exampleeventsofdifferenteventtypes

EventTypes ExampleEvents

Bombing Bostonbombing

BuildingCollapse EastHarlembuildingcollapse

Community Flintwatercrisis,Lovewins(samesexmarriage),Worldcup

Earthquake Ecuador,Japan

Fire Californiawildfire,Brazilnightclubfire

Flood TexasFloods

Hurricane Joaquin,Sandy,Katrina

PlaneCrash Egyptair,Russian,germanwings

PoliticalConflict Brexit,Turkeycoup,GreeceBailoutreferendum

Protests/Riots Ferguson,Egyptianrevolution

Scandal PanamaPapers,SeppBlatter

Shooting Oregoncollegeshooting,Californiashooting,Orlandoclubshooting

TerroristAttack Paris,Brussels,Nicetruckattack

TrainDerailment Amtrak188,Quebec

4

Thereareseveraldefinitionsofaneventthatdifferaccordingtodiscipline(seeChapter4formoredetails).Inthisdissertationweareinterestedinunusualrealworldeventsthatcapturemuchattention,leadingtoahighvolumeofcontentgeneratedontheWWWabouttheevent.Thusweconsiderdifferenttypesofeventslike:airplanecrashes,buildingcollapses,earthquakes,floods,fires,hurricanes,politicalcrises,scandals,shootings,terroristattacks,andtraincrashes/derailments.Allthesetypesofeventssharethesamecharacteristics:somethinghappensataspecificphysicallocationduringaspecificperiodoftime.ExampleeventsforeacheventtypeareshowninTable1.Thisisnotanexclusivelistofeventsandtheirtypes,butratherarepresentativelistofwhatwehavebeenworkingoninthisdissertation.Ourresearchshouldgeneralizetoanyeventwiththesamecharacteristics(i.e.,somethingunusualhappensataspecificplaceonaspecificdate).Weproposedfivemajorchangestotraditionaltopicalfocusedcrawling:

1. Implementinganeventmodelandrepresentation, 2. Incorporatingtheeventmodelinformationextractedfromseedwebpages’

contentintofocusedcrawling, 3. Developingawebpagesourceimportancemodel, 4. Incorporatingthewebpagesourceimportancemodelintofocusedcrawling, 5. AutomatingtheprocessofseedURLsselection.

OurproposedapproachintelligentlycombinesthedifferentaspectsofaneventtoheuristicallyestimatetherelevancetotheeventofaURLand/orawebpage.Ourintelligenteventfocusedcrawlerdistinguisheson-topicURLs/webpagesvs.off-topic(likewiththebaselinetopic-onlyapproach),andalsodistinguisheson-eventURLs/webpagesvs.off-eventURLs/webpagesonwhichthebaselinetopic-onlyapproachfails.Forexample,considertheCaliforniashootingevent.Bothapproaches(ourintelligenteventfocusedcrawlerandthebaselinetopicalfocusedcrawler)successfullyidentifyon-topicURLs/webpages(i.e.,URLs/webpagesaboutashooting).However,oureventfocusedcrawleralsointelligentlydistinguishesbetweenURLs/webpagesthatareon-event(e.g.,relevanttotheCaliforniashooting)andURLs/webpagesthatareoff-event(butstillon-topic,i.e.,aboutashooting).Weconductedfourseriesofexperimentstoevaluateoursystemusingasetofrecentevents:Orlandoshooting,Ecuadorearthquake,Panamapapers,Californiashooting,Brusselsattack,Parisattacks,andOregonshooting.Thefirstexperimentseriesevaluatedtheeffectivenessofourproposedeventmodelrepresentationwhenassessingtherelevanceofwebpages.Oureventmodeloutperformedthetopic-onlyapproaches;itshowedbetterresultsinprecision,recall,andF1-score.Thesecond

5

experimentseriesevaluatedtheeffectivenessoftheeventmodel-basedfocusedcrawlerforcollectingrelevantwebpagesfromtheWWW.Oureventmodel-basedfocusedcrawleroutperformedthestate-of-the-artfocusedcrawler(best-firstsearch);itshowedbetterresultsinharvestratio.Thethirdexperimentevaluatedtheeffectivenessofourproposedwebpagesourceimportanceforcollectingmorerelevantwebpages.Thefocusedcrawlerwithwebpagesourceimportancemanagedtocollectroughlythesamenumberofrelevantwebpagesasthefocusedcrawlerwithoutwebpagesourceimportancebutfromasmallersetofsources.ThefourthexperimentprovidesguidancetoarchivistsregardingtheeffectivenessofcuratingseedURLsfromsocialmediacontent(tweets)usingdifferentmethodsofselection.Ourcontributionsfromthisresearchare:

1- Amodelandrepresentation forcapturing thedifferentaspectsofevents inwebpages(topic,location,anddate);

2- An extended focused crawler approach that uses our event model torepresentcontentandtoestimatetherelevanceofwebpages;

3- Anautomatedapproachforsocialmedia-basedseedURLselection;4- A method to estimate the value of webpages based on their source

importance;5- Anextendedfocusedcrawlerapproachthatintegrates,foreachwebpage,the

textual content information based on our event model, along with thewebpagesourceimportance,intoasinglerelevancescore.

Thefollowingsubsectionsintroducethehypothesestestedthroughthisdissertation,theresearchquestionsinvestigated,the5Sapproachusedtoguideourwork,andtheorganizationofthefollowingchapters.

1.2 HypothesesThisresearchconsidersthefrontendoftheIDEALproject,i.e.,collectingandarchivingdatausinganewtypeoffocusedcrawler.Wetesttwohypothesesaboutfocusedcrawling,thefirstofwhichhasthreeparts:

H1:Usingseveralsourcesofevidenceofwebpagerelevance,includingcontent-basedfeatures,willhelpidentifymoreofjusttherelevantwebpages.

H1.1Usingtemporalandspatialinformationaspartofmodelingandrepresentingeventsleadstobetterdescriptionsofeventsthanthetopic/keyword-basedapproach.

6

H1.2:Incorporatingeventinformationextractedfromawebpage’scontentintofocusedcrawlingwillincreaseeffectiveness,helpingidentifymoreoftherelevantwebpageswhilemaintaininghighlevelsofprecision.H1.3:Awebpagebelongingtoanimportantwebsite(e.g.,www.cnn.com)shouldleadtoalargernumberofrelevantwebpageslinkedtoit.Thesourcewebsite(e.g.,www.cnn.com)isactinglikeahubwhichleadstorelevantwebpages.

H2:Integratingeventinformationandwebpagesourceimportancewillimproverelevancepredictionpower,ensurebalanceintypesofcontentcrawled,andreducebias.

1.3 ResearchQuestionsThisdissertationaddressesfiveresearchquestions,listedbelow,thatmap,asshowninparentheticals,tothehypothesesgivenabove.R1:Howtomodelandrepresentanevent?(H1.1)R2:Howtocomparetwoeventrepresentations?(H1.1)R3:Whatistheeffectofintroducingeventhandlingontheperformanceoffocusedcrawling?(H1.2)R4:Howtomodelandrepresentwebpagesourceimportance?(H1.3)R5:Howtointegratehandlingofeventsandinformationsources?(H2)

1.4 5SOursystemisdesignedtakingintoconsiderationthe5Sframework[11-14]fordevelopingdigitallibraries.Wedescribeherethedifferentdimensionsofthe5Sframeworkandhowtheyareappliedinoursystem.Weareusingthe5S(societies,scenarios,spaces,structures,streams)digitallibraryframeworkfortwomainreasons.First,5Sprovidesachecklistanddesignguidelinesthathelpintheunfoldingofourresearch.Second,themaingoalofoursystemistobuildaneventdigitallibrary,whichisakeypartoftheIDEALproject,whichprovidesservices(byimplementingscenarios)foravarietyofusers(societies).Wedemonstratehowfocusedcrawlingaddscontent(digitalobjects)totheIDEALeventdigitallibrary.Thekeydigitalobjectsaretweetsandwebpages,thateachcanbeviewedasacombinationofstreamandstructure.Anotherofthemainobjectsinourfocusedcrawleristheeventobject,acombinationofstreamand

7

structureandspace.The5Smodelallowsustotreateventsasfirstclassobjects(i.e.,eventobjectscanbecreated,stored,sortedondifferentattributes,searched,andvisualized-basedonattributeslikelocation).Amaindesignconsiderationofourfocusedcrawleristheabilitytoreceiveaneventobjectasinput,andthenstartcrawlingtheWWWforwebpagesthatarerelevanttothateventobject,therebyimplementingavariantoftheWebcrawlingscenario.Morediscussionof5Sanditsconnectiontothisdissertationresearchfollows.Societies:Thesystemcanservedifferentstakeholderslike:historians,thoseinthegeneralpublic,researchers,analysts,responders,eventparticipants,anddecisionmakers.Inaddition,therearesoftwareagents;seemorebelowunderServices.Scenarios:Inadditiontoactivitiesofsoftware(seemorebelowunderServices),thefollowingwillbesupported:

• AUsercanstartbuildingacollectionusingtheeventfocusedcrawler.TheusershouldprovideasetofURLs,collecttheoutputofasearchengine,orsamplefromsocialmediacontentthatisprovidedasinputtooursystem.

• AUsercanvisualizeacertaineventorgroupofevents.Severalvisualizationschemesareprovided,includingmapsandtimelines.

• AUsercanbrowsethesystemcontentaccordingtofacetsorothermechanisms,considering:eventcategories,contenttype,entities,andarchives.

• AUsercansearchallsystemcontenttypes.• AUsercansearchacertaineventcollectionorseveralcollections.

Spaces:Objectsinoursystemcanbeviewedinseveralspaces:

• Webpagedocumentsarerepresentedusingthevectorspacemodel[15,16].• Eventsarerepresentedinaneventspacemodel(topic,location,anddate).• Two-orthree-dimensionalinterfacesaiduserinteraction.• Probabilityspacesaidwithcharacterizinginterdependenciesanddrawing

inferences.Structures:Organizationofobjectsinoursystemcanbeperformedaccordingto:

• Eventcategory(earthquake,flood,hurricane,shooting,planecrash,etc.),whichcanfitwithinataxonomy,ontology,orothertypeofstructure;

8

• Datastreamtype/schema(fortext,image,audio,video,eventrecord,archive);or

• Entity(location,date,personname,organizationname).Representationsinclude(metadata)recordsindatabases,graphs,trees,etc.Streams:Theseinclude:webpagetexts,Webarchives,crawllogs(datarecordinghowthefocusedcrawlercreatedthecollection),images,videos,audiofiles,eventsummariesordescriptions,andothersimilarentitiesassociatedwithevents.Weconsideredeventsasfirst-classobjectsandcreatedcharacterizationsforeachevent.Usingeventrelatedstreams,youcansearchforanevent,browseevents,visualizeevents,etc.Services(notoneofthe5Ss,butrelatedtoSocietiesandScenarios,andhelpfulindescribingaDL):Oursystemwillprovideservicesincluding:

• Creatingortransformingcollections/archives;• Analyzingcollections(extractingentities,providingsummaries,classifying,

clustering);• Searchingallkindsofdatastreams;• Browsingaccordingtoeventcategories,datastreamtypes,and/orentities;

and• Visualizingcontent.

1.5 ThesisOrganizationThisdissertationisorganizedasfollows.Chapter2discussesthearchitectureofthebaselinetopical/focusedcrawler.Chapter3reviewsdifferentapproachesforfocusedcrawling.InChapter4,weproposeourneweventmodelandrepresentation,andexplainhowitisintegratedwiththefocusedcrawlingapproach.Chapter5coversthedesignofexperimentsperformed,whileChapter6presentstheevaluationofoureventmodel-basedfocusedcrawlerandofthebaselinefocusedcrawler.Chapter7explainsthewebpagesourceimportancemodelanditseffectonourfocusedcrawler.Finally,Chapter8concludesanddiscussesissuesforfutureresearch.

9

2 RelatedWorkWediscussthedifferentfieldsrelatedtoourworkinatop-downmanner.FirstwediscussthebiggergeneraltopicofWebcrawling.ThenwediscusstheIDEALprojectandWebarchivingtechniques.Thenwediscussfocusedcrawlertechniques.Mostoftheworkdoneintraditionaltopicalfocusedcrawlingfallsintooneofthreecategories:machinelearning,semanticsimilarity,orcontentandlinkanalysis.Wediscussthemajorworkdoneinthesethreecategoriesinthenextsubsections.Alongtheway,wealsotouchonpublicationsrelatedtoeventmodeling,socialmediaintegrationwithfocusedcrawling,seedselection,andfinallyfocusedcrawlingevaluation.

2.1 WebCrawlingWebcrawlers[17]aresoftwareprogramsthattraversetheWWWfollowingthelinksonthewebpages.AcrawlermodelstheWWWasagraphwherenodesarewebpagesandedgesbetweennodesarethehyperlinksthatmanifestinthewebpages.Acrawlerstartsfromasetofwebpages,calledtheseedset,andfollowsthelinksonthosewebpages.Itdownloadsthecorrespondingwebpages,extractsthelinksinthem,andthenrepeatsthewholecycle.Acrawlerkeepstwodatastructuresthatfacilitatethecrawlingprocess[18],theURLqueue(frontier)andthevisitedURLslist.ThecrawlerusesthefrontiertokeeptheURLsthatareextractedfromwebpagesbutnotvisitedyet,andthevisitedURLslisttokeeptrackoftheURLsthatwerevisitedsoitwon’tvisitthemagain(ortocontrolthefrequencyofvisitingthemagain).Webcrawlershavebeenusedbysearchenginestocollectasmanywebpagesaspossible.Thesearchengineparsesthecollectedwebpages,extractsthetext,andbuildsasearchindex[18].Theindexisthemainelementusedtosupportthesearchingservice.

2.2 TheIDEALprojectTheIDEALproject[5]teamhasdevelopedover1000tweetarchivesaboutgeneraltopicsand/orevents,alongwithover66Webarchivesofman-madeandnaturaldisasters,thelatterusingtheInternetArchive’sgeneralcrawler[10].

10

Figure2WorkflowforcreatingWebarchivesfromsocialmedia(Twitter)

Eventarchivingisdifferentfromdomain/site-basedortopic-basedarchiving.Thefirstinvolvesarchivingaspecificdomain/websitewithallorsomeoftheunderlyingsubdomains/structure.Thesecondcoversagivennumberofwebpagesrelatedtoauser-definedtopic. TheIDEALprojectteamhasidentifiedandemployedthreeapproachesforarchivingwebpagesaboutevents:

1. Manualcurationbydomainexperts,librarians/archivists,andgovernmentagencies.(Highquality–timeconsuming).Seehttps://archive-it.org/explore?show=Collections&fc=meta_Subject:Spontaneousevents

2. Socialmedia-based(crowdsourcing)curationbyextracting,retrieving,andarchivingURLsfromtweetcollectionsaboutanevent(Lowquality–timesaving).Seehttp://www.eventsarchive.org/?q=node/42

3. CrawlingtheWebusingafocusedcrawlingapproachtailoredtoevents(withacceptablequalityandtime)

TheIDEALprojectteamhasusedthefirstapproachandcreatedaround66Webcollectionsaboutdifferentkindsofevents[19].TheymanuallycuratedseedURLsandfedthemintotheArchive-Itserviceforcrawling(seeFigure3).TheyhaveapplieddifferentsettingsoftheArchive-Itconfigurationparametersaccordingtotheimportanceoftheeventaboutwhichtheyarecollecting,andthetypeofURLstheycurated.Thetwomainconfigurationparametersarethefrequencyofcrawling

Event

Collect*Tweets

Tweet*Collection

ExtractURLs

Shortened*URLs

Expand Original*URLs

Fetch Webpages

Archive WARC

Index SOLR

Browse

Wayback

Search

Access

Keyword/Hashtag

Collect Archive/Organize/Analyze

11

andthescopeofcrawling.Thefirstonecontrolsthefrequencybywhichthecrawlershouldrevisitandre-crawlthewebpage,whilethesecondparametercontrolswhetherthecrawlershouldfollowtheoutgoinglinksinthewebpage.TheIDEALteamautomatedtheprocessofcuratingseedURLs[20]bycollectingtweetsabouttheeventsofinterestandthenextractingURLsfromtweetsandfeedingthemtotheArchive-Itserviceforcrawling.Socialmediaingeneral,andTwitterinparticular,provideaveryrichsourceforuser-generatedcontent,whichcontainsalargenumberofURLs.TheIDEALprojectteamhascreatedaround1000tweetcollectionsaboutgeneraltopicsandspecificevents[21].Thetweetcollectionssufferfromnoisycontentlikeporn,jobads,marketing,etc.TheIDEALprojectteamhasappliedseveralfilteringmethodstoensuretheresultingtweetcollectionscontainrelevantcontentonly.Figure2showstheworkflowforcuratingseedURLsfromsocialmediasources(e.g.,Twitter).TheresultingseedURLsarearchivedusingtheHeritrix[10]toolandthentheresultingWebarchivesareindexedbyasearchengineforprovidingaccess,searching,andbrowsingservicesforusers.Thelastapproachisaimedtomaintainabalancebetweenproducinghighqualityeventcollectionsandreducingthetime/resourcesneededforcollectionbuilding.TheIDEALprojectteamhasdevelopedtoolsforsemi-automaticallycollecting,curating,andarchivingwebpagecollections,leveragingmethodsforeventmodelingandfocusedcrawling.TheeventmodelingcoversespeciallyidentifyingandrepresentingeventsconsideringtheirWhat,Where,andWhenaspects.TheIDEALproject’sworkonfocusedcrawlingcouldbeofbenefitforWebarchivingby:

1. HelpingpreparelistsofURLstobearchived(i.e.,afocusedcrawlerrecommendingaseedlist);

2. Helpingextendacollectionautomatically(usingexistingcollectionsformachinelearningtypetrainingofafocusedcrawlertofindsimilarnewwebpages);and

3. Analyzingandsummarizingtheproducedeventcollectionsbyusingthedevelopedeventmodel.

2.3 WebArchivingandArchive-ItserviceCloselyrelatedworkhadbeendoneintheemergingandpromisingfieldofWebarchivinganddigitallibraries(WADL)[22-24].ThemostrelatedaspectofWebarchivingistheselectionofwebpagestobearchivedandhowtocollectthesewebpages,aprocesssometimesknownasWebcuration,typicallygovernedbya

12

selectionpolicy.GeneralWebcrawlersarethedominantmethodforWebarchiving.However,severalnewtechniqueshaveemergedwhichdonotdependoncrawlingtechnology,butratherdependonthetransactionalbehavior[23]oftheWWW(HTTPprotocol)todrivearchivingofwebpages.ManywebpagearchivesarecreatedbycuratorsusingtheInternetArchive’sservicecalledArchive-It[9],whichhelpswithharvesting,building,andpreservingcollectionsofdigitalcontent.TheservicetakesURLsasinputfromauser.Figure3showstheinterfaceforuserstomanuallyentertheURLsfromwhichthecrawlerwillstartcrawling.TheseURLsareusedbyArchive-IttocrawltheWeb,guidedbymanualconfigurationdetails(scopingofthedomainofwebpagestocrawl,typesoffilestocrawl,followingrobots.txtprotocol,etc.),andtheresultingwebpagesarecapturedandstoredinWARC[25]files.TheArchive-Itservicehasprovidedanautomatedwayfordifferentkindsofuserstoarchiveandsaveimportantwebpagesinwhichtheyhaveinterest.However,themethodologyusedinthecrawlertechnologybehindtheArchive-Itserviceisorientedtowardarchivinggeneralwebsites(likegovernment,state,universitylibraries,andfederalwebsites)wherethewholecontentofthewebsiteandthefrequentchanges/updatesofthewebsitearethemainscopeofthearchivingprocess.Forspontaneouseventsthisapproachisnotwellsuited.Mostoftheevent-relatedcontentinvolvesonlyspecificwebpageswithinawebsite,andthosewebpagesarenotfrequentlychanged/updated.Therefore,usingtheArchive-Itservicemayresultinanarchivewithmostofthecontentnotrelatedtotheeventofinterest.In[7]theauthorsanalyzedWebarchivesaboutschoolshootings.TheirresultsshowthatrepresentativeWebarchivesarenoisy,with2%-40%ofwebpagesreflectingrelevantcontent.

13

Figure3SeedURLsmanualcurationusingArchive-Itservice

2.4 FocusedCrawling

2.4.1 MachineLearningMachinelearningbasedfocusedcrawlerapproachesapplytextclassificationalgorithms[26-28]tolearnamodelfromtrainingdata.Thefocusedcrawlerthenusesthemodeltoestimatetherelevanceofunvisitedwebpages.Useofthemodelenhancestheperformanceoftheclassifierbyincorporatingdomainspecificknowledgeandonlinerelevancefeedback.Ourapproachlikewisecanbeconsideredasinvolvingaclassificationtask;werequiretrainingdataforcalculatingtheweightsofthedifferentaspectsoftheeventandweareusingthewebpagetextforbuildingtheeventmodel.RennieandBarto[29]usedreinforcementlearningforsolvingthefocusedcrawlingproblem.TheymodeledthefocusedcrawlingproblemasaMarkovdecisionprocesswithwebpagesasstates,URLsasactions,andon-topicwebpagesastherewards.Anotherreinforcementlearningalgorithm,temporaldifferencelearning,wasused

14

in[30].Theyusedastatevaluefunctiontoestimatetheimportanceofwebpagestoleadtofuturerelevantwebpages.Inourapproach,weusewebpagesourceimportancetoestimatethevalueofwebpagesforlinkingtootherrelevantwebpages.Anotherwork[31]usedthereinforcementlearningframeworkproposedin[29]andenhanceditsperformancebyapplyingincrementalonlinelearning.ForeachnewURL,theyestimateitscorrespondingclassanduseitsfeaturestoupdatetheclassfeaturesanditscorrespondingq-value.Thentheyretrainthesupervisedlearningalgorithmbasedonthenewtrainingdata(oldtrainingdataandthenewURLsseen).Thisapproacheliminatesthedatabiasthatappearsinthetestdata,whereunseenURLsmayappearfromnewdomainsthatwerenotfoundinthetrainingdata.InourapproachweaddressbiasthroughusingwebpagesourceimportancebothinselectingseedURLsandalsoduringcrawling.Infospidersisatopicalcrawlerbasedonadaptiveonlineagents[32]thatusegeneticprogrammingandreinforcementlearningapproachestoestimatetherelevanceofawebpage.Inourapproachwegobeyondjustusingtopics,anduseeventmodelingforestimatingtherelevanceofawebpage.In[33],afocusedcrawlerwasdevelopedforcollectingwebpagesthatcontainsemanticinformation(semanticannotationsexpressedasstructureddataembeddedintheHTMLofthewebpage).TheauthorsproposedanewmethodologyforcrawlingtheWebthatutilizesanonlineclassifierandareinforcementlearningbanditalgorithmforselection.Theonlineclassifierlearnstodetecttherelevantwebpagesduringcrawling,eliminatingtheneedfortrainingamodelbeforecrawling.ThebanditalgorithmprovidesaframeworkfororderingandselectingthenextURLtovisitduringcrawling.TheURLsinthecrawlerfrontieraregroupedbytheirWebhost/domainandeachhost/domainisscoredaccordingtoanimportancemeasure.Thecrawlerchoosesthehost/domainwiththehighestscoreandthenchoosesfromthathost/domaintheURLwiththehighestscore.

2.4.2 SemanticSimilaritySemanticsimilarity-basedtechniquesuseontologies[34,35]fordescribingthedomainofinterest.Thedomainontologycanbebuiltmanuallybydomainexpertsorautomatically,usingconceptextractionalgorithms.Oncetheontologyisbuilt,itcanbeusedforestimatingtherelevanceofunvisitedwebpagesbycomparingtheconceptsextractedfromthetargetwebpagewiththeconceptsthatexistintheontology.Theperformanceofsemanticfocusedcrawlingdependsonhowwelltheontologydescribesandcoversthedomainofinterest.Manyofoureventsare

15

disaster-relatedevents.Althoughwearenotusingdomainknowledgeoreventontologies,yetoureventmodelcouldbeeasilyextendedtomakeuseofdisasterdomainknowledgeforfindingdisaster-specifickeywords.Oureventmodelalsocouldbeeasilymappedtoportionsofadisaster-relatedeventontology[12,36],butfurtherresearchwouldbeneededtoseeifsuchwouldyieldimprovement.Further,usingthisapproach,inasystemthatisaimedtohandleanytypeofimportantevent,wouldrequireconsiderableknowledgeengineeringwork.Relatedtosemanticsissentiment.Sentimentanalysishasbeenintegratedintofocusedcrawlingintwoways:sentiment-focusedcrawling[37]andfocusedcrawlingleveragingsentimentinformation[38].Inbothworks,crawlersmadeuseofthesentimentinformationtobuildthetargetmodelandtoguide,focus,anddirectthecrawlerthroughtheWebgraphbyestimatingtherelevanceofunvisitedURLs.Sentimentorientedcrawlingisconsideredatypeoftopicalcrawling,wherethetargetistocrawlwebpagesthathaveagivensentiment.Sentimentclassescouldbeassimpleaspositivevs.negative,orcomplexbasedonaspecificdomain.Inourwork,eventshavemorecomplexstructurethansentiment.Wecanleveragesentimentinformationabouteventsbutwehavetofindtherelevantwebpagesabouteventsfirstandthenanalyzethesentimentinformationinthosepages.

2.4.3 ContentandLinkAnalysisTextandlinkanalysisalgorithmscombinetextanalysisschemes(e.g.,VectorSpaceModel(VSM)[15,16])andlinkanalysisalgorithmstoestimatetherelevanceandimportanceofwebpages[39-41].Linkanalysisapproachesintroducetheconceptofpopularwebpages.PopularityismeasuredbasedonthelinkstructureoftheWWW.ThisledtotheintroductionoftheconceptsofHubandAuthoritywebpages[42].Hubwebpageshavelinkstomanyauthoritywebpageswhileauthoritywebpagesarelinkedtomanyhubwebpages.Amongthelinkanalysisalgorithms,PageRank[43,44]isthemostused.Alternatively,contextgraphsareusedtorepresentthecontextofawebpageusingneighborhoodwebpagesthataremostsimilartoit.Anotherlineofresearchincorporatesthegenreofwebpagesintofocusedcrawling[45].Thegenreofthewebpagedefinesthetypeofthewebpage(e.g.,forum,tutorial,news,blog,course-syllabus,etc.).Thefocusedcrawlerusestwosetsofkeywords,onefordeterminingthegenreofthewebpageandtheothersetfordeterminingthetopic.Thetwosetsofkeywords(genreandtopic)aremanuallydeterminedbyexpertsandthenusedbythefocusedcrawlerforestimatingtherelevanceofthewebpage.

16

Relatedtowebpagesourceimportance,Pantetal.[46,47]describeanewWebcharacteristic:statuslocalityontheWeb.Awebpage’sstatusmeasurestheimportanceofthewebpagewithrespecttoitspopularity,andisapproximatedbythenumberoflinkspointingtoit.Pantetal.developedanalgorithmforestimatingthestatusofawebpagebasedonlocalcharacteristicsofthewebpageandalsodemonstratedthatthestatuspropertyhassomeofthesamecharacteristicsasthetopicalproperty.Ourapproachalsotriestopredicttheimportanceofawebpagebyexaminingtheimportanceofthedomaintowhichthewebpagebelongs.Theimportanceinourcaseisviewedwithrespecttothedomainoftheeventofinterest.Forexample,inanearthquakeevent,somewebsitesmaybemoredominant,andthusmoreimportant,thanotherwebsites,basedonthenumberofrelevantwebpagesfoundforthatdomain.Chen[48]developedahybridapproachforfocusedcrawlingusinggeneticprogrammingforexploitingdifferentfeaturesinawebpage’stext,andmetadatasearchforexploringdifferentsourcesontheWWW.Chenappliedthegeneticprogrammingapproachforcombiningdifferentrelevancesignalsfromthewebpagetext.HealsousedmetadatasearchforgatheringseveralseedURLsforthecrawlertostartfrom,thusexpandingthecrawler’scoverageoftheWWW.Inordertoovercomethebiasthatcanbefoundinonesearchengine,heusedmultiplesearchenginesandcombinedtheirresults.Ourapproachalsoaimstoimproverecallincrawling,butusesothertypesofevidence(webpagesourceimportance)todoso.

2.5 EventModelingEventmodelingrecentlyhasgainedpopularityindifferentfields,liketopicdetectionandtracking(TDT)[49],animaldiseaseoutbreakdetection[50],networkedmultimediaevents[51],anddocumentsimilarity[52].InTDT,aneventisdescribedasatopicthathappensatacertaintime,inaspecificlocation,andwithaparticularsetofparticipants.Inmultimediaapplications,aneventisdefinedasatupleofaspects:informational,spatial,temporal,structural,causal,andexperiential.TheinformationalaspectincludeseventID,eventtype,andanyotherattributesthatserveasidentificationoftheevent.Spatialandtemporalaspectsrepresentthelocationandtimeproperties,respectively.Thestructuralaspectincludesthesub-eventsbelongingtothecurrentevent.Thecausalaspectincludestheeventscausingthecurrentevent.Finally,theexperientialaspectincludesallmediaresourcesrelatedtothecurrentevent.Weincorporatedeventmodelingintoourcrawler,tobuildanevent-awarefocusedcrawler[3].WecameupwithaneventmodelthatintegratesideasfromTDT[49]

17

(webuiltoureventmodelusingthesamedefinition:somethingthathappenedatacertainplaceonaspecificdate)andworkdonein[50](theydefinedaneventasacombinationofdomain-relatedkeywords,locationentitiesanddateentities,anddiseasenameentitiesthatappearonthesentencelevel).Weusetopic,location,anddateinformation,butaggregateatthecollectionlevel.Aprobabilisticeventmodelhasbeendevelopedin[53]whereaneventismodeledasalatentvariablethatgenerateswebpages,i.e.,theobservations.Awebpageaboutaneventismodeledasatupleofthreeparts(topic,entities,anddate).Topicandentitiesaremodeledasvariableswithmultinomialdistributionoverwordsinwebpagecontent,anddateismodeledasanormaldistributionoverthepublishingdateofthewebpage.Accordingtothismodel,usingatimeanalysisofpublishingdatesofwebpages,aneventisrepresentedbyapeakinthenumberofwebpagespublishedaroundacertaindate.Apeakcouldrepresentoneeventormultipleeventsthathappenedonthesame/closedates.Thetopicandentitiespartsofthemodelhelpindiscriminatingbetweeneventssharingthesamepeak.In[54,55],theLDAtopicmodelingframeworkisusedtomodelevents.Aneventisrepresentedbyamixtureoftopicsoverasetofwebpages.Thetopicsconsideredarebackgroundtopic,topicforeachevent(extractedfromthesetofwebpagesbelongingtothatevent),andtopicforeachdocument.Thereasonfordividingtopicsinthismanneristocaptureandseparatethelanguageusedforgeneralpurposes(backgroundfamouswords),thelanguageusedforeacheventspecifically,andfinallythelanguageusedforeachwebpagespecifically.Eventshavebeenanalyzedinsocialmedia(likeTwitter)usingtailoredmethodsforextractingevent-relatedinformationfromthetweets[56-59].Eventsaremodeledasasetofwordswithspecificstructurelike“subjectverbobject”.Thepurposeoftheresearchisnotcollectinginformationrelatedtoevent,butrathergivenasetofshorttext(tweets),howtoextractevent-relatedinformation.Thistypeofworkmodelseventsatthefine-grainlevel,whereitlooksforspecificstructureinsentencesthatmightrepresentanevent.Inthisdissertation,eventsaremodeledatacoarse-grainlevelusingthecombinationofwordsatthewebpagelevel.In[60]eventsaremodeledastheco-occurrenceofspatialandtemporaltokensinonesentence.Theauthorsprovidedahierarchyformodelingspatialandtemporaltokens.Forspatialhierarchytheyusedcountry,state,city,andstreet,whilefortemporalhierarchytheyusedyear,month,andday.Usingthesehierarchies,the

18

authorsdevelopedasimilarityfunctionforestimatingthesimilaritybetweendocumentsusingeventmodelinstancesextractedfromthedocuments.In[61]theauthorsanalyzedthecharacteristicsofinformationsourcesduringnewsevents.Theyanalyzedthreetypesofinformationsources:mediaoutlets,socialmedia,andquerylogs.Theyanalyzedthreeevents:SanBrunopipeexplosion,NewYorkstorm,andAlaska3-wayelections.Theyanalyzedtheseeventsbecauseallthreesharedthesamecharacteristics:1)locationiscentralized(i.e.,nomultiplelocations),2)timespanislimited(i.e.,sotheyattractattentionandinterestforashortperiodoftime).Theirdefinitionofeventsissimilartoourdefinition:somethinghappenedatacertainplaceandspecifictime.Thus,thereareseveralwaysfordefininganeventdependingonthecontextandtheapplicationinwhichtheeventisused.Oureventmodelissimilartotheprobabilisticeventmodel[53],whichincorporatesthetopic,date,andentitiesaspects.Wedoidentifyadateforanevent.Wealsorecognizelocationentities.However,othertypesofentitiesfoundareincludedinthetopicpartasnormalkeywords.Oureventmodelcanbeeasilyextendedtoaddothertypesofentities(Persons,Organizations,etc.).Combiningseveraltypesofevidenceisanimportanttaskinfocusedcrawling.Initialworkhasusedcontentandlinkbasedinformation.Multimediainformationalsowasused[62],wheretextandimageswereanalyzedforestimatingwebpagerelevance.ThereaBayesiannetworkwasusedforintegratingevidencefromdifferentsourcesofinformation(textandimages).Likewise,wecombinethewebpagesourceimportancewiththewebpagerelevancescore,inparticularbymultiplyingbothtogethertoproduceafinalscore.

2.6 SocialMediaandFocusedCrawlingSeedSelectionIn[63],theauthorscombinedthefocusedcrawlertechniquewithsocialmediatoimprovethefreshnessofthecrawl.AfocusedcrawlerislimitedbythesetofseedURLsitstartsfrom.Socialmediaproducesahugeamountofusergeneratedcontent(e.g.,tweets)thatmaycontainURLs.Sincesocialmediacontentisproducedlive,theURLscontainedthereinwouldbefreshandpossiblymorerecentthantheURLsvisitedotherwisebythefocusedcrawler.InjectingURLsfromsocialmediaintothefocusedcrawler’sURLsqueue,shouldincreasethefreshnessoftheWebcollectionproduced.Theauthorscrawledabouttwoevents--EbolaandUkraineconflict--andusedakeyword-basedmodeltorepresentthetwoevents.Weusedsocialmediacontent(i.e.,Twitter)asasourceofseedURLs.WecollectedtweetsabouteventsandafterthecollectionprocessfinishesweextracttheURLsandfilterthemtoget

19

relevantURLs.In[64],theauthorsexaminedthetopicalqualityofexistingWebarchivesaboutevents.TheybuiltaframeworkthatassesseswhethertheseedURLsusedinbuildingtheWebarchiveareon-topicoroff-topicacrossthedifferenttimesitwascrawled.Theauthorsusedthevectorspacemodeltorepresentthedocumentsandappliedseveralsimilaritymeasurestocalculatethesimilarityscores.Theyevaluatedtheirmethodusingdifferentthresholdvaluestofindthevaluethatyieldsthebestperformance.WeusedthecosinesimilarityforscoringwebpagesandURLsabouteventswithdifferentthresholdvalues.Wehaveusedsocialmediacontent(i.e.,Twitter)toextractandselectseedURLsforfocusedcrawling.Unliketheworkin[58],weextractedtheURLsfromthetweetcollectionsandthenstartedcrawling.

2.7 Evaluatingtopical/focusedcrawlersSeveralmethodshavebeendevelopedforevaluatingfocusedcrawlers[65,66].In[67]ageneralevaluationframeworkhasbeendevelopedwhereanyfocused/topicalcrawlercanbeassessedaccordingtotheevaluationframework,independentlyfromthedomaintopic.Theauthorsusedthreemethodsforevaluatingdifferentfocusedcrawlers.Thefirstoneisusingaclassifiertoclassifytheresultingwebpagesasrelevantornon-relevant(on-topicoroff-topic).Thesecondmethodisusingaretrievalsystemwherethecollectedwebpagesareindexedinthesystemandspecificqueriesarerunagainstthecollection.Differentcrawlersareevaluatedbasedonthenumberofrelevantwebpagesretrievedforeachquery.Thethirdmethodisusingaveragesimilarityscores.Differentfocusedcrawlermethodsareevaluatedbasedontheaveragesimilarityscoresatdifferentstagesofthecrawl.Sincenoneofthesethreetechniquesfitswellwithourapproach,weusealternativeevaluationschemesintheworkthatfollows.Theabovementionedworkscovermuchofthebackgroundforourresearch.Whiletherehavebeenavarietyofrelatedstudies,ourinvestigationisunique,andimprovesuponthemethodswehaveuncoveredto-date.

20

3 FocusedCrawling3.1 TopicRepresentationOneoftheinputstoafocusedcrawlerisasetofURLs;togetherthesecanbeusedtodescribetheevent/topicofinterest.WerefertothemasmodelURLs;theycanbethesameordifferentfromtheseedURLs.InthisdissertationmodelURLsaresameasseedURLsforsmallscaleexperiments,whileinlargescaleexperiments(whereseedURLslistisverylarge).WeusedthemodelURLslistbecause,aswewilldescribelater,inChapter5,inthelarge-scaleexperimentsthenumberofseedURLsisverylarge.Usingallofthosetobuildamodelwouldbetimeconsuming,andcoulddelaythestartofacrawl.Accordingly,weuseasmallernumberofURLsasmodelURLstobuildtheeventmodel.Theseareselectedonthebasisofprovidinghighqualitytextualcontentabouttheevent/topic.ThefocusedcrawlerusesthissetofURLstobuilditsevent/topicmodel,andthenusesthemodeltoestimatetherelevance[65,68-73]oftheURLsandwebpagesitencountersduringcrawling.TheremainingseedURLsareaddedtothequeueforthecrawl,helpingensurebreadthofcoverageandreducingbias.Fromnowon,weusethetermseedURLsforbothseedURLsandmodelURLsforsimplicity.Weconsidertwowaystorepresentanevent.Intherestofthischapter,weconsiderthefirst,traditional,baselineapproach,whereaneventistreatedlikeatopic[74],characterizedbyasetofkeywords.InChapter4,wedescribeournewapproach,whereaneventisdescribedwitharichermodel.Wechoseabest-firstfocusedcrawlerasourbaselinemethodbecauseithasproventobethestate-of-theartmethodintopicalfocusedcrawling[65,75].Thebaselinebest-firstfocusedcrawlerusestheVectorSpaceModel(VSM)[76]approachtobuilditsevent/topicmodel:

1. UsingthemodelURLs,downloadcorrespondingwebpagesandextracttextfromthosewebpages.Eachwebpageistokenizedtoasetofwords,stopwordsremoved,andwordsstemmedandthenconvertedtoavector.Herethevectorrepresentstheuniquetermsinthewebpagesandtheirfrequencies(howmanytimestheyappearinthewebpage).

2. Thecrawlerthenbuildsavocabularyindexusingthewebpagevectors.Thevocabularyindexmapsthesetofuniquewordsinallthewebpagestoalistofthewordfrequenciesinthewebpages.

3. Usingthevocabularyindex,thecrawlercalculatesaweightforeachwordbysummingallitsfrequenciesinthewebpagesinwhichitappeared.ThiscorrespondstothewordcollectionfrequencyasopposedtothewordTermFrequency(TF).Wehaven’tusedInverseDocumentFrequency(IDF)aswe

21

areusingtheseedwebpagesonly,ratherthanalargegeneralcorpus.4. Thecrawlerselectsthetopkwordswithhighestweightsasamodelforthe

event/topicofinterest.Theweightsofthewordsarecalculatedbyusingthelogofthefrequencies,toeliminatetheriskoflongdocumentsdominatingshortdocuments.

Thebaselinecrawlerusestheevent/topicmodeltorepresenttheevent/topicofinterestandalsotomodeleachwebpageitvisitsduringcrawling.Sotheevent/topicvectorhasaslotforeachtermfoundinthevocabulary(orfeaturespace)thatarisesfromthewebpages.Morespecifically,aftergettingaURLwithhighestscorefromthequeue,thecrawlerdownloadsthewebpage,extractsthetext,tokenizesthetextintotokens(words),removesstopwords,appliesstemming,doesfrequencyanalysis,andconvertsthetextintoavectorofwordswiththeirfrequencies.Thefinalwebpagevectorrepresentationwillbeconstructedusingthewordsinthevocabularybuiltfromthemodelwebpagesandtheircorrespondingfrequenciesinthewebpages.Thewordfrequencyinthewebpageiscalledthetermfrequency(TF)intheinformationretrievalliterature[16]andispartoftheTermFrequency–InverseDocumentFrequency(TF-IDF)weightingscheme[15,16].Asmentionedintheprocedureaboveregardingstepnumber3,wehavenotusedtheInverseDocumentFrequency(IDF)becauseitusuallyiscalculatedforageneralcorpusofwebpageswhererelevantandnon-relevantwebpagesexist.Thefocusedcrawlerusesonlypresumedrelevantwebpagesfrommodel(orseed)URLsandthusalsoincludingtheIDFvaluemightleadtoremovingrelevantkeywords.Later,thecrawlerestimatestherelevanceofawebpagebycalculatingthecosinesimilaritybetweentheevent/topicvectorandthewebpagevector.Also,thecrawlerestimatesthescoresofalltheURLsinthatwebpage.ForeachURL,thecrawlercombinestheURLtokensandanchortext,convertsthemtoavectorofwordswiththeirfrequencies,andcalculatesthecosinesimilarityoftheresultingURLvectortotheevent/topicvector.ThentheURLisinsertedintothequeuewiththeestimatedscore.Thevocabulary(keywordsorfeatures)whichthecrawlerusestorepresentthewebpagevectorsandtheextractedURLvectorsisbuiltandextractedfromthesetofwebpagescorrespondingtotheseedURLs.Thewebpagevectorisusedtoestimatetherelevanceofthewebpageandproducearelevancescore.TheURLscoreiscalculatedasanaverageoftheURLvectorscore(calculatedascosinesimilaritybetweenevent/topicvectorandURLvector)andthescoreofthewebpageinwhichtheURLappeared.Sothecrawlerismakinguseofthreetypesoftextualinformation:webpagetext,URLanchortext,andURLaddresstokens.UsingwebpagetextandURLinformationwasprovedtobemoreefficientthanusingwebpagetextonly[75].Figure4showsthearchitectureofthebaselinebest-first

22

focusedcrawler,withthetopicrepresentationandrelevanceestimationprocesseshighlightedindashedboxes.

Figure4Architectureofbaselinefocusedcrawlerwithtopicrepresentationinthelowerbox,crawlingintheupperbox,andprocessingandrelevanceestimationin

themiddlebox.

3.2 CrawlerArchitectureAgeneralWebcrawler[17,18,77-82]consistsofwebpagefetcher(downloader)forretrievingwebpagecontents,URLsqueue(frontier)forstoringunvisitedURLs,andwebpageprocessorforextractingtextandURLsoutofawebpage’sHTML.CrawlersmodeltheWWWasagraphG(V,E)wherenodes(V)arewebpagesandedges(E)arelinksbetweenwebpages.So,twowebpages(nodes)willhaveanedgebetweenthemifonewebpagehasalinkpointingtotheotherwebpage.SimilartogeneralWebcrawling,afocusedcrawlerhasawebpagefetcher,URLsqueue,andwebpageprocessor.Inaddition,afocusedcrawlerhasatopicordomain-specificmodel,andamoduleforestimatingtherelevanceofURLsandwebpages.Typically,afocusedcrawlertakesasinput:1)thedesirednumberofpagestocollect,and2)seedURLstostartcrawlingfrom.Itoutputsthesetofwebpagesfound[26,41,66,75,83].

23

Oneoftheimportantaspectsof(focused)crawlersistheorderingoftheURLsinthequeue,whichspecifiestheorderofvisitingthenodesofthegraph.Inthefocusedcrawlerliterature[27],best-firstsearchisthemostcommonlyusedtechniqueandisconsideredthestate-of-the-artfocusedcrawler,takingintoconsiderationtheestimatedrelevanceoftheURLs/webpagesduringcrawling.AfocusedcrawlerstartsfromaseedURL.Itdownloadsthecorrespondingwebpageandextractsthetextofthatwebpage.Thefocusedcrawlerthenestimatestherelevanceofthewebpagetextualcontentwithregardtothetopic/eventofinterest.Inthenextstep,therearetwodesignoptions.Oneoptionisthatthefocusedcrawlerdecideswhetherthewebpageisrelevantornotbycomparingitsestimatedscoretoapre-definedthreshold.Ifthewebpageisconsideredrelevant,thenthefocusedcrawlerextractstheembeddedURLsfromthewebpageandinsertsthemintothequeue.TheotheroptionisthatthefocusedcrawlerextractsallembeddedURLsfromthewebpageandtheninsertsthoseintothequeue,notbeingconstrainedbythewebpagescore.Thesecondoptiontakesintoconsiderationthetunnelingphenomenaincrawling,whereanon-relevantwebpagelinkstorelevantwebpages,eitherdirectlyorthroughseveralsteps.WheninsertingtheextractedURLsintothequeue,thefocusedcrawlerhastomakeanotherdecision.OneoptionistoinsertallextractedURLs,alongwiththeestimatedscoreofthewebpagefromwhichtheywereextracted.AnotheroptionistoestimatetherelevanceofeachURLbasedonthetokensinboththeURLs’addressandanchortext,andinserttheURLanditsresultingestimatedrelevancescoreintothecorrectpositioninthepriorityqueue.WeadoptahybridapproachwhereweusetheaverageofaURL’sscoreandthescoreoftheparentwebpagefromwhichtheURLwasextracted[75].Next,thefocusedcrawlerpullsfromitsqueuetheURLwithhighestscore,andrepeatstheprocess.Figure5showsafocusedcrawleralgorithmthathandlestunneling(i.e.,extractstheURLsfromthewebpageregardlessofscore):estimatingthescoreofeachURLandinsertingitintothequeuewithitsestimatedscore.Weconsiderthisapproachasthefoundationforthebaselineforevaluationcomparisons.

24

Figure5Baselinefocusedcrawleralgorithm

3.3 LargeScaleDesignConsiderationsIdeally,thefocusedcrawlershouldscorealltheURLsitextractsfromawebpageandinsertthemintoitsfrontierbasedontheirscores.Whenthesituationissmallscale,thefrontiersizeismanageable,howeveriflargescale,thefrontiersizecouldgrowveryfast,andslowdowntheperformanceofthefocusedcrawler,duetomemoryconstraints.

!!!Algorithm*!Baseline!Focused!Crawler!!Input:!Seed!URLs,!pagesLimit,!pageScoreThreshold,*urlScoreThreshold!!Insert!seed!URLs!in!priority!queue!##*Topic*Representation*topicVector!=!Build!topic!representation!from!seed!pages!!##*Crawling*while*pagesCount!<!pagesLimit*and*priorityQueue!is!not!empty:*** URL!=pop!(priorityQueue)!! append!URL!to!visited!list!!!!!!!!!!!!!!!page!=!download!(URL)!! ##*Preprocessing*(Vector*Space*Model)*! pageVector!=!process!(page!text)!

##*Relevance*Estimation*(Cosine*Similarity)*pageScore!=!calculateScore(pageVector,!topicVector)!

!!!! pagesCount!+=!1*************if*(pageScore!>=!pageScoreThreshold):************** * page.getURLs!()!!! !!! relevantPagesCount!+=!1!! !!! save!page!to!eventNrelated!collection!

for*link!in!page.outgoingURLs:!********************* * URL!=!link.address!

!!!!!!!!!!!!!!!!!!!!! validate!(URL)!*************** ******* if*URL!not*in!visited!list!and*URL!not*in!priorityQueue:!

##*Preprocessing*(Vector*Space*Model)* * *urlVector!=!process(URL!text)**##*Relevance*Estimation*(Cosine*Similarity)*

***************************** urlScore!=!calculateScore!(urlVector,!topicVector)!*************** ******************* if*urlScore!>=!urlScoreThreshold:********** * * * * push!(URL,!priorityQueue)!!

25

Thefrontierisconstructedusingapriorityqueue,oftenimplementedusingamaxheap[84,85].Themaxheapdatastructureprovidespopandpushoperationslikeanormalqueue,andmaintainsthemaxheapproperty,i.e.,thattheelementwiththemaximumvalueisalwaysontopoftheheap.Thus,thepopoperationwillalwaysresultintheelementwithmaximumvalue.ThepopoperationisofO(1),whilethepushoperationwillhavetoinserttheelementinitscorrectpositiontomaintainthepropertythatthemaximumelementisalwaysonthetopoftheheap.ThepushoperationisofO(logn)wherenisthenumberofelementsintheheap.ThepushoperationisrepeatedforeveryURLextractedwhilethepopoperationisrepeatedforeveryURLvisited.Therefore,thefocusedcrawlerwouldspendaconsiderableamountoftimeinmaintainingthemaxheapproperty,especiallywhentheheapsizegrows.Thusduringlarge-scalecrawlingexperiments,welimitthenumberofURLsofthemaxheapbyinsertingonlyURLswithascorebiggerthanagiventhresholdanddiscardURLswithrelevancescorelowerthanthegiventhreshold.;wesetthethresholdto0.1empirically.Anothercomponentthatconsumesmemoryinlarge-scalesituationsisthevisitedURLslist.ThislistkeepstheURLsthatthefocusedcrawlerhasvisited(andfetchedtheircorrespondingwebpages)sothatthefocusedcrawlerdoesn’tvisitthemagain.ForeveryURLextractedorpoppedfromthefrontier,thefocusedcrawlerchecksiftheURLexistsinthevisitedURLslist.Usinganormallist,thisoperationwillbetimeconsumingwhenthelistsizeisverylarge.Forlarge-scaleexperiments,weimplementedthevisitedURLslistusingahashtable,wherethehashkeyistheURLaddress.Checkingifakeyexistsinahashtableismuchfasterthansearchingalist.Theabovementionedapproachcharacterizesthebaselinefocusedcrawlerusedinevaluationstudiesreportedbelow.Theeventfocusedcrawlerusesthesamegeneralapproach,butvariesduetoitsuseofaneventmodel(and,later,webpagesourceimportance).Thus,comparisonsallowdeterminingtheeffectsoftheeventmodelandwebpagesourceimportance.

26

4 EventFocusedCrawler4.1 EventModelandRepresentation

4.1.1 EventModelingInthefocusedcrawlerliterature[74],eventsgenerallyareconsideredastopicsandwererepresentedwithalistofkeywords.Althoughthisapproachmightworkwellforsomeevents,itworkslesswellforotherkindsofevents.Forexample,representingeventsasalistofkeywordswouldworkincaseswherethetopicpartoftheeventismostdominantandimportant,whilethelocationandtimepartsaren’timportantordon’tplayasignificantroleintheevent.TheoutbreakofEbolaisagoodexampleofaneventwherethetopic(spreadingofEbola)isthemostimportantaspect.Thelocationanddatearepartofthedetails,butarenotthatimportant,i.e.,thetopicpartislargelysufficienttoclearlydescribetheevent.Ontheotherhand,shootingevents,forexample,can’tbedescribedwiththetopicpartonly.Sincetherearemanyshootingeventsindifferentplacesandatdifferenttimes,weneedthelocationanddatepartstoclearlydescribeaparticularshootingevent.Inthisdissertation,wearefocusingonunusualreal-worldnewseventswhichcauseorcreateanimpulseinthepeople’sinterestandthemediacoverageabouttheevent[61].Goodexamplesofthistypeofeventarenaturaldisasters,elections,shootings,terroristattacks,andincidents(pipeexplosion,craneorbuildingcollapse,etc.).Thistypeofeventischaracterizedbytwomainthings:1)theyhaveacentralizedphysicallocationand2)theyhavealimitedtimeperiodinwhichtheireffectappears(impulseinmediacoverageandpeopleinterest).Evenforelections,weareconcernednotwiththegeneralaspectoftheelectionbutwithaspecificincidentthatcapturespeople’sinterestorcausesanimpulseinmediacoverage.Forexample,theAlaskaelectionsonNovember2,2010capturedpopularattentionbecauseofa3-wayracewhichanindependentcandidatewon[61,86].Thisincontrasttomanyothereventdefinitionsindifferentfields.Inthecomputationallinguisticsfield,aneventcanbedefinedas“asituationthatoccurs”whileinthemultimediafielditisdefinedasachangeinthestateofanobjectinavideoorphotostream.Intheinformationextractionfield,thefocusisonextractingreal-worldlocaleventslikeconcerts,theaterperformances,birthdayparties,etc.;informationcomesfromtextualcontentandyieldsrecordsofeventsinwhichpeoplehaveinterest[87].

27

EventModel:Beforeconsideringcomplexeventmodels,asimpleschemeshouldbetestedfirst.Thus,wedefineaneventassomething(e.g.,adisaster),whichhappenedinacertainplace,andatacertaintime.Thus,aneventEisatuple<T,L,D>.Thethreepartsreflectwhat,where,andwhen.Thus,Tisthetopicoftheevent,Lisitslocation,andDisitsdate.Theseareexplainedbelowinmoredetail.Topic:UsingasetofseedURLs,wecreateaneventvocabulary(asetofuniquekeywordsthatappearfrequentlyinthewebpagesassociatedwiththoseseeds).Werepresentaneventwithareferencevectorcreatedbytakingthetopkeywordsfromtheeventvocabulary.Date:Theeventdateisgivenbyauserorisextractedautomaticallyfromthesetofseedwebpages.Theeventdaterepresentsthestartingdatewhentheeventfirstoccurred.Theeventalsocouldhaveanendingdate,orevenasetofperiodsintime.Location:Asmallsetoflocationentitiesislikelytoappearfrequentlyinmostoftheseedwebpages,representingplacesrelatedtotheevent.Theselocationentitiesareextracted(asdescribednext)fromseedwebpages’texts;weperformafrequencyanalysistohelpfindthemostimportantlocationentitiesmentionedintheseedwebpages.Forexample,wemodeltheshootingthathappenedinSanBernardino,CaliforniaonDecember2,2015asfollows: Topic:shooting,shooter,… Location:SanBernardino,California Date:12/02/2015Similarly,wemodeltheattackthathappenedinBrussels,BelgiumonMarch22,2016asfollows: Topic:terror,attack,explosion,… Location:Brussels,Belgium Date:3/22/2016Oureventmodel(combiningtopic,location,anddate)candefaulttothetopic-onlymodelincaseofEbolabyignoringthedateandlocationpart(bysettingtheweightsofthelocationanddatepartstozero).WenotethatthiswouldbethecasealsoregardingtheZikavirusdiseaseoutbreak.Wecanaddapartinoursystemthatiftheeventtypeisdiseaseoutbreak(manuallyenteredbytheuser),thesystemautomaticallydefaultstothetopic-onlymodel.Alternatively,otherdefaults,likegivingasmallweighttothelocationand/orthedatepart,couldbeinstituted.Thus,oursystemisflexible,andcanbeusedincaseswhereitisdifficulttodetermine,foragivenevent,whichmodelismoreefficient.Butiftheeventisa

28

news/worldeventwhichisphysicallylocalized(hasaclearcenter)andtemporallylimited(withanimpulseinnumberofarticlespublishedinarelativelyshortperiod)thenwebelieveoureventmodel(combiningtopic,location,anddate)should performmoreefficientlythanthetopic-onlymodel.Thus,ifauserisinterestedintheoutburstofZika/Eboladiseaseinacertainplaceandatcertaintime,thenoureventmodel(combiningtopic,location,anddate)shouldperformbetterthanthetopic-onlymodel. Figure6showsthestepsofbuildinganeventmodelfromseedwebpages.WestarttheprocessofbuildinganeventmodelbydownloadingthewebpagescorrespondingtotheseedURLs.WethenextractdatesfromtheseedURLsandtheseedwebpages.Todothis,wefirsttrytoextractthepublicationdatefromtheseedURLsusingapre-definedregularexpression.Ifthatfails,weextractthepublicationdatebyparsingapre-definedsetoftagsfromtheHTMLofthewebpages.

Figure6Stepsofbuildingeventmodelfromseedwebpages

Forthedateextraction,wehaveuseda library1forextractingpublishingdateofawebpageusingheuristics.The first step is to extract thepublishingdate from theURL using regular expressions, if applicable. For example, the URLhttp://www.cnn.com/2016/07/10/us/black-lives-matter-protests/index.html hasapublishingdateofJuly10,2016.IftheURLdoesn’tcontaindateinformation,thenthenextstepistolookforspecifictagsintheheaderportionofthecorrespondingwebpageHTMLtags.Anexampletagthatcontainspublishingdatelookslike:

1https://github.com/Webhose/article-date-extractor

29

<metaname="pubdate"content="2015-11-26T07:11:02Z">

ThisappearsintheheadtagoftheHTMLcontentofawebpage.Therearemultiplemetatagsthatmightcontainpublishingdate;hencethelibraryhasanextensivelistofpossiblemetatagsthatarefrequentlyusedindifferentwebsites.Thefinalstepistocheckinthebodyofthewebpage,ifnopublishingdateisfoundintheheadtag.Asbefore,alistoffrequentlyusedbodytagsisusedtoguidefindingthepublishingdate.Anexampleofsuchatagcontainingpublishingdateis:

<pclass=”pubdate”>Sept3,2011</p>Iftherearemultipledatesfoundinthewebpage,thelibraryreturnsthefirstoneonly.ThelibrarytriestoextractthepublishingdatefromtheURL,thenfromtheheadtagoftheHTML,andthenfromthebodytagoftheHTML.TheorderisimportantbecauseitfollowstheaccuracyoftheextracteddatewheredateextractedfromURLisexpectedtobemoreaccuratethanfromtheheadthanfromthebody.Anextrastepthatcouldbedoneistousenaturallanguageprocessingtechniquestoextractnamedentities(dates)fromthetextualcontentofthewebpage.Usingextractednamedentitiesdates,wecanfigureoutthepublishingdateofthewebpage.However,wehavenotusedthisapproach,becausewiththelibraryweusedwemanagedtoextractpublishingdatesfrommostofthewebpagesandbecauseoftheoverheadofcallingandusingthenamedentityrecognizer.Fortheeventmodellocationsvector,wesegmentthetextofthewebpagesoftheseedURLsintosentencesandapplytheStanfordNamedEntityRecognizer(SNER)2oneachsentencetoextractlocationentities.Wethenperformfrequencyanalysisontheextractedlocationentitiesandconstructthelocationsvector.Itincludestheuniquelocationsextracted,alongwiththeirfrequencyofoccurrenceinallsentencesinallseedwebpages(i.e.,theweightofeachlocationisthecumulativefrequencyinallseedwebpages).TheresultinglocationsvectorwillincludethelocationsfrequentlymentionedinthesetofwebpagescorrespondingtotheseedURLs,whichshouldbethelocationoftheeventofinterest,assumingtheseedwebpagesarerelevantandofhighquality(withregardtocontainingenoughinformationaboutthedifferenteventaspects,namelytopic,location,anddate).TheSNERcouldextractlocationentitiesnotrelatedtotheevent from some of the seedwebpages, as a webpagemay include references to

2http://nlp.stanford.edu/software/CRF-NER.shtml

30

multiple locations.This shouldnot affect themodel, however, as the frequencyofthoselocationentitiesshouldbeverysmall(sincetheytypicallyappearinfewoftheseed webpages). On the other hand, if the event occurs in multiple locations, asuitablelistoflocationsshouldbefoundthroughtheabovementionedprocessingofseedwebpages (i.e., theseedwebpagesshouldcover thedifferent locationsof theeventandnotbeconcentratedononelocationonly).Forthetopicvector,weperformthesameprocessingasforthebaselineVSM.WetokenizethetextofthewebpagesoftheseedURLsintowords,removestopwords,stemwords,performfrequencyanalysis,andconstructthetopicvectorasthesetofuniquetermsalongwiththeirfrequencyofoccurrence.

4.2 EventProcessing

4.2.1 EventModel-basedWebpageScoringInthissection,weshowhowtheeventfocusedcrawlerusestheeventmodeltocalculateascoreforeachwebpageitvisitsandfortheURLsextractedfromthatwebpage.Focusedcrawlersassigneachdownloadedwebpageascore,whichestimatestherelevanceofthewebpage.Inthecaseofevents,eventaspectsareconsideredduringtherelevanceestimationprocess.Thuswescoretherelevanceofthewebpagewithrespecttoeachaspectoftheevent,andthencombinethatinformationtocomputeafinalscore.Accordingtooureventmodel,therearethreeattributeswhichtogetherfullydescribeanevent.Awebpagecanhavesomeoralloftheattributesofanevent.Awebpageisconsideredrelevant(i.e.,talksaboutthetargetevent)ifitsatisfiesthefollowingconditions:

• Ithasanon-emptysubsetofthekeywordsthatrepresentthetopicattributeofthetargetevent(i.e.,istopicallyrelevant).

• Itspublicationdateisclosetotheeventdate.• Ithasanon-emptysubsetofthekeywordsofthelocationattributeofthe

targetevent(i.e.,thelocationentitiesextractedfromthewebpagearesimilartoeventlocationentities).

Awebpagethatsatisfiestheseconditionsshouldbeconsideredrelevantandwillbeaddedtotheoutputcollection.Theeventfocusedcrawlerfirsttakesthefollowingstepswithregardtoawebpage:

1. Extractthetextofthewebpage.

31

2. Extractthepublicationdateofthewebpage.3. ExtractlocationentitiesfromthetextofthewebpageusingNamedEntity

Recognition(NER).Wehavedevelopedafunctiontomeasurethesimilaritybetweenthetargeteventmodelandthewebpagemodel.Thesimilarityfunctionproducesascorethatestimatestherelevanceofthewebpagetothetargetevent.Givenatargeteventmodelandawebpageeventmodel:e1=(T1,L1,D1)ande2=(T2,L2,D2),whereT1istheeventtopicreferencevector,L1isthelistoflocationentitiesextractedfromseedwebpagesusingNER,D1istheeventdate,T2isthebag-of-wordsvectorrepresentationofthewebpagetext,L2isthelistoflocationentitiesextractedfromthewebpagetextusingNER,D2isthepublicationdateofthewebpage,e1isthetargeteventmodel,ande2isthewebpageeventmodel.Thesimilarityfunctionsim(e1,e2)isdefinedas:

!"# $%, $' = *×!,-.$ /%, /' + 1×!,-.$ 2%, 2' + ,×!,-.$(4%, 4') (1),

where

!,-.$ /%, /' = 6(78)×6(79):;<8∩<9

>8 × >9 (2),

i.e.,thecosinesimilaritybetweentheT1andT2vectors,andw(ti)istheweightoftermtindocumenti,and

!,-.$ 2%, 2' = 6(?8)×6(?9)@;A8∩A9

B × B9 (3),

i.e.,thecosinesimilaritybetweentheL1andL2vectors,andw(ti)istheweightoflocationlindocumenti,and

!,-.$ 4%, 4' = 1 −E8FE9

GHI_KLMN (4),

32

wherenum_daysisthenumberofdaysinayear.Thisparametercanbeconfiguredaccordingtotheeventcharacteristics.Ifthetwodatesaremorethanthevalueofthisparameter,thescorewillbezero.Thefinalscoreofthewebpageiscalculatedbyusingaweightedaverageofthescoresofthetopic,location,anddatevectors,whereconstantsa,b,andcaretheweightsofthetopic,location,anddatescores,respectively.Theseaddtoone:a+b+c=1.

4.2.2 CalculatingTheWeightsTheweightsa,b,andccouldbesetmanuallybyanexpertwhowouldtakeintoconsiderationthetypeoftheevent(shooting,hurricane,bombing,earthquake,etc.)andthecharacteristicsoftheevent(timedurationandlocationarea,e.g.,specificlocationandpointintimefora“sharp”event,versusmultiplelocationsandlongtimeperiodsforcomplexevents).Toautomaticallycalculatetheweights,weuseeachaspectoftheeventmodel(topic,location,anddate)separatelytoscoreasampleoflabeledwebpages.Weevaluateeachaspect’sperformanceagainstdifferentthresholdvalues,andwechoosethethresholdvaluethatproducesthebestclassificationperformanceaccordingtoagivenevaluationmetric.Inparticular,weusedtheF1-scoreastheclassificationevaluationmetric.F1-scoreisaninformationretrievalmetricthatcalculatesthegeometricmeanoftheprecisionandrecall[88].WealsousedtheF1-scoretoassigntheweightofeachaspectoftheeventmodel(topic,location,anddate),whichindicatestheimportanceofthataspectincalculatingthefinalscore.Wecalculatetheweightastheratiooftheaspect’sF1-scoretothesumoftheF1-scoresofallaspects.Figure7showsthestepsforcalculatingthescoreofawebpage.For example, assume we have 100 webpages, 50 relevant and 50 non-relevant,relative to a specific event (e.g.,Orlando shooting). Assumealso thatwehave thetarget event model (which could be extracted from another set of relevantwebpages or entered manually by the user). For each of the 100 webpages, weextract the topic vector, locations vector, and publication date. Thenwe calculatethree scores (topic score, location score, and date score) for eachwebpage usingequations1,2,and3,respectively.Afterthisprocessweendupwithamatrixof100rows(webpages)and3columns(topicscore,locationscore,anddatescore).Next,we use each of the scores (topic, location, and date) separately to predict a label

33

(relevant or non-relevant) for eachwebpage.Weproduce the label by comparingthescoretoathresholdvalue(callitK);wegenerate“relevant”ifthescoreislargerthanthethresholdand“non-relevant”ifitissmaller.

Figure7Thestepsforcalculatingthescoreofawebpage

Afterthisprocessweendupwithamatrixof100rows(webpages)and3columns(labelbasedontopicscore,labelbasedonlocationscore,andlabelbasedondatescore).Thenweevaluatetheeffectivenessofeachaspectoftheevent(topic,location,anddate)bycomparingtheactuallabelsandpredictedlabelsforeachofthethreeaspects(topic,location,anddate).WeusetheF1-scoreasthemetricforevaluation.Hereweendupwith3F1-scores(oneforeachofthetopic,location,anddate)forthethresholdvalueK.Werepeatthepreviousprocessfordifferentvalues(nvalues)ofthethresholdparameter.Thenweendupwithamatrixofnrows(differentvaluesofthethreshold)and3columns(topic,location,anddate).Finally,wechoosethemaxF1-scoreforeachaspect(topic,location,anddate).TheweightofeachaspectwillbetheratioofitsF1-scoretothesumofthreeaspects’F1-scores.Inthismanner,theweightofeachaspectcorrespondstohowmuchitcontributestotheoverallperformance.Theweightsofeachaspectoftheeventmodel(topic,location,anddate)arelearntbeforethecrawlingtimebyapplyingthepreviousprocedureonagivensetofURLs

34

andwebpagesthatarelabeledasrelevantornon-relevant.Theweightslearntareusedduringcrawlingandarenotmodified.

4.2.3 EventModel-basedURLScoringAsimilarprocedureisimplementedforestimatingascoreforeachURLextractedfromthewebpage.AURLisconvertedintotokensbyremovingnon-alphabeticcharacters(like‘/’,‘#’,’?’)andalsoremovingURL-specifickeywords(like‘http’,‘com’,’www’).URLtokensarecombinedwithtokensfromassociatedanchortexts.Theresultingtokensarethenconvertedtoabag-of-wordsbasedvectorrepresentation.WeextractthelocationentitiesfromURLanchortextusingSNERand extractthepublicationdatefromtheURLusingregularexpressions3(ifapplicable).

Figure8AnexamplewebpagewitharelevantURLanchortexthighlighted

Figure8showsanexamplewebpageabouttheBrusselsattackeventwitharelevantURLhighlighted.TheanchortextoftheURLis:“ParisandBrusselsterrorsuspecttofacechargesinFrance”.TheaddressthattheURLpointstois:3https://github.com/Webhose/article-date-extractor

35

https://www.theguardian.com/world/2016/jun/09/mohamed-abrini-paris-brussels-terror-suspect-france-man-in-the-hat.WecanseethattheURL’saddresscontainsthepublicationdateofthecorrespondingwebpage(June9,2016).TheURLvectoraftertokenization,stopwordremoval,andstemmingwouldbe:[‘theguardian’,‘world’,’2016’,’jun’,’mohamed’,’abrini’,’paris’,’brussels’,’terror’,’suspect’,’france’,‘man’,’hat’,’charges’].WenoteherethatSNERwillcapturethelocationentitiesfromtheURL’sanchortextonly,nottheURLaddress,becausetheSNERtokenizesitsinputtosentencesandtriestoextractentitiesfromthesesentences;thiscanbedonefortheanchortext(whichincludesmeaningfultext)butnotfortheURL(sinceURLtokensdon’tformmeaningfulsentences).IftheseedURLsarefromonedomain(e.g.,www.theguardian.com),thismayaffectthequalityofinformationextractedtobuildoureventmodel.Thecoveragefromasingledomaincouldbebiasedorlimitedinscope,whilethatislesslikelyiftherearemultipledomains.TheseedURLsshouldbefromdifferentdomains,whichwillensurethatallrequiredinformationispresentintheeventmodelandthusthereisnobiastowardaparticulardomainwebsite.Asasummary, Figure9showsthestepsforcalculatingthescoreofaURL.

Figure9ThestepsforcalculatingthescoreofaURL

36

Inthischapter,wehaveaddressedhypothesis1.1andresearchquestions1and2,namely“Howtomodelandrepresentanevent?”and“Howtocomparetwoeventrepresentations?”.Weexplainedoureventmodelandrepresentation,andhowtheeventfocusedcrawlercanuseittoestimatetherelevanceoftheURLsandwebpagesitvisits.

37

5 ExperimentalSetupInthischapterwedescribethedatasetsusedforourexperiments,thedifferentexperimentsperformed,andtheevaluationmetrics.Weperformedfourseriesofexperiments.Thegoalofthefirstseries(seeresultsinSection6.1)wastovalidatethatoureventmodelcaneffectivelyclassifywebpagesasrelevantornon-relevanttotheeventofinterest.Wecomparedourapproachagainstthebaseline,i.e.,usingthetraditionalvectorspacemodel(VSM)topic-onlyapproach.Inthesecondseriesofexperiments(seeresultsinSection6.2),weaimedtovalidatethattheeventmodelcaneffectivelyestimatethescoresoftheURLsandwebpagesitvisitsandconsequentlyguidethecrawlingprocesstowebpagesrelevanttotheeventofinterest.Inthethirdexperiment(seeresultsinSection7.1),weevaluatedtheeffectivenessofourproposedwebpagesourceimportancemodelforcollectingmorerelevantwebpages.Inthefourthexperiment(seeresultsinSection7.2),weevaluatedtheeffectivenessofcuratingseedURLsfromsocialmediacontent(tweets)usingdifferentmethodsofselection.

5.1 DatasetsForthefirstseriesofexperiments,aboutclassification,wedevisedtwodatasets:1)asetofrelevantwebpagesforthetraining/learningmodelphase,and2)asetofrelevantandnon-relevantwebpagesforthetesting(classification)phase.First,weneedasetofrelevantwebpagesthatthetwomodels(ourevent-basedmodelandthetopic-onlymodel)willusetolearn/buildtheirmodelinthetrainingphase.Accordingly,wemanuallycuratedasetof38URLsandfetchedtheircorrespondingwebpages.Fortheclassificationphase,therewasnoexistingdataset(labeledrelevantandnon-relevantsamples)aboutthatshootingevent.Wedecidedtobuildourowngroundtruthdatasetof1000URLsandwebpages.Wecouldhavemanuallylabeledasetof1000URLsandwebpages,but(tosavetimeandeffort)weusedakeyword-basedcrawlertofetch1000webpagesusingthesetof38URLs(usedinthetrainingphase)asseeds.Weusedthetwowords“California”and“shooting”askeywordsforthecrawler.Afterthecrawlerfinishedcrawling,wemanuallylabeledtheresultingwebpagesintotwoclasses(relevantandnon-relevant).Therewere725webpages

38

labeledasrelevantand275labeledasnon-relevant.WefollowedtheproceduredescribedinSection4.3forcalculatingtheweightsandperformingtheevaluation.ThemanuallylabeledsetofURLsandwebpagesaregivenasinputtobothmodelsandtheresultingcalculatedscoresandgiventhresholdparameterareusedtoproducethepredictedlabels.Thepredictedlabelsarethencomparedtothelabelsdeterminedmanually,toproducetheevaluationresults.After completing the first series of experiments, for the focused crawlingexperiments,weconsideredasetofrecentevents.Table2showsthelistofeventsused in our crawling experiments. For each event we summarize the type of theevent,thelocationanddateoftheevent,andfinallyhowmanyURLswereextractedfromthecorrespondingtweetcollectionandwereusedasseedURLsforcrawling.The number of seed URLs varies across the events because they were extractedfromtheevent’scorrespondingtweetcollections.Thenumberofdesiredwebpageswassetto50,000foreventswithlessthan10,000seedURLsand100,000foreventswithmorethan10,000seedURLs,exceptforEgyptairplanecrasheventwherethenumber of desiredwebpageswas set to 10,000URLs only andParis attack eventwhere thenumber of desiredwebpageswas set to 500,000. For the classificationexperiments,however,itseemedsufficienttojustconsidertheCaliforniashooting.

Table2Listofeventsusedinthelarge-scalecrawlingexperiments

Event Type Location Date #ofSeedURLs

#ofdesiredwebpages

CaliforniaShooting

Shooting SanBernardino,California,USA

December2,2015

4,161 50,000

BrusselsAttack

TerroristAttack

Brussels,Belgium

March22,2016

4,691 50,000

OregonShooting

Shooting Roseburg,Oregon,USA

October1,2015

22,354 100,000

EgyptairPlaneCrash

PlaneCrash MediterraneanSea,

Alexandria,Egypt

May19,2016

1,211 10,000

PanamaPapersLeak

DocumentLeak

Panama April3,2016

18,260 100,000

OrlandoShooting

Shooting Orlando,Florida,USA

June12,2016

1,988 50,000

ParisAttack TerroristAttack

Paris,France November13,2015

88,835 500,000

EcuadorEarthquake

Earthquake Ecuador April16,2016

11,348 100,000

39

Sincethe IDEALproject isworkingwitha largeamountofevent-relateddata,andsincewewantedourresultstobeassessedinthecontextofsuchtypesofdata,wehadampleopportunitytoutilizedatafromeventslikethosementionedabove.

5.2 ExperimentsThegoalofthefirstserieswastovalidatethatoureventmodelcaneffectivelyclassifywebpageswithregardtorelevancetotheeventofinterest.Wecomparedtheperformance,forthetaskofclassification,oftheevent-modelvs.topic-onlymodel.WeusedthemanuallycuratedseedURLsandthestaticdatasetof1000webpagesabouttheCaliforniashootingfortheevaluation.Bothmodelsusecosinesimilarityasascoringfunction,andproduceascorefortheirinputtext(URLorwebpage)thatestimatestherelevanceoftheinputtothegivenevent.Weuseeachofthemodelsasaclassifier,i.e.,bycomparingthecosinesimilarityscoretoagiventhresholdparameter.Ifthecosinesimilarityscoreisbiggerthanthethreshold,thentheoutputlabelisrelevant,butisnon-relevantotherwise.Bothmodels(event-basedandtopic-only)aretrainedusingasetofURLs(positiveonlysamplesastheydon’trequirenegativesamplesfortraining).Thetopic-onlymodelusesthesetofURLstobuildatopicreferencevector,whiletheevent-basedmodelusesthesetofURLstobuildtheeventmodel(topic,location,anddate).ThenextstepistousethemodelstoclassifyURLsandwebpages.Weusethetwomodelsbuilt(event-basedmodelandtopic-onlymodel)toclassifythemanuallylabeledURLsandwebpages(i.e.,ourgoldenstandardtestset)todeterminetheirpredictedlabels.Thelaststepistocomparethepredictedlabels(fromboththeevent-basedmodelandthetopic-onlymodel)totheactual(manuallyproduced)labelsandevaluatetheperformanceofeachofthemodels.Weevaluatedtheperformancebyvaryingtwoparameters:

1. k,thenumberofkeywordsusedinconstructingthetopicvectorinoureventmodelandthetopicreferencevectorforVSM,and

2. threshold,thevalueofthethresholdusedforconvertingthescorestolabels(relevantifthescoreislargerthanthethreshold,otherwisenon-relevant).

Wealsorantheexperimentswithseveralvariationsofoureventmodel.Wethenranthesameexperimentwiththetwopairsoffeaturetypes:a)combinationofthetopicandlocationonly,andb)combinationofthetopicanddateonly.Figure10showsthedesignoftheexperimentsforevaluatingtheeffectivenessofclassificationusingtheeventmodel.

40

Figure10Designofourevaluationmethodoftheeffectivenessoftheeventmodel

forrelevanceestimation;thetwoboxeswithanasteriskindicatethetwoparametersoptimizedintheexperiment

Inthesecondseriesofexperiments,weaimedtovalidatethattheeventmodelcaneffectively estimate the scores of the URLs and webpages to be visited, andconsequentlyguidethefocusedcrawlingprocesstowebpagesrelevanttotheeventofinterest.Intheseexperimentsweusedasetofrecentevents.Intheliterature,themostuseddataset forevaluating topical focusedcrawler is theDMOZdataset.Butsince we are evaluating focused crawlers for events, the DMOZ dataset is notsuitableinourcase.Inthecrawlingexperiments,weneedasetofrelevantURLsforeacheventthatwillbeusedasseedsforstartingthecrawling.WemanuallycuratedthesetofseedURLsfortwoevents(38forCaliforniashootingand23forBrusselsattack).ThesetwosetsofmanuallycuratedURLsareusedforrunningsmall-scalecrawlsonly.Forlarge-scalecrawls,weneedtostartfromalargersetofseedURLs,whichisverydifficulttobuildmanually.Fortherestoftheevents,weextractedthesetofseedURLsfromacorrespondingcollectionoftweets.ThetweetcollectionswerecollectedusingtheTwitterstreamingAPI.TwitterisarichsourceofURLs,asmostofthetweetspostedlinktowebpagesthatcontainmoredetailedinformation.ThenumberofURLsextracteddependsonhowbigtheeventis(highimpacteventsattractmorepeopleandthereforemoretweetsarepostedabouttheevent).

41

Figure11summarizesthestepsforcalculatingtheharvestratioafterthecrawlingprocessfinishes.Theresultingwebpagesofeachcrawl,theircalculatedscores(basedoneachmodel),andagiventhresholdareusedtoproducethepredictedlabelsandthecorrespondingharvestratio.

Figure11Thedesignoftheexperimentforevaluatingtheeffectivenessofevent

modelwithfocusedcrawlertoretrievemorerelevantwebpages

5.3 EvaluationMetricsWeusedtheprecision,recall,andF1-scoremetricstoevaluatetheclassificationperformanceofoureventmodelversusthebaselinetopic-onlyapproachinthefirstseriesofexperiments.Theprecision,recall,andF1-scorearecalculatedusingtheconfusionmatrix,whichcontainsthenumberoftruepositive,truenegative,falsenegative,andfalsepositivesamples(sampleshererefertowebpages).Thetruepositivesamplesaretheonesthatwerepredictedrelevantandwereactuallyrelevant,whilethetruenegativesamplesaretheonesthatwerepredictednon-relevantandwereactuallynon-relevant.Thefalsepositivesamplesaretheonesthatwerepredictedrelevantandwereactuallynon-relevant,whilethefalsenegativesamplesaretheonesthatwerepredictednon-relevantandwereactuallyrelevant.Theprecisionisdefinedasthepercentageofretrievedsamplesthatarerelevant.Itiscalculatedastheratiooftruepositivetothesumofthetrueandfalsepositivesamples.Therecallisdefinedasthepercentageofrelevantsamplesthatareretrieved.Itiscalculatedastheratioofthetruepositivetothesumofthetruepositiveandfalsenegativesamples.TheF1-scoremeasureisthegeometricmeanoftheprecisionandrecallmeasures.Togetaperfectprecisionweusuallymusthave

42

lowrecallandviceversa(perfectrecallcomeswithlowprecision),soF1-scoreisawaytocombinebothmeasures(precisionandrecall).Inotherwords,ifyougetahighvalueforprecision,thatdoesn’tensureyouhavegoodperformance,asyoumighthavealowvalueforrecall.ButifyouhaveahighvaluefortheF1-score,thismeansyouhaveahighvalueforbothprecisionandrecall.Forevaluatingtheperformanceofthefocusedcrawlersinthesecond,third,andfourthseriesofexperiments(i.e.,theabilitytocollectmorerelevantwebpages),weusedtheharvestratiometric[27-29].Theharvestratioisthepercentageofcrawledwebpagesthatarerelevant.Theharvestratiomeasurestheabilityofthecrawlertofindandcollectmorerelevantcontentthannon-relevantones.Ifthecrawlervisitsmanynon-relevantwebpagesinordertofindrelevantones,thenthismeansthecrawlerisusinganinefficientmethod.AhighlyefficientrelevanceestimationmethodwoulddirectthecrawlercorrectlyandensureitfocusesontherelevantpartoftheWebonly,thusproducingahigherharvestratio.ItisworthnotingherethatthecriticalpointforrelevancejudgmentistheabilitytoestimaterelevanceofbothURLsandwebpages.EstimatingtherelevanceofwebpagesiseasierthanforURLsduetothefactthatwebpagescontainrichertextualcontentthanURLs.Apoorrelevanceestimationmethodwouldgivehighscoresfornon-relevantURLs,whichleadstonon-relevantwebpagesandthuslowharvestratio.

43

6 Results6.1 EventModel-basedvs.Topic-OnlyClassificationInthissection, aboutthefirstseriesofexperiments,weshowtheresultsofclassifyingthe1000webpagesabouttheCaliforniashootingusingthetopic-onlyvectorspacemodelversusthreevariantsofoureventmodel,namelytopic+location,topic+date,andtopic+location+date(oureventmodel).Usingthe38seedwebpagesabouttheCaliforniashootingevent,wecreatedavocabularyof1365keywordsthatappearedon5ormorewebpages.Toextractthemostrepresentativekeywords(features)fromthevocabulary,wesortedthevocabularykeywordsbasedontheircumulativenormalizedoccurrencesinalloftheseedwebpages.Wechosethetopkkeywordsfromthesortedvocabulary.Eachseedwebpageisthenrepresentedasavectorofthetopkkeywordsandtheirfrequencyofoccurrenceinthewebpage.Wecreatedthetopicreferencevectorasthecentroidvectorofallseedwebpagevectors.

6.1.1 ClassifyingURLsandwebpagesaboutCaliforniashootingThe1000URLs/webpagesdatasetabouttheCaliforniashootingconsistsofURLaddressesandtheiranchortexts,aswellasthecorrespondingwebpages.WeranexperimentstoclassifyURLsandwebpages,separately.TheURLsandwebpagesweremanuallylabeledastorelevantvs.non-relevant.Werefertothisdatasetasthelabeleddataset.Forthetopic-onlymodelandoureventmodel,wevariedtheparameterk(thenumberofkeywordsinthetopicvector)from5to1365wordswithincrementof10andthethresholdparameterfrom0to1withincrementof0.05.Table3ValuesoftheparametersthatproducedthebestF1-score.Kisthesizeofthe

topicvectorandthresholdisthecutoffvaluefordeterminingrelevantornon-relevantlabelsbasedonthescore

URLs WebpagesK Threshold K Threshold

Topic-only(baseline) 1310 0.25 10 0.45Topic,Location,andDate(Eventmodel) 1310 0.15 10 0.4

AscanbeseeninTable3,forthetopic-onlyapproach,inthecaseofURLs,thevaluesoftheparametersk(thenumberofkeywordsoftopicvector)andthresholdthatgavethebestF1scoreonthelabeleddatasetwere1310and0.25,respectively.Inthe

44

caseofwebpages,theywere10and0.45,respectively.Theweightsoftopic,date,andlocationpartswere0.36,0.22,and0.42,respectively.Theweightswerecalculatedusingequations1-4asdescribedinSection4.2.2.Fortheeventmodelapproach,inthecaseofURLs,thevaluesoftheparametersk(numberofkeywordsoftopicvector)andthresholdthatgavethebestF1scoreonthelabeleddatawere1310and0.15,respectively.Inthecaseofwebpages,theywere10and0.4,respectively.Theweightsoftopic,date,andlocationpartswere0.3,0.355,and0.345,respectively.Table3summarizestheparametervaluesforallsettings.Toexaminetheeffectofthedateandlocationseparately,weranourevaluationusingtopic+locationandtopic+date.Fortopic+location,thebestthresholdvaluewas0.2andtheweightsoftopicandlocationpartswere0.64and0.36,respectively.Fortopic+date,thebestthresholdvaluewas0.2andtheweightsoftopicandlocationpartswere0.47and0.53,respectively.Tables4and5showtheprecision,recall,andF1scoreforthefourexperimentalsettings(topic-only,topic+location,topic+date,andoureventmodel,withtopic,location,anddate)usingthebestvaluesfortheparametersforboththeURLandwebpageclassificationtasks (asshowninTable3).Table4 Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,DateevaluatedonthemanuallylabeledTRAININGURLsdatasetforCalifornia

shootingevent

Precision Recall F1-scoreTopic 0.728 0.723 0.725Topic+Date 0.852 0.855 0.853Topic+Location 0.764 0.73 0.74Topic+Location+Date 0.863 0.867 0.862

AchievinghigherF1-scoremeansbetterclassificationperformance(i.e.,betterabilitytoidentifyanddifferentiatebetweenrelevantandnon-relevantwebpages).Theresultsshowthataddingdateand/orlocationinformationtothetopicenhancestheperformance.Oureventmodel(combiningtopic,location,anddate)achievesthebestperformance(highestF1-score).Thetopic-onlymodelperformedworst(lowestF1-score). Ourexaminationofthedataconfirmedthatthetopic-onlymodeldidnotdifferentiatewellbetweenwebpagestalkingaboutdifferentshootingeventsandourevent(Californiashooting),asallaretopicallyrelated(shooting).Ontheotherhand,thetopic+datemodelperformedbetterthantopic-only,becauseitmanagedtousethepublishingtimeofthewebpagestofilteroutwebpagestalking

45

aboutshootingeventsthathappenedbeforetheCaliforniashootingevent.Thetopic+locationmodelperformedbetterthanthetopic-onlymodelbecauseitfilteredoutwebpagestalkingaboutshootingeventsthathappenedatotherlocationsthanCalifornia.Table5 Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,

andDateevaluatedonthemanuallylabeledTRAININGwebpagesdatasetforCaliforniashootingevent

Precision Recall F1-scoreTopic 0.738 0.734 0.736Topic+Date 0.842 0.846 0.843Topic+Location 0.856 0.859 0.857Topic+Location+Date 0.88 0.884 0.881

Wealsoexaminedtheperformanceofoureventmodel(combiningtopic,location,anddate)versusthetopic-onlyapproachacrossthedifferentvaluesofthethresholdvariablesandthebestvalueforthekparameter.Weplottedtheprecision-recallcurvesforthedifferentvaluesofthethresholdparameter.Figure12andFigure13showthecurvesforthefourdifferentsettings.Thefiguresconfirmtheresultsdescribedabove:addinglocationand/ordateinformationenhancestheperformanceofclassification.Itisalsoshownthattheeffectofaddingdateinformationismuchstrongerthanaddinglocationinformation,inthecaseofURLs.WeinvestigatedthisbehaviorandfoundthatmostoftheURLsinourlabeleddataincludedateinformationthatcanbeextractedeasily.TherewaslesslocationinformationintheURLscomparedtodateinformation.Further,someofthelocationinformationwasnotinastandardformatasexpectedbySNER(whichassumeslocationinformationexistsaspartofavalidsentence;seeSection4.2.3).WehaveusedthemanuallycuratedseedURLsforlearningoureventmodelandthebaselinetopic-onlymodel.Bothmodelshavetwoparameters:K(numberoffeatures,i.e.,words,inthetopicvector)andthreshold(valuefordeterminingthelabels:relevantornon-relevant).Wetunedthevaluesofthetwoparametersandreportedthebestvaluesofthoseparameters(seeTable3)andtheperformanceofthetwomodels(usingthebestvalueofthetwoparameters;seeTables4and5)onthemanuallylabeledtrainingdataset.Finally,wetestedtheperformanceofthetwomodelsonthemanuallylabeledtestdataset(sincethetwomodelshaven’tseenthewebpagesinthisdataset)toseehowwellthetwomodelswillgeneralizetounseenwebpages.Table6showstheperformanceofthebaselinetopic-onlymodel,oureventmodel(topic+location+date),andtwovariantsofoureventmodel(topic+locationandtopic+date).Oureventmodeloutperformsthebaselinetopic-onlymodelbyachievinganF1-scoreof0.894comparedto0.688forthebaselinetopic-onlymodel.Also,addingthedateorlocationinformationachievesbetterperformancethanthebaselinetopic-only

46

model.Addingthelocationinformationismoreeffectivethanaddingthedateinformation.Thiscanbeattributedtotherichnessoflocationinformationinthewebpagescomparedtotheexistenceofpublicationdateinwebpages.Table6Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,andDateevaluatedonthemanuallylabeledTESTwebpagesdatasetforCalifornia

shootingevent

Precision Recall F1-scoreTopic 0.695 0.706 0.688Topic+Date 0.779 0.783 0.776Topic+Location 0.858 0.86 0.858Topic+Location+Date 0.899 0.896 0.894

Figure12 CaliforniashootingURLsevaluationatdifferentthresholdvalues

47

Figure13Californiashootingwebpagesevaluationatdifferentthresholdvalues

48

6.2 EventModel-basedvs.Topic-onlyFocusedCrawlerInthissection,aboutthesecondseriesofexperiments,wereporttheeffectofusingtheeventmodelwiththefocusedcrawler.

6.2.1 CaliforniaShootingInthisexperimentweusedthe38URLsmanuallycurated(seeSection5.2)asseedsforthetwofocusedcrawlers(oureventmodel-basedandthetopic-onlybaseline).TheeventmodelbuiltfromtheseedsisillustratedinTable7.Thefirstrowinthetablegivesthetopicvectorkeywordsandtheirnormalizedcumulativetermfrequenciesinalloftheseedwebpages.Thesameisdoneforthelocationanddate.Weranthetwofocusedcrawlerstocollect1000webpages.Weplotthepercentageofcrawledwebpagesthatarerelevant(harvestratio)atdifferentstagesofthecrawl,i.e.,forthefirst100,200,300,…crawledwebpages.Figure14showstheperformanceofthetwocrawlersinthesmall-scalesetting(1000webpagesonlyarecrawled)duringthedifferentstagesofthecrawlingprocess.Oureventmodel-basedfocusedcrawlercollectedmorerelevantwebpagesduringandattheendofthecrawlingprocessthanthebaselinetopic-onlyfocusedcrawler.Oureventmodel-basedfocusedcrawlerachievedapproximatelyaharvestratioof0.85whilethebaselinetopic-onlyfocusedcrawlerachievedapproximatelyaharvestratioof0.68.

Table7 Californiashootingeventmodel

Keywords Weight

Topic

shoot 0.93 san 0.513 bernardino 0.465 said 0.357 wa 0.323 2015 0.321 peopl 0.31 california 0.305 polic 0.258 suspect 0.177

Location San Bernardino 1 California 0.51 Calif. 0.44

Date 2015-12-02

49

Figure14 Performanceevaluationofeventmodel-basedvs.topic-onlyfocused

crawlersforCaliforniashooting

Inthelarge-scalesetting,weranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)for50Kwebpages.Thetwocrawlersstartedfromasetof~4000seedURLs.Figure15showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.6(average)whilethebaselinetopic-onlyachievedaharvestratioof0.3(average).

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500 600 700 800 900 1000

Percentage4of4crawled4webpages4that4are4relevant4

(Harvest4Ratio)

Total4number4of4webpages4crawled

Performance4Evaluation4of4Event4modelHbased4 Focused4Crawler4vs4Baseline4Focused4

Crawler4for4California4 Shooting4Event

Baseline4Focused4Crawler4H Topic4Only Event4ModelHbased4Focused4Crawler4H Topic4+4Loc4+4Date

50

Figure15Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforCaliforniashooting(50Kwebpages)

6.2.2 BrusselsAttackInthisexperimentweusedthe23URLsmanuallycurated(seeSection5.2)asseedsforthetwofocusedcrawlers(oureventmodel-basedandthetopic-onlybaseline).TheeventmodelbuiltfromtheseedsisillustratedinTable8.Thefirstrowinthetablegivesthetopicvectorkeywordsandtheirnormalizedcumulativetermfrequenciesinalloftheseedwebpages.Thesameisdoneforthelocationanddate.Weranthetwofocusedcrawlerstocollect1000webpages.Weplotthepercentageofcrawledwebpagesthatarerelevant(harvestratio)atdifferentstagesofthecrawl,i.e.,forthefirst100,200,300,…crawledwebpages.Figure16showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.Oureventmodel-basedfocusedcrawlercollectedmorerelevantwebpagesduringthecrawlingprocessthanthetopic-onlybaselinefocusedcrawlerinthesmall-scalesetting(1000webpagesonly).

51

Table8 Brusselsattackeventmodel

Keywords Weight

Topic

brussel 0.881 attack 0.541 airport 0.539 explos 0.381 wa 0.31 peopl 0.273 station 0.254 belgium 0.242 metro 0.197 terror 0.159

Location

Brussels 1 Belgium 0.37 Brussels Airport 0.174 Zaventem 0.174 Paris 0.123

Date 2016-03-22

Figure16 Performanceevaluationofeventmodel-basedfocusedcrawlerfor

Brusselsattack

Inthelarge-scalesetting,weranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect50Kwebpages.Thetwocrawlersstartedfromasetof~4000seedURLs.Figure17showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.Oureventmodel-based

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500 600 700 800 900 1000

Percentage4of4crawled4webpage4that4are4relevant4

(Harvest4Ratio)

Total4nubmer4of4crawled4webpages

Performance4Evaluation4of4Event4modelHbased4 Focused4Crawler4vs4Baseline4Focused4

Crawler4for4Brussels4Attack4Event

Baseline4Focused4Crawler4H Topic4Only Event4modelHbased4Focused4Crawler4H Topic4+4Loc4+4Date

52

focusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.7whilethebaselinetopic-onlyachievedaharvestratioof0.5.

Figure17Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforBrusselsattack(50Kwebpages)

Intheprevioustwoexperiments(CaliforniashootingandBrusselsattack),weperformedsmall-scaleandlarge-scalecrawls.Thereasonforthesmall-scaleexperimentswastoprovethepointthattheeventmodelefficientlyguidedthefocusedcrawlertofocusontherelevantpartoftheWeb,andthecrawlerasaresultretrievedmorerelevantwebpagesthanthetopic-onlybaselinefocusedcrawler.Thepurposeofthelarge-scaleexperimentsistoshowthattheperformanceofoureventmodel-basedfocusedcrawlerremainsbetterthanthetopic-onlybaselinefocusedcrawlerevenforlargenumbersofwebpages,i.e.,isscalable.Intheremainingexperiments(otherevents),weranonlythelarge-scalecrawlingexperiments.Wealreadyshowedtheeffectivenessofourapproachinthesmall-scaleexperimentsandweneedtovalidatethatthesameperformancepersistsat

53

largescale.Wedidn’tmanuallypreparesetsofseedURLsfortheremainingevents;weextractedthemfromthetweetcollectionsforeachevent.

6.2.3 OregonshootingWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect100Kwebpages.Thetwocrawlersstartedfromasetof~22KseedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.6(average)whilethebaselinetopic-onlyachievedaharvestratioof0.25(average).Figure18showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.

Figure18Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforOregonshooting(100Kwebpages)

Inthisexperiment,wehadahighernumberofseedURLsthanintheprevioustwoexperiments(22Kcomparedto4K).WedidnotcontrolthenumberofseedURLs.WeextractedandfilteredtheseedURLsfromtheevent’stweetcollection.ThenumberofseedURLsdependsonthesizeofthecorrespondingtweetcollection,whichdependsontheimpact/coverageandsizeoftheevent(bigorsmall).EventswithbigimpactwillleadtolargetweetcollectionsandthereforemoreURLsextracted.Thedefinitionofimpacthereinourcontextisrelatedtocoverage.The

54

Oregonshootingevent’stweetcollectionhadmoretweetsthantheprevioustwoeventsandthereforethereweremoreURLsextractedthantheprevioustwoevents.HavingmorestartingURLsmeanswehavemoreaccess/pointerstotherelevantpartoftheWebgraph,sowerantheexperimentsfor100Kwebpagesratherthan50K(likeinthefirsttwoexperiments).

6.2.4 EgyptairplanecrashWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect10Kwebpages.Thetwocrawlersstartedfromasetof~1100seedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.5whilethebaselinetopic-onlyachievedaharvestratioof0.4.Figure19showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.

Figure19Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforEgyptairplanecrash(10Kwebpages)

Aneventofbigimpactcouldhavesmallsizetweetcollection(asthecaseforthisevent)becausetherewasnotenoughcoveragefortheeventonTwitter,ortheeventdidn’tattractmuchattentionfromTwitterusers.Anotherpossiblereasonistherewereothermoreattracting/trendingtopicsthatattracted/drewattentionawayfromthatevent.

55

6.2.5 PanamapapersWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect100Kwebpages.Thetwocrawlersstartedfromasetof~18KseedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.6whilethebaselinetopic-onlyachievedaharvestratioof0.4.Figure20showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.

Figure20Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforPanamaPapers(100Kwebpages)

6.2.6 OrlandoshootingWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect50Kwebpages.Thetwocrawlersstartedfromasetof~2000seedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.7whilethebaselinetopic-onlyachievedaharvestratioof0.4.Figure21showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.

56

Figure21Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforOrlandoshooting(50Kwebpages)

6.2.7 ParisattacksWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect500Kwebpages.Thetwocrawlersstartedfromasetof~88KseedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.25whilethebaselinetopic-onlyachievedaharvestratioof0.18.Figure22showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.TheParisattackeventwasabigeventwithhugeimpactlocallyinFranceandinternationally.Wekeptcollectingtweetssincethestartoftheeventandforseveraldaysaftertheevent,whichisnotthecasefortheotherevents.Usuallythetweetcollectingprocessstopsonthesamedayorone/twodaysaftertheevent.ThestoppingpointdependsonthetimeTwitterusersstoppostingabouttheevent,whichtypicallypeaksonthedayoftheeventanddecreasesafterthat.Wenoteherethatthisexperiment(andallourexperiments)startedfromtheEnglishseedURLsonly(thesameappliesduringthecrawlingprocess;weare

57

workingontheEnglishlanguageonly).WeexcludedalltweetsnotinEnglish.EvenwhenlimitingtoEnglishonlytweets,westillhadaround88KseedURLstostartfrom.Theperformanceofthetwocrawlersdegradedattheendofthecrawlasexpected,because(asknowninthetopicalcrawlerliterature[17,26,28,40,41,46,47,65,66,74,75,81,83,89,90])aswegetfarfromtheseedURLs,wefindfewerrelevantwebpages.TherelevantcontentisconcentratedaroundtheseedURLs,sothefurtherawaywegofromtheseedURLs,thegreaterthechancethatwehitnon-relevantcontent.

Figure22Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforParisattack(500Kwebpages)

6.2.8 EcuadorearthquakeWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect100Kwebpages.Thetwocrawlersstartedfromasetof~11KseedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedahigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.75whilethebaselinetopic-onlyachievedaharvestratioof0.4.Figure23showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.

58

Figure23Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforEcuadorearthquake(100Kwebpages)

Inthischapterweaddressedresearchquestion3,whichistheeffectofusingoureventmodelontheperformanceofthefocusedcrawler.Weshowedthatusingtheeventmodelwithfocusedcrawlingleadstoachievingahigherharvestratio,thuscollectingmoreoftherelevantwebpagesthanthetraditionalfocusedcrawler.Thebetterperformancewasshowninsmallandlarge-scalecrawls.

59

7 WebpageSourceImportanceandSocialmedia-basedSeedSelection

Sofarwehaveexploitedcontent-basedfeaturesforfindingrelevantwebpages.Inthissection,weexploretwootherparametersthataffecttheabilityofthefocusedcrawlertofindmorerelevantwebpages:webpagesourceimportanceandseedURLs.

7.1 WebpageSourceImportanceAwebpagesourceisthewebsitethewebpagebelongsto,forexample,thewebpagewiththeURLhttp://www.cnn.com/2014/04/18/world/asia/malaysia-airlines-plane/index.htmlbelongstothesourcewebsitehttp://www.cnn.com.Theimportanceofawebpagesourcewillbedeterminedbasedonthenumberofrelevantwebpagesthatbelongtothewebpagesource.Thisheuristicexhibitstwocharacteristicsthatshouldensureagoodestimateofthewebpagesourceimportance:

1) Followsthelikelihoodpropertiesofthedata,2) Changesdynamicallywithnewdataobservedduringcrawling.

Thefirstcharacteristicensuresthatthemeasureweareusingisrealisticanddescribestheactualdatabeingcollected.Thesecondcharacteristicshowshowthemeasureadaptstochangesinthedatabeingobserved,andalsofollowschangesinthecontentbeingpublishedontheWWW.ThereareseveralreasonsforchoosingourmethodofestimatingsourceimportanceandnotconsideringPageRankandHubandAuthority[42,43,77,78]methods(asanexampleofsourcepopularitymeasures):

1- Dynamicvs.static(fixed),PageRank,hubandauthority,andout-degreemeasuresareallstaticorfixedmeasures.Theyneedtobecalculatedoffline(i.e.,requirethewholedatasetorpartofittocalculatethevaluesandthenareusedafterthatduringcrawling).Thisissimilartoonlineandofflinelearningmethods.Offlinemethodsusethetrainingdatatobuildthemodelandthenuseit.Onlinemethodsdon’tuse/requiretrainingdata;theylearnthemodelanduseitonline/duringcrawling.So,themodelisupdatedduringcrawling.

2- PageRankandhubandauthoritymethodsaretimeconsumingandcomputationallyintensive,soitwillbetimeconsumingtoadaptthemforanonlineversion.

3- PageRankandhubandauthorityaremethodsformeasuringpopularityandqualityofwebpages.Wecanimaginethatusingthemforestimatingthe

60

importanceofasourcewithrespecttoaspecificeventislikeusinggeneralWebcrawlersforcrawlingwebpagesaboutaspecificevent.Weexpectthatsuchpopularitymeasuresaretoogeneraltobeconsideredameasureforsourceimportance.Forexample,anunpopularwebsite(source)couldbeveryrelevanttoanevent,duetoitscontentorhavinglinkstootherrelevantwebpagesabouttheevent.Alsousingtopic-orientedPageRankandhubandauthoritymethodsislikeusingtopicalcrawlersforcrawlingaboutevents,whichisnotefficient;thatisakeypointofthisdissertation.

Soweneedadynamic,simplycalculated,andevent-specificmethodforestimatingthelikelihoodoffindinganewrelevantwebpagefromasourcebyusingthelikelihoodofthesourceinthecurrentlycrawledwebpagesorthediscoveredbutnotyetvisitedURLsinthefrontier.Thisissimilartoagraph-basedalgorithmwhichlearnsfromdifferentpathsinthegraphwhetheracertainpathwillleadtorelevantwebpagesevenifweencounternon-relevantonesinthemiddle.Thus,ifawebpageisnotrelevantbutitssourcehashighprobabilityofhavingevent-relevantwebpagesthenthereisahighprobabilitythatthecurrentlow-score(non-relevant)webpagewilllinktoarelevantwebpage.Thisapproachactslikeagreedyalgorithmwherethecrawlerwillcrawlmorefromthesourcewiththehighestimportance.Thecrawlerwillkeepcrawlingrelevantwebpagefromthemostimportantsourceuntilitnolongerfindsrelevantwebpages,andswitchestoutilizeanotherimportantsource.Wecouldexperimentwithseveralcandidatemethodsforestimatingsourceimportance,usingtheinformationaboutcurrentlycrawledwebpages,liketheonesin[1],namely(thefollowinglististakenfromthepaperin[33]withchanges):

a. NegativeAbsoluteBadfunction,wherethescoreofasourceisthenegativenumberofalreadycrawlednon-relevantwebpages;

b. BestScorefunction,wherethescoreofasourceisthemaximalscoreofoneofthediscoveredbutnotyetvisitedURLsthatbelongstothesource;

c. SuccessRatefunction,wherethescoreofasourceistheratiobetweenthenumberofrelevantwebpagescrawledandthenon-relevant;theratioisinitializedwithpriorparametersαandβwhichwesetto1:score(source)=(#relevant(source)+α)/(#non-relevant(source)+β);

d. ThompsonSamplingfunction,wherethescoreofasourceisarandomnumber,drawnfromabeta-distributionwithpriorparametersαand

61

β;inthiscasewetakeasthescoretherandomvalue:score(source)=Beta(#relevant(source)+α,#non-relevant(source)+β);weinitializedthepriorsαandβwith1;

e. AbsoluteGood*BestScorefunction,wherethescoreofasourceistheproductoftheabsolutenumberofalreadycrawledrelevantwebpagesandthebestscorefunctiondescribedin(b);

f. ThompsonSampling*BestScorefunction,wherethescoreofasourceistheproductoftheThompsonsamplingfunction(d)andthebestscorefunction(b);and

g. SuccessRate*BestScorefunction,wherethescoreofasourceistheproductofthesuccessratefunction(c)andthebestscorefunction(b).

Theresultsshownin[33]indicatethatthesuccessratefunction(c)isthebestscoringfunctionwithregardtocrawlingthelargestnumberofrelevantwebpages(i.e.,thehighestharvestratio).Thuswechosetousethesuccessratefunctionasourmethodforestimatingwebpagesourceimportance.Werananexperimentwiththeeventmodel-basedfocusedcrawlerandsourceimportance.Wecombinedthewebpagesourceimportancescorewiththeeventmodel-basedrelevancescoretoproducethefinalscoreofURLs.Onepossiblemethodofcombinationismultiplyingbothscorestogether,soURLswithhighwebpageimportancescoreandhighrelevancescoreswillgetahigherfinalscore.Wenoteherethatthewebpageimportancescoreiscalculatedduringthecrawlingprocessandisnotafixedvalue,butratheradynamicvaluethatchangesduringthecrawlingprocess.Atthebeginningofthecrawl,allsourceshavethesameinitialimportancescore.Whenanewwebpageisretrievedandfoundrelevant,thenthecorrespondingwebpagesourceimportancescoreisupdated.Inthiswaythemorewefindrelevantwebpagesfromasource,themoretheimportancescoreofthissourceincreases.Figure24showstheperformanceofeventmodel-basedfocusedcrawlingwithandwithoutsourceimportanceforBrusselsattackevent.WenoticefromFigure24thatbothcrawlersachievealmostthesameperformance,withthecrawlerwithsourceimportancestrugglinginthefirsthalfbecauseofthedynamicvalueofthesourceimportance.Wefurtherexaminedtheresultsofthetwocrawlersandfoundthatthecollectionproducedfromthecrawlerwithsourceimportancehadonly21uniquewebsites(webpagessources)whilethecrawlerwithnosourceimportancehad81uniquewebsites.Thecrawlerwithsourceimportancesucceededincollectingthesamenumberofrelevantwebpagesasthecrawlerwithnosourceimportance,but

62

fromfarfewerwebsites.Werantheexperimentalsoon3moreevents:Californiashooting(seeFigure25),Ecuadorearthquake(seeFigure26),andOrlandoshooting(seedFigure27).

Figure24EffectofsourceimportanceoneventfocusedcrawlingforBrusselsattack

event

Figure25EffectofsourceimportanceoneventfocusedcrawlingforCalifornia

shootingevent

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500 600 700 800 900 1000Percentage4of4craw

led4webpages4that4are4relevant

(Harvest4Ratio)

Total4number4of4crawled4webpages

Effect4of4Source4Importance4on4Event4Focused4Crawling

Event4Focused4Crawler4with4Source4Importance

Event4Focused4Crawler

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Percentage)of)crawled)pages)that)are)relevant

Number)of)pages)crawled

Effect)of)Source)Importance)on)event)focused)crawling)for)California)shooting)event

Event1Focused1Crawler Event1Focused1Crawler1 with1Source1Importance

63

Figure26EffectofsourceimportanceoneventfocusedcrawlingforEcuador

earthquakeevent

Figure27EffectofsourceimportanceoneventfocusedcrawlingforOrlando

shootingevent

0

0.2

0.4

0.6

0.8

1

1.2

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Perce

ntage4of4craw

led4webpages4that4a

re4re

levant

Number4 of4webpages4crawled

Effect4of4source4importance4on4event4focused4crawling4for4Ecuador4earthquake4event

Event4Focused4Crawler Event4Focused4Crawler4with4Source4Importance

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Percentage4of4crawled4webpages4that4a

re4re

levant

Number4 of4webpages4crawled

Effect4of4source4importance4on4event4focused4crawling4for4Orlando4 shooting4event

Event4Focused4Crawler Event4Focused4Crawler4with4Source4Importance

64

WecanseefromtheresultsinFigures24-27,thataddingwebpagesourceimportancetoeventmodel-basedfocusedcrawlingenhancestheperformanceandachievesabetterharvestratio.Theeventmodel-basedfocusedcrawlerwithsourceimportancestrugglesinthebeginningofthecrawlandthenmanagestoenhancetheperformanceattheendofthecrawl.Thereasonforthebadperformanceinthebeginningisthatthefocusedcrawleristryingtofindthegoodsources;onceitsettlesongoodonesitexploitstheirimportancetoreachmorerelevantwebpages.Inthissectionwecoveredresearchquestions4and5,whichaddresshowtomodelwebpagesourceimportanceandhowtointegratethemwitheventmodel-basedfocusedcrawlers.Weshowedthatusingwebpagesourceimportancehelpsthefocusedcrawlercollectrelevantwebpageswhilefocusingonimportantsources.Regardinghypothesis2,First,wehaveshownthatifweknowthebiasofdifferentwebsitesaccordingtosomecriteria,wecanincludeotherwebsitesinordertoreducebias.Second,ourexperimentalresultsshowthatintegratingeventinformationandsourceimportanceleadstoimprovedestimatesofrelevance.Additionaldemonstrationsofthesecapabilitiesaregiveninthefollowingsubsections.

7.2 SeedURLsforcrawlingThefactorsthataffectanycrawlingexperimentsarethenumberofdesiredwebpagestobecrawledandthenumberofseedURLsfedtothefocusedcrawler.SuccessfullycollectingthenumberofdesiredwebpagesdependsonthequalityandthenumberofseedURLswestartfrom.WeshouldstartfromURLsthatwilllink/leadtothelargestnumberofrelevantwebpages.TherearedifferenttypesofseedURLs.WeclassifytheseedURLswithregardtorelevanceandlinkingasfollows:relevantandlinkingtorelevantURLs,relevantandnotlinking,non-relevantandlinking,non-relevantandnotlinking.Wedon’twantthelasttype(totallyuselessURLs).Allothertypesareeithergoodintheirownright,orarepointingtoothergoodURLs.Table9summarizesthedifferenttypesofseedURLs.WedeterminewhetheraURLlinkstootherrelevantURLsornotbydownloadingthecorrespondingwebpageandextractingthelinksfromthewebpagecontent.WeestimatetherelevanceofaURLbyclassifyingtheURLtokensasrelevantornon-relevanttoanevent.URLtokensarethesetoftokensextractedfromtheURLaddressandtheURLanchortextthatappearsontheparentwebpage.

65

Table9DifferenttypesofseedURLs

LinkingtorelevantURLs NotlinkingtorelevantURLs

Relevant Hubwebpage Deadend,authoritywebpage

Non-Relevant Tunneling Reject;ignorethatpath

7.3 Semi-automatedSocialMedia-basedSeedURLGenerationSocialmediahasproventobeanimportantandrichassetforcollectingwebpagesaboutevents.EnsuringfullWebarchivecoverageofaneventisnotaneasytask,forseveralreasons.First,eventsdifferinimpactandimportance.Bigeventstendtolastforalongtime,impactmultipleplaces,andevensparkarangeofdebatesaboutdiversetopics.Second,tobuildaWebcollectionthatfullycoversaneventrequiressamplinganunbiasedsetofwebpagesfromtheWWW(whichishuge,heterogeneous,anddynamicallychanging).ThesizeoftheWWWmakesitdifficulttocollect,curate,andsampleanunbiasedsetofwebpagesusingmanualtechniques.Fortunately,focusedcrawlershavebeenproveneffective[2,26,46,66,74,75,89]inautomatingandacceleratingtheprocessofcollectingwebpages,startingfromasetofseedURLs.However,theabilityofthefocusedcrawlertofindrelevantanddiversewebpagesdependsonthequality(contentqualityandlinkingstructurequality)andthebroadcoverage(seedURLsfromdifferentwebpagesources)oftheseedURLs.Wehavebeenresearchingbuildingwebpage/tweetcollectionsaboutevents.Weidentifiedthreemainapproaches:1)theInternetArchive’sArchive-Itserviceforcollectingandarchivingwebpages,2)apairofarchivingtoolsforcollectingtweets,and3)eventmodel-basedfocusedcrawlingofwebpages.WeproposedahybridapproachforbuildingunbiasedcollectionsofwebpageswithhighcoverageusingseedURLsgeneratedfromsocialmediacontent(tweets),togetherwitheventmodel-basedfocusedcrawlers.Thetweetcollectionprocesses[1,2,74,91]ensurealargesampleofseedURLswithbroadandheterogeneousgenresofwebpages(horizontal/exploringaspect)whiletheeventmodel-basedfocusedcrawlerensureshighqualityandrelevantwebpages(vertical/exploitingaspect).

66

7.3.1 SelectingSeedURLsWeapplythefollowingstepsforselecting/curatingthesetofseedURLs:

• GrouplongURLsbysources/domains/hosts• CountthenumberoflongURLspersource(sourceimportance)• Sortsources(descending)accordingtonumberofURLsineachsource• PicktopKsourcesandthenchooseoneURLfromeachoftheKsources

ChoosingKuniquesourcesensuresdiversityoftheseedURLsandchoosingthetopKaccordingtosourceimportancemeasureensuresbroadcoverageandhighquality.AlthoughtweetcollectionsabouteventsareaveryrichsourceofseedURLs,theycontainalotofnoise(porn,jobmarketing,otherspam,andvariedothertypesofnon-relevanttweetsorURLs).OneimportantsteprequiredbeforeusingURLsextractedfromtweetsisanimportanceanalysisofeachURL/webpagesource,e.g.,consideringthedomainnameoftheURL.ThesetofseedURLsshouldbenormalized.Considerthelistbelow.URL1isthenormalizedversionofURL2.BothURLspointstothesamewebpage.

• URL1=www.cnn.com• URL2=www.cnn.com?utm_source=feedburner&utm_medium=twitter

TheURLsmentionedintweetsarenotallrelevant.Weneedtofilterthem.Weusedakeyword-basedfilteringmethod,whichincludesonlyURLsthathaveatleastoneofapre-definedsetofkeywords.Thesetofkeywordsiscreatedmanuallyforeachevent.Suchafilteringprocessiscloseinaccuracytoclassification,butmuchfaster.

7.3.2 SeedsURLDomain/SourceImportanceWedefinethewebpagesourceimportanceastheprobabilityoffindingmorerelevantwebpageswhenstartingwithaseedURLfromthatsource.Weassumethatthetopical/eventlocalitypropertyholdswherewebpagesaboutaneventlinktootherwebpagesaboutthesameevent.AlsoweassumethatURLs/webpagesfromasamesourceareconnected(i.e.,ifyoustartfromoneofthemyoucanreachtheothers).WeestimatethewebpagesourceimportancebycalculatingthenumberofURLsfromthesamesourceextractedfromthetweetcollection.Figure28showstheworkflowforextractingURLsfromtweetcollections.WeapplyourmethodsontheextractedURLstocalculatethesourceimportance.

67

Figure28Workflowforextracting,expanding,andselectingURLsfromtweets

WerananexperimentabouttheBrusselsattackevent.Weranoureventfocusedcrawlertocollect1000webpagesstartingfromdifferentsetsofseedURLs.ThesetsofseedURLstestedinourexperimentsdifferintwoaspects:thenumber(K)ofURLsandtheuniquenessoftheURLswithrespecttotheirwebsites.WeselectedtheseedURLsfromapoolofURLsextractedfromasetoftweetscollectedusingtheTwitterstreamingAPI.Table10summarizesthestatisticsfortheBrusselsattacktweetcollection.Figure29showsthelanguagedistributionofthetweets.MostofthetweetsareintheEnglishlanguage,whichisthelanguageweareworkedoninourresearch.Figures30and31showthenumberoftweets(withoutandwithURLs),andtheirdistributionacrosstime.Mostofthetweetswerepostedonthefirstdayoftheevent;theirnumberdecreaseswithtime.Figure32showsthedistributionofthesourcesaccordingtooursourceimportancemeasure.Weusedtheharvestratiomeasuretoevaluatetheoutputofthefocusedcrawlers.

68

Table10Brusselsattacktweetcollectionstatistics

Category Number

Alltweets 2,227,706

TweetsinEnglish(lang=en) 1,838,276

Tweetcreationdatedistribution:3/22/2016 1,253,152

TweetswithURLs 937,009

TweetswithURLcreationdatedistribution:3/22/2016

462,154

UniqueshortURLsextracted(lang=en) 113,402

UniquelongURLs 85,991(twitter.com=38,168)

Uniquedomains/sources 8,082(2980>=2,596>=10)

De-duplicatedURLs 74,698

URLswithkeywords“brussels,attack” 16,187

Figure29Brusselsattacktweetslanguagedistribution

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

1,600,000

1,800,000

2,000,000

en

und fr tr es nl

de ar el it th hi

ru in pl

ja svpt

daht tl fa fi

uk et

no csro ur lv hu sl

cyko iw zhbg sr

eu lt

mr is ta ne

gu

bn ka vi

te si

pa

am knhy

ml

ckb ps

or

NumberAofATweets

Language

LanguageADistribution

69

Figure30Brusselsattacktweetscreationdatedistribution

Figure31BrusselsattacktweetswithURLsdistribution

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

3/22/16 3/23/16 3/24/16 3/25/16 3/26/16 3/27/16 3/28/16

Number2of2Tweets

Tweet2Creation2 Date

Tweets2Date2Distribution

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

450,000

500,000

3/22/16 3/23/16 3/24/16 3/25/16 3/26/16 3/27/16 3/28/16

Number2of2Tweets2with2URLs

Tweets2Creation2 Date

Tweets2w/2URLs2Date2Distribution

70

Figure32BrusselsattackseedURLsdomainsdistribution

Table11summarizestheresultsofthedifferentsettingsregardingtheseedURLsfortheBrusselsattackevent.Thefirstrowisforthecaseofhavingkuniquedomains.WechooseoneURLfromeachdomain.Thesecondrowisforthecaseofhavingk/2uniquedomains;wechoose2URLsfromeachdomain.ThedomainsfromwhichwechoosetheURLsaresortedaccordingtohowmanyURLsbelongtothem.SothetopkdomainsaretheonesthathavethelargestnumbersofURLsbelongingtothem.ThecolumnsrepresentthedifferentvaluesforK(thedesirednumberofURLswewanttoselectasseeds).Astheresultsshows,asweincreasethenumberofseeds,thefocusedcrawlerfindsmorerelevantwebpages(leadingtoahighharvestratio).Also,distributingtheseedURLsacrossseveralwebsitesincreasestheabilityofthefocusedcrawlertofindmorerelevantwebpages.Table12showsthewebpagesourcedistributionintheresultingcrawledwebpages.Astheresultsshow,usingURLswithmoreuniquesourcesproducescollectionswithamorediversesetofsources,andthisincreasesbyincreasingthenumberofseeds.Forexample,startingfrom10seedURLsfromuniquesourcesledtoacollectionof1000webpagesfrom68uniquesourcesincontrasttotheoriginal31uniquesources.

0

50

100

150

200

250

300

350

400

450

500Nu

mber-of-URLs

Domains

Domains-Distribution

71

Table11HarvestratioforeventfocusedcrawlerusingtwomethodsofseedselectionwithdifferentnumbersofseedsforBrusselsattack

K=10 K=50 K=100TopKFrequentuniquewebsites 0.685 0.752 0.817TopK/2FrequentwebsiteswithmultipleURLsfromsamesource

0.645 0.763 0.775

Table12NumberofdifferentdomainsintheoutputcollectionsofcrawlingexperimentsforBrusselsattack

K=10 K=50 K=100TopKfrequentuniquewebsites 68 74 117

TopKfrequentwebsiteswithmultipleURLsfromsamesource

31 58 78

Tables13and14showtheresultsfortheOregonshootingevent,Tables15and16showtheresults fortheCaliforniashootingevent,andTables17and18showtheresults for theOrlandoshootingevent.For theOregonshootingevent,wesee thesameeffectasfortheBrusselsattackevent,havingseedURLsfromuniquedomains(rather thanhavingmultipleURLs from the samedomain) increases theabilityofthefocusedcrawlertofindmorerelevantwebpages. However, inthecaseK=100,therewasnoperformanceimprovement.Bothmethodsachievedthesameharvestratio(0.741)butthemethodusingURLsfromuniquedomainsproducedacollectionwith 135 unique domains while the method using multiple URLs from the samedomainproducedacollectionwithonly96uniquedomains.AnotherproblemwiththeOregonshootingcollectionisthatweranourexperiments10monthsaftertheeventhappened.Most of the seeds andwebpages about the eventno longer exist(give404errorasHTTPResponse).Thisaffectstheperformanceofbothcrawlers;addingmoreseedsdoesn’thelpinfindingrelevantwebpages.

Table13HarvestratioforeventfocusedcrawlerusingtwomethodsofseedsselectionwithdifferentnumbersofseedsforOregonshooting

K=10 K=50 K=100TopKFrequentuniquewebsites 0.717 0.744 0.741TopKFrequentwebsiteswithmultipleURLsfromsamesource

0.715 0.669 0.741

72

Table14NumberofdifferentdomainsintheoutputcollectionsofcrawlingexperimentsforOregonshooting

K=10 K=50 K=100TopKfrequentuniquewebsites 65 93 135

TopKfrequentwebsiteswithmultipleURLsfromsamesource

59 84 96

WealsoseethesamebehaviorintheCaliforniashootingeventasshowninTables15and16.CrawlingwithURLsfromuniquedomainsachievesbetterperformancethancrawlingwithmultipleURLsfromthesamedomain.IncreasingthenumberofseedURLsincreasestheperformanceofthefocusedcrawler.Also,inthecaseK=100,bothmethodsachieveaharvestratioofaround0.9,butthemethodusingURLsfromuniquedomainsproducedacollectionwith129uniquedomainswhilethemethodusingmultipleURLsfromsamedomainproducedacollectionwithonly101uniquedomains.AbetterwayistochooseseedURLsfromtype4(i.e.,HubURLs)(seeTable9)wheretheURLpointstoawebpageofrelevantcontentandthewebpagecontainsURLstootherrelevantwebpages.SinceweareautomatingtheprocessofselectingtheseedURLsfromthepoolofURLsextractedfromsocialmedia(i.e.,Twitter),wethinkweachievedareasonableperformancewithminimalornomanualwork.ThiscontraststothesituationwhenseedURLsarecuratedwithmanualwork.

Table15HarvestratioforeventfocusedcrawlerusingtwomethodsofseedsselectionwithdifferentnumbersofseedsforCaliforniashooting

K=10 K=50 K=100TopKFrequentuniquewebsites 0.7 0.7604 0.7818TopKFrequentwebsiteswithmultipleURLsfromsamesource

0.66 0.7218 0.7548

Table16Numberofdifferentdomainsintheoutputcollectionsofcrawling

experimentsforCaliforniashooting

K=10 K=50 K=100TopKfrequentuniquewebsites 60 97 129

TopKfrequentwebsiteswithmultipleURLsfromsamesource

54 64 101

WeexaminedtheseedURLsproducedinthecasek=100inboththeOregonandCaliforniashootingevents.WenoticedthatalthoughweselectedtheseedURLsfromimportantdomains,thetypeofthewebpagestheseURLslinkingtoareauthority/deadend,wherethecontentofthewebpagesisrelevantbuttheyarenot

73

linkingtootherrelevantwebpages.ThusaddingmoreoftheseURLsdidn’thelpthefocusedcrawlerfindorreachmorerelevantwebpages.Finally,theOrlandoshootingeventleadstothesamebehaviorasinthepreviousevents.AddingmoreURLsfromuniquedomainshelpsthefocusedcrawlerfindmorerelevantwebpages,asisshowninTable16.Inthecasek=100,addingmoreuniquedomainsachievedalmostthesameperformanceasthemethodofaddingmoreURLsfromthesamedomain(likeforthepreviouslymentionedevents).AreasonablejustificationforthatbehavioristhatwhenwepickURLsfromthetop100domains,thedomaindistributiongetswider,andweincludedomainswithalownumberofURLsfromthem(thetailofthedistribution),whencomparedtothemostfrequentdomainsatthetopofthelist.ThesedomainsdonotlinktomorerelevantURLsandthusdon’thelpthefocusedcrawlerreachmorerelevantpartsoftheWWW.Weverifiedthat,byexaminingthelast10domainsinthetop100domainsintheOrlandoshootingevent.WeexaminedthenumberofURLsthatarefromthelast10domainsinthecrawledwebpagesproducedattheendofthecrawl.Wefoundthat8outofthe10domainshad1URLonlyintheresultingcollectionofwebpages.AnoptimumselectionofseedURLsisatrade-offbetweenaddingmoreuniquedomainsandhavingseedURLsfromtype4seedURLs,whicharehighlyrelevantontheirown,andalsolinktootherrelevantURLs.WeusedthenumberofURLsfromadomaintoestimatetheprobabilitythataURLfromthatdomainwilllinktootherrelevantURLs.Abetterway(butrathercomputationallyexpensive)istobuildtheWebgraphoutoftheseedURLsandselectonlytheonesthathavehigherout-degreeorthatoptimizethecoverageanddiversitytrade-off.

Table17HarvestratioforeventfocusedcrawlerusingtwomethodsofseedsselectionwithdifferentnumbersofseedsforOrlandoshooting

K=10 K=50 K=100TopKfrequentuniquewebsites 0.6 0.709 0.722

TopKfrequentwebsiteswithmultipleURLsfromsamesource

0.5 0.6856 0.7158

Table18NumberofdifferentdomainsintheoutputcollectionsofcrawlingexperimentsforOrlandoshooting

K=10 K=50 K=100TopKfrequentuniquewebsites 157 181 220

TopKfrequentwebsiteswithmultipleURLsfromsamesource

13 157 181

74

8 ConclusionandFutureWorkWeproposedamodelandrepresentationforevents.Weshowedhowtorepresentaneventusingourmodel.Wecalculatedtheweightsofthethreeattributesofoureventmodelbyjointlyoptimizingtwoparameters-thenumberofkeywordsandthethresholdvalue-toyieldthebestF1-scoreevaluationmetriconamanuallylabeled(relevantandnon-relevant)datasetofURLsandwebpagesabouttheCaliforniashooting.Theresultsshowedthattheeventmodel,withtheseweightsemployed,caneffectivelyclassifyURLsandwebpagesastotheirrelevancetotheeventofinterest.Weincorporatedoureventmodelintofocusedcrawlingandshowedthatoureventmodel-basedfocusedcrawlerbuiltanevent-relatedWebcollectionmoreeffectivelythanthestate-of-the-artbest-firsttopic-onlyfocusedcrawlerontwodifferentevents:CaliforniashootingandBrusselsattack.Theresultsforsmall-scaleexperiments(collecting1000webpagesfrom38and23seedURLs,respectively)showedthatourevent-modelbasedfocusedcrawleroutperformedthetopic-onlyfocusedcrawlerbycollectingmorerelevantwebpagesaboutthetwoevents(i.e.,achievinghigherharvestratio).Weranexperimentsforlarge-scalecrawling(rangingfrom50K–500Kwebpages)on7differentevents:Californiashooting,Brusselsattack,Oregonshooting,Egyptairplanecrash,Panamapapers,Parisattack,andEcuadorearthquake.Weleveragedsocialmedia(i.e.,Twitter)toextractandselectseedURLsforcrawling.Oureventmodel-basedfocusedcrawleroutperformedthetopic-onlyfocusedcrawlerbycollectingmorerelevantwebpages.Weproposedandincorporatedwebpages’sourceimportanceintoourfocusedcrawler.Theresultsshowedthatusingwebpagesourceimportanceledtoanequivalentqualityeventrelatedcollection,relativetothebaseline,butrequiredfewersources.Finally,weshowedtheeffectoftheseedURLsonthequalityoftheresultingwebpagescollections.WedemonstrateduseofthesourceimportancemeasuretocurateandselecthighqualityseedURLsfromURLsextractedfromsocialmediacontent(tweets).

75

Wehavecoveredthefiveresearchquestionsandtherelatedhypotheses.Researchquestions1and2(andhypothesis1.1)werecoveredinChapter4:buildingtheeventmodelandrepresentationandusingitwiththefocusedcrawler.Wecoveredresearchquestion3(andhypothesis1.2)inChapter6,evaluatingtheeffectivenessofoureventmodelinfocusedcrawling.Finally,wecoveredresearchquestions4and5(andhypotheses1.3and2)inChapter7,modelingwebpagesourceimportanceandintegratingitwitheventmodel-basedfocusedcrawler.

8.1 ContributionsOurcontributionsinthisdissertationresearchare:

1. Designinganeventmodelthatcapturestheinformationneededforrepresentingevents(withdisastereventsasacasestudy).

2. Developinganevent-awarefocusedcrawlerthatusestheeventmodelforthetargeteventandforwebpagerepresentation,aswellasfordevelopinganewsimilarityfunctionthathelpsinwebpagerelevanceestimation.

3. Designingandincorporatingwebpagesourceimportancemodelintoafocusedcrawlersystem.

4. Developinganewmethodologyforsemi-automatedseedURLgenerationfromsocialmediacontent.

5. Buildinganeventdigitallibraryofevent-relatedobjects(text,metadatarecords,archives,andentities).

8.2 FutureWorkOureventmodelhascapturedthreeattributesforanevent(topic,location,anddate).Weplantoextendoureventmodelbyextractingandaddingorganizationsandparticipants;thatinformationwillrepresentthe‘Who’partinthe‘WhodidWhat,WhereandWhen’eventmodel.Thiswillenrichoureventmodelandconsequentlyshouldincreasetheeventmodel-basedfocusedcrawler’spowertoestimateandretrievemorerelevantwebpages.Further,withregardtofocusedcrawlingforlargeevents,weareintegratingourtweetcollectionefforts,thatalreadyhaveresultedinover1.2billiontweetsspreadacrossabout1000collection,withfollow-upfocusedcrawlingthatstartswithseedsthatcomefromtheURLsfoundinthosetweets.Ontheapplicationside,wealsoplantouseoureventmodeltoanalyzeandsummarizeacollectionofwebpages;thiscanworkforanycollectionaboutaparticularevent(e.g.,preparedthroughmanualcuration,orusingoureventfocusedcrawler[92]).Usingoureventmodel,wewillgeneratealistofindicativesentences,

76

andextractentitiestorepresentandsummarizeanevent.Therearemultiplealgorithmsandsoftwareimplementationsfortextsummarization,butwebelievethisconceptofcorpus/eventsummarizationisnewandworthinvestigation.Ourpreliminarystudyofsuchsummarizationsuggeststhatresultswillhavehighqualityandutility[92].Further,weplantorunmoreexperimentsondifferentkindsofeventsandtotestotherheuristicsforcombiningwebpagesourceimportancewitheventmodel-basedrelevancescores.Finally,wewillbuildaknowledgebaseofsources,foreachtypeofevent.TheknowledgebasewillincludealistofURLs,extractedfromsocialmediacontentaboutdifferenttypesofevents.ThelistofextractedURLswillbeusedtobuildalistofpairs:sourcesandtheirimportancescore(howmanyrelevantURLsarefromasource).Thislistcouldbeusedtocomputepriorsforthesourceimportancemodelduringcrawling.

77

References1. Farag,M.M.G.andE.A.Fox,Buildingandarchivingeventwebcollections:A

focusedcrawlerapproach,inBulletinofIEEETechnicalCommitteeonDigital

Libraries.2015.p.1-2.2. Farag,M.andE.A.Fox,FocusedCrawlingForEvents.InternationalJournalof

DigitalLibraries,SpecialIssueofWebArchiving-Inreview,2016.3. Magdy,M.andE.A.Fox,IntelligentEventFocusedCrawling,inThe11th

InternationalConferenceonInformationSystemsforCrisisResponseand

Management(ISCRAM)-Poster.2014:UniversityPark,Pennsilvenya,USA.4. O'Reilly,T.,WhatisWeb2.0:Designpatternsandbusinessmodelsforthenext

generationofsoftware.Communications&strategies,2007.1(1):p.17.5. IDEAL.IntegratedDigitalEventArchiveandLibrary.2016[cited2016April26];

Availablefrom:http://eventsarchive.org/.6. Internet_Archive.InternetArchive,Adigitallibraryoffreecontentandwayback

machine.2016[cited2016April26];Availablefrom:https://archive.org/.7. Farag,M.,P.Nakate,andE.A.Fox,BigDataProcessingofSchoolShooting

Archives,inProceedingsofthe16thACM/IEEE-CSonJointConferenceonDigital

Libraries.2016,ACM:Newark,NewJersey,USA.p.271-272.8. IDEAL_Collections.IDEALWebCollectionsandTweetArchives.2016[cited2016

April26];Availablefrom:http://eventsarchive.org/eventstable.9. Archive-It.Webarchivingservicesforlibrariesandarchives.2016[cited2016

April26];Availablefrom:https://archive-it.org/.10. G.Mohr,etal.IntroductiontoHeritrix,anArchivalQualityWebCrawler.in

Proceedingsofthe4thInternationalWebArchivingWorkshop(IWAW’04).2004.11. Fox,E.A.andJ.P.Leidig,DigitalLibraryApplications:CBIR,Education,Social

Networks,eScience/Simulation,andGIS.2014:Morgan&ClaypoolPublishers.12. Fox,E.A.andR.d.S.Torres,DigitalLibraryTechnologies:ComplexObjects,

Annotation,Ontologies,Classification,Extraction,andSecurity.2014:Morgan&ClaypoolPublishers.

13. Shen,R.,M.A.Goncalves,andE.A.Fox,KeyIssuesRegardingDigitalLibraries:EvaluationandIntegration.2013:Morgan&ClaypoolPublishers.

14. Fox,E.A.,M.A.Goncalves,andR.Shen,TheoreticalFoundationsforDigitalLibraries:The5S(Societies,Scenarios,Spaces,Structures,Streams)Approach.2012:Morgan&ClaypoolPublishers.

15. Salton,G.andC.Buckley,Term-weightingapproachesinautomatictextretrieval.InformationProcessing&Management,1988.24(5):p.513-523.

16. Salton,G.andM.J.McGill,IntroductiontoModernInformationRetrieval.1986:McGraw-Hill,Inc.

17. Pant,G.,P.Srinivasan,andF.Menczer,Crawlingtheweb,inWebDynamics.2004,Springer.p.153-177.

18. Manning,C.D.,etal.,IntroductiontoInformationRetrieval.2008:CambridgeUniversityPress.496.

78

19. Archive-It.Archive-ItCollections,SpontaneousEvents.2016[cited2016July];Availablefrom:https://archive-it.org/explore?show=Collections&fc=meta_Subject%3ASpontaneousevents.

20. Yang,S.,etal.AstudyofautomationfromseedURLgenerationtofocusedweb

archivedevelopment:theCTRnetcontext.inProceedingsofthe12thACM/IEEE-

CSjointconferenceonDigitalLibraries.2012.ACM.21. IDEAL.IDEALTweetCollections.2016[cited2016August24];Availablefrom:

http://hadoop.dlib.vt.edu/.22. Fox,E.A.andM.M.Farag,Reportontheworkshoponwebarchivinganddigital

libraries(WADL2013).SIGIRForum,2013.47(2):p.128-133.23. Fox,E.A.,Z.Xie,andM.Klein,WADL2016:ThirdInternationalWorkshoponWeb

ArchivingandDigitalLibraries,inProceedingsofthe16thACM/IEEE-CSonJoint

ConferenceonDigitalLibraries.2016,ACM:Newark,NewJersey,USA.p.293-294.

24. Fox,E.A.andZ.Xie,WebArchivingandDigitalLibraries(WADL),inProceedingsofthe15thACM/IEEE-CSJointConferenceonDigitalLibraries.2015,ACM:Knoxville,Tennessee,USA.p.303-303.

25. WARC.Informationanddocumentation--WARCfileformat-ISO28500:2009.2016[cited2016August24];Availablefrom:http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717.

26. Chakrabarti,S.,M.VandenBerg,andB.Dom,Focusedcrawling:anewapproachtotopic-specificWebresourcediscovery.ComputerNetworks,1999.31(11):p.1623-1640.

27. Batsakis,S.,E.G.Petrakis,andE.Milios,Improvingtheperformanceoffocused

webcrawlers.Data&KnowledgeEngineering,2009.68(10):p.1001-1013.28. Pant,G.andP.Srinivasan,Learningtocrawl:Comparingclassificationschemes.

ACMTransactionsonInformationSystems(TOIS),2005.23(4):p.430-462.29. Rennie,J.andA.McCallum.Efficientwebspideringwithreinforcementlearning.

inProceedingsoftheInternationalConferenceonMachineLearning.1999.Citeseer.

30. Grigoriadis,A.andG.Paliouras,Focusedcrawlingusingtemporaldifference-

learning,inMethodsandApplicationsofArtificialIntelligence.2004,Springer.p.142-153.

31. Singh,N.,etal.LargeScaleURL-basedClassificationUsingOnlineIncremental

Learning.inMachineLearningandApplications(ICMLA),201211thInternational

Conferenceon.2012.IEEE.32. Menczer,F.andA.E.Monge,Scalablewebsearchbyadaptiveonlineagents:An

InfoSpiderscasestudy,inIntelligentInformationAgents.1999,Springer.p.323-347.

33. Meusel,R.,P.Mika,andR.Blanco,FocusedCrawlingforStructuredData,inProceedingsofthe23rdACMInternationalConferenceonConferenceon

InformationandKnowledgeManagement.2014,ACM:Shanghai,China.p.1039-1048.

79

34. Ehrig,M.andA.Maedche.Ontology-focusedcrawlingofWebdocuments.inProceedingsofthe2003ACMsymposiumonAppliedcomputing.2003.ACM.

35. Dong,H.,F.K.Hussain,andE.Chang.Asurveyinsemanticwebtechnologies-

inspiredfocusedcrawlers.inDigitalInformationManagement,2008.ICDIM

2008.ThirdInternationalConferenceon.2008.IEEE.36. Yang,S.,etal.,CTRnetDLfordisasterinformationservices,inProceedingsofthe

11thannualinternationalACM/IEEEjointconferenceonDigitallibraries.2011,ACM:Ottawa,Ontario,Canada.p.437-438.

37. Vural,A.G.,B.B.Cambazoglu,andP.Karagoz,Sentiment-focusedwebcrawling.ACMTransactionsontheWeb(TWEB),2014.8(4):p.22.

38. Fu,T.,etal.,Sentimentalspidering:leveragingopinioninformationinfocused

crawlers.ACMTransactionsonInformationSystems(TOIS),2012.30(4):p.24.39. Almpanidis,G.,C.Kotropoulos,andI.Pitas,Combiningtextandlinkanalysisfor

focusedcrawling—Anapplicationforverticalsearchengines.InformationSystems,2007.32(6):p.886-908.

40. Diligenti,M.,etal.FocusedCrawlingUsingContextGraphs.inVLDB.2000.41. Pant,G.andP.Srinivasan,Linkcontextsinclassifier-guidedtopicalcrawlers.

KnowledgeandDataEngineering,IEEETransactionson,2006.18(1):p.107-122.42. Kleinberg,J.M.,etal.Thewebasagraph:measurements,models,andmethods.

inInternationalComputingandCombinatoricsConference.1999.Springer.43. Brin,S.andL.Page,Reprintof:Theanatomyofalarge-scalehypertextualweb

searchengine.Computernetworks,2012.56(18):p.3825-3833.44. Page,L.,etal.,ThePageRankcitationranking:bringingordertotheweb.1999:

TechnicalReport.StanfordInfoLab.45. DeAssis,G.T.,etal.Exploitinggenreinfocusedcrawling.inStringProcessingand

InformationRetrieval.2007.Springer.46. Pant,G.andP.Srinivasan,Predictingwebpagestatus.InformationSystems

Research,2010.21(2):p.345-364.47. Pant,G.andP.Srinivasan,StatusLocalityontheWeb:ImplicationsforBuilding

FocusedCollections.InformationSystemsResearch,2013.24(3):p.802-821.48. Chen,Y.,Anovelhybridfocusedcrawlingalgorithmtobuilddomain-specific

collections.2007,VirginiaPolytechnicInstituteandStateUniversity.49. Allan,J.,Introductiontotopicdetectionandtracking,inTopicdetectionand

tracking.2002,Springer.p.1-16.50. Volkova,S.,etal.,Animaldiseaseeventrecognitionandclassification.UsingWeb

DataintheMedicalDomain,2010:p.54.51. Westermann,U.andR.Jain,Towardacommoneventmodelformultimedia

applications.IEEEMultiMedia,2007.14(1):p.19-29.52. Strötgen,J.,M.Gertz,andC.Junghans.Anevent-centricmodelformultilingual

documentsimilarity.inProceedingsofthe34thinternationalACMSIGIR

conferenceonResearchanddevelopmentinInformationRetrieval.2011.ACM.53. Li,Z.,etal.,Aprobabilisticmodelforretrospectivenewseventdetection,in

Proceedingsofthe28thannualinternationalACMSIGIRconferenceonResearch

80

anddevelopmentininformationretrieval.2005,ACM:Salvador,Brazil.p.106-113.

54. Ha-Thuc,V.,etal.Newseventmodelingandtrackinginthesocialwebwith

ontologicalguidance.inSemanticComputing(ICSC),2010IEEEFourth

InternationalConferenceon.2010.IEEE.55. Ha-Thuc,V.,etal.Arelevance-basedtopicmodelfornewseventtracking.in

Proceedingsofthe32ndinternationalACMSIGIRconferenceonResearchand

developmentininformationretrieval.2009.ACM.56. Becker,H.,M.Naaman,andL.Gravano,Beyondtrendingtopics:Real-world

eventidentificationonTwitter,inFifthInternationalAAAIConferenceonWeblogsandSocialMedia.2011:Barcelona,Spain.

57. Parikh,R.andK.Karlapalem,ET:eventsfromtweets,inProceedingsofthe22ndInternationalConferenceonWorldWideWeb.2013,ACM:RiodeJaneiro,Brazil.p.613-620.

58. Ritter,A.,etal.,OpendomaineventextractionfromTwitter,inProceedingsofthe18thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddata

mining.2012,ACM:Beijing,China.p.1104-1112.59. Ritter,A.,etal.,WeaklySupervisedExtractionofComputerSecurityEventsfrom

Twitter,inProceedingsofthe24thInternationalConferenceonWorldWideWeb.2015,ACM:Florence,Italy.p.896-905.

60. Strotgen,J.,M.Gertz,andC.Junghans,Anevent-centricmodelformultilingual

documentsimilarity,inProceedingsofthe34thinternationalACMSIGIR

conferenceonResearchanddevelopmentinInformationRetrieval.2011,ACM:Beijing,China.p.953-962.

61. Yom-Tov,E.andF.Diaz,Locationandtimelinessofinformationsourcesduring

newsevents,inProceedingsofthe34thinternationalACMSIGIRconferenceon

ResearchanddevelopmentinInformationRetrieval.2011,ACM:Beijing,China.p.1105-1106.

62. Lakka,C.,etal.,ABayesiannetworkmodelingapproachforcrossmediaanalysis.SignalProcessing:ImageCommunication,2011.26(3):p.175-193.

63. Gossen,G.,E.Demidova,andT.Risse.iCrawl:ImprovingtheFreshnessofWeb

CollectionsbyIntegratingSocialWebandFocusedWebCrawling.inProceedingsofthe15thACM/IEEE-CSJointConferenceonDigitalLibraries.2015.Knoxville,Tennessee,USA.

64. AlNoamany,Y.,M.C.Weigle,andM.L.Nelson,Detectingoff-topicpagesinwebarchives.ResearchandAdvancedTechnologyforDigitalLibraries,Springer,2015:p.225-237.

65. Menczer,F.,etal.Evaluatingtopic-drivenwebcrawlers.inProceedingsofthe24thannualinternationalACMSIGIRconferenceonResearchanddevelopment

ininformationretrieval.2001.ACM.66. Menczer,F.,G.Pant,andP.Srinivasan,Topicalwebcrawlers:Evaluating

adaptivealgorithms.ACMTransactionsonInternetTechnology(TOIT),2004.4(4):p.378-419.

81

67. Srinivasan,P.,F.Menczer,andG.Pant,Ageneralevaluationframeworkfor

topicalcrawlers.InformationRetrieval,2005.8(3):p.417-447.68. Borlund,P.,TheconceptofrelevanceinIR.JournaloftheAmericanSocietyfor

informationScienceandTechnology,2003.54(10):p.913-925.69. Hjørland,B.,Thefoundationoftheconceptofrelevance.JournaloftheAmerican

SocietyforInformationScienceandTechnology,2010.61(2):p.217-237.70. Schamber,L.,RelevanceandInformationBehavior.Annualreviewofinformation

scienceandtechnology(ARIST),1994.29:p.3-48.71. Saracevic,T.,Relevance:Areviewoftheliteratureandaframeworkforthinking

onthenotionininformationscience.PartIII:Behaviorandeffectsofrelevance.JournaloftheAmericanSocietyforInformationScienceandTechnology,2007.58(13):p.2126-2144.

72. Mizzaro,S.,Relevance:Thewholehistory.JASIS,1997.48(9):p.810-832.73. Voorhees,E.M.,Variationsinrelevancejudgmentsandthemeasurementof

retrievaleffectiveness.Informationprocessing&management,2000.36(5):p.697-716.

74. Gossen,G.,E.Demidova,andT.Risse.iCrawl:ImprovingtheFreshnessofWeb

CollectionsbyIntegratingSocialWebandFocusedWebCrawling.inProceedingsofthe15thACM/IEEE-CEonJointConferenceonDigitalLibraries.2015.ACM.

75. Batsakis,S.,E.G.M.Petrakis,andE.Milios,Improvingtheperformanceoffocused

webcrawlers.DataKnowl.Eng.,2009.68(10):p.1001-1013.76. Salton,G.,A.Wong,andC.S.Yang,AVectorSpaceModelforAutomaticIndexing.

CommunicationsoftheACM,1975.18(11):p.613-620.77. Brin,S.andL.Page,Theanatomyofalarge-scalehypertextualWebsearch

engine.ComputernetworksandISDNsystems,1998.30(1-7):p.107-117.78. Cho,J.,H.Garcia-Molina,andL.Page,EfficientCrawlingThroughURLOrdering,

inSeventhInternationalWorld-WideWebConference(WWW1998).1998:Brisbane,Australia.

79. Heydon,A.andM.Najork,Mercator:Ascalable,extensiblewebcrawler.WorldWideWeb,1999.2(4):p.219-229.

80. Kobayashi,M.andK.Takeda,Informationretrievalontheweb.ACMComput.Surv.,2000.32(2):p.144-173.

81. Chakrabarti,S.,MiningtheWeb:DiscoveringKnowledgefromHyperTextData.2002:ScienceandTechnologyBooks.350.

82. Castillo,C.,Effectivewebcrawling.SIGIRForum,2005.39(1):p.55-56.83. Chakrabarti,S.,K.Punera,andM.Subramanyam,Acceleratedfocusedcrawling

throughonlinerelevancefeedback,inProceedingsofthe11thinternationalconferenceonWorldWideWeb.2002,ACM:Honolulu,Hawaii,USA.p.148-159.

84. Atkinson,M.D.,etal.,Min-maxheapsandgeneralizedpriorityqueues.CommunicationsoftheACM,1986.29(10):p.996-1000.

85. Min-max_heap.Min-maxheap.2016[cited2016August24];Availablefrom:https://en.wikipedia.org/wiki/Min-max_heap.

86. Yom-Tov,E.andF.Diaz,Outofsight,notoutofmind:ontheeffectofsocialand

physicaldetachmentoninformationneed,inProceedingsofthe34th

82

internationalACMSIGIRconferenceonResearchanddevelopmentinInformation

Retrieval.2011,ACM:Beijing,China.p.385-394.87. Foley,J.,M.Bendersky,andV.Josifovski,LearningtoExtractLocalEventsfrom

theWeb,inProceedingsofthe38thInternationalACMSIGIRConferenceon

ResearchandDevelopmentinInformationRetrieval.2015,ACM:Santiago,Chile.p.423-432.

88. Baeza-Yates,R.andB.Ribeiro-Neto,Moderninformationretrieval.Vol.463.1999:ACMpress,NewYork.

89. Aggarwal,C.C.,F.Al-Garawi,andP.S.Yu.IntelligentcrawlingontheWorldWide

Webwitharbitrarypredicates.inProceedingsofthe10thinternationalconferenceonWorldWideWeb.2001.ACM.

90. Liu,H.,J.Janssen,andE.Milios,UsingHMMtolearnuserbrowsingpatternsfor

focusedwebcrawling.Data&KnowledgeEngineering,2006.59(2):p.270-291.91. Farag,M.andE.A.Fox,Whichwebpageshouldwecrawlfirst?Socialmedia-

basedwebpagesourceimportanceguidance,inWorkshoponWebArchivingand

DigitalLibraries(WADL2016)-JointConferenceonDigitalLibraries(JCDL2016).2016:Newark,NJ,USA.

92. Farag,M.andE.A.Fox,Webarchivecontentanalysis.2015:PresentedatInternationalInternetPresentationConsortiumGeneralAssemblyIIPC2015,California,USA.