Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling...

IntelligentEventFocusedCrawling

MohamedMagdyGharibFarag

DissertationsubmittedtothefacultyoftheVirginiaPolytechnicInstituteandStateUniversityinpartialfulfillmentoftherequirementsforthedegreeof

DoctorofPhilosophy

inComputerScienceandApplications

EdwardA.Fox,Co-ChairRihamH.Mansour,Co-Chair

WeiguoFanScotlandC.LemanPadminiSrinivasan

August11,2016Blacksburg,VA

Keywords:FocusedCrawling,EventModeling,WebArchives,DigitalLibraries,

SourceImportance,SocialMedia,SeedGeneration

IntelligentEventFocusedCrawling

MohamedMagdyGharibFarag

ABSTRACTThereisneedforanintegratedeventfocusedcrawlingsystemtocollectWebdataaboutkeyevents.Whenaneventoccurs,manyuserstrytolocatethemostup-to-dateinformationaboutthatevent.Yet,thereislittlesystematiccollectingandarchivinganywhereofinformationaboutevents.Weproposeintelligenteventfocusedcrawlingforautomaticeventtrackingandarchiving,aswellaseffectiveaccess.Weextendthetraditionalfocused(topical)crawlingtechniquesintwodirections,modelingandrepresenting:eventsandwebpagesourceimportance. Wedevelopedaneventmodelthatcancapturekeyeventinformation(topical,spatial,andtemporal).Weincorporatedthatmodelintothefocusedcrawleralgorithm.Forthefocusedcrawlertoleveragetheeventmodelinpredictingawebpage’srelevance,wedevelopedafunctionthatmeasuresthesimilaritybetweentwoeventrepresentations,basedontextualcontent.Althoughthetextualcontentprovidesarichsetoffeatures,weproposedanadditionalsourceofevidencethatallowsthefocusedcrawlertobetterestimatetheimportanceofawebpagebyconsideringitswebsite.Weestimatedwebpagesourceimportancebytheratioofnumberofrelevantwebpagestonon-relevantwebpagesfoundduringcrawlingawebsite.Wecombinedthetextualcontentinformationandsourceimportanceintoasinglerelevancescore.Forthefocusedcrawlertoworkwell,itneedsadiversesetofhighqualityseedURLs(URLsofrelevantwebpagesthatlinktootherrelevantwebpages).AlthoughmanualcurationofseedURLsguaranteesquality,itrequiresexhaustivemanuallabor.WeproposedanautomatedapproachforcuratingseedURLsusingsocialmediacontent.WeleveragedtherichnessofsocialmediacontentabouteventstoextractURLsthatcanbeusedasseedURLsforfurtherfocusedcrawling.Weevaluatedoursystemthroughfourseriesofexperiments,usingrecentevents:Orlandoshooting,Ecuadorearthquake,Panamapapers,Californiashooting,Brusselsattack,Parisattack,andOregonshooting.Inthefirstexperimentseriesourproposedeventmodelrepresentation,usedtopredictwebpagerelevance,outperformedthetopic-onlyapproach,showingbetterresultsinprecision,recall,

andF1-score.Inthesecondseries,usingharvestratiotomeasureabilitytocollectrelevantwebpages,oureventmodel-basedfocusedcrawleroutperformedthestate-of-the-artfocusedcrawler(best-firstsearch).Thethirdseriesevaluatedtheeffectivenessofourproposedwebpagesourceimportanceforcollectingmorerelevantwebpages.Thefocusedcrawlerwithwebpagesourceimportancemanagedtocollectroughlythesamenumberofrelevantwebpagesasthefocusedcrawlerwithoutwebpagesourceimportance,butfromasmallersetofsources.ThefourthseriesprovidesguidancetoarchivistsregardingtheeffectivenessofcuratingseedURLsfromsocialmediacontent(tweets)usingdifferentmethodsofselection.

AcknowledgmentsFirstIwouldliketothankmyadvisorsDr.EdwardA.FoxandDr.RihamMansourforallthecontinuoushelp,support,encouragement,patience,motivation,andguidancethattheygavemethroughmyPh.D.journey.IcouldnothaveimaginedhavingbetteradvisorsandmentorsformyPh.D.study. Besidesmyadvisors,Iwouldliketothanktherestofmythesiscommittee-Prof.PadminiSrinivasan,Prof.PatrickWeiguoFan,andProf.ScotlandLeman-fortheirinsightfulcommentsandencouragement,butalsoforthehardquestionswhichmotivatedmetowidenmyresearchfromavarietyofperspectives.MysincerethanksalsogoestoSethPeery,LukeWard,ShaneColeman,andAndiOgier,whoprovidedmeanopportunitytojointheirteamasintern,andwhohelpedmelearnnewtechnologies,extendmyskills,andapplymyresearchexperiencetopracticalproblems.Ithankmyfellowlabmates:SeungwonYang,SunshinLee,VenkatSrinivasan,TarekKanan,SungHeePark,MonicaAkbar,KiranChitturi,PrashantChandrasekar,EricFouh,andallotherlabmembersIforgottomention,forthestimulatingdiscussions,forthesleeplessnightswewereworkingtogetherbeforedeadlines,andforallthefunwehavehadinthelastfiveyears.IwouldliketothankmywifeSamarElSaadawy.Nowordsorthankswouldsufficeorgiveherwhatshedeserves.Shesufferedalotforme.ShesupportedmeinallthedifferentstagesofmyPh.D.withoutcomplaining.Withouther,Iwouldn’thavedoneanything.ThankyoumyheroSamarElSaadawy. Lastbutnotleast,Iwouldliketothankmyfamily-myparentsandmybrotherandsisters-forsupportingmespirituallythroughoutmyPh.D.andmylifeingeneral.ThanksgotoNSFforsupport,especiallythroughgrantsIIS-1619028,IIS-1619371,IIS-1319578,DUE-1141209,IIS-0916733,andIIS-0736055.ThanksalsogotoVirginiaTech’sDigitalLibraryResearchLaboratoryandDepartmentofComputerScience.

TableofContentsABSTRACT....................................................................................................................................................iiAcknowledgments....................................................................................................................................ivTableofContents.......................................................................................................................................vListofFigures............................................................................................................................................viiListofTables...............................................................................................................................................ix1 Introduction.......................................................................................................................................11.1 Motivation.................................................................................................................................11.2 Hypotheses................................................................................................................................51.3 ResearchQuestions...............................................................................................................61.4 5S...................................................................................................................................................61.5 ThesisOrganization..............................................................................................................8

2 RelatedWork.....................................................................................................................................92.1 WebCrawling...........................................................................................................................92.2 TheIDEALproject..................................................................................................................92.3 WebArchivingandArchive-Itservice.......................................................................112.4 FocusedCrawling................................................................................................................132.4.1 MachineLearning......................................................................................................132.4.2 SemanticSimilarity...................................................................................................142.4.3 ContentandLinkAnalysis.....................................................................................15

2.5 EventModeling....................................................................................................................162.6 SocialMediaandFocusedCrawlingSeedSelection.............................................182.7 Evaluatingtopical/focusedcrawlers..........................................................................19

3 FocusedCrawling..........................................................................................................................203.1 TopicRepresentation........................................................................................................203.2 CrawlerArchitecture.........................................................................................................223.3 LargeScaleDesignConsiderations..............................................................................24

4 EventFocusedCrawler...............................................................................................................264.1 EventModelandRepresentation.................................................................................264.1.1 EventModeling...........................................................................................................26

4.2 EventProcessing.................................................................................................................304.2.1 EventModel-basedWebpageScoring..............................................................304.2.2 CalculatingTheWeights.........................................................................................324.2.3 EventModel-basedURLScoring.........................................................................34

5 ExperimentalSetup......................................................................................................................375.1 Datasets...................................................................................................................................375.2 Experiments..........................................................................................................................395.3 EvaluationMetrics..............................................................................................................41

6 Results...............................................................................................................................................436.1 EventModel-basedvs.Topic-OnlyClassification..................................................436.1.1 ClassifyingURLsandwebpagesaboutCaliforniashooting....................43

6.2 EventModel-basedvs.Topic-onlyFocusedCrawler...........................................486.2.1 CaliforniaShooting...................................................................................................486.2.2 BrusselsAttack...........................................................................................................506.2.3 Oregonshooting.........................................................................................................53

6.2.4 Egyptairplanecrash................................................................................................546.2.5 Panamapapers...........................................................................................................556.2.6 Orlandoshooting.......................................................................................................556.2.7 Parisattacks.................................................................................................................566.2.8 Ecuadorearthquake.................................................................................................57

7 WebpageSourceImportanceandSocialmedia-basedSeedSelection..................597.1 WebpageSourceImportance.........................................................................................597.2 SeedURLsforcrawling.....................................................................................................647.3 Semi-automatedSocialMedia-basedSeedURLGeneration.............................657.3.1 SelectingSeedURLs..................................................................................................667.3.2 SeedsURLDomain/SourceImportance..........................................................66

8 ConclusionandFutureWork...................................................................................................748.1 Contributions........................................................................................................................758.2 FutureWork..........................................................................................................................75

References.................................................................................................................................................77

ListofFiguresFigure1OverviewofIDEALsystemandroleofeventfocusedcrawling.........................2Figure2WorkflowforcreatingWebarchivesfromsocialmedia(Twitter)................10Figure3SeedURLsmanualcurationusingArchive-Itservice...........................................13Figure4Architectureofbaselinefocusedcrawlerwithtopicrepresentationinthe

lowerbox,crawlingintheupperbox,andprocessingandrelevanceestimationinthemiddlebox...........................................................................................................................22

Figure5Baselinefocusedcrawleralgorithm.............................................................................24Figure6Stepsofbuildingeventmodelfromseedwebpages.............................................28Figure7Thestepsforcalculatingthescoreofawebpage...................................................33Figure8AnexamplewebpagewitharelevantURLanchortexthighlighted..............34Figure9ThestepsforcalculatingthescoreofaURL.............................................................35Figure10Designofourevaluationmethodoftheeffectivenessoftheeventmodel

forrelevanceestimation;thetwoboxeswithanasteriskindicatethetwoparametersoptimizedintheexperiment...........................................................................40

Figure11Thedesignoftheexperimentforevaluatingtheeffectivenessofeventmodelwithfocusedcrawlertoretrievemorerelevantwebpages.........................41

Figure12 CaliforniashootingURLsevaluationatdifferentthresholdvalues.............46Figure13Californiashootingwebpagesevaluationatdifferentthresholdvalues...47Figure14 Performanceevaluationofeventmodel-basedvs.topic-onlyfocused

crawlersforCaliforniashooting.............................................................................................49Figure15Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforCaliforniashooting(50Kwebpages).........................................50Figure16 Performanceevaluationofeventmodel-basedfocusedcrawlerfor

Brusselsattack...............................................................................................................................51Figure17Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforBrusselsattack(50Kwebpages).................................................52Figure18Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforOregonshooting(100Kwebpages)...........................................53Figure19Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforEgyptairplanecrash(10Kwebpages).....................................54Figure20Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforPanamaPapers(100Kwebpages).............................................55Figure21Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforOrlandoshooting(50Kwebpages)............................................56Figure22Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforParisattack(500Kwebpages).....................................................57Figure23Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforEcuadorearthquake(100Kwebpages)...................................58Figure24EffectofsourceimportanceoneventfocusedcrawlingforBrusselsattack

event...................................................................................................................................................62Figure25EffectofsourceimportanceoneventfocusedcrawlingforCalifornia

shootingevent................................................................................................................................62Figure26EffectofsourceimportanceoneventfocusedcrawlingforEcuador

earthquakeevent..........................................................................................................................63

Figure27EffectofsourceimportanceoneventfocusedcrawlingforOrlandoshootingevent................................................................................................................................63

Figure28Workflowforextracting,expanding,andselectingURLsfromtweets......67Figure29Brusselsattacktweetslanguagedistribution.......................................................68Figure30Brusselsattacktweetscreationdatedistribution...............................................69Figure31BrusselsattacktweetswithURLsdistribution.....................................................69Figure32BrusselsattackseedURLsdomainsdistribution................................................70

ListofTablesTable1Exampleeventsofdifferenteventtypes.........................................................................3Table2Listofeventsusedinthelarge-scalecrawlingexperiments...............................38Table3ValuesoftheparametersthatproducedthebestF1-score.Kisthesizeofthe

topicvectorandthresholdisthecutoffvaluefordeterminingrelevantornon-relevantlabelsbasedonthescore........................................................................................43

Table4 Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,DateevaluatedonthemanuallylabeledTRAININGURLsdatasetforCaliforniashootingevent................................................................................................................................44

Table5 Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,andDateevaluatedonthemanuallylabeledTRAININGwebpagesdatasetforCaliforniashootingevent...........................................................................................................45

Table6Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,andDateevaluatedonthemanuallylabeledTESTwebpagesdatasetforCaliforniashootingevent...........................................................................................................46

Table7 Californiashootingeventmodel......................................................................................48Table8 Brusselsattackeventmodel..............................................................................................51Table9DifferenttypesofseedURLs.............................................................................................65Table10Brusselsattacktweetcollectionstatistics................................................................68Table11Harvestratioforeventfocusedcrawlerusingtwomethodsofseed

selectionwithdifferentnumbersofseedsforBrusselsattack.................................71Table12Numberofdifferentdomainsintheoutputcollectionsofcrawling

experimentsforBrusselsattack.............................................................................................71Table13Harvestratioforeventfocusedcrawlerusingtwomethodsofseeds

selectionwithdifferentnumbersofseedsforOregonshooting..............................71Table14Numberofdifferentdomainsintheoutputcollectionsofcrawling

experimentsforOregonshooting..........................................................................................72Table15Harvestratioforeventfocusedcrawlerusingtwomethodsofseeds

selectionwithdifferentnumbersofseedsforCaliforniashooting.........................72Table16Numberofdifferentdomainsintheoutputcollectionsofcrawling

experimentsforCaliforniashooting.....................................................................................72Table17Harvestratioforeventfocusedcrawlerusingtwomethodsofseeds

selectionwithdifferentnumbersofseedsforOrlandoshooting............................73Table18Numberofdifferentdomainsintheoutputcollectionsofcrawling

experimentsforOrlandoshooting........................................................................................73

1 Introduction

1.1 MotivationThereisneedforanintegratedeventfocusedcrawlingsystem[1-3]tocollectWebdataaboutkeyevents.Eventsleadtoourmostpoignantmemories.Werememberbirthdays,graduations,holidays,weddings,andothereventsthatmarkstagesofourlife,aswellasthelivesoffamilyandfriends.Asasocietywerememberassassinations,naturaldisasters,man-madedisasters,politicaluprisings,terroristattacks,andwars--aswellaselections,heroicacts,sportingevents,andothereventsthatshapecommunity,national,andinternationalopinions.WebandTwittercontentdescribesmanyofthesesocietalevents.Inpart,Web2.0[4]isahighlyresponsivesensorofimportantoccurrencesintherealworld,sincepeoplefromacrosstheglobemeetvirtuallyandsharerelatedobservationsandstoriesonline.Wecanleveragethisstreamofdata,forautomaticcollectionofevents,totriggereventarchiving,andlatertoenableeventrelatedservicesthatsupportcommunities.Permanentstorageandaccesstobigdatacollectionsofeventrelateddigitalinformation,includingwebpages,tweets,images,videos,andsounds,couldleadtoanimportantinternationalasset.Regardingthatasset,thereisneedfordigitallibraries(DLs)providingimmediateandeffectiveaccess,andarchiveswithhistoricalcollectionsthataidscienceandeducation,aswellasstudiesrelatedtoeconomic,military,orpoliticaladvantage.Whensomethingnotableoccurs,manyuserstrytolocatethemostup-to-dateinformationaboutthatevent.Later,researchers,scholars,students,andothersseekinformationaboutsimilarevents,sometimesforcross-eventcomparisonsortrendanalyses.Yet,thereislittlesystematiccollectingandarchivinganywhereofinformationaboutevents,exceptwhennationalorstateeventsarecapturedaspartofgovernmentrelatedWebarchives.ThisistheneedaddressedbytheIntegratedDigitalEventArchiveandLibrary(IDEAL)project[5].ThoughtheInternetArchive[6]supportssomeevent-orientedarchiving,coverageislimited.Manyimportanteventsareignored,whileothersareonlycapturedinpart.Somegroupscollectdataonlyuntilthelastvictimisrescued,whileothersstartlateandmissearlyposts.Further,toolsforcapturearecomplex,andfewarchivists

mastertheirfeatures,soachievinghighrecallisexpensive.Therearefewmechanismstofilteroutnoiseincollections.Accesstotheresultingarchivesisawkwardandinefficientduetothefactthatmuchofthecontentcapturedisnon-relevant[7].WearguethatmanualcurationofseedURLsisnotscalableandnotfullyeffectiveforarchivingeventsthathavehighimpact.Thus,improvedtechnologyisneeded.TheIDEALprojectisdevelopingadigitallibrary/archivesupportingautomaticeventtracking,crawling,andarchiving,aswellaseffectiveaccess(inthesenseofaidinginthefindingandutilizationofrelevanthighqualityinformation).Figure1showsanoverviewoftheIDEALproject.Bytakinginputfromtweets,news,webpages,(micro)blogs,andqueries,oursystemwillcollectandarchiveeventrelateddigitalobjects,andprovideabroadrangeofhelpfulservices.

Figure1OverviewofIDEALsystemandroleofeventfocusedcrawling

ThisdissertationfocusesonthedatafrontendoftheIDEALproject,i.e.,collectingandarchivingdatausinganewtypeoffocusedcrawler.TheIDEALprojecthasaround11TBofwebpagearchives(WARCfiles)andover1.2billiontweetsacrosshundredsofdifferentevents[8].Earlyon,thewebpagearchiveswerecollectedusingtheInternetArchive’s[6]Archive-Itservice[9],whichusestheHeritrix[10]toolforarchivingwebpages.Originally,theIDEALprojectmanuallypreparedalistofURLsforevents,andfedittotheArchive-Itserviceforcrawling.Theproblemwiththisweaklycuratedapproachisthatweproducedcollectionswithlowprecision(i.e.,withfewrelevantandmanynon-relevantwebpages);HeritrixisageneralWebcrawleranddoesn’tanalyzethetextualcontentofthewebpagesbeforedownloadingthem.Toovercomethisproblem,theIDEALteamshiftedtoanotherapproach:weextractedURLsfromtweetarchivesthatwerebuiltaboutevents,and

downloadedonlythecorrespondingwebpages.Theresultingcollectionshavehighprecision(mostofthewebpagesarerelevant)butlowrecall(notalloftherelevantwebpagesarefound).Afocusedcrawlerwouldhelpsolveeachofthepreviousproblemsbycrawling(toincreaserecall)theWWWstartingfromtheURLsextractedfromthetweetarchivesbutthenfollowingonlytherelevantwebpages(toenhanceprecision)inordertofindandcollectasmuchrelevantinformationaspossible.However,inthepast,focusedcrawlingwasmainlyappliedtotopicalcrawling,i.e.,collectingwebpagesaboutacertaintopicordomain.Accordingly,sinceweareworkingwithevents,weextendedtheapproachpreviouslyusedbytheIDEALteamandadapted/changedthetraditionalfocusedcrawlerapproachtoaccommodateourneeds.

Table1Exampleeventsofdifferenteventtypes

EventTypes ExampleEvents

Bombing Bostonbombing

BuildingCollapse EastHarlembuildingcollapse

Community Flintwatercrisis,Lovewins(samesexmarriage),Worldcup

Earthquake Ecuador,Japan

Fire Californiawildfire,Brazilnightclubfire

Flood TexasFloods

Hurricane Joaquin,Sandy,Katrina

PlaneCrash Egyptair,Russian,germanwings

PoliticalConflict Brexit,Turkeycoup,GreeceBailoutreferendum

Protests/Riots Ferguson,Egyptianrevolution

Scandal PanamaPapers,SeppBlatter

Shooting Oregoncollegeshooting,Californiashooting,Orlandoclubshooting

TerroristAttack Paris,Brussels,Nicetruckattack

TrainDerailment Amtrak188,Quebec

Thereareseveraldefinitionsofaneventthatdifferaccordingtodiscipline(seeChapter4formoredetails).Inthisdissertationweareinterestedinunusualrealworldeventsthatcapturemuchattention,leadingtoahighvolumeofcontentgeneratedontheWWWabouttheevent.Thusweconsiderdifferenttypesofeventslike:airplanecrashes,buildingcollapses,earthquakes,floods,fires,hurricanes,politicalcrises,scandals,shootings,terroristattacks,andtraincrashes/derailments.Allthesetypesofeventssharethesamecharacteristics:somethinghappensataspecificphysicallocationduringaspecificperiodoftime.ExampleeventsforeacheventtypeareshowninTable1.Thisisnotanexclusivelistofeventsandtheirtypes,butratherarepresentativelistofwhatwehavebeenworkingoninthisdissertation.Ourresearchshouldgeneralizetoanyeventwiththesamecharacteristics(i.e.,somethingunusualhappensataspecificplaceonaspecificdate).Weproposedfivemajorchangestotraditionaltopicalfocusedcrawling:

1. Implementinganeventmodelandrepresentation, 2. Incorporatingtheeventmodelinformationextractedfromseedwebpages’

contentintofocusedcrawling, 3. Developingawebpagesourceimportancemodel, 4. Incorporatingthewebpagesourceimportancemodelintofocusedcrawling, 5. AutomatingtheprocessofseedURLsselection.

OurproposedapproachintelligentlycombinesthedifferentaspectsofaneventtoheuristicallyestimatetherelevancetotheeventofaURLand/orawebpage.Ourintelligenteventfocusedcrawlerdistinguisheson-topicURLs/webpagesvs.off-topic(likewiththebaselinetopic-onlyapproach),andalsodistinguisheson-eventURLs/webpagesvs.off-eventURLs/webpagesonwhichthebaselinetopic-onlyapproachfails.Forexample,considertheCaliforniashootingevent.Bothapproaches(ourintelligenteventfocusedcrawlerandthebaselinetopicalfocusedcrawler)successfullyidentifyon-topicURLs/webpages(i.e.,URLs/webpagesaboutashooting).However,oureventfocusedcrawleralsointelligentlydistinguishesbetweenURLs/webpagesthatareon-event(e.g.,relevanttotheCaliforniashooting)andURLs/webpagesthatareoff-event(butstillon-topic,i.e.,aboutashooting).Weconductedfourseriesofexperimentstoevaluateoursystemusingasetofrecentevents:Orlandoshooting,Ecuadorearthquake,Panamapapers,Californiashooting,Brusselsattack,Parisattacks,andOregonshooting.Thefirstexperimentseriesevaluatedtheeffectivenessofourproposedeventmodelrepresentationwhenassessingtherelevanceofwebpages.Oureventmodeloutperformedthetopic-onlyapproaches;itshowedbetterresultsinprecision,recall,andF1-score.Thesecond

experimentseriesevaluatedtheeffectivenessoftheeventmodel-basedfocusedcrawlerforcollectingrelevantwebpagesfromtheWWW.Oureventmodel-basedfocusedcrawleroutperformedthestate-of-the-artfocusedcrawler(best-firstsearch);itshowedbetterresultsinharvestratio.Thethirdexperimentevaluatedtheeffectivenessofourproposedwebpagesourceimportanceforcollectingmorerelevantwebpages.Thefocusedcrawlerwithwebpagesourceimportancemanagedtocollectroughlythesamenumberofrelevantwebpagesasthefocusedcrawlerwithoutwebpagesourceimportancebutfromasmallersetofsources.ThefourthexperimentprovidesguidancetoarchivistsregardingtheeffectivenessofcuratingseedURLsfromsocialmediacontent(tweets)usingdifferentmethodsofselection.Ourcontributionsfromthisresearchare:

1- Amodelandrepresentation forcapturing thedifferentaspectsofevents inwebpages(topic,location,anddate);

2- An extended focused crawler approach that uses our event model torepresentcontentandtoestimatetherelevanceofwebpages;

3- Anautomatedapproachforsocialmedia-basedseedURLselection;4- A method to estimate the value of webpages based on their source

importance;5- Anextendedfocusedcrawlerapproachthatintegrates,foreachwebpage,the

textual content information based on our event model, along with thewebpagesourceimportance,intoasinglerelevancescore.

Thefollowingsubsectionsintroducethehypothesestestedthroughthisdissertation,theresearchquestionsinvestigated,the5Sapproachusedtoguideourwork,andtheorganizationofthefollowingchapters.

1.2 HypothesesThisresearchconsidersthefrontendoftheIDEALproject,i.e.,collectingandarchivingdatausinganewtypeoffocusedcrawler.Wetesttwohypothesesaboutfocusedcrawling,thefirstofwhichhasthreeparts:

H1:Usingseveralsourcesofevidenceofwebpagerelevance,includingcontent-basedfeatures,willhelpidentifymoreofjusttherelevantwebpages.

H1.1Usingtemporalandspatialinformationaspartofmodelingandrepresentingeventsleadstobetterdescriptionsofeventsthanthetopic/keyword-basedapproach.

H1.2:Incorporatingeventinformationextractedfromawebpage’scontentintofocusedcrawlingwillincreaseeffectiveness,helpingidentifymoreoftherelevantwebpageswhilemaintaininghighlevelsofprecision.H1.3:Awebpagebelongingtoanimportantwebsite(e.g.,www.cnn.com)shouldleadtoalargernumberofrelevantwebpageslinkedtoit.Thesourcewebsite(e.g.,www.cnn.com)isactinglikeahubwhichleadstorelevantwebpages.

H2:Integratingeventinformationandwebpagesourceimportancewillimproverelevancepredictionpower,ensurebalanceintypesofcontentcrawled,andreducebias.

1.3 ResearchQuestionsThisdissertationaddressesfiveresearchquestions,listedbelow,thatmap,asshowninparentheticals,tothehypothesesgivenabove.R1:Howtomodelandrepresentanevent?(H1.1)R2:Howtocomparetwoeventrepresentations?(H1.1)R3:Whatistheeffectofintroducingeventhandlingontheperformanceoffocusedcrawling?(H1.2)R4:Howtomodelandrepresentwebpagesourceimportance?(H1.3)R5:Howtointegratehandlingofeventsandinformationsources?(H2)

1.4 5SOursystemisdesignedtakingintoconsiderationthe5Sframework[11-14]fordevelopingdigitallibraries.Wedescribeherethedifferentdimensionsofthe5Sframeworkandhowtheyareappliedinoursystem.Weareusingthe5S(societies,scenarios,spaces,structures,streams)digitallibraryframeworkfortwomainreasons.First,5Sprovidesachecklistanddesignguidelinesthathelpintheunfoldingofourresearch.Second,themaingoalofoursystemistobuildaneventdigitallibrary,whichisakeypartoftheIDEALproject,whichprovidesservices(byimplementingscenarios)foravarietyofusers(societies).Wedemonstratehowfocusedcrawlingaddscontent(digitalobjects)totheIDEALeventdigitallibrary.Thekeydigitalobjectsaretweetsandwebpages,thateachcanbeviewedasacombinationofstreamandstructure.Anotherofthemainobjectsinourfocusedcrawleristheeventobject,acombinationofstreamand

structureandspace.The5Smodelallowsustotreateventsasfirstclassobjects(i.e.,eventobjectscanbecreated,stored,sortedondifferentattributes,searched,andvisualized-basedonattributeslikelocation).Amaindesignconsiderationofourfocusedcrawleristheabilitytoreceiveaneventobjectasinput,andthenstartcrawlingtheWWWforwebpagesthatarerelevanttothateventobject,therebyimplementingavariantoftheWebcrawlingscenario.Morediscussionof5Sanditsconnectiontothisdissertationresearchfollows.Societies:Thesystemcanservedifferentstakeholderslike:historians,thoseinthegeneralpublic,researchers,analysts,responders,eventparticipants,anddecisionmakers.Inaddition,therearesoftwareagents;seemorebelowunderServices.Scenarios:Inadditiontoactivitiesofsoftware(seemorebelowunderServices),thefollowingwillbesupported:

• AUsercanstartbuildingacollectionusingtheeventfocusedcrawler.TheusershouldprovideasetofURLs,collecttheoutputofasearchengine,orsamplefromsocialmediacontentthatisprovidedasinputtooursystem.

• AUsercanvisualizeacertaineventorgroupofevents.Severalvisualizationschemesareprovided,includingmapsandtimelines.

• AUsercanbrowsethesystemcontentaccordingtofacetsorothermechanisms,considering:eventcategories,contenttype,entities,andarchives.

• AUsercansearchallsystemcontenttypes.• AUsercansearchacertaineventcollectionorseveralcollections.

Spaces:Objectsinoursystemcanbeviewedinseveralspaces:

• Webpagedocumentsarerepresentedusingthevectorspacemodel[15,16].• Eventsarerepresentedinaneventspacemodel(topic,location,anddate).• Two-orthree-dimensionalinterfacesaiduserinteraction.• Probabilityspacesaidwithcharacterizinginterdependenciesanddrawing

inferences.Structures:Organizationofobjectsinoursystemcanbeperformedaccordingto:

• Eventcategory(earthquake,flood,hurricane,shooting,planecrash,etc.),whichcanfitwithinataxonomy,ontology,orothertypeofstructure;

• Datastreamtype/schema(fortext,image,audio,video,eventrecord,archive);or

• Entity(location,date,personname,organizationname).Representationsinclude(metadata)recordsindatabases,graphs,trees,etc.Streams:Theseinclude:webpagetexts,Webarchives,crawllogs(datarecordinghowthefocusedcrawlercreatedthecollection),images,videos,audiofiles,eventsummariesordescriptions,andothersimilarentitiesassociatedwithevents.Weconsideredeventsasfirst-classobjectsandcreatedcharacterizationsforeachevent.Usingeventrelatedstreams,youcansearchforanevent,browseevents,visualizeevents,etc.Services(notoneofthe5Ss,butrelatedtoSocietiesandScenarios,andhelpfulindescribingaDL):Oursystemwillprovideservicesincluding:

• Creatingortransformingcollections/archives;• Analyzingcollections(extractingentities,providingsummaries,classifying,

clustering);• Searchingallkindsofdatastreams;• Browsingaccordingtoeventcategories,datastreamtypes,and/orentities;

and• Visualizingcontent.

1.5 ThesisOrganizationThisdissertationisorganizedasfollows.Chapter2discussesthearchitectureofthebaselinetopical/focusedcrawler.Chapter3reviewsdifferentapproachesforfocusedcrawling.InChapter4,weproposeourneweventmodelandrepresentation,andexplainhowitisintegratedwiththefocusedcrawlingapproach.Chapter5coversthedesignofexperimentsperformed,whileChapter6presentstheevaluationofoureventmodel-basedfocusedcrawlerandofthebaselinefocusedcrawler.Chapter7explainsthewebpagesourceimportancemodelanditseffectonourfocusedcrawler.Finally,Chapter8concludesanddiscussesissuesforfutureresearch.

2 RelatedWorkWediscussthedifferentfieldsrelatedtoourworkinatop-downmanner.FirstwediscussthebiggergeneraltopicofWebcrawling.ThenwediscusstheIDEALprojectandWebarchivingtechniques.Thenwediscussfocusedcrawlertechniques.Mostoftheworkdoneintraditionaltopicalfocusedcrawlingfallsintooneofthreecategories:machinelearning,semanticsimilarity,orcontentandlinkanalysis.Wediscussthemajorworkdoneinthesethreecategoriesinthenextsubsections.Alongtheway,wealsotouchonpublicationsrelatedtoeventmodeling,socialmediaintegrationwithfocusedcrawling,seedselection,andfinallyfocusedcrawlingevaluation.

2.1 WebCrawlingWebcrawlers[17]aresoftwareprogramsthattraversetheWWWfollowingthelinksonthewebpages.AcrawlermodelstheWWWasagraphwherenodesarewebpagesandedgesbetweennodesarethehyperlinksthatmanifestinthewebpages.Acrawlerstartsfromasetofwebpages,calledtheseedset,andfollowsthelinksonthosewebpages.Itdownloadsthecorrespondingwebpages,extractsthelinksinthem,andthenrepeatsthewholecycle.Acrawlerkeepstwodatastructuresthatfacilitatethecrawlingprocess[18],theURLqueue(frontier)andthevisitedURLslist.ThecrawlerusesthefrontiertokeeptheURLsthatareextractedfromwebpagesbutnotvisitedyet,andthevisitedURLslisttokeeptrackoftheURLsthatwerevisitedsoitwon’tvisitthemagain(ortocontrolthefrequencyofvisitingthemagain).Webcrawlershavebeenusedbysearchenginestocollectasmanywebpagesaspossible.Thesearchengineparsesthecollectedwebpages,extractsthetext,andbuildsasearchindex[18].Theindexisthemainelementusedtosupportthesearchingservice.

2.2 TheIDEALprojectTheIDEALproject[5]teamhasdevelopedover1000tweetarchivesaboutgeneraltopicsand/orevents,alongwithover66Webarchivesofman-madeandnaturaldisasters,thelatterusingtheInternetArchive’sgeneralcrawler[10].

Figure2WorkflowforcreatingWebarchivesfromsocialmedia(Twitter)

Eventarchivingisdifferentfromdomain/site-basedortopic-basedarchiving.Thefirstinvolvesarchivingaspecificdomain/websitewithallorsomeoftheunderlyingsubdomains/structure.Thesecondcoversagivennumberofwebpagesrelatedtoauser-definedtopic. TheIDEALprojectteamhasidentifiedandemployedthreeapproachesforarchivingwebpagesaboutevents:

1. Manualcurationbydomainexperts,librarians/archivists,andgovernmentagencies.(Highquality–timeconsuming).Seehttps://archive-it.org/explore?show=Collections&fc=meta_Subject:Spontaneousevents

2. Socialmedia-based(crowdsourcing)curationbyextracting,retrieving,andarchivingURLsfromtweetcollectionsaboutanevent(Lowquality–timesaving).Seehttp://www.eventsarchive.org/?q=node/42

3. CrawlingtheWebusingafocusedcrawlingapproachtailoredtoevents(withacceptablequalityandtime)

TheIDEALprojectteamhasusedthefirstapproachandcreatedaround66Webcollectionsaboutdifferentkindsofevents[19].TheymanuallycuratedseedURLsandfedthemintotheArchive-Itserviceforcrawling(seeFigure3).TheyhaveapplieddifferentsettingsoftheArchive-Itconfigurationparametersaccordingtotheimportanceoftheeventaboutwhichtheyarecollecting,andthetypeofURLstheycurated.Thetwomainconfigurationparametersarethefrequencyofcrawling

Collect*Tweets

Tweet*Collection

ExtractURLs

Shortened*URLs

Expand Original*URLs

Fetch Webpages

Archive WARC

Index SOLR

Browse

Wayback

Search

Access

Keyword/Hashtag

Collect Archive/Organize/Analyze

andthescopeofcrawling.Thefirstonecontrolsthefrequencybywhichthecrawlershouldrevisitandre-crawlthewebpage,whilethesecondparametercontrolswhetherthecrawlershouldfollowtheoutgoinglinksinthewebpage.TheIDEALteamautomatedtheprocessofcuratingseedURLs[20]bycollectingtweetsabouttheeventsofinterestandthenextractingURLsfromtweetsandfeedingthemtotheArchive-Itserviceforcrawling.Socialmediaingeneral,andTwitterinparticular,provideaveryrichsourceforuser-generatedcontent,whichcontainsalargenumberofURLs.TheIDEALprojectteamhascreatedaround1000tweetcollectionsaboutgeneraltopicsandspecificevents[21].Thetweetcollectionssufferfromnoisycontentlikeporn,jobads,marketing,etc.TheIDEALprojectteamhasappliedseveralfilteringmethodstoensuretheresultingtweetcollectionscontainrelevantcontentonly.Figure2showstheworkflowforcuratingseedURLsfromsocialmediasources(e.g.,Twitter).TheresultingseedURLsarearchivedusingtheHeritrix[10]toolandthentheresultingWebarchivesareindexedbyasearchengineforprovidingaccess,searching,andbrowsingservicesforusers.Thelastapproachisaimedtomaintainabalancebetweenproducinghighqualityeventcollectionsandreducingthetime/resourcesneededforcollectionbuilding.TheIDEALprojectteamhasdevelopedtoolsforsemi-automaticallycollecting,curating,andarchivingwebpagecollections,leveragingmethodsforeventmodelingandfocusedcrawling.TheeventmodelingcoversespeciallyidentifyingandrepresentingeventsconsideringtheirWhat,Where,andWhenaspects.TheIDEALproject’sworkonfocusedcrawlingcouldbeofbenefitforWebarchivingby:

1. HelpingpreparelistsofURLstobearchived(i.e.,afocusedcrawlerrecommendingaseedlist);

2. Helpingextendacollectionautomatically(usingexistingcollectionsformachinelearningtypetrainingofafocusedcrawlertofindsimilarnewwebpages);and

3. Analyzingandsummarizingtheproducedeventcollectionsbyusingthedevelopedeventmodel.

2.3 WebArchivingandArchive-ItserviceCloselyrelatedworkhadbeendoneintheemergingandpromisingfieldofWebarchivinganddigitallibraries(WADL)[22-24].ThemostrelatedaspectofWebarchivingistheselectionofwebpagestobearchivedandhowtocollectthesewebpages,aprocesssometimesknownasWebcuration,typicallygovernedbya

selectionpolicy.GeneralWebcrawlersarethedominantmethodforWebarchiving.However,severalnewtechniqueshaveemergedwhichdonotdependoncrawlingtechnology,butratherdependonthetransactionalbehavior[23]oftheWWW(HTTPprotocol)todrivearchivingofwebpages.ManywebpagearchivesarecreatedbycuratorsusingtheInternetArchive’sservicecalledArchive-It[9],whichhelpswithharvesting,building,andpreservingcollectionsofdigitalcontent.TheservicetakesURLsasinputfromauser.Figure3showstheinterfaceforuserstomanuallyentertheURLsfromwhichthecrawlerwillstartcrawling.TheseURLsareusedbyArchive-IttocrawltheWeb,guidedbymanualconfigurationdetails(scopingofthedomainofwebpagestocrawl,typesoffilestocrawl,followingrobots.txtprotocol,etc.),andtheresultingwebpagesarecapturedandstoredinWARC[25]files.TheArchive-Itservicehasprovidedanautomatedwayfordifferentkindsofuserstoarchiveandsaveimportantwebpagesinwhichtheyhaveinterest.However,themethodologyusedinthecrawlertechnologybehindtheArchive-Itserviceisorientedtowardarchivinggeneralwebsites(likegovernment,state,universitylibraries,andfederalwebsites)wherethewholecontentofthewebsiteandthefrequentchanges/updatesofthewebsitearethemainscopeofthearchivingprocess.Forspontaneouseventsthisapproachisnotwellsuited.Mostoftheevent-relatedcontentinvolvesonlyspecificwebpageswithinawebsite,andthosewebpagesarenotfrequentlychanged/updated.Therefore,usingtheArchive-Itservicemayresultinanarchivewithmostofthecontentnotrelatedtotheeventofinterest.In[7]theauthorsanalyzedWebarchivesaboutschoolshootings.TheirresultsshowthatrepresentativeWebarchivesarenoisy,with2%-40%ofwebpagesreflectingrelevantcontent.

Figure3SeedURLsmanualcurationusingArchive-Itservice

2.4 FocusedCrawling

2.4.1 MachineLearningMachinelearningbasedfocusedcrawlerapproachesapplytextclassificationalgorithms[26-28]tolearnamodelfromtrainingdata.Thefocusedcrawlerthenusesthemodeltoestimatetherelevanceofunvisitedwebpages.Useofthemodelenhancestheperformanceoftheclassifierbyincorporatingdomainspecificknowledgeandonlinerelevancefeedback.Ourapproachlikewisecanbeconsideredasinvolvingaclassificationtask;werequiretrainingdataforcalculatingtheweightsofthedifferentaspectsoftheeventandweareusingthewebpagetextforbuildingtheeventmodel.RennieandBarto[29]usedreinforcementlearningforsolvingthefocusedcrawlingproblem.TheymodeledthefocusedcrawlingproblemasaMarkovdecisionprocesswithwebpagesasstates,URLsasactions,andon-topicwebpagesastherewards.Anotherreinforcementlearningalgorithm,temporaldifferencelearning,wasused

in[30].Theyusedastatevaluefunctiontoestimatetheimportanceofwebpagestoleadtofuturerelevantwebpages.Inourapproach,weusewebpagesourceimportancetoestimatethevalueofwebpagesforlinkingtootherrelevantwebpages.Anotherwork[31]usedthereinforcementlearningframeworkproposedin[29]andenhanceditsperformancebyapplyingincrementalonlinelearning.ForeachnewURL,theyestimateitscorrespondingclassanduseitsfeaturestoupdatetheclassfeaturesanditscorrespondingq-value.Thentheyretrainthesupervisedlearningalgorithmbasedonthenewtrainingdata(oldtrainingdataandthenewURLsseen).Thisapproacheliminatesthedatabiasthatappearsinthetestdata,whereunseenURLsmayappearfromnewdomainsthatwerenotfoundinthetrainingdata.InourapproachweaddressbiasthroughusingwebpagesourceimportancebothinselectingseedURLsandalsoduringcrawling.Infospidersisatopicalcrawlerbasedonadaptiveonlineagents[32]thatusegeneticprogrammingandreinforcementlearningapproachestoestimatetherelevanceofawebpage.Inourapproachwegobeyondjustusingtopics,anduseeventmodelingforestimatingtherelevanceofawebpage.In[33],afocusedcrawlerwasdevelopedforcollectingwebpagesthatcontainsemanticinformation(semanticannotationsexpressedasstructureddataembeddedintheHTMLofthewebpage).TheauthorsproposedanewmethodologyforcrawlingtheWebthatutilizesanonlineclassifierandareinforcementlearningbanditalgorithmforselection.Theonlineclassifierlearnstodetecttherelevantwebpagesduringcrawling,eliminatingtheneedfortrainingamodelbeforecrawling.ThebanditalgorithmprovidesaframeworkfororderingandselectingthenextURLtovisitduringcrawling.TheURLsinthecrawlerfrontieraregroupedbytheirWebhost/domainandeachhost/domainisscoredaccordingtoanimportancemeasure.Thecrawlerchoosesthehost/domainwiththehighestscoreandthenchoosesfromthathost/domaintheURLwiththehighestscore.

2.4.2 SemanticSimilaritySemanticsimilarity-basedtechniquesuseontologies[34,35]fordescribingthedomainofinterest.Thedomainontologycanbebuiltmanuallybydomainexpertsorautomatically,usingconceptextractionalgorithms.Oncetheontologyisbuilt,itcanbeusedforestimatingtherelevanceofunvisitedwebpagesbycomparingtheconceptsextractedfromthetargetwebpagewiththeconceptsthatexistintheontology.Theperformanceofsemanticfocusedcrawlingdependsonhowwelltheontologydescribesandcoversthedomainofinterest.Manyofoureventsare

disaster-relatedevents.Althoughwearenotusingdomainknowledgeoreventontologies,yetoureventmodelcouldbeeasilyextendedtomakeuseofdisasterdomainknowledgeforfindingdisaster-specifickeywords.Oureventmodelalsocouldbeeasilymappedtoportionsofadisaster-relatedeventontology[12,36],butfurtherresearchwouldbeneededtoseeifsuchwouldyieldimprovement.Further,usingthisapproach,inasystemthatisaimedtohandleanytypeofimportantevent,wouldrequireconsiderableknowledgeengineeringwork.Relatedtosemanticsissentiment.Sentimentanalysishasbeenintegratedintofocusedcrawlingintwoways:sentiment-focusedcrawling[37]andfocusedcrawlingleveragingsentimentinformation[38].Inbothworks,crawlersmadeuseofthesentimentinformationtobuildthetargetmodelandtoguide,focus,anddirectthecrawlerthroughtheWebgraphbyestimatingtherelevanceofunvisitedURLs.Sentimentorientedcrawlingisconsideredatypeoftopicalcrawling,wherethetargetistocrawlwebpagesthathaveagivensentiment.Sentimentclassescouldbeassimpleaspositivevs.negative,orcomplexbasedonaspecificdomain.Inourwork,eventshavemorecomplexstructurethansentiment.Wecanleveragesentimentinformationabouteventsbutwehavetofindtherelevantwebpagesabouteventsfirstandthenanalyzethesentimentinformationinthosepages.

2.4.3 ContentandLinkAnalysisTextandlinkanalysisalgorithmscombinetextanalysisschemes(e.g.,VectorSpaceModel(VSM)[15,16])andlinkanalysisalgorithmstoestimatetherelevanceandimportanceofwebpages[39-41].Linkanalysisapproachesintroducetheconceptofpopularwebpages.PopularityismeasuredbasedonthelinkstructureoftheWWW.ThisledtotheintroductionoftheconceptsofHubandAuthoritywebpages[42].Hubwebpageshavelinkstomanyauthoritywebpageswhileauthoritywebpagesarelinkedtomanyhubwebpages.Amongthelinkanalysisalgorithms,PageRank[43,44]isthemostused.Alternatively,contextgraphsareusedtorepresentthecontextofawebpageusingneighborhoodwebpagesthataremostsimilartoit.Anotherlineofresearchincorporatesthegenreofwebpagesintofocusedcrawling[45].Thegenreofthewebpagedefinesthetypeofthewebpage(e.g.,forum,tutorial,news,blog,course-syllabus,etc.).Thefocusedcrawlerusestwosetsofkeywords,onefordeterminingthegenreofthewebpageandtheothersetfordeterminingthetopic.Thetwosetsofkeywords(genreandtopic)aremanuallydeterminedbyexpertsandthenusedbythefocusedcrawlerforestimatingtherelevanceofthewebpage.

Relatedtowebpagesourceimportance,Pantetal.[46,47]describeanewWebcharacteristic:statuslocalityontheWeb.Awebpage’sstatusmeasurestheimportanceofthewebpagewithrespecttoitspopularity,andisapproximatedbythenumberoflinkspointingtoit.Pantetal.developedanalgorithmforestimatingthestatusofawebpagebasedonlocalcharacteristicsofthewebpageandalsodemonstratedthatthestatuspropertyhassomeofthesamecharacteristicsasthetopicalproperty.Ourapproachalsotriestopredicttheimportanceofawebpagebyexaminingtheimportanceofthedomaintowhichthewebpagebelongs.Theimportanceinourcaseisviewedwithrespecttothedomainoftheeventofinterest.Forexample,inanearthquakeevent,somewebsitesmaybemoredominant,andthusmoreimportant,thanotherwebsites,basedonthenumberofrelevantwebpagesfoundforthatdomain.Chen[48]developedahybridapproachforfocusedcrawlingusinggeneticprogrammingforexploitingdifferentfeaturesinawebpage’stext,andmetadatasearchforexploringdifferentsourcesontheWWW.Chenappliedthegeneticprogrammingapproachforcombiningdifferentrelevancesignalsfromthewebpagetext.HealsousedmetadatasearchforgatheringseveralseedURLsforthecrawlertostartfrom,thusexpandingthecrawler’scoverageoftheWWW.Inordertoovercomethebiasthatcanbefoundinonesearchengine,heusedmultiplesearchenginesandcombinedtheirresults.Ourapproachalsoaimstoimproverecallincrawling,butusesothertypesofevidence(webpagesourceimportance)todoso.

2.5 EventModelingEventmodelingrecentlyhasgainedpopularityindifferentfields,liketopicdetectionandtracking(TDT)[49],animaldiseaseoutbreakdetection[50],networkedmultimediaevents[51],anddocumentsimilarity[52].InTDT,aneventisdescribedasatopicthathappensatacertaintime,inaspecificlocation,andwithaparticularsetofparticipants.Inmultimediaapplications,aneventisdefinedasatupleofaspects:informational,spatial,temporal,structural,causal,andexperiential.TheinformationalaspectincludeseventID,eventtype,andanyotherattributesthatserveasidentificationoftheevent.Spatialandtemporalaspectsrepresentthelocationandtimeproperties,respectively.Thestructuralaspectincludesthesub-eventsbelongingtothecurrentevent.Thecausalaspectincludestheeventscausingthecurrentevent.Finally,theexperientialaspectincludesallmediaresourcesrelatedtothecurrentevent.Weincorporatedeventmodelingintoourcrawler,tobuildanevent-awarefocusedcrawler[3].WecameupwithaneventmodelthatintegratesideasfromTDT[49]

(webuiltoureventmodelusingthesamedefinition:somethingthathappenedatacertainplaceonaspecificdate)andworkdonein[50](theydefinedaneventasacombinationofdomain-relatedkeywords,locationentitiesanddateentities,anddiseasenameentitiesthatappearonthesentencelevel).Weusetopic,location,anddateinformation,butaggregateatthecollectionlevel.Aprobabilisticeventmodelhasbeendevelopedin[53]whereaneventismodeledasalatentvariablethatgenerateswebpages,i.e.,theobservations.Awebpageaboutaneventismodeledasatupleofthreeparts(topic,entities,anddate).Topicandentitiesaremodeledasvariableswithmultinomialdistributionoverwordsinwebpagecontent,anddateismodeledasanormaldistributionoverthepublishingdateofthewebpage.Accordingtothismodel,usingatimeanalysisofpublishingdatesofwebpages,aneventisrepresentedbyapeakinthenumberofwebpagespublishedaroundacertaindate.Apeakcouldrepresentoneeventormultipleeventsthathappenedonthesame/closedates.Thetopicandentitiespartsofthemodelhelpindiscriminatingbetweeneventssharingthesamepeak.In[54,55],theLDAtopicmodelingframeworkisusedtomodelevents.Aneventisrepresentedbyamixtureoftopicsoverasetofwebpages.Thetopicsconsideredarebackgroundtopic,topicforeachevent(extractedfromthesetofwebpagesbelongingtothatevent),andtopicforeachdocument.Thereasonfordividingtopicsinthismanneristocaptureandseparatethelanguageusedforgeneralpurposes(backgroundfamouswords),thelanguageusedforeacheventspecifically,andfinallythelanguageusedforeachwebpagespecifically.Eventshavebeenanalyzedinsocialmedia(likeTwitter)usingtailoredmethodsforextractingevent-relatedinformationfromthetweets[56-59].Eventsaremodeledasasetofwordswithspecificstructurelike“subjectverbobject”.Thepurposeoftheresearchisnotcollectinginformationrelatedtoevent,butrathergivenasetofshorttext(tweets),howtoextractevent-relatedinformation.Thistypeofworkmodelseventsatthefine-grainlevel,whereitlooksforspecificstructureinsentencesthatmightrepresentanevent.Inthisdissertation,eventsaremodeledatacoarse-grainlevelusingthecombinationofwordsatthewebpagelevel.In[60]eventsaremodeledastheco-occurrenceofspatialandtemporaltokensinonesentence.Theauthorsprovidedahierarchyformodelingspatialandtemporaltokens.Forspatialhierarchytheyusedcountry,state,city,andstreet,whilefortemporalhierarchytheyusedyear,month,andday.Usingthesehierarchies,the

authorsdevelopedasimilarityfunctionforestimatingthesimilaritybetweendocumentsusingeventmodelinstancesextractedfromthedocuments.In[61]theauthorsanalyzedthecharacteristicsofinformationsourcesduringnewsevents.Theyanalyzedthreetypesofinformationsources:mediaoutlets,socialmedia,andquerylogs.Theyanalyzedthreeevents:SanBrunopipeexplosion,NewYorkstorm,andAlaska3-wayelections.Theyanalyzedtheseeventsbecauseallthreesharedthesamecharacteristics:1)locationiscentralized(i.e.,nomultiplelocations),2)timespanislimited(i.e.,sotheyattractattentionandinterestforashortperiodoftime).Theirdefinitionofeventsissimilartoourdefinition:somethinghappenedatacertainplaceandspecifictime.Thus,thereareseveralwaysfordefininganeventdependingonthecontextandtheapplicationinwhichtheeventisused.Oureventmodelissimilartotheprobabilisticeventmodel[53],whichincorporatesthetopic,date,andentitiesaspects.Wedoidentifyadateforanevent.Wealsorecognizelocationentities.However,othertypesofentitiesfoundareincludedinthetopicpartasnormalkeywords.Oureventmodelcanbeeasilyextendedtoaddothertypesofentities(Persons,Organizations,etc.).Combiningseveraltypesofevidenceisanimportanttaskinfocusedcrawling.Initialworkhasusedcontentandlinkbasedinformation.Multimediainformationalsowasused[62],wheretextandimageswereanalyzedforestimatingwebpagerelevance.ThereaBayesiannetworkwasusedforintegratingevidencefromdifferentsourcesofinformation(textandimages).Likewise,wecombinethewebpagesourceimportancewiththewebpagerelevancescore,inparticularbymultiplyingbothtogethertoproduceafinalscore.

2.6 SocialMediaandFocusedCrawlingSeedSelectionIn[63],theauthorscombinedthefocusedcrawlertechniquewithsocialmediatoimprovethefreshnessofthecrawl.AfocusedcrawlerislimitedbythesetofseedURLsitstartsfrom.Socialmediaproducesahugeamountofusergeneratedcontent(e.g.,tweets)thatmaycontainURLs.Sincesocialmediacontentisproducedlive,theURLscontainedthereinwouldbefreshandpossiblymorerecentthantheURLsvisitedotherwisebythefocusedcrawler.InjectingURLsfromsocialmediaintothefocusedcrawler’sURLsqueue,shouldincreasethefreshnessoftheWebcollectionproduced.Theauthorscrawledabouttwoevents--EbolaandUkraineconflict--andusedakeyword-basedmodeltorepresentthetwoevents.Weusedsocialmediacontent(i.e.,Twitter)asasourceofseedURLs.WecollectedtweetsabouteventsandafterthecollectionprocessfinishesweextracttheURLsandfilterthemtoget

relevantURLs.In[64],theauthorsexaminedthetopicalqualityofexistingWebarchivesaboutevents.TheybuiltaframeworkthatassesseswhethertheseedURLsusedinbuildingtheWebarchiveareon-topicoroff-topicacrossthedifferenttimesitwascrawled.Theauthorsusedthevectorspacemodeltorepresentthedocumentsandappliedseveralsimilaritymeasurestocalculatethesimilarityscores.Theyevaluatedtheirmethodusingdifferentthresholdvaluestofindthevaluethatyieldsthebestperformance.WeusedthecosinesimilarityforscoringwebpagesandURLsabouteventswithdifferentthresholdvalues.Wehaveusedsocialmediacontent(i.e.,Twitter)toextractandselectseedURLsforfocusedcrawling.Unliketheworkin[58],weextractedtheURLsfromthetweetcollectionsandthenstartedcrawling.

2.7 Evaluatingtopical/focusedcrawlersSeveralmethodshavebeendevelopedforevaluatingfocusedcrawlers[65,66].In[67]ageneralevaluationframeworkhasbeendevelopedwhereanyfocused/topicalcrawlercanbeassessedaccordingtotheevaluationframework,independentlyfromthedomaintopic.Theauthorsusedthreemethodsforevaluatingdifferentfocusedcrawlers.Thefirstoneisusingaclassifiertoclassifytheresultingwebpagesasrelevantornon-relevant(on-topicoroff-topic).Thesecondmethodisusingaretrievalsystemwherethecollectedwebpagesareindexedinthesystemandspecificqueriesarerunagainstthecollection.Differentcrawlersareevaluatedbasedonthenumberofrelevantwebpagesretrievedforeachquery.Thethirdmethodisusingaveragesimilarityscores.Differentfocusedcrawlermethodsareevaluatedbasedontheaveragesimilarityscoresatdifferentstagesofthecrawl.Sincenoneofthesethreetechniquesfitswellwithourapproach,weusealternativeevaluationschemesintheworkthatfollows.Theabovementionedworkscovermuchofthebackgroundforourresearch.Whiletherehavebeenavarietyofrelatedstudies,ourinvestigationisunique,andimprovesuponthemethodswehaveuncoveredto-date.

3 FocusedCrawling3.1 TopicRepresentationOneoftheinputstoafocusedcrawlerisasetofURLs;togetherthesecanbeusedtodescribetheevent/topicofinterest.WerefertothemasmodelURLs;theycanbethesameordifferentfromtheseedURLs.InthisdissertationmodelURLsaresameasseedURLsforsmallscaleexperiments,whileinlargescaleexperiments(whereseedURLslistisverylarge).WeusedthemodelURLslistbecause,aswewilldescribelater,inChapter5,inthelarge-scaleexperimentsthenumberofseedURLsisverylarge.Usingallofthosetobuildamodelwouldbetimeconsuming,andcoulddelaythestartofacrawl.Accordingly,weuseasmallernumberofURLsasmodelURLstobuildtheeventmodel.Theseareselectedonthebasisofprovidinghighqualitytextualcontentabouttheevent/topic.ThefocusedcrawlerusesthissetofURLstobuilditsevent/topicmodel,andthenusesthemodeltoestimatetherelevance[65,68-73]oftheURLsandwebpagesitencountersduringcrawling.TheremainingseedURLsareaddedtothequeueforthecrawl,helpingensurebreadthofcoverageandreducingbias.Fromnowon,weusethetermseedURLsforbothseedURLsandmodelURLsforsimplicity.Weconsidertwowaystorepresentanevent.Intherestofthischapter,weconsiderthefirst,traditional,baselineapproach,whereaneventistreatedlikeatopic[74],characterizedbyasetofkeywords.InChapter4,wedescribeournewapproach,whereaneventisdescribedwitharichermodel.Wechoseabest-firstfocusedcrawlerasourbaselinemethodbecauseithasproventobethestate-of-theartmethodintopicalfocusedcrawling[65,75].Thebaselinebest-firstfocusedcrawlerusestheVectorSpaceModel(VSM)[76]approachtobuilditsevent/topicmodel:

1. UsingthemodelURLs,downloadcorrespondingwebpagesandextracttextfromthosewebpages.Eachwebpageistokenizedtoasetofwords,stopwordsremoved,andwordsstemmedandthenconvertedtoavector.Herethevectorrepresentstheuniquetermsinthewebpagesandtheirfrequencies(howmanytimestheyappearinthewebpage).

2. Thecrawlerthenbuildsavocabularyindexusingthewebpagevectors.Thevocabularyindexmapsthesetofuniquewordsinallthewebpagestoalistofthewordfrequenciesinthewebpages.

3. Usingthevocabularyindex,thecrawlercalculatesaweightforeachwordbysummingallitsfrequenciesinthewebpagesinwhichitappeared.ThiscorrespondstothewordcollectionfrequencyasopposedtothewordTermFrequency(TF).Wehaven’tusedInverseDocumentFrequency(IDF)aswe

areusingtheseedwebpagesonly,ratherthanalargegeneralcorpus.4. Thecrawlerselectsthetopkwordswithhighestweightsasamodelforthe

event/topicofinterest.Theweightsofthewordsarecalculatedbyusingthelogofthefrequencies,toeliminatetheriskoflongdocumentsdominatingshortdocuments.

Thebaselinecrawlerusestheevent/topicmodeltorepresenttheevent/topicofinterestandalsotomodeleachwebpageitvisitsduringcrawling.Sotheevent/topicvectorhasaslotforeachtermfoundinthevocabulary(orfeaturespace)thatarisesfromthewebpages.Morespecifically,aftergettingaURLwithhighestscorefromthequeue,thecrawlerdownloadsthewebpage,extractsthetext,tokenizesthetextintotokens(words),removesstopwords,appliesstemming,doesfrequencyanalysis,andconvertsthetextintoavectorofwordswiththeirfrequencies.Thefinalwebpagevectorrepresentationwillbeconstructedusingthewordsinthevocabularybuiltfromthemodelwebpagesandtheircorrespondingfrequenciesinthewebpages.Thewordfrequencyinthewebpageiscalledthetermfrequency(TF)intheinformationretrievalliterature[16]andispartoftheTermFrequency–InverseDocumentFrequency(TF-IDF)weightingscheme[15,16].Asmentionedintheprocedureaboveregardingstepnumber3,wehavenotusedtheInverseDocumentFrequency(IDF)becauseitusuallyiscalculatedforageneralcorpusofwebpageswhererelevantandnon-relevantwebpagesexist.Thefocusedcrawlerusesonlypresumedrelevantwebpagesfrommodel(orseed)URLsandthusalsoincludingtheIDFvaluemightleadtoremovingrelevantkeywords.Later,thecrawlerestimatestherelevanceofawebpagebycalculatingthecosinesimilaritybetweentheevent/topicvectorandthewebpagevector.Also,thecrawlerestimatesthescoresofalltheURLsinthatwebpage.ForeachURL,thecrawlercombinestheURLtokensandanchortext,convertsthemtoavectorofwordswiththeirfrequencies,andcalculatesthecosinesimilarityoftheresultingURLvectortotheevent/topicvector.ThentheURLisinsertedintothequeuewiththeestimatedscore.Thevocabulary(keywordsorfeatures)whichthecrawlerusestorepresentthewebpagevectorsandtheextractedURLvectorsisbuiltandextractedfromthesetofwebpagescorrespondingtotheseedURLs.Thewebpagevectorisusedtoestimatetherelevanceofthewebpageandproducearelevancescore.TheURLscoreiscalculatedasanaverageoftheURLvectorscore(calculatedascosinesimilaritybetweenevent/topicvectorandURLvector)andthescoreofthewebpageinwhichtheURLappeared.Sothecrawlerismakinguseofthreetypesoftextualinformation:webpagetext,URLanchortext,andURLaddresstokens.UsingwebpagetextandURLinformationwasprovedtobemoreefficientthanusingwebpagetextonly[75].Figure4showsthearchitectureofthebaselinebest-first

focusedcrawler,withthetopicrepresentationandrelevanceestimationprocesseshighlightedindashedboxes.

Figure4Architectureofbaselinefocusedcrawlerwithtopicrepresentationinthelowerbox,crawlingintheupperbox,andprocessingandrelevanceestimationin

themiddlebox.

3.2 CrawlerArchitectureAgeneralWebcrawler[17,18,77-82]consistsofwebpagefetcher(downloader)forretrievingwebpagecontents,URLsqueue(frontier)forstoringunvisitedURLs,andwebpageprocessorforextractingtextandURLsoutofawebpage’sHTML.CrawlersmodeltheWWWasagraphG(V,E)wherenodes(V)arewebpagesandedges(E)arelinksbetweenwebpages.So,twowebpages(nodes)willhaveanedgebetweenthemifonewebpagehasalinkpointingtotheotherwebpage.SimilartogeneralWebcrawling,afocusedcrawlerhasawebpagefetcher,URLsqueue,andwebpageprocessor.Inaddition,afocusedcrawlerhasatopicordomain-specificmodel,andamoduleforestimatingtherelevanceofURLsandwebpages.Typically,afocusedcrawlertakesasinput:1)thedesirednumberofpagestocollect,and2)seedURLstostartcrawlingfrom.Itoutputsthesetofwebpagesfound[26,41,66,75,83].

Oneoftheimportantaspectsof(focused)crawlersistheorderingoftheURLsinthequeue,whichspecifiestheorderofvisitingthenodesofthegraph.Inthefocusedcrawlerliterature[27],best-firstsearchisthemostcommonlyusedtechniqueandisconsideredthestate-of-the-artfocusedcrawler,takingintoconsiderationtheestimatedrelevanceoftheURLs/webpagesduringcrawling.AfocusedcrawlerstartsfromaseedURL.Itdownloadsthecorrespondingwebpageandextractsthetextofthatwebpage.Thefocusedcrawlerthenestimatestherelevanceofthewebpagetextualcontentwithregardtothetopic/eventofinterest.Inthenextstep,therearetwodesignoptions.Oneoptionisthatthefocusedcrawlerdecideswhetherthewebpageisrelevantornotbycomparingitsestimatedscoretoapre-definedthreshold.Ifthewebpageisconsideredrelevant,thenthefocusedcrawlerextractstheembeddedURLsfromthewebpageandinsertsthemintothequeue.TheotheroptionisthatthefocusedcrawlerextractsallembeddedURLsfromthewebpageandtheninsertsthoseintothequeue,notbeingconstrainedbythewebpagescore.Thesecondoptiontakesintoconsiderationthetunnelingphenomenaincrawling,whereanon-relevantwebpagelinkstorelevantwebpages,eitherdirectlyorthroughseveralsteps.WheninsertingtheextractedURLsintothequeue,thefocusedcrawlerhastomakeanotherdecision.OneoptionistoinsertallextractedURLs,alongwiththeestimatedscoreofthewebpagefromwhichtheywereextracted.AnotheroptionistoestimatetherelevanceofeachURLbasedonthetokensinboththeURLs’addressandanchortext,andinserttheURLanditsresultingestimatedrelevancescoreintothecorrectpositioninthepriorityqueue.WeadoptahybridapproachwhereweusetheaverageofaURL’sscoreandthescoreoftheparentwebpagefromwhichtheURLwasextracted[75].Next,thefocusedcrawlerpullsfromitsqueuetheURLwithhighestscore,andrepeatstheprocess.Figure5showsafocusedcrawleralgorithmthathandlestunneling(i.e.,extractstheURLsfromthewebpageregardlessofscore):estimatingthescoreofeachURLandinsertingitintothequeuewithitsestimatedscore.Weconsiderthisapproachasthefoundationforthebaselineforevaluationcomparisons.

Figure5Baselinefocusedcrawleralgorithm

3.3 LargeScaleDesignConsiderationsIdeally,thefocusedcrawlershouldscorealltheURLsitextractsfromawebpageandinsertthemintoitsfrontierbasedontheirscores.Whenthesituationissmallscale,thefrontiersizeismanageable,howeveriflargescale,thefrontiersizecouldgrowveryfast,andslowdowntheperformanceofthefocusedcrawler,duetomemoryconstraints.

!!!Algorithm*!Baseline!Focused!Crawler!!Input:!Seed!URLs,!pagesLimit,!pageScoreThreshold,*urlScoreThreshold!!Insert!seed!URLs!in!priority!queue!##*Topic*Representation*topicVector!=!Build!topic!representation!from!seed!pages!!##*Crawling*while*pagesCount!<!pagesLimit*and*priorityQueue!is!not!empty:*** URL!=pop!(priorityQueue)!! append!URL!to!visited!list!!!!!!!!!!!!!!!page!=!download!(URL)!! ##*Preprocessing*(Vector*Space*Model)*! pageVector!=!process!(page!text)!

##*Relevance*Estimation*(Cosine*Similarity)*pageScore!=!calculateScore(pageVector,!topicVector)!

!!!! pagesCount!+=!1*************if*(pageScore!>=!pageScoreThreshold):************** * page.getURLs!()!!! !!! relevantPagesCount!+=!1!! !!! save!page!to!eventNrelated!collection!

for*link!in!page.outgoingURLs:!********************* * URL!=!link.address!

!!!!!!!!!!!!!!!!!!!!! validate!(URL)!*************** ******* if*URL!not*in!visited!list!and*URL!not*in!priorityQueue:!

##*Preprocessing*(Vector*Space*Model)* * *urlVector!=!process(URL!text)**##*Relevance*Estimation*(Cosine*Similarity)*

***************************** urlScore!=!calculateScore!(urlVector,!topicVector)!*************** ******************* if*urlScore!>=!urlScoreThreshold:********** * * * * push!(URL,!priorityQueue)!!

Thefrontierisconstructedusingapriorityqueue,oftenimplementedusingamaxheap[84,85].Themaxheapdatastructureprovidespopandpushoperationslikeanormalqueue,andmaintainsthemaxheapproperty,i.e.,thattheelementwiththemaximumvalueisalwaysontopoftheheap.Thus,thepopoperationwillalwaysresultintheelementwithmaximumvalue.ThepopoperationisofO(1),whilethepushoperationwillhavetoinserttheelementinitscorrectpositiontomaintainthepropertythatthemaximumelementisalwaysonthetopoftheheap.ThepushoperationisofO(logn)wherenisthenumberofelementsintheheap.ThepushoperationisrepeatedforeveryURLextractedwhilethepopoperationisrepeatedforeveryURLvisited.Therefore,thefocusedcrawlerwouldspendaconsiderableamountoftimeinmaintainingthemaxheapproperty,especiallywhentheheapsizegrows.Thusduringlarge-scalecrawlingexperiments,welimitthenumberofURLsofthemaxheapbyinsertingonlyURLswithascorebiggerthanagiventhresholdanddiscardURLswithrelevancescorelowerthanthegiventhreshold.;wesetthethresholdto0.1empirically.Anothercomponentthatconsumesmemoryinlarge-scalesituationsisthevisitedURLslist.ThislistkeepstheURLsthatthefocusedcrawlerhasvisited(andfetchedtheircorrespondingwebpages)sothatthefocusedcrawlerdoesn’tvisitthemagain.ForeveryURLextractedorpoppedfromthefrontier,thefocusedcrawlerchecksiftheURLexistsinthevisitedURLslist.Usinganormallist,thisoperationwillbetimeconsumingwhenthelistsizeisverylarge.Forlarge-scaleexperiments,weimplementedthevisitedURLslistusingahashtable,wherethehashkeyistheURLaddress.Checkingifakeyexistsinahashtableismuchfasterthansearchingalist.Theabovementionedapproachcharacterizesthebaselinefocusedcrawlerusedinevaluationstudiesreportedbelow.Theeventfocusedcrawlerusesthesamegeneralapproach,butvariesduetoitsuseofaneventmodel(and,later,webpagesourceimportance).Thus,comparisonsallowdeterminingtheeffectsoftheeventmodelandwebpagesourceimportance.

4 EventFocusedCrawler4.1 EventModelandRepresentation

4.1.1 EventModelingInthefocusedcrawlerliterature[74],eventsgenerallyareconsideredastopicsandwererepresentedwithalistofkeywords.Althoughthisapproachmightworkwellforsomeevents,itworkslesswellforotherkindsofevents.Forexample,representingeventsasalistofkeywordswouldworkincaseswherethetopicpartoftheeventismostdominantandimportant,whilethelocationandtimepartsaren’timportantordon’tplayasignificantroleintheevent.TheoutbreakofEbolaisagoodexampleofaneventwherethetopic(spreadingofEbola)isthemostimportantaspect.Thelocationanddatearepartofthedetails,butarenotthatimportant,i.e.,thetopicpartislargelysufficienttoclearlydescribetheevent.Ontheotherhand,shootingevents,forexample,can’tbedescribedwiththetopicpartonly.Sincetherearemanyshootingeventsindifferentplacesandatdifferenttimes,weneedthelocationanddatepartstoclearlydescribeaparticularshootingevent.Inthisdissertation,wearefocusingonunusualreal-worldnewseventswhichcauseorcreateanimpulseinthepeople’sinterestandthemediacoverageabouttheevent[61].Goodexamplesofthistypeofeventarenaturaldisasters,elections,shootings,terroristattacks,andincidents(pipeexplosion,craneorbuildingcollapse,etc.).Thistypeofeventischaracterizedbytwomainthings:1)theyhaveacentralizedphysicallocationand2)theyhavealimitedtimeperiodinwhichtheireffectappears(impulseinmediacoverageandpeopleinterest).Evenforelections,weareconcernednotwiththegeneralaspectoftheelectionbutwithaspecificincidentthatcapturespeople’sinterestorcausesanimpulseinmediacoverage.Forexample,theAlaskaelectionsonNovember2,2010capturedpopularattentionbecauseofa3-wayracewhichanindependentcandidatewon[61,86].Thisincontrasttomanyothereventdefinitionsindifferentfields.Inthecomputationallinguisticsfield,aneventcanbedefinedas“asituationthatoccurs”whileinthemultimediafielditisdefinedasachangeinthestateofanobjectinavideoorphotostream.Intheinformationextractionfield,thefocusisonextractingreal-worldlocaleventslikeconcerts,theaterperformances,birthdayparties,etc.;informationcomesfromtextualcontentandyieldsrecordsofeventsinwhichpeoplehaveinterest[87].

EventModel:Beforeconsideringcomplexeventmodels,asimpleschemeshouldbetestedfirst.Thus,wedefineaneventassomething(e.g.,adisaster),whichhappenedinacertainplace,andatacertaintime.Thus,aneventEisatuple<T,L,D>.Thethreepartsreflectwhat,where,andwhen.Thus,Tisthetopicoftheevent,Lisitslocation,andDisitsdate.Theseareexplainedbelowinmoredetail.Topic:UsingasetofseedURLs,wecreateaneventvocabulary(asetofuniquekeywordsthatappearfrequentlyinthewebpagesassociatedwiththoseseeds).Werepresentaneventwithareferencevectorcreatedbytakingthetopkeywordsfromtheeventvocabulary.Date:Theeventdateisgivenbyauserorisextractedautomaticallyfromthesetofseedwebpages.Theeventdaterepresentsthestartingdatewhentheeventfirstoccurred.Theeventalsocouldhaveanendingdate,orevenasetofperiodsintime.Location:Asmallsetoflocationentitiesislikelytoappearfrequentlyinmostoftheseedwebpages,representingplacesrelatedtotheevent.Theselocationentitiesareextracted(asdescribednext)fromseedwebpages’texts;weperformafrequencyanalysistohelpfindthemostimportantlocationentitiesmentionedintheseedwebpages.Forexample,wemodeltheshootingthathappenedinSanBernardino,CaliforniaonDecember2,2015asfollows: Topic:shooting,shooter,… Location:SanBernardino,California Date:12/02/2015Similarly,wemodeltheattackthathappenedinBrussels,BelgiumonMarch22,2016asfollows: Topic:terror,attack,explosion,… Location:Brussels,Belgium Date:3/22/2016Oureventmodel(combiningtopic,location,anddate)candefaulttothetopic-onlymodelincaseofEbolabyignoringthedateandlocationpart(bysettingtheweightsofthelocationanddatepartstozero).WenotethatthiswouldbethecasealsoregardingtheZikavirusdiseaseoutbreak.Wecanaddapartinoursystemthatiftheeventtypeisdiseaseoutbreak(manuallyenteredbytheuser),thesystemautomaticallydefaultstothetopic-onlymodel.Alternatively,otherdefaults,likegivingasmallweighttothelocationand/orthedatepart,couldbeinstituted.Thus,oursystemisflexible,andcanbeusedincaseswhereitisdifficulttodetermine,foragivenevent,whichmodelismoreefficient.Butiftheeventisa

news/worldeventwhichisphysicallylocalized(hasaclearcenter)andtemporallylimited(withanimpulseinnumberofarticlespublishedinarelativelyshortperiod)thenwebelieveoureventmodel(combiningtopic,location,anddate)should performmoreefficientlythanthetopic-onlymodel.Thus,ifauserisinterestedintheoutburstofZika/Eboladiseaseinacertainplaceandatcertaintime,thenoureventmodel(combiningtopic,location,anddate)shouldperformbetterthanthetopic-onlymodel. Figure6showsthestepsofbuildinganeventmodelfromseedwebpages.WestarttheprocessofbuildinganeventmodelbydownloadingthewebpagescorrespondingtotheseedURLs.WethenextractdatesfromtheseedURLsandtheseedwebpages.Todothis,wefirsttrytoextractthepublicationdatefromtheseedURLsusingapre-definedregularexpression.Ifthatfails,weextractthepublicationdatebyparsingapre-definedsetoftagsfromtheHTMLofthewebpages.

Figure6Stepsofbuildingeventmodelfromseedwebpages

Forthedateextraction,wehaveuseda library1forextractingpublishingdateofawebpageusingheuristics.The first step is to extract thepublishingdate from theURL using regular expressions, if applicable. For example, the URLhttp://www.cnn.com/2016/07/10/us/black-lives-matter-protests/index.html hasapublishingdateofJuly10,2016.IftheURLdoesn’tcontaindateinformation,thenthenextstepistolookforspecifictagsintheheaderportionofthecorrespondingwebpageHTMLtags.Anexampletagthatcontainspublishingdatelookslike:

1https://github.com/Webhose/article-date-extractor

<metaname="pubdate"content="2015-11-26T07:11:02Z">

ThisappearsintheheadtagoftheHTMLcontentofawebpage.Therearemultiplemetatagsthatmightcontainpublishingdate;hencethelibraryhasanextensivelistofpossiblemetatagsthatarefrequentlyusedindifferentwebsites.Thefinalstepistocheckinthebodyofthewebpage,ifnopublishingdateisfoundintheheadtag.Asbefore,alistoffrequentlyusedbodytagsisusedtoguidefindingthepublishingdate.Anexampleofsuchatagcontainingpublishingdateis:

<pclass=”pubdate”>Sept3,2011</p>Iftherearemultipledatesfoundinthewebpage,thelibraryreturnsthefirstoneonly.ThelibrarytriestoextractthepublishingdatefromtheURL,thenfromtheheadtagoftheHTML,andthenfromthebodytagoftheHTML.TheorderisimportantbecauseitfollowstheaccuracyoftheextracteddatewheredateextractedfromURLisexpectedtobemoreaccuratethanfromtheheadthanfromthebody.Anextrastepthatcouldbedoneistousenaturallanguageprocessingtechniquestoextractnamedentities(dates)fromthetextualcontentofthewebpage.Usingextractednamedentitiesdates,wecanfigureoutthepublishingdateofthewebpage.However,wehavenotusedthisapproach,becausewiththelibraryweusedwemanagedtoextractpublishingdatesfrommostofthewebpagesandbecauseoftheoverheadofcallingandusingthenamedentityrecognizer.Fortheeventmodellocationsvector,wesegmentthetextofthewebpagesoftheseedURLsintosentencesandapplytheStanfordNamedEntityRecognizer(SNER)2oneachsentencetoextractlocationentities.Wethenperformfrequencyanalysisontheextractedlocationentitiesandconstructthelocationsvector.Itincludestheuniquelocationsextracted,alongwiththeirfrequencyofoccurrenceinallsentencesinallseedwebpages(i.e.,theweightofeachlocationisthecumulativefrequencyinallseedwebpages).TheresultinglocationsvectorwillincludethelocationsfrequentlymentionedinthesetofwebpagescorrespondingtotheseedURLs,whichshouldbethelocationoftheeventofinterest,assumingtheseedwebpagesarerelevantandofhighquality(withregardtocontainingenoughinformationaboutthedifferenteventaspects,namelytopic,location,anddate).TheSNERcouldextractlocationentitiesnotrelatedtotheevent from some of the seedwebpages, as a webpagemay include references to

2http://nlp.stanford.edu/software/CRF-NER.shtml

multiple locations.This shouldnot affect themodel, however, as the frequencyofthoselocationentitiesshouldbeverysmall(sincetheytypicallyappearinfewoftheseed webpages). On the other hand, if the event occurs in multiple locations, asuitablelistoflocationsshouldbefoundthroughtheabovementionedprocessingofseedwebpages (i.e., theseedwebpagesshouldcover thedifferent locationsof theeventandnotbeconcentratedononelocationonly).Forthetopicvector,weperformthesameprocessingasforthebaselineVSM.WetokenizethetextofthewebpagesoftheseedURLsintowords,removestopwords,stemwords,performfrequencyanalysis,andconstructthetopicvectorasthesetofuniquetermsalongwiththeirfrequencyofoccurrence.

4.2 EventProcessing

4.2.1 EventModel-basedWebpageScoringInthissection,weshowhowtheeventfocusedcrawlerusestheeventmodeltocalculateascoreforeachwebpageitvisitsandfortheURLsextractedfromthatwebpage.Focusedcrawlersassigneachdownloadedwebpageascore,whichestimatestherelevanceofthewebpage.Inthecaseofevents,eventaspectsareconsideredduringtherelevanceestimationprocess.Thuswescoretherelevanceofthewebpagewithrespecttoeachaspectoftheevent,andthencombinethatinformationtocomputeafinalscore.Accordingtooureventmodel,therearethreeattributeswhichtogetherfullydescribeanevent.Awebpagecanhavesomeoralloftheattributesofanevent.Awebpageisconsideredrelevant(i.e.,talksaboutthetargetevent)ifitsatisfiesthefollowingconditions:

• Ithasanon-emptysubsetofthekeywordsthatrepresentthetopicattributeofthetargetevent(i.e.,istopicallyrelevant).

• Itspublicationdateisclosetotheeventdate.• Ithasanon-emptysubsetofthekeywordsofthelocationattributeofthe

targetevent(i.e.,thelocationentitiesextractedfromthewebpagearesimilartoeventlocationentities).

Awebpagethatsatisfiestheseconditionsshouldbeconsideredrelevantandwillbeaddedtotheoutputcollection.Theeventfocusedcrawlerfirsttakesthefollowingstepswithregardtoawebpage:

1. Extractthetextofthewebpage.

2. Extractthepublicationdateofthewebpage.3. ExtractlocationentitiesfromthetextofthewebpageusingNamedEntity

Recognition(NER).Wehavedevelopedafunctiontomeasurethesimilaritybetweenthetargeteventmodelandthewebpagemodel.Thesimilarityfunctionproducesascorethatestimatestherelevanceofthewebpagetothetargetevent.Givenatargeteventmodelandawebpageeventmodel:e1=(T1,L1,D1)ande2=(T2,L2,D2),whereT1istheeventtopicreferencevector,L1isthelistoflocationentitiesextractedfromseedwebpagesusingNER,D1istheeventdate,T2isthebag-of-wordsvectorrepresentationofthewebpagetext,L2isthelistoflocationentitiesextractedfromthewebpagetextusingNER,D2isthepublicationdateofthewebpage,e1isthetargeteventmodel,ande2isthewebpageeventmodel.Thesimilarityfunctionsim(e1,e2)isdefinedas:

!"# $%, $' = *×!,-.$ /%, /' + 1×!,-.$ 2%, 2' + ,×!,-.$(4%, 4') (1),

!,-.$ /%, /' = 6(78)×6(79):;<8∩<9

>8 × >9 (2),

i.e.,thecosinesimilaritybetweentheT1andT2vectors,andw(ti)istheweightoftermtindocumenti,and

!,-.$ 2%, 2' = 6(?8)×6(?9)@;A8∩A9

B × B9 (3),

i.e.,thecosinesimilaritybetweentheL1andL2vectors,andw(ti)istheweightoflocationlindocumenti,and

!,-.$ 4%, 4' = 1 −E8FE9

GHI_KLMN (4),

wherenum_daysisthenumberofdaysinayear.Thisparametercanbeconfiguredaccordingtotheeventcharacteristics.Ifthetwodatesaremorethanthevalueofthisparameter,thescorewillbezero.Thefinalscoreofthewebpageiscalculatedbyusingaweightedaverageofthescoresofthetopic,location,anddatevectors,whereconstantsa,b,andcaretheweightsofthetopic,location,anddatescores,respectively.Theseaddtoone:a+b+c=1.

4.2.2 CalculatingTheWeightsTheweightsa,b,andccouldbesetmanuallybyanexpertwhowouldtakeintoconsiderationthetypeoftheevent(shooting,hurricane,bombing,earthquake,etc.)andthecharacteristicsoftheevent(timedurationandlocationarea,e.g.,specificlocationandpointintimefora“sharp”event,versusmultiplelocationsandlongtimeperiodsforcomplexevents).Toautomaticallycalculatetheweights,weuseeachaspectoftheeventmodel(topic,location,anddate)separatelytoscoreasampleoflabeledwebpages.Weevaluateeachaspect’sperformanceagainstdifferentthresholdvalues,andwechoosethethresholdvaluethatproducesthebestclassificationperformanceaccordingtoagivenevaluationmetric.Inparticular,weusedtheF1-scoreastheclassificationevaluationmetric.F1-scoreisaninformationretrievalmetricthatcalculatesthegeometricmeanoftheprecisionandrecall[88].WealsousedtheF1-scoretoassigntheweightofeachaspectoftheeventmodel(topic,location,anddate),whichindicatestheimportanceofthataspectincalculatingthefinalscore.Wecalculatetheweightastheratiooftheaspect’sF1-scoretothesumoftheF1-scoresofallaspects.Figure7showsthestepsforcalculatingthescoreofawebpage.For example, assume we have 100 webpages, 50 relevant and 50 non-relevant,relative to a specific event (e.g.,Orlando shooting). Assumealso thatwehave thetarget event model (which could be extracted from another set of relevantwebpages or entered manually by the user). For each of the 100 webpages, weextract the topic vector, locations vector, and publication date. Thenwe calculatethree scores (topic score, location score, and date score) for eachwebpage usingequations1,2,and3,respectively.Afterthisprocessweendupwithamatrixof100rows(webpages)and3columns(topicscore,locationscore,anddatescore).Next,we use each of the scores (topic, location, and date) separately to predict a label

(relevant or non-relevant) for eachwebpage.Weproduce the label by comparingthescoretoathresholdvalue(callitK);wegenerate“relevant”ifthescoreislargerthanthethresholdand“non-relevant”ifitissmaller.

Figure7Thestepsforcalculatingthescoreofawebpage

Afterthisprocessweendupwithamatrixof100rows(webpages)and3columns(labelbasedontopicscore,labelbasedonlocationscore,andlabelbasedondatescore).Thenweevaluatetheeffectivenessofeachaspectoftheevent(topic,location,anddate)bycomparingtheactuallabelsandpredictedlabelsforeachofthethreeaspects(topic,location,anddate).WeusetheF1-scoreasthemetricforevaluation.Hereweendupwith3F1-scores(oneforeachofthetopic,location,anddate)forthethresholdvalueK.Werepeatthepreviousprocessfordifferentvalues(nvalues)ofthethresholdparameter.Thenweendupwithamatrixofnrows(differentvaluesofthethreshold)and3columns(topic,location,anddate).Finally,wechoosethemaxF1-scoreforeachaspect(topic,location,anddate).TheweightofeachaspectwillbetheratioofitsF1-scoretothesumofthreeaspects’F1-scores.Inthismanner,theweightofeachaspectcorrespondstohowmuchitcontributestotheoverallperformance.Theweightsofeachaspectoftheeventmodel(topic,location,anddate)arelearntbeforethecrawlingtimebyapplyingthepreviousprocedureonagivensetofURLs

andwebpagesthatarelabeledasrelevantornon-relevant.Theweightslearntareusedduringcrawlingandarenotmodified.

4.2.3 EventModel-basedURLScoringAsimilarprocedureisimplementedforestimatingascoreforeachURLextractedfromthewebpage.AURLisconvertedintotokensbyremovingnon-alphabeticcharacters(like‘/’,‘#’,’?’)andalsoremovingURL-specifickeywords(like‘http’,‘com’,’www’).URLtokensarecombinedwithtokensfromassociatedanchortexts.Theresultingtokensarethenconvertedtoabag-of-wordsbasedvectorrepresentation.WeextractthelocationentitiesfromURLanchortextusingSNERand extractthepublicationdatefromtheURLusingregularexpressions3(ifapplicable).

Figure8AnexamplewebpagewitharelevantURLanchortexthighlighted

Figure8showsanexamplewebpageabouttheBrusselsattackeventwitharelevantURLhighlighted.TheanchortextoftheURLis:“ParisandBrusselsterrorsuspecttofacechargesinFrance”.TheaddressthattheURLpointstois:3https://github.com/Webhose/article-date-extractor

https://www.theguardian.com/world/2016/jun/09/mohamed-abrini-paris-brussels-terror-suspect-france-man-in-the-hat.WecanseethattheURL’saddresscontainsthepublicationdateofthecorrespondingwebpage(June9,2016).TheURLvectoraftertokenization,stopwordremoval,andstemmingwouldbe:[‘theguardian’,‘world’,’2016’,’jun’,’mohamed’,’abrini’,’paris’,’brussels’,’terror’,’suspect’,’france’,‘man’,’hat’,’charges’].WenoteherethatSNERwillcapturethelocationentitiesfromtheURL’sanchortextonly,nottheURLaddress,becausetheSNERtokenizesitsinputtosentencesandtriestoextractentitiesfromthesesentences;thiscanbedonefortheanchortext(whichincludesmeaningfultext)butnotfortheURL(sinceURLtokensdon’tformmeaningfulsentences).IftheseedURLsarefromonedomain(e.g.,www.theguardian.com),thismayaffectthequalityofinformationextractedtobuildoureventmodel.Thecoveragefromasingledomaincouldbebiasedorlimitedinscope,whilethatislesslikelyiftherearemultipledomains.TheseedURLsshouldbefromdifferentdomains,whichwillensurethatallrequiredinformationispresentintheeventmodelandthusthereisnobiastowardaparticulardomainwebsite.Asasummary, Figure9showsthestepsforcalculatingthescoreofaURL.

Figure9ThestepsforcalculatingthescoreofaURL

Inthischapter,wehaveaddressedhypothesis1.1andresearchquestions1and2,namely“Howtomodelandrepresentanevent?”and“Howtocomparetwoeventrepresentations?”.Weexplainedoureventmodelandrepresentation,andhowtheeventfocusedcrawlercanuseittoestimatetherelevanceoftheURLsandwebpagesitvisits.

5 ExperimentalSetupInthischapterwedescribethedatasetsusedforourexperiments,thedifferentexperimentsperformed,andtheevaluationmetrics.Weperformedfourseriesofexperiments.Thegoalofthefirstseries(seeresultsinSection6.1)wastovalidatethatoureventmodelcaneffectivelyclassifywebpagesasrelevantornon-relevanttotheeventofinterest.Wecomparedourapproachagainstthebaseline,i.e.,usingthetraditionalvectorspacemodel(VSM)topic-onlyapproach.Inthesecondseriesofexperiments(seeresultsinSection6.2),weaimedtovalidatethattheeventmodelcaneffectivelyestimatethescoresoftheURLsandwebpagesitvisitsandconsequentlyguidethecrawlingprocesstowebpagesrelevanttotheeventofinterest.Inthethirdexperiment(seeresultsinSection7.1),weevaluatedtheeffectivenessofourproposedwebpagesourceimportancemodelforcollectingmorerelevantwebpages.Inthefourthexperiment(seeresultsinSection7.2),weevaluatedtheeffectivenessofcuratingseedURLsfromsocialmediacontent(tweets)usingdifferentmethodsofselection.

5.1 DatasetsForthefirstseriesofexperiments,aboutclassification,wedevisedtwodatasets:1)asetofrelevantwebpagesforthetraining/learningmodelphase,and2)asetofrelevantandnon-relevantwebpagesforthetesting(classification)phase.First,weneedasetofrelevantwebpagesthatthetwomodels(ourevent-basedmodelandthetopic-onlymodel)willusetolearn/buildtheirmodelinthetrainingphase.Accordingly,wemanuallycuratedasetof38URLsandfetchedtheircorrespondingwebpages.Fortheclassificationphase,therewasnoexistingdataset(labeledrelevantandnon-relevantsamples)aboutthatshootingevent.Wedecidedtobuildourowngroundtruthdatasetof1000URLsandwebpages.Wecouldhavemanuallylabeledasetof1000URLsandwebpages,but(tosavetimeandeffort)weusedakeyword-basedcrawlertofetch1000webpagesusingthesetof38URLs(usedinthetrainingphase)asseeds.Weusedthetwowords“California”and“shooting”askeywordsforthecrawler.Afterthecrawlerfinishedcrawling,wemanuallylabeledtheresultingwebpagesintotwoclasses(relevantandnon-relevant).Therewere725webpages

labeledasrelevantand275labeledasnon-relevant.WefollowedtheproceduredescribedinSection4.3forcalculatingtheweightsandperformingtheevaluation.ThemanuallylabeledsetofURLsandwebpagesaregivenasinputtobothmodelsandtheresultingcalculatedscoresandgiventhresholdparameterareusedtoproducethepredictedlabels.Thepredictedlabelsarethencomparedtothelabelsdeterminedmanually,toproducetheevaluationresults.After completing the first series of experiments, for the focused crawlingexperiments,weconsideredasetofrecentevents.Table2showsthelistofeventsused in our crawling experiments. For each event we summarize the type of theevent,thelocationanddateoftheevent,andfinallyhowmanyURLswereextractedfromthecorrespondingtweetcollectionandwereusedasseedURLsforcrawling.The number of seed URLs varies across the events because they were extractedfromtheevent’scorrespondingtweetcollections.Thenumberofdesiredwebpageswassetto50,000foreventswithlessthan10,000seedURLsand100,000foreventswithmorethan10,000seedURLs,exceptforEgyptairplanecrasheventwherethenumber of desiredwebpageswas set to 10,000URLs only andParis attack eventwhere thenumber of desiredwebpageswas set to 500,000. For the classificationexperiments,however,itseemedsufficienttojustconsidertheCaliforniashooting.

Table2Listofeventsusedinthelarge-scalecrawlingexperiments

Event Type Location Date #ofSeedURLs

#ofdesiredwebpages

CaliforniaShooting

Shooting SanBernardino,California,USA

December2,2015

4,161 50,000

BrusselsAttack

TerroristAttack

Brussels,Belgium

March22,2016

4,691 50,000

OregonShooting

Shooting Roseburg,Oregon,USA

October1,2015

22,354 100,000

EgyptairPlaneCrash

PlaneCrash MediterraneanSea,

Alexandria,Egypt

May19,2016

1,211 10,000

PanamaPapersLeak

DocumentLeak

Panama April3,2016

18,260 100,000

OrlandoShooting

Shooting Orlando,Florida,USA

June12,2016

1,988 50,000

ParisAttack TerroristAttack

Paris,France November13,2015

88,835 500,000

EcuadorEarthquake

Earthquake Ecuador April16,2016

11,348 100,000

Sincethe IDEALproject isworkingwitha largeamountofevent-relateddata,andsincewewantedourresultstobeassessedinthecontextofsuchtypesofdata,wehadampleopportunitytoutilizedatafromeventslikethosementionedabove.

5.2 ExperimentsThegoalofthefirstserieswastovalidatethatoureventmodelcaneffectivelyclassifywebpageswithregardtorelevancetotheeventofinterest.Wecomparedtheperformance,forthetaskofclassification,oftheevent-modelvs.topic-onlymodel.WeusedthemanuallycuratedseedURLsandthestaticdatasetof1000webpagesabouttheCaliforniashootingfortheevaluation.Bothmodelsusecosinesimilarityasascoringfunction,andproduceascorefortheirinputtext(URLorwebpage)thatestimatestherelevanceoftheinputtothegivenevent.Weuseeachofthemodelsasaclassifier,i.e.,bycomparingthecosinesimilarityscoretoagiventhresholdparameter.Ifthecosinesimilarityscoreisbiggerthanthethreshold,thentheoutputlabelisrelevant,butisnon-relevantotherwise.Bothmodels(event-basedandtopic-only)aretrainedusingasetofURLs(positiveonlysamplesastheydon’trequirenegativesamplesfortraining).Thetopic-onlymodelusesthesetofURLstobuildatopicreferencevector,whiletheevent-basedmodelusesthesetofURLstobuildtheeventmodel(topic,location,anddate).ThenextstepistousethemodelstoclassifyURLsandwebpages.Weusethetwomodelsbuilt(event-basedmodelandtopic-onlymodel)toclassifythemanuallylabeledURLsandwebpages(i.e.,ourgoldenstandardtestset)todeterminetheirpredictedlabels.Thelaststepistocomparethepredictedlabels(fromboththeevent-basedmodelandthetopic-onlymodel)totheactual(manuallyproduced)labelsandevaluatetheperformanceofeachofthemodels.Weevaluatedtheperformancebyvaryingtwoparameters:

1. k,thenumberofkeywordsusedinconstructingthetopicvectorinoureventmodelandthetopicreferencevectorforVSM,and

2. threshold,thevalueofthethresholdusedforconvertingthescorestolabels(relevantifthescoreislargerthanthethreshold,otherwisenon-relevant).

Wealsorantheexperimentswithseveralvariationsofoureventmodel.Wethenranthesameexperimentwiththetwopairsoffeaturetypes:a)combinationofthetopicandlocationonly,andb)combinationofthetopicanddateonly.Figure10showsthedesignoftheexperimentsforevaluatingtheeffectivenessofclassificationusingtheeventmodel.

Figure10Designofourevaluationmethodoftheeffectivenessoftheeventmodel

forrelevanceestimation;thetwoboxeswithanasteriskindicatethetwoparametersoptimizedintheexperiment

Inthesecondseriesofexperiments,weaimedtovalidatethattheeventmodelcaneffectively estimate the scores of the URLs and webpages to be visited, andconsequentlyguidethefocusedcrawlingprocesstowebpagesrelevanttotheeventofinterest.Intheseexperimentsweusedasetofrecentevents.Intheliterature,themostuseddataset forevaluating topical focusedcrawler is theDMOZdataset.Butsince we are evaluating focused crawlers for events, the DMOZ dataset is notsuitableinourcase.Inthecrawlingexperiments,weneedasetofrelevantURLsforeacheventthatwillbeusedasseedsforstartingthecrawling.WemanuallycuratedthesetofseedURLsfortwoevents(38forCaliforniashootingand23forBrusselsattack).ThesetwosetsofmanuallycuratedURLsareusedforrunningsmall-scalecrawlsonly.Forlarge-scalecrawls,weneedtostartfromalargersetofseedURLs,whichisverydifficulttobuildmanually.Fortherestoftheevents,weextractedthesetofseedURLsfromacorrespondingcollectionoftweets.ThetweetcollectionswerecollectedusingtheTwitterstreamingAPI.TwitterisarichsourceofURLs,asmostofthetweetspostedlinktowebpagesthatcontainmoredetailedinformation.ThenumberofURLsextracteddependsonhowbigtheeventis(highimpacteventsattractmorepeopleandthereforemoretweetsarepostedabouttheevent).

Figure11summarizesthestepsforcalculatingtheharvestratioafterthecrawlingprocessfinishes.Theresultingwebpagesofeachcrawl,theircalculatedscores(basedoneachmodel),andagiventhresholdareusedtoproducethepredictedlabelsandthecorrespondingharvestratio.

Figure11Thedesignoftheexperimentforevaluatingtheeffectivenessofevent

modelwithfocusedcrawlertoretrievemorerelevantwebpages

5.3 EvaluationMetricsWeusedtheprecision,recall,andF1-scoremetricstoevaluatetheclassificationperformanceofoureventmodelversusthebaselinetopic-onlyapproachinthefirstseriesofexperiments.Theprecision,recall,andF1-scorearecalculatedusingtheconfusionmatrix,whichcontainsthenumberoftruepositive,truenegative,falsenegative,andfalsepositivesamples(sampleshererefertowebpages).Thetruepositivesamplesaretheonesthatwerepredictedrelevantandwereactuallyrelevant,whilethetruenegativesamplesaretheonesthatwerepredictednon-relevantandwereactuallynon-relevant.Thefalsepositivesamplesaretheonesthatwerepredictedrelevantandwereactuallynon-relevant,whilethefalsenegativesamplesaretheonesthatwerepredictednon-relevantandwereactuallyrelevant.Theprecisionisdefinedasthepercentageofretrievedsamplesthatarerelevant.Itiscalculatedastheratiooftruepositivetothesumofthetrueandfalsepositivesamples.Therecallisdefinedasthepercentageofrelevantsamplesthatareretrieved.Itiscalculatedastheratioofthetruepositivetothesumofthetruepositiveandfalsenegativesamples.TheF1-scoremeasureisthegeometricmeanoftheprecisionandrecallmeasures.Togetaperfectprecisionweusuallymusthave

lowrecallandviceversa(perfectrecallcomeswithlowprecision),soF1-scoreisawaytocombinebothmeasures(precisionandrecall).Inotherwords,ifyougetahighvalueforprecision,thatdoesn’tensureyouhavegoodperformance,asyoumighthavealowvalueforrecall.ButifyouhaveahighvaluefortheF1-score,thismeansyouhaveahighvalueforbothprecisionandrecall.Forevaluatingtheperformanceofthefocusedcrawlersinthesecond,third,andfourthseriesofexperiments(i.e.,theabilitytocollectmorerelevantwebpages),weusedtheharvestratiometric[27-29].Theharvestratioisthepercentageofcrawledwebpagesthatarerelevant.Theharvestratiomeasurestheabilityofthecrawlertofindandcollectmorerelevantcontentthannon-relevantones.Ifthecrawlervisitsmanynon-relevantwebpagesinordertofindrelevantones,thenthismeansthecrawlerisusinganinefficientmethod.AhighlyefficientrelevanceestimationmethodwoulddirectthecrawlercorrectlyandensureitfocusesontherelevantpartoftheWebonly,thusproducingahigherharvestratio.ItisworthnotingherethatthecriticalpointforrelevancejudgmentistheabilitytoestimaterelevanceofbothURLsandwebpages.EstimatingtherelevanceofwebpagesiseasierthanforURLsduetothefactthatwebpagescontainrichertextualcontentthanURLs.Apoorrelevanceestimationmethodwouldgivehighscoresfornon-relevantURLs,whichleadstonon-relevantwebpagesandthuslowharvestratio.

6 Results6.1 EventModel-basedvs.Topic-OnlyClassificationInthissection, aboutthefirstseriesofexperiments,weshowtheresultsofclassifyingthe1000webpagesabouttheCaliforniashootingusingthetopic-onlyvectorspacemodelversusthreevariantsofoureventmodel,namelytopic+location,topic+date,andtopic+location+date(oureventmodel).Usingthe38seedwebpagesabouttheCaliforniashootingevent,wecreatedavocabularyof1365keywordsthatappearedon5ormorewebpages.Toextractthemostrepresentativekeywords(features)fromthevocabulary,wesortedthevocabularykeywordsbasedontheircumulativenormalizedoccurrencesinalloftheseedwebpages.Wechosethetopkkeywordsfromthesortedvocabulary.Eachseedwebpageisthenrepresentedasavectorofthetopkkeywordsandtheirfrequencyofoccurrenceinthewebpage.Wecreatedthetopicreferencevectorasthecentroidvectorofallseedwebpagevectors.

6.1.1 ClassifyingURLsandwebpagesaboutCaliforniashootingThe1000URLs/webpagesdatasetabouttheCaliforniashootingconsistsofURLaddressesandtheiranchortexts,aswellasthecorrespondingwebpages.WeranexperimentstoclassifyURLsandwebpages,separately.TheURLsandwebpagesweremanuallylabeledastorelevantvs.non-relevant.Werefertothisdatasetasthelabeleddataset.Forthetopic-onlymodelandoureventmodel,wevariedtheparameterk(thenumberofkeywordsinthetopicvector)from5to1365wordswithincrementof10andthethresholdparameterfrom0to1withincrementof0.05.Table3ValuesoftheparametersthatproducedthebestF1-score.Kisthesizeofthe

topicvectorandthresholdisthecutoffvaluefordeterminingrelevantornon-relevantlabelsbasedonthescore

URLs WebpagesK Threshold K Threshold

Topic-only(baseline) 1310 0.25 10 0.45Topic,Location,andDate(Eventmodel) 1310 0.15 10 0.4

AscanbeseeninTable3,forthetopic-onlyapproach,inthecaseofURLs,thevaluesoftheparametersk(thenumberofkeywordsoftopicvector)andthresholdthatgavethebestF1scoreonthelabeleddatasetwere1310and0.25,respectively.Inthe

caseofwebpages,theywere10and0.45,respectively.Theweightsoftopic,date,andlocationpartswere0.36,0.22,and0.42,respectively.Theweightswerecalculatedusingequations1-4asdescribedinSection4.2.2.Fortheeventmodelapproach,inthecaseofURLs,thevaluesoftheparametersk(numberofkeywordsoftopicvector)andthresholdthatgavethebestF1scoreonthelabeleddatawere1310and0.15,respectively.Inthecaseofwebpages,theywere10and0.4,respectively.Theweightsoftopic,date,andlocationpartswere0.3,0.355,and0.345,respectively.Table3summarizestheparametervaluesforallsettings.Toexaminetheeffectofthedateandlocationseparately,weranourevaluationusingtopic+locationandtopic+date.Fortopic+location,thebestthresholdvaluewas0.2andtheweightsoftopicandlocationpartswere0.64and0.36,respectively.Fortopic+date,thebestthresholdvaluewas0.2andtheweightsoftopicandlocationpartswere0.47and0.53,respectively.Tables4and5showtheprecision,recall,andF1scoreforthefourexperimentalsettings(topic-only,topic+location,topic+date,andoureventmodel,withtopic,location,anddate)usingthebestvaluesfortheparametersforboththeURLandwebpageclassificationtasks (asshowninTable3).Table4 Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,DateevaluatedonthemanuallylabeledTRAININGURLsdatasetforCalifornia

shootingevent

Precision Recall F1-scoreTopic 0.728 0.723 0.725Topic+Date 0.852 0.855 0.853Topic+Location 0.764 0.73 0.74Topic+Location+Date 0.863 0.867 0.862

AchievinghigherF1-scoremeansbetterclassificationperformance(i.e.,betterabilitytoidentifyanddifferentiatebetweenrelevantandnon-relevantwebpages).Theresultsshowthataddingdateand/orlocationinformationtothetopicenhancestheperformance.Oureventmodel(combiningtopic,location,anddate)achievesthebestperformance(highestF1-score).Thetopic-onlymodelperformedworst(lowestF1-score). Ourexaminationofthedataconfirmedthatthetopic-onlymodeldidnotdifferentiatewellbetweenwebpagestalkingaboutdifferentshootingeventsandourevent(Californiashooting),asallaretopicallyrelated(shooting).Ontheotherhand,thetopic+datemodelperformedbetterthantopic-only,becauseitmanagedtousethepublishingtimeofthewebpagestofilteroutwebpagestalking

aboutshootingeventsthathappenedbeforetheCaliforniashootingevent.Thetopic+locationmodelperformedbetterthanthetopic-onlymodelbecauseitfilteredoutwebpagestalkingaboutshootingeventsthathappenedatotherlocationsthanCalifornia.Table5 Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,

andDateevaluatedonthemanuallylabeledTRAININGwebpagesdatasetforCaliforniashootingevent

Wealsoexaminedtheperformanceofoureventmodel(combiningtopic,location,anddate)versusthetopic-onlyapproachacrossthedifferentvaluesofthethresholdvariablesandthebestvalueforthekparameter.Weplottedtheprecision-recallcurvesforthedifferentvaluesofthethresholdparameter.Figure12andFigure13showthecurvesforthefourdifferentsettings.Thefiguresconfirmtheresultsdescribedabove:addinglocationand/ordateinformationenhancestheperformanceofclassification.Itisalsoshownthattheeffectofaddingdateinformationismuchstrongerthanaddinglocationinformation,inthecaseofURLs.WeinvestigatedthisbehaviorandfoundthatmostoftheURLsinourlabeleddataincludedateinformationthatcanbeextractedeasily.TherewaslesslocationinformationintheURLscomparedtodateinformation.Further,someofthelocationinformationwasnotinastandardformatasexpectedbySNER(whichassumeslocationinformationexistsaspartofavalidsentence;seeSection4.2.3).WehaveusedthemanuallycuratedseedURLsforlearningoureventmodelandthebaselinetopic-onlymodel.Bothmodelshavetwoparameters:K(numberoffeatures,i.e.,words,inthetopicvector)andthreshold(valuefordeterminingthelabels:relevantornon-relevant).Wetunedthevaluesofthetwoparametersandreportedthebestvaluesofthoseparameters(seeTable3)andtheperformanceofthetwomodels(usingthebestvalueofthetwoparameters;seeTables4and5)onthemanuallylabeledtrainingdataset.Finally,wetestedtheperformanceofthetwomodelsonthemanuallylabeledtestdataset(sincethetwomodelshaven’tseenthewebpagesinthisdataset)toseehowwellthetwomodelswillgeneralizetounseenwebpages.Table6showstheperformanceofthebaselinetopic-onlymodel,oureventmodel(topic+location+date),andtwovariantsofoureventmodel(topic+locationandtopic+date).Oureventmodeloutperformsthebaselinetopic-onlymodelbyachievinganF1-scoreof0.894comparedto0.688forthebaselinetopic-onlymodel.Also,addingthedateorlocationinformationachievesbetterperformancethanthebaselinetopic-only

model.Addingthelocationinformationismoreeffectivethanaddingthedateinformation.Thiscanbeattributedtotherichnessoflocationinformationinthewebpagescomparedtotheexistenceofpublicationdateinwebpages.Table6Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,andDateevaluatedonthemanuallylabeledTESTwebpagesdatasetforCalifornia

shootingevent

Figure12 CaliforniashootingURLsevaluationatdifferentthresholdvalues

Figure13Californiashootingwebpagesevaluationatdifferentthresholdvalues

6.2 EventModel-basedvs.Topic-onlyFocusedCrawlerInthissection,aboutthesecondseriesofexperiments,wereporttheeffectofusingtheeventmodelwiththefocusedcrawler.

6.2.1 CaliforniaShootingInthisexperimentweusedthe38URLsmanuallycurated(seeSection5.2)asseedsforthetwofocusedcrawlers(oureventmodel-basedandthetopic-onlybaseline).TheeventmodelbuiltfromtheseedsisillustratedinTable7.Thefirstrowinthetablegivesthetopicvectorkeywordsandtheirnormalizedcumulativetermfrequenciesinalloftheseedwebpages.Thesameisdoneforthelocationanddate.Weranthetwofocusedcrawlerstocollect1000webpages.Weplotthepercentageofcrawledwebpagesthatarerelevant(harvestratio)atdifferentstagesofthecrawl,i.e.,forthefirst100,200,300,…crawledwebpages.Figure14showstheperformanceofthetwocrawlersinthesmall-scalesetting(1000webpagesonlyarecrawled)duringthedifferentstagesofthecrawlingprocess.Oureventmodel-basedfocusedcrawlercollectedmorerelevantwebpagesduringandattheendofthecrawlingprocessthanthebaselinetopic-onlyfocusedcrawler.Oureventmodel-basedfocusedcrawlerachievedapproximatelyaharvestratioof0.85whilethebaselinetopic-onlyfocusedcrawlerachievedapproximatelyaharvestratioof0.68.

Table7 Californiashootingeventmodel

Keywords Weight

shoot 0.93 san 0.513 bernardino 0.465 said 0.357 wa 0.323 2015 0.321 peopl 0.31 california 0.305 polic 0.258 suspect 0.177

Location San Bernardino 1 California 0.51 Calif. 0.44

Date 2015-12-02

Figure14 Performanceevaluationofeventmodel-basedvs.topic-onlyfocused

crawlersforCaliforniashooting

Inthelarge-scalesetting,weranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)for50Kwebpages.Thetwocrawlersstartedfromasetof~4000seedURLs.Figure15showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.6(average)whilethebaselinetopic-onlyachievedaharvestratioof0.3(average).

0 100 200 300 400 500 600 700 800 900 1000

Percentage4of4crawled4webpages4that4are4relevant4

(Harvest4Ratio)

Total4number4of4webpages4crawled

Performance4Evaluation4of4Event4modelHbased4 Focused4Crawler4vs4Baseline4Focused4

Crawler4for4California4 Shooting4Event

Baseline4Focused4Crawler4H Topic4Only Event4ModelHbased4Focused4Crawler4H Topic4+4Loc4+4Date

Figure15Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforCaliforniashooting(50Kwebpages)

6.2.2 BrusselsAttackInthisexperimentweusedthe23URLsmanuallycurated(seeSection5.2)asseedsforthetwofocusedcrawlers(oureventmodel-basedandthetopic-onlybaseline).TheeventmodelbuiltfromtheseedsisillustratedinTable8.Thefirstrowinthetablegivesthetopicvectorkeywordsandtheirnormalizedcumulativetermfrequenciesinalloftheseedwebpages.Thesameisdoneforthelocationanddate.Weranthetwofocusedcrawlerstocollect1000webpages.Weplotthepercentageofcrawledwebpagesthatarerelevant(harvestratio)atdifferentstagesofthecrawl,i.e.,forthefirst100,200,300,…crawledwebpages.Figure16showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.Oureventmodel-basedfocusedcrawlercollectedmorerelevantwebpagesduringthecrawlingprocessthanthetopic-onlybaselinefocusedcrawlerinthesmall-scalesetting(1000webpagesonly).

Table8 Brusselsattackeventmodel

Keywords Weight

brussel 0.881 attack 0.541 airport 0.539 explos 0.381 wa 0.31 peopl 0.273 station 0.254 belgium 0.242 metro 0.197 terror 0.159

Location

Brussels 1 Belgium 0.37 Brussels Airport 0.174 Zaventem 0.174 Paris 0.123

Date 2016-03-22

Figure16 Performanceevaluationofeventmodel-basedfocusedcrawlerfor

Brusselsattack

Inthelarge-scalesetting,weranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect50Kwebpages.Thetwocrawlersstartedfromasetof~4000seedURLs.Figure17showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.Oureventmodel-based

0 100 200 300 400 500 600 700 800 900 1000

Percentage4of4crawled4webpage4that4are4relevant4

(Harvest4Ratio)

Total4nubmer4of4crawled4webpages

Performance4Evaluation4of4Event4modelHbased4 Focused4Crawler4vs4Baseline4Focused4

Crawler4for4Brussels4Attack4Event

Baseline4Focused4Crawler4H Topic4Only Event4modelHbased4Focused4Crawler4H Topic4+4Loc4+4Date

focusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.7whilethebaselinetopic-onlyachievedaharvestratioof0.5.

focusedcrawlersforBrusselsattack(50Kwebpages)

Intheprevioustwoexperiments(CaliforniashootingandBrusselsattack),weperformedsmall-scaleandlarge-scalecrawls.Thereasonforthesmall-scaleexperimentswastoprovethepointthattheeventmodelefficientlyguidedthefocusedcrawlertofocusontherelevantpartoftheWeb,andthecrawlerasaresultretrievedmorerelevantwebpagesthanthetopic-onlybaselinefocusedcrawler.Thepurposeofthelarge-scaleexperimentsistoshowthattheperformanceofoureventmodel-basedfocusedcrawlerremainsbetterthanthetopic-onlybaselinefocusedcrawlerevenforlargenumbersofwebpages,i.e.,isscalable.Intheremainingexperiments(otherevents),weranonlythelarge-scalecrawlingexperiments.Wealreadyshowedtheeffectivenessofourapproachinthesmall-scaleexperimentsandweneedtovalidatethatthesameperformancepersistsat

largescale.Wedidn’tmanuallypreparesetsofseedURLsfortheremainingevents;weextractedthemfromthetweetcollectionsforeachevent.

6.2.3 OregonshootingWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect100Kwebpages.Thetwocrawlersstartedfromasetof~22KseedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.6(average)whilethebaselinetopic-onlyachievedaharvestratioof0.25(average).Figure18showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.

focusedcrawlersforOregonshooting(100Kwebpages)

Inthisexperiment,wehadahighernumberofseedURLsthanintheprevioustwoexperiments(22Kcomparedto4K).WedidnotcontrolthenumberofseedURLs.WeextractedandfilteredtheseedURLsfromtheevent’stweetcollection.ThenumberofseedURLsdependsonthesizeofthecorrespondingtweetcollection,whichdependsontheimpact/coverageandsizeoftheevent(bigorsmall).EventswithbigimpactwillleadtolargetweetcollectionsandthereforemoreURLsextracted.Thedefinitionofimpacthereinourcontextisrelatedtocoverage.The

Oregonshootingevent’stweetcollectionhadmoretweetsthantheprevioustwoeventsandthereforethereweremoreURLsextractedthantheprevioustwoevents.HavingmorestartingURLsmeanswehavemoreaccess/pointerstotherelevantpartoftheWebgraph,sowerantheexperimentsfor100Kwebpagesratherthan50K(likeinthefirsttwoexperiments).

6.2.4 EgyptairplanecrashWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect10Kwebpages.Thetwocrawlersstartedfromasetof~1100seedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.5whilethebaselinetopic-onlyachievedaharvestratioof0.4.Figure19showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.

focusedcrawlersforEgyptairplanecrash(10Kwebpages)

Aneventofbigimpactcouldhavesmallsizetweetcollection(asthecaseforthisevent)becausetherewasnotenoughcoveragefortheeventonTwitter,ortheeventdidn’tattractmuchattentionfromTwitterusers.Anotherpossiblereasonistherewereothermoreattracting/trendingtopicsthatattracted/drewattentionawayfromthatevent.

6.2.5 PanamapapersWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect100Kwebpages.Thetwocrawlersstartedfromasetof~18KseedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.6whilethebaselinetopic-onlyachievedaharvestratioof0.4.Figure20showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.

focusedcrawlersforPanamaPapers(100Kwebpages)

6.2.6 OrlandoshootingWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect50Kwebpages.Thetwocrawlersstartedfromasetof~2000seedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.7whilethebaselinetopic-onlyachievedaharvestratioof0.4.Figure21showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.

focusedcrawlersforOrlandoshooting(50Kwebpages)

6.2.7 ParisattacksWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect500Kwebpages.Thetwocrawlersstartedfromasetof~88KseedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.25whilethebaselinetopic-onlyachievedaharvestratioof0.18.Figure22showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.TheParisattackeventwasabigeventwithhugeimpactlocallyinFranceandinternationally.Wekeptcollectingtweetssincethestartoftheeventandforseveraldaysaftertheevent,whichisnotthecasefortheotherevents.Usuallythetweetcollectingprocessstopsonthesamedayorone/twodaysaftertheevent.ThestoppingpointdependsonthetimeTwitterusersstoppostingabouttheevent,whichtypicallypeaksonthedayoftheeventanddecreasesafterthat.Wenoteherethatthisexperiment(andallourexperiments)startedfromtheEnglishseedURLsonly(thesameappliesduringthecrawlingprocess;weare

workingontheEnglishlanguageonly).WeexcludedalltweetsnotinEnglish.EvenwhenlimitingtoEnglishonlytweets,westillhadaround88KseedURLstostartfrom.Theperformanceofthetwocrawlersdegradedattheendofthecrawlasexpected,because(asknowninthetopicalcrawlerliterature[17,26,28,40,41,46,47,65,66,74,75,81,83,89,90])aswegetfarfromtheseedURLs,wefindfewerrelevantwebpages.TherelevantcontentisconcentratedaroundtheseedURLs,sothefurtherawaywegofromtheseedURLs,thegreaterthechancethatwehitnon-relevantcontent.

focusedcrawlersforParisattack(500Kwebpages)

6.2.8 EcuadorearthquakeWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect100Kwebpages.Thetwocrawlersstartedfromasetof~11KseedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedahigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.75whilethebaselinetopic-onlyachievedaharvestratioof0.4.Figure23showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.

focusedcrawlersforEcuadorearthquake(100Kwebpages)

Inthischapterweaddressedresearchquestion3,whichistheeffectofusingoureventmodelontheperformanceofthefocusedcrawler.Weshowedthatusingtheeventmodelwithfocusedcrawlingleadstoachievingahigherharvestratio,thuscollectingmoreoftherelevantwebpagesthanthetraditionalfocusedcrawler.Thebetterperformancewasshowninsmallandlarge-scalecrawls.

7 WebpageSourceImportanceandSocialmedia-basedSeedSelection

Sofarwehaveexploitedcontent-basedfeaturesforfindingrelevantwebpages.Inthissection,weexploretwootherparametersthataffecttheabilityofthefocusedcrawlertofindmorerelevantwebpages:webpagesourceimportanceandseedURLs.

7.1 WebpageSourceImportanceAwebpagesourceisthewebsitethewebpagebelongsto,forexample,thewebpagewiththeURLhttp://www.cnn.com/2014/04/18/world/asia/malaysia-airlines-plane/index.htmlbelongstothesourcewebsitehttp://www.cnn.com.Theimportanceofawebpagesourcewillbedeterminedbasedonthenumberofrelevantwebpagesthatbelongtothewebpagesource.Thisheuristicexhibitstwocharacteristicsthatshouldensureagoodestimateofthewebpagesourceimportance:

1) Followsthelikelihoodpropertiesofthedata,2) Changesdynamicallywithnewdataobservedduringcrawling.

Thefirstcharacteristicensuresthatthemeasureweareusingisrealisticanddescribestheactualdatabeingcollected.Thesecondcharacteristicshowshowthemeasureadaptstochangesinthedatabeingobserved,andalsofollowschangesinthecontentbeingpublishedontheWWW.ThereareseveralreasonsforchoosingourmethodofestimatingsourceimportanceandnotconsideringPageRankandHubandAuthority[42,43,77,78]methods(asanexampleofsourcepopularitymeasures):

1- Dynamicvs.static(fixed),PageRank,hubandauthority,andout-degreemeasuresareallstaticorfixedmeasures.Theyneedtobecalculatedoffline(i.e.,requirethewholedatasetorpartofittocalculatethevaluesandthenareusedafterthatduringcrawling).Thisissimilartoonlineandofflinelearningmethods.Offlinemethodsusethetrainingdatatobuildthemodelandthenuseit.Onlinemethodsdon’tuse/requiretrainingdata;theylearnthemodelanduseitonline/duringcrawling.So,themodelisupdatedduringcrawling.

2- PageRankandhubandauthoritymethodsaretimeconsumingandcomputationallyintensive,soitwillbetimeconsumingtoadaptthemforanonlineversion.

3- PageRankandhubandauthorityaremethodsformeasuringpopularityandqualityofwebpages.Wecanimaginethatusingthemforestimatingthe

importanceofasourcewithrespecttoaspecificeventislikeusinggeneralWebcrawlersforcrawlingwebpagesaboutaspecificevent.Weexpectthatsuchpopularitymeasuresaretoogeneraltobeconsideredameasureforsourceimportance.Forexample,anunpopularwebsite(source)couldbeveryrelevanttoanevent,duetoitscontentorhavinglinkstootherrelevantwebpagesabouttheevent.Alsousingtopic-orientedPageRankandhubandauthoritymethodsislikeusingtopicalcrawlersforcrawlingaboutevents,whichisnotefficient;thatisakeypointofthisdissertation.

Soweneedadynamic,simplycalculated,andevent-specificmethodforestimatingthelikelihoodoffindinganewrelevantwebpagefromasourcebyusingthelikelihoodofthesourceinthecurrentlycrawledwebpagesorthediscoveredbutnotyetvisitedURLsinthefrontier.Thisissimilartoagraph-basedalgorithmwhichlearnsfromdifferentpathsinthegraphwhetheracertainpathwillleadtorelevantwebpagesevenifweencounternon-relevantonesinthemiddle.Thus,ifawebpageisnotrelevantbutitssourcehashighprobabilityofhavingevent-relevantwebpagesthenthereisahighprobabilitythatthecurrentlow-score(non-relevant)webpagewilllinktoarelevantwebpage.Thisapproachactslikeagreedyalgorithmwherethecrawlerwillcrawlmorefromthesourcewiththehighestimportance.Thecrawlerwillkeepcrawlingrelevantwebpagefromthemostimportantsourceuntilitnolongerfindsrelevantwebpages,andswitchestoutilizeanotherimportantsource.Wecouldexperimentwithseveralcandidatemethodsforestimatingsourceimportance,usingtheinformationaboutcurrentlycrawledwebpages,liketheonesin[1],namely(thefollowinglististakenfromthepaperin[33]withchanges):

a. NegativeAbsoluteBadfunction,wherethescoreofasourceisthenegativenumberofalreadycrawlednon-relevantwebpages;

b. BestScorefunction,wherethescoreofasourceisthemaximalscoreofoneofthediscoveredbutnotyetvisitedURLsthatbelongstothesource;

c. SuccessRatefunction,wherethescoreofasourceistheratiobetweenthenumberofrelevantwebpagescrawledandthenon-relevant;theratioisinitializedwithpriorparametersαandβwhichwesetto1:score(source)=(#relevant(source)+α)/(#non-relevant(source)+β);

d. ThompsonSamplingfunction,wherethescoreofasourceisarandomnumber,drawnfromabeta-distributionwithpriorparametersαand

β;inthiscasewetakeasthescoretherandomvalue:score(source)=Beta(#relevant(source)+α,#non-relevant(source)+β);weinitializedthepriorsαandβwith1;

e. AbsoluteGood*BestScorefunction,wherethescoreofasourceistheproductoftheabsolutenumberofalreadycrawledrelevantwebpagesandthebestscorefunctiondescribedin(b);

f. ThompsonSampling*BestScorefunction,wherethescoreofasourceistheproductoftheThompsonsamplingfunction(d)andthebestscorefunction(b);and

g. SuccessRate*BestScorefunction,wherethescoreofasourceistheproductofthesuccessratefunction(c)andthebestscorefunction(b).

Theresultsshownin[33]indicatethatthesuccessratefunction(c)isthebestscoringfunctionwithregardtocrawlingthelargestnumberofrelevantwebpages(i.e.,thehighestharvestratio).Thuswechosetousethesuccessratefunctionasourmethodforestimatingwebpagesourceimportance.Werananexperimentwiththeeventmodel-basedfocusedcrawlerandsourceimportance.Wecombinedthewebpagesourceimportancescorewiththeeventmodel-basedrelevancescoretoproducethefinalscoreofURLs.Onepossiblemethodofcombinationismultiplyingbothscorestogether,soURLswithhighwebpageimportancescoreandhighrelevancescoreswillgetahigherfinalscore.Wenoteherethatthewebpageimportancescoreiscalculatedduringthecrawlingprocessandisnotafixedvalue,butratheradynamicvaluethatchangesduringthecrawlingprocess.Atthebeginningofthecrawl,allsourceshavethesameinitialimportancescore.Whenanewwebpageisretrievedandfoundrelevant,thenthecorrespondingwebpagesourceimportancescoreisupdated.Inthiswaythemorewefindrelevantwebpagesfromasource,themoretheimportancescoreofthissourceincreases.Figure24showstheperformanceofeventmodel-basedfocusedcrawlingwithandwithoutsourceimportanceforBrusselsattackevent.WenoticefromFigure24thatbothcrawlersachievealmostthesameperformance,withthecrawlerwithsourceimportancestrugglinginthefirsthalfbecauseofthedynamicvalueofthesourceimportance.Wefurtherexaminedtheresultsofthetwocrawlersandfoundthatthecollectionproducedfromthecrawlerwithsourceimportancehadonly21uniquewebsites(webpagessources)whilethecrawlerwithnosourceimportancehad81uniquewebsites.Thecrawlerwithsourceimportancesucceededincollectingthesamenumberofrelevantwebpagesasthecrawlerwithnosourceimportance,but

fromfarfewerwebsites.Werantheexperimentalsoon3moreevents:Californiashooting(seeFigure25),Ecuadorearthquake(seeFigure26),andOrlandoshooting(seedFigure27).

Figure24EffectofsourceimportanceoneventfocusedcrawlingforBrusselsattack

Figure25EffectofsourceimportanceoneventfocusedcrawlingforCalifornia

shootingevent

0 100 200 300 400 500 600 700 800 900 1000Percentage4of4craw

led4webpages4that4are4relevant

(Harvest4Ratio)

Total4number4of4crawled4webpages

Effect4of4Source4Importance4on4Event4Focused4Crawling

Event4Focused4Crawler4with4Source4Importance

Event4Focused4Crawler

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Percentage)of)crawled)pages)that)are)relevant

Number)of)pages)crawled

Effect)of)Source)Importance)on)event)focused)crawling)for)California)shooting)event

Event1Focused1Crawler Event1Focused1Crawler1 with1Source1Importance

Figure26EffectofsourceimportanceoneventfocusedcrawlingforEcuador

earthquakeevent

Figure27EffectofsourceimportanceoneventfocusedcrawlingforOrlando

shootingevent

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

ntage4of4craw

led4webpages4that4a

levant

Number4 of4webpages4crawled

Effect4of4source4importance4on4event4focused4crawling4for4Ecuador4earthquake4event

Event4Focused4Crawler Event4Focused4Crawler4with4Source4Importance

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Percentage4of4crawled4webpages4that4a

levant

Number4 of4webpages4crawled

Effect4of4source4importance4on4event4focused4crawling4for4Orlando4 shooting4event

Event4Focused4Crawler Event4Focused4Crawler4with4Source4Importance

WecanseefromtheresultsinFigures24-27,thataddingwebpagesourceimportancetoeventmodel-basedfocusedcrawlingenhancestheperformanceandachievesabetterharvestratio.Theeventmodel-basedfocusedcrawlerwithsourceimportancestrugglesinthebeginningofthecrawlandthenmanagestoenhancetheperformanceattheendofthecrawl.Thereasonforthebadperformanceinthebeginningisthatthefocusedcrawleristryingtofindthegoodsources;onceitsettlesongoodonesitexploitstheirimportancetoreachmorerelevantwebpages.Inthissectionwecoveredresearchquestions4and5,whichaddresshowtomodelwebpagesourceimportanceandhowtointegratethemwitheventmodel-basedfocusedcrawlers.Weshowedthatusingwebpagesourceimportancehelpsthefocusedcrawlercollectrelevantwebpageswhilefocusingonimportantsources.Regardinghypothesis2,First,wehaveshownthatifweknowthebiasofdifferentwebsitesaccordingtosomecriteria,wecanincludeotherwebsitesinordertoreducebias.Second,ourexperimentalresultsshowthatintegratingeventinformationandsourceimportanceleadstoimprovedestimatesofrelevance.Additionaldemonstrationsofthesecapabilitiesaregiveninthefollowingsubsections.

7.2 SeedURLsforcrawlingThefactorsthataffectanycrawlingexperimentsarethenumberofdesiredwebpagestobecrawledandthenumberofseedURLsfedtothefocusedcrawler.SuccessfullycollectingthenumberofdesiredwebpagesdependsonthequalityandthenumberofseedURLswestartfrom.WeshouldstartfromURLsthatwilllink/leadtothelargestnumberofrelevantwebpages.TherearedifferenttypesofseedURLs.WeclassifytheseedURLswithregardtorelevanceandlinkingasfollows:relevantandlinkingtorelevantURLs,relevantandnotlinking,non-relevantandlinking,non-relevantandnotlinking.Wedon’twantthelasttype(totallyuselessURLs).Allothertypesareeithergoodintheirownright,orarepointingtoothergoodURLs.Table9summarizesthedifferenttypesofseedURLs.WedeterminewhetheraURLlinkstootherrelevantURLsornotbydownloadingthecorrespondingwebpageandextractingthelinksfromthewebpagecontent.WeestimatetherelevanceofaURLbyclassifyingtheURLtokensasrelevantornon-relevanttoanevent.URLtokensarethesetoftokensextractedfromtheURLaddressandtheURLanchortextthatappearsontheparentwebpage.

Table9DifferenttypesofseedURLs

LinkingtorelevantURLs NotlinkingtorelevantURLs

Relevant Hubwebpage Deadend,authoritywebpage

Non-Relevant Tunneling Reject;ignorethatpath

7.3 Semi-automatedSocialMedia-basedSeedURLGenerationSocialmediahasproventobeanimportantandrichassetforcollectingwebpagesaboutevents.EnsuringfullWebarchivecoverageofaneventisnotaneasytask,forseveralreasons.First,eventsdifferinimpactandimportance.Bigeventstendtolastforalongtime,impactmultipleplaces,andevensparkarangeofdebatesaboutdiversetopics.Second,tobuildaWebcollectionthatfullycoversaneventrequiressamplinganunbiasedsetofwebpagesfromtheWWW(whichishuge,heterogeneous,anddynamicallychanging).ThesizeoftheWWWmakesitdifficulttocollect,curate,andsampleanunbiasedsetofwebpagesusingmanualtechniques.Fortunately,focusedcrawlershavebeenproveneffective[2,26,46,66,74,75,89]inautomatingandacceleratingtheprocessofcollectingwebpages,startingfromasetofseedURLs.However,theabilityofthefocusedcrawlertofindrelevantanddiversewebpagesdependsonthequality(contentqualityandlinkingstructurequality)andthebroadcoverage(seedURLsfromdifferentwebpagesources)oftheseedURLs.Wehavebeenresearchingbuildingwebpage/tweetcollectionsaboutevents.Weidentifiedthreemainapproaches:1)theInternetArchive’sArchive-Itserviceforcollectingandarchivingwebpages,2)apairofarchivingtoolsforcollectingtweets,and3)eventmodel-basedfocusedcrawlingofwebpages.WeproposedahybridapproachforbuildingunbiasedcollectionsofwebpageswithhighcoverageusingseedURLsgeneratedfromsocialmediacontent(tweets),togetherwitheventmodel-basedfocusedcrawlers.Thetweetcollectionprocesses[1,2,74,91]ensurealargesampleofseedURLswithbroadandheterogeneousgenresofwebpages(horizontal/exploringaspect)whiletheeventmodel-basedfocusedcrawlerensureshighqualityandrelevantwebpages(vertical/exploitingaspect).

7.3.1 SelectingSeedURLsWeapplythefollowingstepsforselecting/curatingthesetofseedURLs:

• GrouplongURLsbysources/domains/hosts• CountthenumberoflongURLspersource(sourceimportance)• Sortsources(descending)accordingtonumberofURLsineachsource• PicktopKsourcesandthenchooseoneURLfromeachoftheKsources

ChoosingKuniquesourcesensuresdiversityoftheseedURLsandchoosingthetopKaccordingtosourceimportancemeasureensuresbroadcoverageandhighquality.AlthoughtweetcollectionsabouteventsareaveryrichsourceofseedURLs,theycontainalotofnoise(porn,jobmarketing,otherspam,andvariedothertypesofnon-relevanttweetsorURLs).OneimportantsteprequiredbeforeusingURLsextractedfromtweetsisanimportanceanalysisofeachURL/webpagesource,e.g.,consideringthedomainnameoftheURL.ThesetofseedURLsshouldbenormalized.Considerthelistbelow.URL1isthenormalizedversionofURL2.BothURLspointstothesamewebpage.

• URL1=www.cnn.com• URL2=www.cnn.com?utm_source=feedburner&utm_medium=twitter

TheURLsmentionedintweetsarenotallrelevant.Weneedtofilterthem.Weusedakeyword-basedfilteringmethod,whichincludesonlyURLsthathaveatleastoneofapre-definedsetofkeywords.Thesetofkeywordsiscreatedmanuallyforeachevent.Suchafilteringprocessiscloseinaccuracytoclassification,butmuchfaster.

7.3.2 SeedsURLDomain/SourceImportanceWedefinethewebpagesourceimportanceastheprobabilityoffindingmorerelevantwebpageswhenstartingwithaseedURLfromthatsource.Weassumethatthetopical/eventlocalitypropertyholdswherewebpagesaboutaneventlinktootherwebpagesaboutthesameevent.AlsoweassumethatURLs/webpagesfromasamesourceareconnected(i.e.,ifyoustartfromoneofthemyoucanreachtheothers).WeestimatethewebpagesourceimportancebycalculatingthenumberofURLsfromthesamesourceextractedfromthetweetcollection.Figure28showstheworkflowforextractingURLsfromtweetcollections.WeapplyourmethodsontheextractedURLstocalculatethesourceimportance.

Figure28Workflowforextracting,expanding,andselectingURLsfromtweets

WerananexperimentabouttheBrusselsattackevent.Weranoureventfocusedcrawlertocollect1000webpagesstartingfromdifferentsetsofseedURLs.ThesetsofseedURLstestedinourexperimentsdifferintwoaspects:thenumber(K)ofURLsandtheuniquenessoftheURLswithrespecttotheirwebsites.WeselectedtheseedURLsfromapoolofURLsextractedfromasetoftweetscollectedusingtheTwitterstreamingAPI.Table10summarizesthestatisticsfortheBrusselsattacktweetcollection.Figure29showsthelanguagedistributionofthetweets.MostofthetweetsareintheEnglishlanguage,whichisthelanguageweareworkedoninourresearch.Figures30and31showthenumberoftweets(withoutandwithURLs),andtheirdistributionacrosstime.Mostofthetweetswerepostedonthefirstdayoftheevent;theirnumberdecreaseswithtime.Figure32showsthedistributionofthesourcesaccordingtooursourceimportancemeasure.Weusedtheharvestratiomeasuretoevaluatetheoutputofthefocusedcrawlers.

Table10Brusselsattacktweetcollectionstatistics

Category Number

Alltweets 2,227,706

TweetsinEnglish(lang=en) 1,838,276

Tweetcreationdatedistribution:3/22/2016 1,253,152

TweetswithURLs 937,009

TweetswithURLcreationdatedistribution:3/22/2016

462,154

UniqueshortURLsextracted(lang=en) 113,402

UniquelongURLs 85,991(twitter.com=38,168)

Uniquedomains/sources 8,082(2980>=2,596>=10)

De-duplicatedURLs 74,698

URLswithkeywords“brussels,attack” 16,187

Figure29Brusselsattacktweetslanguagedistribution

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

1,600,000

1,800,000

2,000,000

und fr tr es nl

de ar el it th hi

ru in pl

ja svpt

daht tl fa fi

no csro ur lv hu sl

cyko iw zhbg sr

mr is ta ne

bn ka vi

am knhy

ckb ps

NumberAofATweets

Language

LanguageADistribution

Figure30Brusselsattacktweetscreationdatedistribution

Figure31BrusselsattacktweetswithURLsdistribution

200,000

400,000

600,000

800,000

1,000,000

1,200,000

3/22/16 3/23/16 3/24/16 3/25/16 3/26/16 3/27/16 3/28/16

Number2of2Tweets

Tweet2Creation2 Date

Tweets2Date2Distribution

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

450,000

500,000

3/22/16 3/23/16 3/24/16 3/25/16 3/26/16 3/27/16 3/28/16

Number2of2Tweets2with2URLs

Tweets2Creation2 Date

Tweets2w/2URLs2Date2Distribution

Figure32BrusselsattackseedURLsdomainsdistribution

Table11summarizestheresultsofthedifferentsettingsregardingtheseedURLsfortheBrusselsattackevent.Thefirstrowisforthecaseofhavingkuniquedomains.WechooseoneURLfromeachdomain.Thesecondrowisforthecaseofhavingk/2uniquedomains;wechoose2URLsfromeachdomain.ThedomainsfromwhichwechoosetheURLsaresortedaccordingtohowmanyURLsbelongtothem.SothetopkdomainsaretheonesthathavethelargestnumbersofURLsbelongingtothem.ThecolumnsrepresentthedifferentvaluesforK(thedesirednumberofURLswewanttoselectasseeds).Astheresultsshows,asweincreasethenumberofseeds,thefocusedcrawlerfindsmorerelevantwebpages(leadingtoahighharvestratio).Also,distributingtheseedURLsacrossseveralwebsitesincreasestheabilityofthefocusedcrawlertofindmorerelevantwebpages.Table12showsthewebpagesourcedistributionintheresultingcrawledwebpages.Astheresultsshow,usingURLswithmoreuniquesourcesproducescollectionswithamorediversesetofsources,andthisincreasesbyincreasingthenumberofseeds.Forexample,startingfrom10seedURLsfromuniquesourcesledtoacollectionof1000webpagesfrom68uniquesourcesincontrasttotheoriginal31uniquesources.

mber-of-URLs

Domains

Domains-Distribution

Table11HarvestratioforeventfocusedcrawlerusingtwomethodsofseedselectionwithdifferentnumbersofseedsforBrusselsattack

K=10 K=50 K=100TopKFrequentuniquewebsites 0.685 0.752 0.817TopK/2FrequentwebsiteswithmultipleURLsfromsamesource

0.645 0.763 0.775

Table12NumberofdifferentdomainsintheoutputcollectionsofcrawlingexperimentsforBrusselsattack

K=10 K=50 K=100TopKfrequentuniquewebsites 68 74 117

TopKfrequentwebsiteswithmultipleURLsfromsamesource

31 58 78

Tables13and14showtheresultsfortheOregonshootingevent,Tables15and16showtheresults fortheCaliforniashootingevent,andTables17and18showtheresults for theOrlandoshootingevent.For theOregonshootingevent,wesee thesameeffectasfortheBrusselsattackevent,havingseedURLsfromuniquedomains(rather thanhavingmultipleURLs from the samedomain) increases theabilityofthefocusedcrawlertofindmorerelevantwebpages. However, inthecaseK=100,therewasnoperformanceimprovement.Bothmethodsachievedthesameharvestratio(0.741)butthemethodusingURLsfromuniquedomainsproducedacollectionwith 135 unique domains while the method using multiple URLs from the samedomainproducedacollectionwithonly96uniquedomains.AnotherproblemwiththeOregonshootingcollectionisthatweranourexperiments10monthsaftertheeventhappened.Most of the seeds andwebpages about the eventno longer exist(give404errorasHTTPResponse).Thisaffectstheperformanceofbothcrawlers;addingmoreseedsdoesn’thelpinfindingrelevantwebpages.

Table13HarvestratioforeventfocusedcrawlerusingtwomethodsofseedsselectionwithdifferentnumbersofseedsforOregonshooting

K=10 K=50 K=100TopKFrequentuniquewebsites 0.717 0.744 0.741TopKFrequentwebsiteswithmultipleURLsfromsamesource

0.715 0.669 0.741

Table14NumberofdifferentdomainsintheoutputcollectionsofcrawlingexperimentsforOregonshooting

59 84 96

WealsoseethesamebehaviorintheCaliforniashootingeventasshowninTables15and16.CrawlingwithURLsfromuniquedomainsachievesbetterperformancethancrawlingwithmultipleURLsfromthesamedomain.IncreasingthenumberofseedURLsincreasestheperformanceofthefocusedcrawler.Also,inthecaseK=100,bothmethodsachieveaharvestratioofaround0.9,butthemethodusingURLsfromuniquedomainsproducedacollectionwith129uniquedomainswhilethemethodusingmultipleURLsfromsamedomainproducedacollectionwithonly101uniquedomains.AbetterwayistochooseseedURLsfromtype4(i.e.,HubURLs)(seeTable9)wheretheURLpointstoawebpageofrelevantcontentandthewebpagecontainsURLstootherrelevantwebpages.SinceweareautomatingtheprocessofselectingtheseedURLsfromthepoolofURLsextractedfromsocialmedia(i.e.,Twitter),wethinkweachievedareasonableperformancewithminimalornomanualwork.ThiscontraststothesituationwhenseedURLsarecuratedwithmanualwork.

Table15HarvestratioforeventfocusedcrawlerusingtwomethodsofseedsselectionwithdifferentnumbersofseedsforCaliforniashooting

K=10 K=50 K=100TopKFrequentuniquewebsites 0.7 0.7604 0.7818TopKFrequentwebsiteswithmultipleURLsfromsamesource

0.66 0.7218 0.7548

Table16Numberofdifferentdomainsintheoutputcollectionsofcrawling

experimentsforCaliforniashooting

54 64 101

WeexaminedtheseedURLsproducedinthecasek=100inboththeOregonandCaliforniashootingevents.WenoticedthatalthoughweselectedtheseedURLsfromimportantdomains,thetypeofthewebpagestheseURLslinkingtoareauthority/deadend,wherethecontentofthewebpagesisrelevantbuttheyarenot

linkingtootherrelevantwebpages.ThusaddingmoreoftheseURLsdidn’thelpthefocusedcrawlerfindorreachmorerelevantwebpages.Finally,theOrlandoshootingeventleadstothesamebehaviorasinthepreviousevents.AddingmoreURLsfromuniquedomainshelpsthefocusedcrawlerfindmorerelevantwebpages,asisshowninTable16.Inthecasek=100,addingmoreuniquedomainsachievedalmostthesameperformanceasthemethodofaddingmoreURLsfromthesamedomain(likeforthepreviouslymentionedevents).AreasonablejustificationforthatbehavioristhatwhenwepickURLsfromthetop100domains,thedomaindistributiongetswider,andweincludedomainswithalownumberofURLsfromthem(thetailofthedistribution),whencomparedtothemostfrequentdomainsatthetopofthelist.ThesedomainsdonotlinktomorerelevantURLsandthusdon’thelpthefocusedcrawlerreachmorerelevantpartsoftheWWW.Weverifiedthat,byexaminingthelast10domainsinthetop100domainsintheOrlandoshootingevent.WeexaminedthenumberofURLsthatarefromthelast10domainsinthecrawledwebpagesproducedattheendofthecrawl.Wefoundthat8outofthe10domainshad1URLonlyintheresultingcollectionofwebpages.AnoptimumselectionofseedURLsisatrade-offbetweenaddingmoreuniquedomainsandhavingseedURLsfromtype4seedURLs,whicharehighlyrelevantontheirown,andalsolinktootherrelevantURLs.WeusedthenumberofURLsfromadomaintoestimatetheprobabilitythataURLfromthatdomainwilllinktootherrelevantURLs.Abetterway(butrathercomputationallyexpensive)istobuildtheWebgraphoutoftheseedURLsandselectonlytheonesthathavehigherout-degreeorthatoptimizethecoverageanddiversitytrade-off.

Table17HarvestratioforeventfocusedcrawlerusingtwomethodsofseedsselectionwithdifferentnumbersofseedsforOrlandoshooting

K=10 K=50 K=100TopKfrequentuniquewebsites 0.6 0.709 0.722

0.5 0.6856 0.7158

Table18NumberofdifferentdomainsintheoutputcollectionsofcrawlingexperimentsforOrlandoshooting

13 157 181

8 ConclusionandFutureWorkWeproposedamodelandrepresentationforevents.Weshowedhowtorepresentaneventusingourmodel.Wecalculatedtheweightsofthethreeattributesofoureventmodelbyjointlyoptimizingtwoparameters-thenumberofkeywordsandthethresholdvalue-toyieldthebestF1-scoreevaluationmetriconamanuallylabeled(relevantandnon-relevant)datasetofURLsandwebpagesabouttheCaliforniashooting.Theresultsshowedthattheeventmodel,withtheseweightsemployed,caneffectivelyclassifyURLsandwebpagesastotheirrelevancetotheeventofinterest.Weincorporatedoureventmodelintofocusedcrawlingandshowedthatoureventmodel-basedfocusedcrawlerbuiltanevent-relatedWebcollectionmoreeffectivelythanthestate-of-the-artbest-firsttopic-onlyfocusedcrawlerontwodifferentevents:CaliforniashootingandBrusselsattack.Theresultsforsmall-scaleexperiments(collecting1000webpagesfrom38and23seedURLs,respectively)showedthatourevent-modelbasedfocusedcrawleroutperformedthetopic-onlyfocusedcrawlerbycollectingmorerelevantwebpagesaboutthetwoevents(i.e.,achievinghigherharvestratio).Weranexperimentsforlarge-scalecrawling(rangingfrom50K–500Kwebpages)on7differentevents:Californiashooting,Brusselsattack,Oregonshooting,Egyptairplanecrash,Panamapapers,Parisattack,andEcuadorearthquake.Weleveragedsocialmedia(i.e.,Twitter)toextractandselectseedURLsforcrawling.Oureventmodel-basedfocusedcrawleroutperformedthetopic-onlyfocusedcrawlerbycollectingmorerelevantwebpages.Weproposedandincorporatedwebpages’sourceimportanceintoourfocusedcrawler.Theresultsshowedthatusingwebpagesourceimportanceledtoanequivalentqualityeventrelatedcollection,relativetothebaseline,butrequiredfewersources.Finally,weshowedtheeffectoftheseedURLsonthequalityoftheresultingwebpagescollections.WedemonstrateduseofthesourceimportancemeasuretocurateandselecthighqualityseedURLsfromURLsextractedfromsocialmediacontent(tweets).

Wehavecoveredthefiveresearchquestionsandtherelatedhypotheses.Researchquestions1and2(andhypothesis1.1)werecoveredinChapter4:buildingtheeventmodelandrepresentationandusingitwiththefocusedcrawler.Wecoveredresearchquestion3(andhypothesis1.2)inChapter6,evaluatingtheeffectivenessofoureventmodelinfocusedcrawling.Finally,wecoveredresearchquestions4and5(andhypotheses1.3and2)inChapter7,modelingwebpagesourceimportanceandintegratingitwitheventmodel-basedfocusedcrawler.

8.1 ContributionsOurcontributionsinthisdissertationresearchare:

1. Designinganeventmodelthatcapturestheinformationneededforrepresentingevents(withdisastereventsasacasestudy).

2. Developinganevent-awarefocusedcrawlerthatusestheeventmodelforthetargeteventandforwebpagerepresentation,aswellasfordevelopinganewsimilarityfunctionthathelpsinwebpagerelevanceestimation.

3. Designingandincorporatingwebpagesourceimportancemodelintoafocusedcrawlersystem.

4. Developinganewmethodologyforsemi-automatedseedURLgenerationfromsocialmediacontent.

5. Buildinganeventdigitallibraryofevent-relatedobjects(text,metadatarecords,archives,andentities).

8.2 FutureWorkOureventmodelhascapturedthreeattributesforanevent(topic,location,anddate).Weplantoextendoureventmodelbyextractingandaddingorganizationsandparticipants;thatinformationwillrepresentthe‘Who’partinthe‘WhodidWhat,WhereandWhen’eventmodel.Thiswillenrichoureventmodelandconsequentlyshouldincreasetheeventmodel-basedfocusedcrawler’spowertoestimateandretrievemorerelevantwebpages.Further,withregardtofocusedcrawlingforlargeevents,weareintegratingourtweetcollectionefforts,thatalreadyhaveresultedinover1.2billiontweetsspreadacrossabout1000collection,withfollow-upfocusedcrawlingthatstartswithseedsthatcomefromtheURLsfoundinthosetweets.Ontheapplicationside,wealsoplantouseoureventmodeltoanalyzeandsummarizeacollectionofwebpages;thiscanworkforanycollectionaboutaparticularevent(e.g.,preparedthroughmanualcuration,orusingoureventfocusedcrawler[92]).Usingoureventmodel,wewillgeneratealistofindicativesentences,

andextractentitiestorepresentandsummarizeanevent.Therearemultiplealgorithmsandsoftwareimplementationsfortextsummarization,butwebelievethisconceptofcorpus/eventsummarizationisnewandworthinvestigation.Ourpreliminarystudyofsuchsummarizationsuggeststhatresultswillhavehighqualityandutility[92].Further,weplantorunmoreexperimentsondifferentkindsofeventsandtotestotherheuristicsforcombiningwebpagesourceimportancewitheventmodel-basedrelevancescores.Finally,wewillbuildaknowledgebaseofsources,foreachtypeofevent.TheknowledgebasewillincludealistofURLs,extractedfromsocialmediacontentaboutdifferenttypesofevents.ThelistofextractedURLswillbeusedtobuildalistofpairs:sourcesandtheirimportancescore(howmanyrelevantURLsarefromasource).Thislistcouldbeusedtocomputepriorsforthesourceimportancemodelduringcrawling.

References1. Farag,M.M.G.andE.A.Fox,Buildingandarchivingeventwebcollections:A

focusedcrawlerapproach,inBulletinofIEEETechnicalCommitteeonDigital

Libraries.2015.p.1-2.2. Farag,M.andE.A.Fox,FocusedCrawlingForEvents.InternationalJournalof

DigitalLibraries,SpecialIssueofWebArchiving-Inreview,2016.3. Magdy,M.andE.A.Fox,IntelligentEventFocusedCrawling,inThe11th

InternationalConferenceonInformationSystemsforCrisisResponseand

Management(ISCRAM)-Poster.2014:UniversityPark,Pennsilvenya,USA.4. O'Reilly,T.,WhatisWeb2.0:Designpatternsandbusinessmodelsforthenext

generationofsoftware.Communications&strategies,2007.1(1):p.17.5. IDEAL.IntegratedDigitalEventArchiveandLibrary.2016[cited2016April26];

Availablefrom:http://eventsarchive.org/.6. Internet_Archive.InternetArchive,Adigitallibraryoffreecontentandwayback

machine.2016[cited2016April26];Availablefrom:https://archive.org/.7. Farag,M.,P.Nakate,andE.A.Fox,BigDataProcessingofSchoolShooting

Archives,inProceedingsofthe16thACM/IEEE-CSonJointConferenceonDigital

Libraries.2016,ACM:Newark,NewJersey,USA.p.271-272.8. IDEAL_Collections.IDEALWebCollectionsandTweetArchives.2016[cited2016

April26];Availablefrom:http://eventsarchive.org/eventstable.9. Archive-It.Webarchivingservicesforlibrariesandarchives.2016[cited2016

April26];Availablefrom:https://archive-it.org/.10. G.Mohr,etal.IntroductiontoHeritrix,anArchivalQualityWebCrawler.in

Proceedingsofthe4thInternationalWebArchivingWorkshop(IWAW’04).2004.11. Fox,E.A.andJ.P.Leidig,DigitalLibraryApplications:CBIR,Education,Social

Networks,eScience/Simulation,andGIS.2014:Morgan&ClaypoolPublishers.12. Fox,E.A.andR.d.S.Torres,DigitalLibraryTechnologies:ComplexObjects,

Annotation,Ontologies,Classification,Extraction,andSecurity.2014:Morgan&ClaypoolPublishers.

13. Shen,R.,M.A.Goncalves,andE.A.Fox,KeyIssuesRegardingDigitalLibraries:EvaluationandIntegration.2013:Morgan&ClaypoolPublishers.

14. Fox,E.A.,M.A.Goncalves,andR.Shen,TheoreticalFoundationsforDigitalLibraries:The5S(Societies,Scenarios,Spaces,Structures,Streams)Approach.2012:Morgan&ClaypoolPublishers.

15. Salton,G.andC.Buckley,Term-weightingapproachesinautomatictextretrieval.InformationProcessing&Management,1988.24(5):p.513-523.

16. Salton,G.andM.J.McGill,IntroductiontoModernInformationRetrieval.1986:McGraw-Hill,Inc.

17. Pant,G.,P.Srinivasan,andF.Menczer,Crawlingtheweb,inWebDynamics.2004,Springer.p.153-177.

18. Manning,C.D.,etal.,IntroductiontoInformationRetrieval.2008:CambridgeUniversityPress.496.

19. Archive-It.Archive-ItCollections,SpontaneousEvents.2016[cited2016July];Availablefrom:https://archive-it.org/explore?show=Collections&fc=meta_Subject%3ASpontaneousevents.

20. Yang,S.,etal.AstudyofautomationfromseedURLgenerationtofocusedweb

archivedevelopment:theCTRnetcontext.inProceedingsofthe12thACM/IEEE-

CSjointconferenceonDigitalLibraries.2012.ACM.21. IDEAL.IDEALTweetCollections.2016[cited2016August24];Availablefrom:

http://hadoop.dlib.vt.edu/.22. Fox,E.A.andM.M.Farag,Reportontheworkshoponwebarchivinganddigital

libraries(WADL2013).SIGIRForum,2013.47(2):p.128-133.23. Fox,E.A.,Z.Xie,andM.Klein,WADL2016:ThirdInternationalWorkshoponWeb

ArchivingandDigitalLibraries,inProceedingsofthe16thACM/IEEE-CSonJoint

ConferenceonDigitalLibraries.2016,ACM:Newark,NewJersey,USA.p.293-294.

24. Fox,E.A.andZ.Xie,WebArchivingandDigitalLibraries(WADL),inProceedingsofthe15thACM/IEEE-CSJointConferenceonDigitalLibraries.2015,ACM:Knoxville,Tennessee,USA.p.303-303.

25. WARC.Informationanddocumentation--WARCfileformat-ISO28500:2009.2016[cited2016August24];Availablefrom:http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717.

26. Chakrabarti,S.,M.VandenBerg,andB.Dom,Focusedcrawling:anewapproachtotopic-specificWebresourcediscovery.ComputerNetworks,1999.31(11):p.1623-1640.

27. Batsakis,S.,E.G.Petrakis,andE.Milios,Improvingtheperformanceoffocused

webcrawlers.Data&KnowledgeEngineering,2009.68(10):p.1001-1013.28. Pant,G.andP.Srinivasan,Learningtocrawl:Comparingclassificationschemes.

ACMTransactionsonInformationSystems(TOIS),2005.23(4):p.430-462.29. Rennie,J.andA.McCallum.Efficientwebspideringwithreinforcementlearning.

inProceedingsoftheInternationalConferenceonMachineLearning.1999.Citeseer.

30. Grigoriadis,A.andG.Paliouras,Focusedcrawlingusingtemporaldifference-

learning,inMethodsandApplicationsofArtificialIntelligence.2004,Springer.p.142-153.

31. Singh,N.,etal.LargeScaleURL-basedClassificationUsingOnlineIncremental

Learning.inMachineLearningandApplications(ICMLA),201211thInternational

Conferenceon.2012.IEEE.32. Menczer,F.andA.E.Monge,Scalablewebsearchbyadaptiveonlineagents:An

InfoSpiderscasestudy,inIntelligentInformationAgents.1999,Springer.p.323-347.

33. Meusel,R.,P.Mika,andR.Blanco,FocusedCrawlingforStructuredData,inProceedingsofthe23rdACMInternationalConferenceonConferenceon

InformationandKnowledgeManagement.2014,ACM:Shanghai,China.p.1039-1048.

34. Ehrig,M.andA.Maedche.Ontology-focusedcrawlingofWebdocuments.inProceedingsofthe2003ACMsymposiumonAppliedcomputing.2003.ACM.

35. Dong,H.,F.K.Hussain,andE.Chang.Asurveyinsemanticwebtechnologies-

inspiredfocusedcrawlers.inDigitalInformationManagement,2008.ICDIM

2008.ThirdInternationalConferenceon.2008.IEEE.36. Yang,S.,etal.,CTRnetDLfordisasterinformationservices,inProceedingsofthe

11thannualinternationalACM/IEEEjointconferenceonDigitallibraries.2011,ACM:Ottawa,Ontario,Canada.p.437-438.

37. Vural,A.G.,B.B.Cambazoglu,andP.Karagoz,Sentiment-focusedwebcrawling.ACMTransactionsontheWeb(TWEB),2014.8(4):p.22.

38. Fu,T.,etal.,Sentimentalspidering:leveragingopinioninformationinfocused

crawlers.ACMTransactionsonInformationSystems(TOIS),2012.30(4):p.24.39. Almpanidis,G.,C.Kotropoulos,andI.Pitas,Combiningtextandlinkanalysisfor

focusedcrawling—Anapplicationforverticalsearchengines.InformationSystems,2007.32(6):p.886-908.

40. Diligenti,M.,etal.FocusedCrawlingUsingContextGraphs.inVLDB.2000.41. Pant,G.andP.Srinivasan,Linkcontextsinclassifier-guidedtopicalcrawlers.

KnowledgeandDataEngineering,IEEETransactionson,2006.18(1):p.107-122.42. Kleinberg,J.M.,etal.Thewebasagraph:measurements,models,andmethods.

inInternationalComputingandCombinatoricsConference.1999.Springer.43. Brin,S.andL.Page,Reprintof:Theanatomyofalarge-scalehypertextualweb

searchengine.Computernetworks,2012.56(18):p.3825-3833.44. Page,L.,etal.,ThePageRankcitationranking:bringingordertotheweb.1999:

TechnicalReport.StanfordInfoLab.45. DeAssis,G.T.,etal.Exploitinggenreinfocusedcrawling.inStringProcessingand

InformationRetrieval.2007.Springer.46. Pant,G.andP.Srinivasan,Predictingwebpagestatus.InformationSystems

Research,2010.21(2):p.345-364.47. Pant,G.andP.Srinivasan,StatusLocalityontheWeb:ImplicationsforBuilding

FocusedCollections.InformationSystemsResearch,2013.24(3):p.802-821.48. Chen,Y.,Anovelhybridfocusedcrawlingalgorithmtobuilddomain-specific

collections.2007,VirginiaPolytechnicInstituteandStateUniversity.49. Allan,J.,Introductiontotopicdetectionandtracking,inTopicdetectionand

tracking.2002,Springer.p.1-16.50. Volkova,S.,etal.,Animaldiseaseeventrecognitionandclassification.UsingWeb

DataintheMedicalDomain,2010:p.54.51. Westermann,U.andR.Jain,Towardacommoneventmodelformultimedia

applications.IEEEMultiMedia,2007.14(1):p.19-29.52. Strötgen,J.,M.Gertz,andC.Junghans.Anevent-centricmodelformultilingual

documentsimilarity.inProceedingsofthe34thinternationalACMSIGIR

conferenceonResearchanddevelopmentinInformationRetrieval.2011.ACM.53. Li,Z.,etal.,Aprobabilisticmodelforretrospectivenewseventdetection,in

Proceedingsofthe28thannualinternationalACMSIGIRconferenceonResearch

anddevelopmentininformationretrieval.2005,ACM:Salvador,Brazil.p.106-113.

54. Ha-Thuc,V.,etal.Newseventmodelingandtrackinginthesocialwebwith

ontologicalguidance.inSemanticComputing(ICSC),2010IEEEFourth

InternationalConferenceon.2010.IEEE.55. Ha-Thuc,V.,etal.Arelevance-basedtopicmodelfornewseventtracking.in

Proceedingsofthe32ndinternationalACMSIGIRconferenceonResearchand

developmentininformationretrieval.2009.ACM.56. Becker,H.,M.Naaman,andL.Gravano,Beyondtrendingtopics:Real-world

eventidentificationonTwitter,inFifthInternationalAAAIConferenceonWeblogsandSocialMedia.2011:Barcelona,Spain.

57. Parikh,R.andK.Karlapalem,ET:eventsfromtweets,inProceedingsofthe22ndInternationalConferenceonWorldWideWeb.2013,ACM:RiodeJaneiro,Brazil.p.613-620.

58. Ritter,A.,etal.,OpendomaineventextractionfromTwitter,inProceedingsofthe18thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddata

mining.2012,ACM:Beijing,China.p.1104-1112.59. Ritter,A.,etal.,WeaklySupervisedExtractionofComputerSecurityEventsfrom

Twitter,inProceedingsofthe24thInternationalConferenceonWorldWideWeb.2015,ACM:Florence,Italy.p.896-905.

60. Strotgen,J.,M.Gertz,andC.Junghans,Anevent-centricmodelformultilingual

documentsimilarity,inProceedingsofthe34thinternationalACMSIGIR

conferenceonResearchanddevelopmentinInformationRetrieval.2011,ACM:Beijing,China.p.953-962.

61. Yom-Tov,E.andF.Diaz,Locationandtimelinessofinformationsourcesduring

newsevents,inProceedingsofthe34thinternationalACMSIGIRconferenceon

ResearchanddevelopmentinInformationRetrieval.2011,ACM:Beijing,China.p.1105-1106.

62. Lakka,C.,etal.,ABayesiannetworkmodelingapproachforcrossmediaanalysis.SignalProcessing:ImageCommunication,2011.26(3):p.175-193.

63. Gossen,G.,E.Demidova,andT.Risse.iCrawl:ImprovingtheFreshnessofWeb

CollectionsbyIntegratingSocialWebandFocusedWebCrawling.inProceedingsofthe15thACM/IEEE-CSJointConferenceonDigitalLibraries.2015.Knoxville,Tennessee,USA.

64. AlNoamany,Y.,M.C.Weigle,andM.L.Nelson,Detectingoff-topicpagesinwebarchives.ResearchandAdvancedTechnologyforDigitalLibraries,Springer,2015:p.225-237.

65. Menczer,F.,etal.Evaluatingtopic-drivenwebcrawlers.inProceedingsofthe24thannualinternationalACMSIGIRconferenceonResearchanddevelopment

ininformationretrieval.2001.ACM.66. Menczer,F.,G.Pant,andP.Srinivasan,Topicalwebcrawlers:Evaluating

adaptivealgorithms.ACMTransactionsonInternetTechnology(TOIT),2004.4(4):p.378-419.

67. Srinivasan,P.,F.Menczer,andG.Pant,Ageneralevaluationframeworkfor

topicalcrawlers.InformationRetrieval,2005.8(3):p.417-447.68. Borlund,P.,TheconceptofrelevanceinIR.JournaloftheAmericanSocietyfor

informationScienceandTechnology,2003.54(10):p.913-925.69. Hjørland,B.,Thefoundationoftheconceptofrelevance.JournaloftheAmerican

SocietyforInformationScienceandTechnology,2010.61(2):p.217-237.70. Schamber,L.,RelevanceandInformationBehavior.Annualreviewofinformation

scienceandtechnology(ARIST),1994.29:p.3-48.71. Saracevic,T.,Relevance:Areviewoftheliteratureandaframeworkforthinking

onthenotionininformationscience.PartIII:Behaviorandeffectsofrelevance.JournaloftheAmericanSocietyforInformationScienceandTechnology,2007.58(13):p.2126-2144.

72. Mizzaro,S.,Relevance:Thewholehistory.JASIS,1997.48(9):p.810-832.73. Voorhees,E.M.,Variationsinrelevancejudgmentsandthemeasurementof

retrievaleffectiveness.Informationprocessing&management,2000.36(5):p.697-716.

74. Gossen,G.,E.Demidova,andT.Risse.iCrawl:ImprovingtheFreshnessofWeb

CollectionsbyIntegratingSocialWebandFocusedWebCrawling.inProceedingsofthe15thACM/IEEE-CEonJointConferenceonDigitalLibraries.2015.ACM.

75. Batsakis,S.,E.G.M.Petrakis,andE.Milios,Improvingtheperformanceoffocused

webcrawlers.DataKnowl.Eng.,2009.68(10):p.1001-1013.76. Salton,G.,A.Wong,andC.S.Yang,AVectorSpaceModelforAutomaticIndexing.

CommunicationsoftheACM,1975.18(11):p.613-620.77. Brin,S.andL.Page,Theanatomyofalarge-scalehypertextualWebsearch

engine.ComputernetworksandISDNsystems,1998.30(1-7):p.107-117.78. Cho,J.,H.Garcia-Molina,andL.Page,EfficientCrawlingThroughURLOrdering,

inSeventhInternationalWorld-WideWebConference(WWW1998).1998:Brisbane,Australia.

79. Heydon,A.andM.Najork,Mercator:Ascalable,extensiblewebcrawler.WorldWideWeb,1999.2(4):p.219-229.

80. Kobayashi,M.andK.Takeda,Informationretrievalontheweb.ACMComput.Surv.,2000.32(2):p.144-173.

81. Chakrabarti,S.,MiningtheWeb:DiscoveringKnowledgefromHyperTextData.2002:ScienceandTechnologyBooks.350.

82. Castillo,C.,Effectivewebcrawling.SIGIRForum,2005.39(1):p.55-56.83. Chakrabarti,S.,K.Punera,andM.Subramanyam,Acceleratedfocusedcrawling

throughonlinerelevancefeedback,inProceedingsofthe11thinternationalconferenceonWorldWideWeb.2002,ACM:Honolulu,Hawaii,USA.p.148-159.

84. Atkinson,M.D.,etal.,Min-maxheapsandgeneralizedpriorityqueues.CommunicationsoftheACM,1986.29(10):p.996-1000.

85. Min-max_heap.Min-maxheap.2016[cited2016August24];Availablefrom:https://en.wikipedia.org/wiki/Min-max_heap.

86. Yom-Tov,E.andF.Diaz,Outofsight,notoutofmind:ontheeffectofsocialand

physicaldetachmentoninformationneed,inProceedingsofthe34th

internationalACMSIGIRconferenceonResearchanddevelopmentinInformation

Retrieval.2011,ACM:Beijing,China.p.385-394.87. Foley,J.,M.Bendersky,andV.Josifovski,LearningtoExtractLocalEventsfrom

theWeb,inProceedingsofthe38thInternationalACMSIGIRConferenceon

ResearchandDevelopmentinInformationRetrieval.2015,ACM:Santiago,Chile.p.423-432.

88. Baeza-Yates,R.andB.Ribeiro-Neto,Moderninformationretrieval.Vol.463.1999:ACMpress,NewYork.

89. Aggarwal,C.C.,F.Al-Garawi,andP.S.Yu.IntelligentcrawlingontheWorldWide

Webwitharbitrarypredicates.inProceedingsofthe10thinternationalconferenceonWorldWideWeb.2001.ACM.

90. Liu,H.,J.Janssen,andE.Milios,UsingHMMtolearnuserbrowsingpatternsfor

focusedwebcrawling.Data&KnowledgeEngineering,2006.59(2):p.270-291.91. Farag,M.andE.A.Fox,Whichwebpageshouldwecrawlfirst?Socialmedia-

basedwebpagesourceimportanceguidance,inWorkshoponWebArchivingand

DigitalLibraries(WADL2016)-JointConferenceonDigitalLibraries(JCDL2016).2016:Newark,NJ,USA.

92. Farag,M.andE.A.Fox,Webarchivecontentanalysis.2015:PresentedatInternationalInternetPresentationConsortiumGeneralAssemblyIIPC2015,California,USA.

Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling...

Documents

Transcript of Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling...

Adaptive Focused Crawling Presented by: Siqing Du Date: 10/19/05.

10 Classiﬁcation!and!Focused!Crawling!for Semistructured!Datainfolab.stanford.edu/~theobald/pub/is03-fulltext.pdf · The BINGO! 1 focused crawling toolkit consists of six main components

A Novel Hybrid Focused Crawling Algorithm to Build Domain-Specific

Focused Crawling with - schd.wsschd.ws/hosted_files/apachecon2016/4b/Focused crawling with Nutch...Apache Nutch Highly extensible and scalable open source web crawler software project.

focused web crawling in e-learning system

5 Benefits of Web Crawling Services Over In-house Crawling

Focused Crawling for Vertical Search

Crawling and web indexes. Today’s lecture Crawling Connectivity servers.

Web Crawling and IR - IIT Bombay · Web Crawling and IR Author: Naga Varun Dasari ... Focused Crawling ... By giving a semi structured query which

Web crawling

Focused Crawling: A New Approach to Topic-Speciﬁc … · 2017-05-14 · Focused Crawling: A New Approach to ... custom disk data structures are used in standard crawlers ... structured

Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin.

Accelerated Focused Crawling through Online Relevance Feedbacksoumen/doc/ · Accelerated Focused Crawling through Online Relevance Feedback Soumen Chakrabartiy Kunal Punera Mallela

SENTIMENT-FOCUSED WEB CRAWLING A THESIS SUBMITTED ...

Spider and the Flies : Focused Crawling on Tumblr to ... · PDF fileSpider and the Flies : Focused Crawling on Tumblr to Detect Hate Promoting Communities Swati Agarwal Indraprastha

Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

Focused Web Crawli ng for E-Learning Contentcse.iitkgp.ac.in/~abhij/facad/03UG/Report/03CS3011_Udit_Sajjanhar.pdf · This is to certify that the thesis titled Focused Web Crawling

Web image size prediction for efficient focused image crawling

ROACH: Online Apprentice Critic Focused Crawling via CSS ... · ROACH: Online Apprentice Critic Focused Crawling via CSS Cues and Reinforcement Asitang Mishra1, Chris A. Mattmann1,2,

SENTIMENT-FOCUSED WEB CRAWLING A THESIS ...etd.lib.metu.edu.tr/upload/12616409/index.pdfIn this thesis, we present a new perspective for focused web crawling. First, we propose a sentiment-focused