Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling...

91
Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag Dissertation submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Applications Edward A. Fox, Co-Chair Riham H. Mansour, Co-Chair Weiguo Fan Scotland C. Leman Padmini Srinivasan August 11, 2016 Blacksburg, VA Keywords: Focused Crawling, Event Modeling, Web Archives, Digital Libraries, Source Importance, Social Media, Seed Generation Copyright 2016 Mohamed M. G. Farag

Transcript of Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling...

Page 1: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

IntelligentEventFocusedCrawling

MohamedMagdyGharibFarag

DissertationsubmittedtothefacultyoftheVirginiaPolytechnicInstituteandStateUniversityinpartialfulfillmentoftherequirementsforthedegreeof

DoctorofPhilosophy

inComputerScienceandApplications

EdwardA.Fox,Co-ChairRihamH.Mansour,Co-Chair

WeiguoFanScotlandC.LemanPadminiSrinivasan

August11,2016Blacksburg,VA

Keywords:FocusedCrawling,EventModeling,WebArchives,DigitalLibraries,

SourceImportance,SocialMedia,SeedGeneration

Copyright2016MohamedM.G.Farag

Page 2: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

ii

IntelligentEventFocusedCrawling

MohamedMagdyGharibFarag

ABSTRACTThereisneedforanintegratedeventfocusedcrawlingsystemtocollectWebdataaboutkeyevents.Whenaneventoccurs,manyuserstrytolocatethemostup-to-dateinformationaboutthatevent.Yet,thereislittlesystematiccollectingandarchivinganywhereofinformationaboutevents.Weproposeintelligenteventfocusedcrawlingforautomaticeventtrackingandarchiving,aswellaseffectiveaccess.Weextendthetraditionalfocused(topical)crawlingtechniquesintwodirections,modelingandrepresenting:eventsandwebpagesourceimportance. Wedevelopedaneventmodelthatcancapturekeyeventinformation(topical,spatial,andtemporal).Weincorporatedthatmodelintothefocusedcrawleralgorithm.Forthefocusedcrawlertoleveragetheeventmodelinpredictingawebpage’srelevance,wedevelopedafunctionthatmeasuresthesimilaritybetweentwoeventrepresentations,basedontextualcontent.Althoughthetextualcontentprovidesarichsetoffeatures,weproposedanadditionalsourceofevidencethatallowsthefocusedcrawlertobetterestimatetheimportanceofawebpagebyconsideringitswebsite.Weestimatedwebpagesourceimportancebytheratioofnumberofrelevantwebpagestonon-relevantwebpagesfoundduringcrawlingawebsite.Wecombinedthetextualcontentinformationandsourceimportanceintoasinglerelevancescore.Forthefocusedcrawlertoworkwell,itneedsadiversesetofhighqualityseedURLs(URLsofrelevantwebpagesthatlinktootherrelevantwebpages).AlthoughmanualcurationofseedURLsguaranteesquality,itrequiresexhaustivemanuallabor.WeproposedanautomatedapproachforcuratingseedURLsusingsocialmediacontent.WeleveragedtherichnessofsocialmediacontentabouteventstoextractURLsthatcanbeusedasseedURLsforfurtherfocusedcrawling.Weevaluatedoursystemthroughfourseriesofexperiments,usingrecentevents:Orlandoshooting,Ecuadorearthquake,Panamapapers,Californiashooting,Brusselsattack,Parisattack,andOregonshooting.Inthefirstexperimentseriesourproposedeventmodelrepresentation,usedtopredictwebpagerelevance,outperformedthetopic-onlyapproach,showingbetterresultsinprecision,recall,

Page 3: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

iii

andF1-score.Inthesecondseries,usingharvestratiotomeasureabilitytocollectrelevantwebpages,oureventmodel-basedfocusedcrawleroutperformedthestate-of-the-artfocusedcrawler(best-firstsearch).Thethirdseriesevaluatedtheeffectivenessofourproposedwebpagesourceimportanceforcollectingmorerelevantwebpages.Thefocusedcrawlerwithwebpagesourceimportancemanagedtocollectroughlythesamenumberofrelevantwebpagesasthefocusedcrawlerwithoutwebpagesourceimportance,butfromasmallersetofsources.ThefourthseriesprovidesguidancetoarchivistsregardingtheeffectivenessofcuratingseedURLsfromsocialmediacontent(tweets)usingdifferentmethodsofselection.

Page 4: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

iv

AcknowledgmentsFirstIwouldliketothankmyadvisorsDr.EdwardA.FoxandDr.RihamMansourforallthecontinuoushelp,support,encouragement,patience,motivation,andguidancethattheygavemethroughmyPh.D.journey.IcouldnothaveimaginedhavingbetteradvisorsandmentorsformyPh.D.study. Besidesmyadvisors,Iwouldliketothanktherestofmythesiscommittee-Prof.PadminiSrinivasan,Prof.PatrickWeiguoFan,andProf.ScotlandLeman-fortheirinsightfulcommentsandencouragement,butalsoforthehardquestionswhichmotivatedmetowidenmyresearchfromavarietyofperspectives.MysincerethanksalsogoestoSethPeery,LukeWard,ShaneColeman,andAndiOgier,whoprovidedmeanopportunitytojointheirteamasintern,andwhohelpedmelearnnewtechnologies,extendmyskills,andapplymyresearchexperiencetopracticalproblems.Ithankmyfellowlabmates:SeungwonYang,SunshinLee,VenkatSrinivasan,TarekKanan,SungHeePark,MonicaAkbar,KiranChitturi,PrashantChandrasekar,EricFouh,andallotherlabmembersIforgottomention,forthestimulatingdiscussions,forthesleeplessnightswewereworkingtogetherbeforedeadlines,andforallthefunwehavehadinthelastfiveyears.IwouldliketothankmywifeSamarElSaadawy.Nowordsorthankswouldsufficeorgiveherwhatshedeserves.Shesufferedalotforme.ShesupportedmeinallthedifferentstagesofmyPh.D.withoutcomplaining.Withouther,Iwouldn’thavedoneanything.ThankyoumyheroSamarElSaadawy. Lastbutnotleast,Iwouldliketothankmyfamily-myparentsandmybrotherandsisters-forsupportingmespirituallythroughoutmyPh.D.andmylifeingeneral.ThanksgotoNSFforsupport,especiallythroughgrantsIIS-1619028,IIS-1619371,IIS-1319578,DUE-1141209,IIS-0916733,andIIS-0736055.ThanksalsogotoVirginiaTech’sDigitalLibraryResearchLaboratoryandDepartmentofComputerScience.

Page 5: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

v

TableofContentsABSTRACT....................................................................................................................................................iiAcknowledgments....................................................................................................................................ivTableofContents.......................................................................................................................................vListofFigures............................................................................................................................................viiListofTables...............................................................................................................................................ix1 Introduction.......................................................................................................................................11.1 Motivation.................................................................................................................................11.2 Hypotheses................................................................................................................................51.3 ResearchQuestions...............................................................................................................61.4 5S...................................................................................................................................................61.5 ThesisOrganization..............................................................................................................8

2 RelatedWork.....................................................................................................................................92.1 WebCrawling...........................................................................................................................92.2 TheIDEALproject..................................................................................................................92.3 WebArchivingandArchive-Itservice.......................................................................112.4 FocusedCrawling................................................................................................................132.4.1 MachineLearning......................................................................................................132.4.2 SemanticSimilarity...................................................................................................142.4.3 ContentandLinkAnalysis.....................................................................................15

2.5 EventModeling....................................................................................................................162.6 SocialMediaandFocusedCrawlingSeedSelection.............................................182.7 Evaluatingtopical/focusedcrawlers..........................................................................19

3 FocusedCrawling..........................................................................................................................203.1 TopicRepresentation........................................................................................................203.2 CrawlerArchitecture.........................................................................................................223.3 LargeScaleDesignConsiderations..............................................................................24

4 EventFocusedCrawler...............................................................................................................264.1 EventModelandRepresentation.................................................................................264.1.1 EventModeling...........................................................................................................26

4.2 EventProcessing.................................................................................................................304.2.1 EventModel-basedWebpageScoring..............................................................304.2.2 CalculatingTheWeights.........................................................................................324.2.3 EventModel-basedURLScoring.........................................................................34

5 ExperimentalSetup......................................................................................................................375.1 Datasets...................................................................................................................................375.2 Experiments..........................................................................................................................395.3 EvaluationMetrics..............................................................................................................41

6 Results...............................................................................................................................................436.1 EventModel-basedvs.Topic-OnlyClassification..................................................436.1.1 ClassifyingURLsandwebpagesaboutCaliforniashooting....................43

6.2 EventModel-basedvs.Topic-onlyFocusedCrawler...........................................486.2.1 CaliforniaShooting...................................................................................................486.2.2 BrusselsAttack...........................................................................................................506.2.3 Oregonshooting.........................................................................................................53

Page 6: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

vi

6.2.4 Egyptairplanecrash................................................................................................546.2.5 Panamapapers...........................................................................................................556.2.6 Orlandoshooting.......................................................................................................556.2.7 Parisattacks.................................................................................................................566.2.8 Ecuadorearthquake.................................................................................................57

7 WebpageSourceImportanceandSocialmedia-basedSeedSelection..................597.1 WebpageSourceImportance.........................................................................................597.2 SeedURLsforcrawling.....................................................................................................647.3 Semi-automatedSocialMedia-basedSeedURLGeneration.............................657.3.1 SelectingSeedURLs..................................................................................................667.3.2 SeedsURLDomain/SourceImportance..........................................................66

8 ConclusionandFutureWork...................................................................................................748.1 Contributions........................................................................................................................758.2 FutureWork..........................................................................................................................75

References.................................................................................................................................................77

Page 7: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

vii

ListofFiguresFigure1OverviewofIDEALsystemandroleofeventfocusedcrawling.........................2Figure2WorkflowforcreatingWebarchivesfromsocialmedia(Twitter)................10Figure3SeedURLsmanualcurationusingArchive-Itservice...........................................13Figure4Architectureofbaselinefocusedcrawlerwithtopicrepresentationinthe

lowerbox,crawlingintheupperbox,andprocessingandrelevanceestimationinthemiddlebox...........................................................................................................................22

Figure5Baselinefocusedcrawleralgorithm.............................................................................24Figure6Stepsofbuildingeventmodelfromseedwebpages.............................................28Figure7Thestepsforcalculatingthescoreofawebpage...................................................33Figure8AnexamplewebpagewitharelevantURLanchortexthighlighted..............34Figure9ThestepsforcalculatingthescoreofaURL.............................................................35Figure10Designofourevaluationmethodoftheeffectivenessoftheeventmodel

forrelevanceestimation;thetwoboxeswithanasteriskindicatethetwoparametersoptimizedintheexperiment...........................................................................40

Figure11Thedesignoftheexperimentforevaluatingtheeffectivenessofeventmodelwithfocusedcrawlertoretrievemorerelevantwebpages.........................41

Figure12 CaliforniashootingURLsevaluationatdifferentthresholdvalues.............46Figure13Californiashootingwebpagesevaluationatdifferentthresholdvalues...47Figure14 Performanceevaluationofeventmodel-basedvs.topic-onlyfocused

crawlersforCaliforniashooting.............................................................................................49Figure15Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforCaliforniashooting(50Kwebpages).........................................50Figure16 Performanceevaluationofeventmodel-basedfocusedcrawlerfor

Brusselsattack...............................................................................................................................51Figure17Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforBrusselsattack(50Kwebpages).................................................52Figure18Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforOregonshooting(100Kwebpages)...........................................53Figure19Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforEgyptairplanecrash(10Kwebpages).....................................54Figure20Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforPanamaPapers(100Kwebpages).............................................55Figure21Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforOrlandoshooting(50Kwebpages)............................................56Figure22Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforParisattack(500Kwebpages).....................................................57Figure23Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforEcuadorearthquake(100Kwebpages)...................................58Figure24EffectofsourceimportanceoneventfocusedcrawlingforBrusselsattack

event...................................................................................................................................................62Figure25EffectofsourceimportanceoneventfocusedcrawlingforCalifornia

shootingevent................................................................................................................................62Figure26EffectofsourceimportanceoneventfocusedcrawlingforEcuador

earthquakeevent..........................................................................................................................63

Page 8: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

viii

Figure27EffectofsourceimportanceoneventfocusedcrawlingforOrlandoshootingevent................................................................................................................................63

Figure28Workflowforextracting,expanding,andselectingURLsfromtweets......67Figure29Brusselsattacktweetslanguagedistribution.......................................................68Figure30Brusselsattacktweetscreationdatedistribution...............................................69Figure31BrusselsattacktweetswithURLsdistribution.....................................................69Figure32BrusselsattackseedURLsdomainsdistribution................................................70

Page 9: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

ix

ListofTablesTable1Exampleeventsofdifferenteventtypes.........................................................................3Table2Listofeventsusedinthelarge-scalecrawlingexperiments...............................38Table3ValuesoftheparametersthatproducedthebestF1-score.Kisthesizeofthe

topicvectorandthresholdisthecutoffvaluefordeterminingrelevantornon-relevantlabelsbasedonthescore........................................................................................43

Table4 Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,DateevaluatedonthemanuallylabeledTRAININGURLsdatasetforCaliforniashootingevent................................................................................................................................44

Table5 Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,andDateevaluatedonthemanuallylabeledTRAININGwebpagesdatasetforCaliforniashootingevent...........................................................................................................45

Table6Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,andDateevaluatedonthemanuallylabeledTESTwebpagesdatasetforCaliforniashootingevent...........................................................................................................46

Table7 Californiashootingeventmodel......................................................................................48Table8 Brusselsattackeventmodel..............................................................................................51Table9DifferenttypesofseedURLs.............................................................................................65Table10Brusselsattacktweetcollectionstatistics................................................................68Table11Harvestratioforeventfocusedcrawlerusingtwomethodsofseed

selectionwithdifferentnumbersofseedsforBrusselsattack.................................71Table12Numberofdifferentdomainsintheoutputcollectionsofcrawling

experimentsforBrusselsattack.............................................................................................71Table13Harvestratioforeventfocusedcrawlerusingtwomethodsofseeds

selectionwithdifferentnumbersofseedsforOregonshooting..............................71Table14Numberofdifferentdomainsintheoutputcollectionsofcrawling

experimentsforOregonshooting..........................................................................................72Table15Harvestratioforeventfocusedcrawlerusingtwomethodsofseeds

selectionwithdifferentnumbersofseedsforCaliforniashooting.........................72Table16Numberofdifferentdomainsintheoutputcollectionsofcrawling

experimentsforCaliforniashooting.....................................................................................72Table17Harvestratioforeventfocusedcrawlerusingtwomethodsofseeds

selectionwithdifferentnumbersofseedsforOrlandoshooting............................73Table18Numberofdifferentdomainsintheoutputcollectionsofcrawling

experimentsforOrlandoshooting........................................................................................73

Page 10: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

1

1 Introduction

1.1 MotivationThereisneedforanintegratedeventfocusedcrawlingsystem[1-3]tocollectWebdataaboutkeyevents.Eventsleadtoourmostpoignantmemories.Werememberbirthdays,graduations,holidays,weddings,andothereventsthatmarkstagesofourlife,aswellasthelivesoffamilyandfriends.Asasocietywerememberassassinations,naturaldisasters,man-madedisasters,politicaluprisings,terroristattacks,andwars--aswellaselections,heroicacts,sportingevents,andothereventsthatshapecommunity,national,andinternationalopinions.WebandTwittercontentdescribesmanyofthesesocietalevents.Inpart,Web2.0[4]isahighlyresponsivesensorofimportantoccurrencesintherealworld,sincepeoplefromacrosstheglobemeetvirtuallyandsharerelatedobservationsandstoriesonline.Wecanleveragethisstreamofdata,forautomaticcollectionofevents,totriggereventarchiving,andlatertoenableeventrelatedservicesthatsupportcommunities.Permanentstorageandaccesstobigdatacollectionsofeventrelateddigitalinformation,includingwebpages,tweets,images,videos,andsounds,couldleadtoanimportantinternationalasset.Regardingthatasset,thereisneedfordigitallibraries(DLs)providingimmediateandeffectiveaccess,andarchiveswithhistoricalcollectionsthataidscienceandeducation,aswellasstudiesrelatedtoeconomic,military,orpoliticaladvantage.Whensomethingnotableoccurs,manyuserstrytolocatethemostup-to-dateinformationaboutthatevent.Later,researchers,scholars,students,andothersseekinformationaboutsimilarevents,sometimesforcross-eventcomparisonsortrendanalyses.Yet,thereislittlesystematiccollectingandarchivinganywhereofinformationaboutevents,exceptwhennationalorstateeventsarecapturedaspartofgovernmentrelatedWebarchives.ThisistheneedaddressedbytheIntegratedDigitalEventArchiveandLibrary(IDEAL)project[5].ThoughtheInternetArchive[6]supportssomeevent-orientedarchiving,coverageislimited.Manyimportanteventsareignored,whileothersareonlycapturedinpart.Somegroupscollectdataonlyuntilthelastvictimisrescued,whileothersstartlateandmissearlyposts.Further,toolsforcapturearecomplex,andfewarchivists

Page 11: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

2

mastertheirfeatures,soachievinghighrecallisexpensive.Therearefewmechanismstofilteroutnoiseincollections.Accesstotheresultingarchivesisawkwardandinefficientduetothefactthatmuchofthecontentcapturedisnon-relevant[7].WearguethatmanualcurationofseedURLsisnotscalableandnotfullyeffectiveforarchivingeventsthathavehighimpact.Thus,improvedtechnologyisneeded.TheIDEALprojectisdevelopingadigitallibrary/archivesupportingautomaticeventtracking,crawling,andarchiving,aswellaseffectiveaccess(inthesenseofaidinginthefindingandutilizationofrelevanthighqualityinformation).Figure1showsanoverviewoftheIDEALproject.Bytakinginputfromtweets,news,webpages,(micro)blogs,andqueries,oursystemwillcollectandarchiveeventrelateddigitalobjects,andprovideabroadrangeofhelpfulservices.

Figure1OverviewofIDEALsystemandroleofeventfocusedcrawling

ThisdissertationfocusesonthedatafrontendoftheIDEALproject,i.e.,collectingandarchivingdatausinganewtypeoffocusedcrawler.TheIDEALprojecthasaround11TBofwebpagearchives(WARCfiles)andover1.2billiontweetsacrosshundredsofdifferentevents[8].Earlyon,thewebpagearchiveswerecollectedusingtheInternetArchive’s[6]Archive-Itservice[9],whichusestheHeritrix[10]toolforarchivingwebpages.Originally,theIDEALprojectmanuallypreparedalistofURLsforevents,andfedittotheArchive-Itserviceforcrawling.Theproblemwiththisweaklycuratedapproachisthatweproducedcollectionswithlowprecision(i.e.,withfewrelevantandmanynon-relevantwebpages);HeritrixisageneralWebcrawleranddoesn’tanalyzethetextualcontentofthewebpagesbeforedownloadingthem.Toovercomethisproblem,theIDEALteamshiftedtoanotherapproach:weextractedURLsfromtweetarchivesthatwerebuiltaboutevents,and

Page 12: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

3

downloadedonlythecorrespondingwebpages.Theresultingcollectionshavehighprecision(mostofthewebpagesarerelevant)butlowrecall(notalloftherelevantwebpagesarefound).Afocusedcrawlerwouldhelpsolveeachofthepreviousproblemsbycrawling(toincreaserecall)theWWWstartingfromtheURLsextractedfromthetweetarchivesbutthenfollowingonlytherelevantwebpages(toenhanceprecision)inordertofindandcollectasmuchrelevantinformationaspossible.However,inthepast,focusedcrawlingwasmainlyappliedtotopicalcrawling,i.e.,collectingwebpagesaboutacertaintopicordomain.Accordingly,sinceweareworkingwithevents,weextendedtheapproachpreviouslyusedbytheIDEALteamandadapted/changedthetraditionalfocusedcrawlerapproachtoaccommodateourneeds.

Table1Exampleeventsofdifferenteventtypes

EventTypes ExampleEvents

Bombing Bostonbombing

BuildingCollapse EastHarlembuildingcollapse

Community Flintwatercrisis,Lovewins(samesexmarriage),Worldcup

Earthquake Ecuador,Japan

Fire Californiawildfire,Brazilnightclubfire

Flood TexasFloods

Hurricane Joaquin,Sandy,Katrina

PlaneCrash Egyptair,Russian,germanwings

PoliticalConflict Brexit,Turkeycoup,GreeceBailoutreferendum

Protests/Riots Ferguson,Egyptianrevolution

Scandal PanamaPapers,SeppBlatter

Shooting Oregoncollegeshooting,Californiashooting,Orlandoclubshooting

TerroristAttack Paris,Brussels,Nicetruckattack

TrainDerailment Amtrak188,Quebec

Page 13: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

4

Thereareseveraldefinitionsofaneventthatdifferaccordingtodiscipline(seeChapter4formoredetails).Inthisdissertationweareinterestedinunusualrealworldeventsthatcapturemuchattention,leadingtoahighvolumeofcontentgeneratedontheWWWabouttheevent.Thusweconsiderdifferenttypesofeventslike:airplanecrashes,buildingcollapses,earthquakes,floods,fires,hurricanes,politicalcrises,scandals,shootings,terroristattacks,andtraincrashes/derailments.Allthesetypesofeventssharethesamecharacteristics:somethinghappensataspecificphysicallocationduringaspecificperiodoftime.ExampleeventsforeacheventtypeareshowninTable1.Thisisnotanexclusivelistofeventsandtheirtypes,butratherarepresentativelistofwhatwehavebeenworkingoninthisdissertation.Ourresearchshouldgeneralizetoanyeventwiththesamecharacteristics(i.e.,somethingunusualhappensataspecificplaceonaspecificdate).Weproposedfivemajorchangestotraditionaltopicalfocusedcrawling:

1. Implementinganeventmodelandrepresentation, 2. Incorporatingtheeventmodelinformationextractedfromseedwebpages’

contentintofocusedcrawling, 3. Developingawebpagesourceimportancemodel, 4. Incorporatingthewebpagesourceimportancemodelintofocusedcrawling, 5. AutomatingtheprocessofseedURLsselection.

OurproposedapproachintelligentlycombinesthedifferentaspectsofaneventtoheuristicallyestimatetherelevancetotheeventofaURLand/orawebpage.Ourintelligenteventfocusedcrawlerdistinguisheson-topicURLs/webpagesvs.off-topic(likewiththebaselinetopic-onlyapproach),andalsodistinguisheson-eventURLs/webpagesvs.off-eventURLs/webpagesonwhichthebaselinetopic-onlyapproachfails.Forexample,considertheCaliforniashootingevent.Bothapproaches(ourintelligenteventfocusedcrawlerandthebaselinetopicalfocusedcrawler)successfullyidentifyon-topicURLs/webpages(i.e.,URLs/webpagesaboutashooting).However,oureventfocusedcrawleralsointelligentlydistinguishesbetweenURLs/webpagesthatareon-event(e.g.,relevanttotheCaliforniashooting)andURLs/webpagesthatareoff-event(butstillon-topic,i.e.,aboutashooting).Weconductedfourseriesofexperimentstoevaluateoursystemusingasetofrecentevents:Orlandoshooting,Ecuadorearthquake,Panamapapers,Californiashooting,Brusselsattack,Parisattacks,andOregonshooting.Thefirstexperimentseriesevaluatedtheeffectivenessofourproposedeventmodelrepresentationwhenassessingtherelevanceofwebpages.Oureventmodeloutperformedthetopic-onlyapproaches;itshowedbetterresultsinprecision,recall,andF1-score.Thesecond

Page 14: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

5

experimentseriesevaluatedtheeffectivenessoftheeventmodel-basedfocusedcrawlerforcollectingrelevantwebpagesfromtheWWW.Oureventmodel-basedfocusedcrawleroutperformedthestate-of-the-artfocusedcrawler(best-firstsearch);itshowedbetterresultsinharvestratio.Thethirdexperimentevaluatedtheeffectivenessofourproposedwebpagesourceimportanceforcollectingmorerelevantwebpages.Thefocusedcrawlerwithwebpagesourceimportancemanagedtocollectroughlythesamenumberofrelevantwebpagesasthefocusedcrawlerwithoutwebpagesourceimportancebutfromasmallersetofsources.ThefourthexperimentprovidesguidancetoarchivistsregardingtheeffectivenessofcuratingseedURLsfromsocialmediacontent(tweets)usingdifferentmethodsofselection.Ourcontributionsfromthisresearchare:

1- Amodelandrepresentation forcapturing thedifferentaspectsofevents inwebpages(topic,location,anddate);

2- An extended focused crawler approach that uses our event model torepresentcontentandtoestimatetherelevanceofwebpages;

3- Anautomatedapproachforsocialmedia-basedseedURLselection;4- A method to estimate the value of webpages based on their source

importance;5- Anextendedfocusedcrawlerapproachthatintegrates,foreachwebpage,the

textual content information based on our event model, along with thewebpagesourceimportance,intoasinglerelevancescore.

Thefollowingsubsectionsintroducethehypothesestestedthroughthisdissertation,theresearchquestionsinvestigated,the5Sapproachusedtoguideourwork,andtheorganizationofthefollowingchapters.

1.2 HypothesesThisresearchconsidersthefrontendoftheIDEALproject,i.e.,collectingandarchivingdatausinganewtypeoffocusedcrawler.Wetesttwohypothesesaboutfocusedcrawling,thefirstofwhichhasthreeparts:

H1:Usingseveralsourcesofevidenceofwebpagerelevance,includingcontent-basedfeatures,willhelpidentifymoreofjusttherelevantwebpages.

H1.1Usingtemporalandspatialinformationaspartofmodelingandrepresentingeventsleadstobetterdescriptionsofeventsthanthetopic/keyword-basedapproach.

Page 15: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

6

H1.2:Incorporatingeventinformationextractedfromawebpage’scontentintofocusedcrawlingwillincreaseeffectiveness,helpingidentifymoreoftherelevantwebpageswhilemaintaininghighlevelsofprecision.H1.3:Awebpagebelongingtoanimportantwebsite(e.g.,www.cnn.com)shouldleadtoalargernumberofrelevantwebpageslinkedtoit.Thesourcewebsite(e.g.,www.cnn.com)isactinglikeahubwhichleadstorelevantwebpages.

H2:Integratingeventinformationandwebpagesourceimportancewillimproverelevancepredictionpower,ensurebalanceintypesofcontentcrawled,andreducebias.

1.3 ResearchQuestionsThisdissertationaddressesfiveresearchquestions,listedbelow,thatmap,asshowninparentheticals,tothehypothesesgivenabove.R1:Howtomodelandrepresentanevent?(H1.1)R2:Howtocomparetwoeventrepresentations?(H1.1)R3:Whatistheeffectofintroducingeventhandlingontheperformanceoffocusedcrawling?(H1.2)R4:Howtomodelandrepresentwebpagesourceimportance?(H1.3)R5:Howtointegratehandlingofeventsandinformationsources?(H2)

1.4 5SOursystemisdesignedtakingintoconsiderationthe5Sframework[11-14]fordevelopingdigitallibraries.Wedescribeherethedifferentdimensionsofthe5Sframeworkandhowtheyareappliedinoursystem.Weareusingthe5S(societies,scenarios,spaces,structures,streams)digitallibraryframeworkfortwomainreasons.First,5Sprovidesachecklistanddesignguidelinesthathelpintheunfoldingofourresearch.Second,themaingoalofoursystemistobuildaneventdigitallibrary,whichisakeypartoftheIDEALproject,whichprovidesservices(byimplementingscenarios)foravarietyofusers(societies).Wedemonstratehowfocusedcrawlingaddscontent(digitalobjects)totheIDEALeventdigitallibrary.Thekeydigitalobjectsaretweetsandwebpages,thateachcanbeviewedasacombinationofstreamandstructure.Anotherofthemainobjectsinourfocusedcrawleristheeventobject,acombinationofstreamand

Page 16: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

7

structureandspace.The5Smodelallowsustotreateventsasfirstclassobjects(i.e.,eventobjectscanbecreated,stored,sortedondifferentattributes,searched,andvisualized-basedonattributeslikelocation).Amaindesignconsiderationofourfocusedcrawleristheabilitytoreceiveaneventobjectasinput,andthenstartcrawlingtheWWWforwebpagesthatarerelevanttothateventobject,therebyimplementingavariantoftheWebcrawlingscenario.Morediscussionof5Sanditsconnectiontothisdissertationresearchfollows.Societies:Thesystemcanservedifferentstakeholderslike:historians,thoseinthegeneralpublic,researchers,analysts,responders,eventparticipants,anddecisionmakers.Inaddition,therearesoftwareagents;seemorebelowunderServices.Scenarios:Inadditiontoactivitiesofsoftware(seemorebelowunderServices),thefollowingwillbesupported:

• AUsercanstartbuildingacollectionusingtheeventfocusedcrawler.TheusershouldprovideasetofURLs,collecttheoutputofasearchengine,orsamplefromsocialmediacontentthatisprovidedasinputtooursystem.

• AUsercanvisualizeacertaineventorgroupofevents.Severalvisualizationschemesareprovided,includingmapsandtimelines.

• AUsercanbrowsethesystemcontentaccordingtofacetsorothermechanisms,considering:eventcategories,contenttype,entities,andarchives.

• AUsercansearchallsystemcontenttypes.• AUsercansearchacertaineventcollectionorseveralcollections.

Spaces:Objectsinoursystemcanbeviewedinseveralspaces:

• Webpagedocumentsarerepresentedusingthevectorspacemodel[15,16].• Eventsarerepresentedinaneventspacemodel(topic,location,anddate).• Two-orthree-dimensionalinterfacesaiduserinteraction.• Probabilityspacesaidwithcharacterizinginterdependenciesanddrawing

inferences.Structures:Organizationofobjectsinoursystemcanbeperformedaccordingto:

• Eventcategory(earthquake,flood,hurricane,shooting,planecrash,etc.),whichcanfitwithinataxonomy,ontology,orothertypeofstructure;

Page 17: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

8

• Datastreamtype/schema(fortext,image,audio,video,eventrecord,archive);or

• Entity(location,date,personname,organizationname).Representationsinclude(metadata)recordsindatabases,graphs,trees,etc.Streams:Theseinclude:webpagetexts,Webarchives,crawllogs(datarecordinghowthefocusedcrawlercreatedthecollection),images,videos,audiofiles,eventsummariesordescriptions,andothersimilarentitiesassociatedwithevents.Weconsideredeventsasfirst-classobjectsandcreatedcharacterizationsforeachevent.Usingeventrelatedstreams,youcansearchforanevent,browseevents,visualizeevents,etc.Services(notoneofthe5Ss,butrelatedtoSocietiesandScenarios,andhelpfulindescribingaDL):Oursystemwillprovideservicesincluding:

• Creatingortransformingcollections/archives;• Analyzingcollections(extractingentities,providingsummaries,classifying,

clustering);• Searchingallkindsofdatastreams;• Browsingaccordingtoeventcategories,datastreamtypes,and/orentities;

and• Visualizingcontent.

1.5 ThesisOrganizationThisdissertationisorganizedasfollows.Chapter2discussesthearchitectureofthebaselinetopical/focusedcrawler.Chapter3reviewsdifferentapproachesforfocusedcrawling.InChapter4,weproposeourneweventmodelandrepresentation,andexplainhowitisintegratedwiththefocusedcrawlingapproach.Chapter5coversthedesignofexperimentsperformed,whileChapter6presentstheevaluationofoureventmodel-basedfocusedcrawlerandofthebaselinefocusedcrawler.Chapter7explainsthewebpagesourceimportancemodelanditseffectonourfocusedcrawler.Finally,Chapter8concludesanddiscussesissuesforfutureresearch.

Page 18: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

9

2 RelatedWorkWediscussthedifferentfieldsrelatedtoourworkinatop-downmanner.FirstwediscussthebiggergeneraltopicofWebcrawling.ThenwediscusstheIDEALprojectandWebarchivingtechniques.Thenwediscussfocusedcrawlertechniques.Mostoftheworkdoneintraditionaltopicalfocusedcrawlingfallsintooneofthreecategories:machinelearning,semanticsimilarity,orcontentandlinkanalysis.Wediscussthemajorworkdoneinthesethreecategoriesinthenextsubsections.Alongtheway,wealsotouchonpublicationsrelatedtoeventmodeling,socialmediaintegrationwithfocusedcrawling,seedselection,andfinallyfocusedcrawlingevaluation.

2.1 WebCrawlingWebcrawlers[17]aresoftwareprogramsthattraversetheWWWfollowingthelinksonthewebpages.AcrawlermodelstheWWWasagraphwherenodesarewebpagesandedgesbetweennodesarethehyperlinksthatmanifestinthewebpages.Acrawlerstartsfromasetofwebpages,calledtheseedset,andfollowsthelinksonthosewebpages.Itdownloadsthecorrespondingwebpages,extractsthelinksinthem,andthenrepeatsthewholecycle.Acrawlerkeepstwodatastructuresthatfacilitatethecrawlingprocess[18],theURLqueue(frontier)andthevisitedURLslist.ThecrawlerusesthefrontiertokeeptheURLsthatareextractedfromwebpagesbutnotvisitedyet,andthevisitedURLslisttokeeptrackoftheURLsthatwerevisitedsoitwon’tvisitthemagain(ortocontrolthefrequencyofvisitingthemagain).Webcrawlershavebeenusedbysearchenginestocollectasmanywebpagesaspossible.Thesearchengineparsesthecollectedwebpages,extractsthetext,andbuildsasearchindex[18].Theindexisthemainelementusedtosupportthesearchingservice.

2.2 TheIDEALprojectTheIDEALproject[5]teamhasdevelopedover1000tweetarchivesaboutgeneraltopicsand/orevents,alongwithover66Webarchivesofman-madeandnaturaldisasters,thelatterusingtheInternetArchive’sgeneralcrawler[10].

Page 19: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

10

Figure2WorkflowforcreatingWebarchivesfromsocialmedia(Twitter)

Eventarchivingisdifferentfromdomain/site-basedortopic-basedarchiving.Thefirstinvolvesarchivingaspecificdomain/websitewithallorsomeoftheunderlyingsubdomains/structure.Thesecondcoversagivennumberofwebpagesrelatedtoauser-definedtopic. TheIDEALprojectteamhasidentifiedandemployedthreeapproachesforarchivingwebpagesaboutevents:

1. Manualcurationbydomainexperts,librarians/archivists,andgovernmentagencies.(Highquality–timeconsuming).Seehttps://archive-it.org/explore?show=Collections&fc=meta_Subject:Spontaneousevents

2. Socialmedia-based(crowdsourcing)curationbyextracting,retrieving,andarchivingURLsfromtweetcollectionsaboutanevent(Lowquality–timesaving).Seehttp://www.eventsarchive.org/?q=node/42

3. CrawlingtheWebusingafocusedcrawlingapproachtailoredtoevents(withacceptablequalityandtime)

TheIDEALprojectteamhasusedthefirstapproachandcreatedaround66Webcollectionsaboutdifferentkindsofevents[19].TheymanuallycuratedseedURLsandfedthemintotheArchive-Itserviceforcrawling(seeFigure3).TheyhaveapplieddifferentsettingsoftheArchive-Itconfigurationparametersaccordingtotheimportanceoftheeventaboutwhichtheyarecollecting,andthetypeofURLstheycurated.Thetwomainconfigurationparametersarethefrequencyofcrawling

Event

Collect*Tweets

Tweet*Collection

ExtractURLs

Shortened*URLs

Expand Original*URLs

Fetch Webpages

Archive WARC

Index SOLR

Browse

Wayback

Search

Access

Keyword/Hashtag

Collect Archive/Organize/Analyze

Page 20: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

11

andthescopeofcrawling.Thefirstonecontrolsthefrequencybywhichthecrawlershouldrevisitandre-crawlthewebpage,whilethesecondparametercontrolswhetherthecrawlershouldfollowtheoutgoinglinksinthewebpage.TheIDEALteamautomatedtheprocessofcuratingseedURLs[20]bycollectingtweetsabouttheeventsofinterestandthenextractingURLsfromtweetsandfeedingthemtotheArchive-Itserviceforcrawling.Socialmediaingeneral,andTwitterinparticular,provideaveryrichsourceforuser-generatedcontent,whichcontainsalargenumberofURLs.TheIDEALprojectteamhascreatedaround1000tweetcollectionsaboutgeneraltopicsandspecificevents[21].Thetweetcollectionssufferfromnoisycontentlikeporn,jobads,marketing,etc.TheIDEALprojectteamhasappliedseveralfilteringmethodstoensuretheresultingtweetcollectionscontainrelevantcontentonly.Figure2showstheworkflowforcuratingseedURLsfromsocialmediasources(e.g.,Twitter).TheresultingseedURLsarearchivedusingtheHeritrix[10]toolandthentheresultingWebarchivesareindexedbyasearchengineforprovidingaccess,searching,andbrowsingservicesforusers.Thelastapproachisaimedtomaintainabalancebetweenproducinghighqualityeventcollectionsandreducingthetime/resourcesneededforcollectionbuilding.TheIDEALprojectteamhasdevelopedtoolsforsemi-automaticallycollecting,curating,andarchivingwebpagecollections,leveragingmethodsforeventmodelingandfocusedcrawling.TheeventmodelingcoversespeciallyidentifyingandrepresentingeventsconsideringtheirWhat,Where,andWhenaspects.TheIDEALproject’sworkonfocusedcrawlingcouldbeofbenefitforWebarchivingby:

1. HelpingpreparelistsofURLstobearchived(i.e.,afocusedcrawlerrecommendingaseedlist);

2. Helpingextendacollectionautomatically(usingexistingcollectionsformachinelearningtypetrainingofafocusedcrawlertofindsimilarnewwebpages);and

3. Analyzingandsummarizingtheproducedeventcollectionsbyusingthedevelopedeventmodel.

2.3 WebArchivingandArchive-ItserviceCloselyrelatedworkhadbeendoneintheemergingandpromisingfieldofWebarchivinganddigitallibraries(WADL)[22-24].ThemostrelatedaspectofWebarchivingistheselectionofwebpagestobearchivedandhowtocollectthesewebpages,aprocesssometimesknownasWebcuration,typicallygovernedbya

Page 21: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

12

selectionpolicy.GeneralWebcrawlersarethedominantmethodforWebarchiving.However,severalnewtechniqueshaveemergedwhichdonotdependoncrawlingtechnology,butratherdependonthetransactionalbehavior[23]oftheWWW(HTTPprotocol)todrivearchivingofwebpages.ManywebpagearchivesarecreatedbycuratorsusingtheInternetArchive’sservicecalledArchive-It[9],whichhelpswithharvesting,building,andpreservingcollectionsofdigitalcontent.TheservicetakesURLsasinputfromauser.Figure3showstheinterfaceforuserstomanuallyentertheURLsfromwhichthecrawlerwillstartcrawling.TheseURLsareusedbyArchive-IttocrawltheWeb,guidedbymanualconfigurationdetails(scopingofthedomainofwebpagestocrawl,typesoffilestocrawl,followingrobots.txtprotocol,etc.),andtheresultingwebpagesarecapturedandstoredinWARC[25]files.TheArchive-Itservicehasprovidedanautomatedwayfordifferentkindsofuserstoarchiveandsaveimportantwebpagesinwhichtheyhaveinterest.However,themethodologyusedinthecrawlertechnologybehindtheArchive-Itserviceisorientedtowardarchivinggeneralwebsites(likegovernment,state,universitylibraries,andfederalwebsites)wherethewholecontentofthewebsiteandthefrequentchanges/updatesofthewebsitearethemainscopeofthearchivingprocess.Forspontaneouseventsthisapproachisnotwellsuited.Mostoftheevent-relatedcontentinvolvesonlyspecificwebpageswithinawebsite,andthosewebpagesarenotfrequentlychanged/updated.Therefore,usingtheArchive-Itservicemayresultinanarchivewithmostofthecontentnotrelatedtotheeventofinterest.In[7]theauthorsanalyzedWebarchivesaboutschoolshootings.TheirresultsshowthatrepresentativeWebarchivesarenoisy,with2%-40%ofwebpagesreflectingrelevantcontent.

Page 22: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

13

Figure3SeedURLsmanualcurationusingArchive-Itservice

2.4 FocusedCrawling

2.4.1 MachineLearningMachinelearningbasedfocusedcrawlerapproachesapplytextclassificationalgorithms[26-28]tolearnamodelfromtrainingdata.Thefocusedcrawlerthenusesthemodeltoestimatetherelevanceofunvisitedwebpages.Useofthemodelenhancestheperformanceoftheclassifierbyincorporatingdomainspecificknowledgeandonlinerelevancefeedback.Ourapproachlikewisecanbeconsideredasinvolvingaclassificationtask;werequiretrainingdataforcalculatingtheweightsofthedifferentaspectsoftheeventandweareusingthewebpagetextforbuildingtheeventmodel.RennieandBarto[29]usedreinforcementlearningforsolvingthefocusedcrawlingproblem.TheymodeledthefocusedcrawlingproblemasaMarkovdecisionprocesswithwebpagesasstates,URLsasactions,andon-topicwebpagesastherewards.Anotherreinforcementlearningalgorithm,temporaldifferencelearning,wasused

Page 23: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

14

in[30].Theyusedastatevaluefunctiontoestimatetheimportanceofwebpagestoleadtofuturerelevantwebpages.Inourapproach,weusewebpagesourceimportancetoestimatethevalueofwebpagesforlinkingtootherrelevantwebpages.Anotherwork[31]usedthereinforcementlearningframeworkproposedin[29]andenhanceditsperformancebyapplyingincrementalonlinelearning.ForeachnewURL,theyestimateitscorrespondingclassanduseitsfeaturestoupdatetheclassfeaturesanditscorrespondingq-value.Thentheyretrainthesupervisedlearningalgorithmbasedonthenewtrainingdata(oldtrainingdataandthenewURLsseen).Thisapproacheliminatesthedatabiasthatappearsinthetestdata,whereunseenURLsmayappearfromnewdomainsthatwerenotfoundinthetrainingdata.InourapproachweaddressbiasthroughusingwebpagesourceimportancebothinselectingseedURLsandalsoduringcrawling.Infospidersisatopicalcrawlerbasedonadaptiveonlineagents[32]thatusegeneticprogrammingandreinforcementlearningapproachestoestimatetherelevanceofawebpage.Inourapproachwegobeyondjustusingtopics,anduseeventmodelingforestimatingtherelevanceofawebpage.In[33],afocusedcrawlerwasdevelopedforcollectingwebpagesthatcontainsemanticinformation(semanticannotationsexpressedasstructureddataembeddedintheHTMLofthewebpage).TheauthorsproposedanewmethodologyforcrawlingtheWebthatutilizesanonlineclassifierandareinforcementlearningbanditalgorithmforselection.Theonlineclassifierlearnstodetecttherelevantwebpagesduringcrawling,eliminatingtheneedfortrainingamodelbeforecrawling.ThebanditalgorithmprovidesaframeworkfororderingandselectingthenextURLtovisitduringcrawling.TheURLsinthecrawlerfrontieraregroupedbytheirWebhost/domainandeachhost/domainisscoredaccordingtoanimportancemeasure.Thecrawlerchoosesthehost/domainwiththehighestscoreandthenchoosesfromthathost/domaintheURLwiththehighestscore.

2.4.2 SemanticSimilaritySemanticsimilarity-basedtechniquesuseontologies[34,35]fordescribingthedomainofinterest.Thedomainontologycanbebuiltmanuallybydomainexpertsorautomatically,usingconceptextractionalgorithms.Oncetheontologyisbuilt,itcanbeusedforestimatingtherelevanceofunvisitedwebpagesbycomparingtheconceptsextractedfromthetargetwebpagewiththeconceptsthatexistintheontology.Theperformanceofsemanticfocusedcrawlingdependsonhowwelltheontologydescribesandcoversthedomainofinterest.Manyofoureventsare

Page 24: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

15

disaster-relatedevents.Althoughwearenotusingdomainknowledgeoreventontologies,yetoureventmodelcouldbeeasilyextendedtomakeuseofdisasterdomainknowledgeforfindingdisaster-specifickeywords.Oureventmodelalsocouldbeeasilymappedtoportionsofadisaster-relatedeventontology[12,36],butfurtherresearchwouldbeneededtoseeifsuchwouldyieldimprovement.Further,usingthisapproach,inasystemthatisaimedtohandleanytypeofimportantevent,wouldrequireconsiderableknowledgeengineeringwork.Relatedtosemanticsissentiment.Sentimentanalysishasbeenintegratedintofocusedcrawlingintwoways:sentiment-focusedcrawling[37]andfocusedcrawlingleveragingsentimentinformation[38].Inbothworks,crawlersmadeuseofthesentimentinformationtobuildthetargetmodelandtoguide,focus,anddirectthecrawlerthroughtheWebgraphbyestimatingtherelevanceofunvisitedURLs.Sentimentorientedcrawlingisconsideredatypeoftopicalcrawling,wherethetargetistocrawlwebpagesthathaveagivensentiment.Sentimentclassescouldbeassimpleaspositivevs.negative,orcomplexbasedonaspecificdomain.Inourwork,eventshavemorecomplexstructurethansentiment.Wecanleveragesentimentinformationabouteventsbutwehavetofindtherelevantwebpagesabouteventsfirstandthenanalyzethesentimentinformationinthosepages.

2.4.3 ContentandLinkAnalysisTextandlinkanalysisalgorithmscombinetextanalysisschemes(e.g.,VectorSpaceModel(VSM)[15,16])andlinkanalysisalgorithmstoestimatetherelevanceandimportanceofwebpages[39-41].Linkanalysisapproachesintroducetheconceptofpopularwebpages.PopularityismeasuredbasedonthelinkstructureoftheWWW.ThisledtotheintroductionoftheconceptsofHubandAuthoritywebpages[42].Hubwebpageshavelinkstomanyauthoritywebpageswhileauthoritywebpagesarelinkedtomanyhubwebpages.Amongthelinkanalysisalgorithms,PageRank[43,44]isthemostused.Alternatively,contextgraphsareusedtorepresentthecontextofawebpageusingneighborhoodwebpagesthataremostsimilartoit.Anotherlineofresearchincorporatesthegenreofwebpagesintofocusedcrawling[45].Thegenreofthewebpagedefinesthetypeofthewebpage(e.g.,forum,tutorial,news,blog,course-syllabus,etc.).Thefocusedcrawlerusestwosetsofkeywords,onefordeterminingthegenreofthewebpageandtheothersetfordeterminingthetopic.Thetwosetsofkeywords(genreandtopic)aremanuallydeterminedbyexpertsandthenusedbythefocusedcrawlerforestimatingtherelevanceofthewebpage.

Page 25: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

16

Relatedtowebpagesourceimportance,Pantetal.[46,47]describeanewWebcharacteristic:statuslocalityontheWeb.Awebpage’sstatusmeasurestheimportanceofthewebpagewithrespecttoitspopularity,andisapproximatedbythenumberoflinkspointingtoit.Pantetal.developedanalgorithmforestimatingthestatusofawebpagebasedonlocalcharacteristicsofthewebpageandalsodemonstratedthatthestatuspropertyhassomeofthesamecharacteristicsasthetopicalproperty.Ourapproachalsotriestopredicttheimportanceofawebpagebyexaminingtheimportanceofthedomaintowhichthewebpagebelongs.Theimportanceinourcaseisviewedwithrespecttothedomainoftheeventofinterest.Forexample,inanearthquakeevent,somewebsitesmaybemoredominant,andthusmoreimportant,thanotherwebsites,basedonthenumberofrelevantwebpagesfoundforthatdomain.Chen[48]developedahybridapproachforfocusedcrawlingusinggeneticprogrammingforexploitingdifferentfeaturesinawebpage’stext,andmetadatasearchforexploringdifferentsourcesontheWWW.Chenappliedthegeneticprogrammingapproachforcombiningdifferentrelevancesignalsfromthewebpagetext.HealsousedmetadatasearchforgatheringseveralseedURLsforthecrawlertostartfrom,thusexpandingthecrawler’scoverageoftheWWW.Inordertoovercomethebiasthatcanbefoundinonesearchengine,heusedmultiplesearchenginesandcombinedtheirresults.Ourapproachalsoaimstoimproverecallincrawling,butusesothertypesofevidence(webpagesourceimportance)todoso.

2.5 EventModelingEventmodelingrecentlyhasgainedpopularityindifferentfields,liketopicdetectionandtracking(TDT)[49],animaldiseaseoutbreakdetection[50],networkedmultimediaevents[51],anddocumentsimilarity[52].InTDT,aneventisdescribedasatopicthathappensatacertaintime,inaspecificlocation,andwithaparticularsetofparticipants.Inmultimediaapplications,aneventisdefinedasatupleofaspects:informational,spatial,temporal,structural,causal,andexperiential.TheinformationalaspectincludeseventID,eventtype,andanyotherattributesthatserveasidentificationoftheevent.Spatialandtemporalaspectsrepresentthelocationandtimeproperties,respectively.Thestructuralaspectincludesthesub-eventsbelongingtothecurrentevent.Thecausalaspectincludestheeventscausingthecurrentevent.Finally,theexperientialaspectincludesallmediaresourcesrelatedtothecurrentevent.Weincorporatedeventmodelingintoourcrawler,tobuildanevent-awarefocusedcrawler[3].WecameupwithaneventmodelthatintegratesideasfromTDT[49]

Page 26: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

17

(webuiltoureventmodelusingthesamedefinition:somethingthathappenedatacertainplaceonaspecificdate)andworkdonein[50](theydefinedaneventasacombinationofdomain-relatedkeywords,locationentitiesanddateentities,anddiseasenameentitiesthatappearonthesentencelevel).Weusetopic,location,anddateinformation,butaggregateatthecollectionlevel.Aprobabilisticeventmodelhasbeendevelopedin[53]whereaneventismodeledasalatentvariablethatgenerateswebpages,i.e.,theobservations.Awebpageaboutaneventismodeledasatupleofthreeparts(topic,entities,anddate).Topicandentitiesaremodeledasvariableswithmultinomialdistributionoverwordsinwebpagecontent,anddateismodeledasanormaldistributionoverthepublishingdateofthewebpage.Accordingtothismodel,usingatimeanalysisofpublishingdatesofwebpages,aneventisrepresentedbyapeakinthenumberofwebpagespublishedaroundacertaindate.Apeakcouldrepresentoneeventormultipleeventsthathappenedonthesame/closedates.Thetopicandentitiespartsofthemodelhelpindiscriminatingbetweeneventssharingthesamepeak.In[54,55],theLDAtopicmodelingframeworkisusedtomodelevents.Aneventisrepresentedbyamixtureoftopicsoverasetofwebpages.Thetopicsconsideredarebackgroundtopic,topicforeachevent(extractedfromthesetofwebpagesbelongingtothatevent),andtopicforeachdocument.Thereasonfordividingtopicsinthismanneristocaptureandseparatethelanguageusedforgeneralpurposes(backgroundfamouswords),thelanguageusedforeacheventspecifically,andfinallythelanguageusedforeachwebpagespecifically.Eventshavebeenanalyzedinsocialmedia(likeTwitter)usingtailoredmethodsforextractingevent-relatedinformationfromthetweets[56-59].Eventsaremodeledasasetofwordswithspecificstructurelike“subjectverbobject”.Thepurposeoftheresearchisnotcollectinginformationrelatedtoevent,butrathergivenasetofshorttext(tweets),howtoextractevent-relatedinformation.Thistypeofworkmodelseventsatthefine-grainlevel,whereitlooksforspecificstructureinsentencesthatmightrepresentanevent.Inthisdissertation,eventsaremodeledatacoarse-grainlevelusingthecombinationofwordsatthewebpagelevel.In[60]eventsaremodeledastheco-occurrenceofspatialandtemporaltokensinonesentence.Theauthorsprovidedahierarchyformodelingspatialandtemporaltokens.Forspatialhierarchytheyusedcountry,state,city,andstreet,whilefortemporalhierarchytheyusedyear,month,andday.Usingthesehierarchies,the

Page 27: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

18

authorsdevelopedasimilarityfunctionforestimatingthesimilaritybetweendocumentsusingeventmodelinstancesextractedfromthedocuments.In[61]theauthorsanalyzedthecharacteristicsofinformationsourcesduringnewsevents.Theyanalyzedthreetypesofinformationsources:mediaoutlets,socialmedia,andquerylogs.Theyanalyzedthreeevents:SanBrunopipeexplosion,NewYorkstorm,andAlaska3-wayelections.Theyanalyzedtheseeventsbecauseallthreesharedthesamecharacteristics:1)locationiscentralized(i.e.,nomultiplelocations),2)timespanislimited(i.e.,sotheyattractattentionandinterestforashortperiodoftime).Theirdefinitionofeventsissimilartoourdefinition:somethinghappenedatacertainplaceandspecifictime.Thus,thereareseveralwaysfordefininganeventdependingonthecontextandtheapplicationinwhichtheeventisused.Oureventmodelissimilartotheprobabilisticeventmodel[53],whichincorporatesthetopic,date,andentitiesaspects.Wedoidentifyadateforanevent.Wealsorecognizelocationentities.However,othertypesofentitiesfoundareincludedinthetopicpartasnormalkeywords.Oureventmodelcanbeeasilyextendedtoaddothertypesofentities(Persons,Organizations,etc.).Combiningseveraltypesofevidenceisanimportanttaskinfocusedcrawling.Initialworkhasusedcontentandlinkbasedinformation.Multimediainformationalsowasused[62],wheretextandimageswereanalyzedforestimatingwebpagerelevance.ThereaBayesiannetworkwasusedforintegratingevidencefromdifferentsourcesofinformation(textandimages).Likewise,wecombinethewebpagesourceimportancewiththewebpagerelevancescore,inparticularbymultiplyingbothtogethertoproduceafinalscore.

2.6 SocialMediaandFocusedCrawlingSeedSelectionIn[63],theauthorscombinedthefocusedcrawlertechniquewithsocialmediatoimprovethefreshnessofthecrawl.AfocusedcrawlerislimitedbythesetofseedURLsitstartsfrom.Socialmediaproducesahugeamountofusergeneratedcontent(e.g.,tweets)thatmaycontainURLs.Sincesocialmediacontentisproducedlive,theURLscontainedthereinwouldbefreshandpossiblymorerecentthantheURLsvisitedotherwisebythefocusedcrawler.InjectingURLsfromsocialmediaintothefocusedcrawler’sURLsqueue,shouldincreasethefreshnessoftheWebcollectionproduced.Theauthorscrawledabouttwoevents--EbolaandUkraineconflict--andusedakeyword-basedmodeltorepresentthetwoevents.Weusedsocialmediacontent(i.e.,Twitter)asasourceofseedURLs.WecollectedtweetsabouteventsandafterthecollectionprocessfinishesweextracttheURLsandfilterthemtoget

Page 28: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

19

relevantURLs.In[64],theauthorsexaminedthetopicalqualityofexistingWebarchivesaboutevents.TheybuiltaframeworkthatassesseswhethertheseedURLsusedinbuildingtheWebarchiveareon-topicoroff-topicacrossthedifferenttimesitwascrawled.Theauthorsusedthevectorspacemodeltorepresentthedocumentsandappliedseveralsimilaritymeasurestocalculatethesimilarityscores.Theyevaluatedtheirmethodusingdifferentthresholdvaluestofindthevaluethatyieldsthebestperformance.WeusedthecosinesimilarityforscoringwebpagesandURLsabouteventswithdifferentthresholdvalues.Wehaveusedsocialmediacontent(i.e.,Twitter)toextractandselectseedURLsforfocusedcrawling.Unliketheworkin[58],weextractedtheURLsfromthetweetcollectionsandthenstartedcrawling.

2.7 Evaluatingtopical/focusedcrawlersSeveralmethodshavebeendevelopedforevaluatingfocusedcrawlers[65,66].In[67]ageneralevaluationframeworkhasbeendevelopedwhereanyfocused/topicalcrawlercanbeassessedaccordingtotheevaluationframework,independentlyfromthedomaintopic.Theauthorsusedthreemethodsforevaluatingdifferentfocusedcrawlers.Thefirstoneisusingaclassifiertoclassifytheresultingwebpagesasrelevantornon-relevant(on-topicoroff-topic).Thesecondmethodisusingaretrievalsystemwherethecollectedwebpagesareindexedinthesystemandspecificqueriesarerunagainstthecollection.Differentcrawlersareevaluatedbasedonthenumberofrelevantwebpagesretrievedforeachquery.Thethirdmethodisusingaveragesimilarityscores.Differentfocusedcrawlermethodsareevaluatedbasedontheaveragesimilarityscoresatdifferentstagesofthecrawl.Sincenoneofthesethreetechniquesfitswellwithourapproach,weusealternativeevaluationschemesintheworkthatfollows.Theabovementionedworkscovermuchofthebackgroundforourresearch.Whiletherehavebeenavarietyofrelatedstudies,ourinvestigationisunique,andimprovesuponthemethodswehaveuncoveredto-date.

Page 29: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

20

3 FocusedCrawling3.1 TopicRepresentationOneoftheinputstoafocusedcrawlerisasetofURLs;togetherthesecanbeusedtodescribetheevent/topicofinterest.WerefertothemasmodelURLs;theycanbethesameordifferentfromtheseedURLs.InthisdissertationmodelURLsaresameasseedURLsforsmallscaleexperiments,whileinlargescaleexperiments(whereseedURLslistisverylarge).WeusedthemodelURLslistbecause,aswewilldescribelater,inChapter5,inthelarge-scaleexperimentsthenumberofseedURLsisverylarge.Usingallofthosetobuildamodelwouldbetimeconsuming,andcoulddelaythestartofacrawl.Accordingly,weuseasmallernumberofURLsasmodelURLstobuildtheeventmodel.Theseareselectedonthebasisofprovidinghighqualitytextualcontentabouttheevent/topic.ThefocusedcrawlerusesthissetofURLstobuilditsevent/topicmodel,andthenusesthemodeltoestimatetherelevance[65,68-73]oftheURLsandwebpagesitencountersduringcrawling.TheremainingseedURLsareaddedtothequeueforthecrawl,helpingensurebreadthofcoverageandreducingbias.Fromnowon,weusethetermseedURLsforbothseedURLsandmodelURLsforsimplicity.Weconsidertwowaystorepresentanevent.Intherestofthischapter,weconsiderthefirst,traditional,baselineapproach,whereaneventistreatedlikeatopic[74],characterizedbyasetofkeywords.InChapter4,wedescribeournewapproach,whereaneventisdescribedwitharichermodel.Wechoseabest-firstfocusedcrawlerasourbaselinemethodbecauseithasproventobethestate-of-theartmethodintopicalfocusedcrawling[65,75].Thebaselinebest-firstfocusedcrawlerusestheVectorSpaceModel(VSM)[76]approachtobuilditsevent/topicmodel:

1. UsingthemodelURLs,downloadcorrespondingwebpagesandextracttextfromthosewebpages.Eachwebpageistokenizedtoasetofwords,stopwordsremoved,andwordsstemmedandthenconvertedtoavector.Herethevectorrepresentstheuniquetermsinthewebpagesandtheirfrequencies(howmanytimestheyappearinthewebpage).

2. Thecrawlerthenbuildsavocabularyindexusingthewebpagevectors.Thevocabularyindexmapsthesetofuniquewordsinallthewebpagestoalistofthewordfrequenciesinthewebpages.

3. Usingthevocabularyindex,thecrawlercalculatesaweightforeachwordbysummingallitsfrequenciesinthewebpagesinwhichitappeared.ThiscorrespondstothewordcollectionfrequencyasopposedtothewordTermFrequency(TF).Wehaven’tusedInverseDocumentFrequency(IDF)aswe

Page 30: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

21

areusingtheseedwebpagesonly,ratherthanalargegeneralcorpus.4. Thecrawlerselectsthetopkwordswithhighestweightsasamodelforthe

event/topicofinterest.Theweightsofthewordsarecalculatedbyusingthelogofthefrequencies,toeliminatetheriskoflongdocumentsdominatingshortdocuments.

Thebaselinecrawlerusestheevent/topicmodeltorepresenttheevent/topicofinterestandalsotomodeleachwebpageitvisitsduringcrawling.Sotheevent/topicvectorhasaslotforeachtermfoundinthevocabulary(orfeaturespace)thatarisesfromthewebpages.Morespecifically,aftergettingaURLwithhighestscorefromthequeue,thecrawlerdownloadsthewebpage,extractsthetext,tokenizesthetextintotokens(words),removesstopwords,appliesstemming,doesfrequencyanalysis,andconvertsthetextintoavectorofwordswiththeirfrequencies.Thefinalwebpagevectorrepresentationwillbeconstructedusingthewordsinthevocabularybuiltfromthemodelwebpagesandtheircorrespondingfrequenciesinthewebpages.Thewordfrequencyinthewebpageiscalledthetermfrequency(TF)intheinformationretrievalliterature[16]andispartoftheTermFrequency–InverseDocumentFrequency(TF-IDF)weightingscheme[15,16].Asmentionedintheprocedureaboveregardingstepnumber3,wehavenotusedtheInverseDocumentFrequency(IDF)becauseitusuallyiscalculatedforageneralcorpusofwebpageswhererelevantandnon-relevantwebpagesexist.Thefocusedcrawlerusesonlypresumedrelevantwebpagesfrommodel(orseed)URLsandthusalsoincludingtheIDFvaluemightleadtoremovingrelevantkeywords.Later,thecrawlerestimatestherelevanceofawebpagebycalculatingthecosinesimilaritybetweentheevent/topicvectorandthewebpagevector.Also,thecrawlerestimatesthescoresofalltheURLsinthatwebpage.ForeachURL,thecrawlercombinestheURLtokensandanchortext,convertsthemtoavectorofwordswiththeirfrequencies,andcalculatesthecosinesimilarityoftheresultingURLvectortotheevent/topicvector.ThentheURLisinsertedintothequeuewiththeestimatedscore.Thevocabulary(keywordsorfeatures)whichthecrawlerusestorepresentthewebpagevectorsandtheextractedURLvectorsisbuiltandextractedfromthesetofwebpagescorrespondingtotheseedURLs.Thewebpagevectorisusedtoestimatetherelevanceofthewebpageandproducearelevancescore.TheURLscoreiscalculatedasanaverageoftheURLvectorscore(calculatedascosinesimilaritybetweenevent/topicvectorandURLvector)andthescoreofthewebpageinwhichtheURLappeared.Sothecrawlerismakinguseofthreetypesoftextualinformation:webpagetext,URLanchortext,andURLaddresstokens.UsingwebpagetextandURLinformationwasprovedtobemoreefficientthanusingwebpagetextonly[75].Figure4showsthearchitectureofthebaselinebest-first

Page 31: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

22

focusedcrawler,withthetopicrepresentationandrelevanceestimationprocesseshighlightedindashedboxes.

Figure4Architectureofbaselinefocusedcrawlerwithtopicrepresentationinthelowerbox,crawlingintheupperbox,andprocessingandrelevanceestimationin

themiddlebox.

3.2 CrawlerArchitectureAgeneralWebcrawler[17,18,77-82]consistsofwebpagefetcher(downloader)forretrievingwebpagecontents,URLsqueue(frontier)forstoringunvisitedURLs,andwebpageprocessorforextractingtextandURLsoutofawebpage’sHTML.CrawlersmodeltheWWWasagraphG(V,E)wherenodes(V)arewebpagesandedges(E)arelinksbetweenwebpages.So,twowebpages(nodes)willhaveanedgebetweenthemifonewebpagehasalinkpointingtotheotherwebpage.SimilartogeneralWebcrawling,afocusedcrawlerhasawebpagefetcher,URLsqueue,andwebpageprocessor.Inaddition,afocusedcrawlerhasatopicordomain-specificmodel,andamoduleforestimatingtherelevanceofURLsandwebpages.Typically,afocusedcrawlertakesasinput:1)thedesirednumberofpagestocollect,and2)seedURLstostartcrawlingfrom.Itoutputsthesetofwebpagesfound[26,41,66,75,83].

Page 32: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

23

Oneoftheimportantaspectsof(focused)crawlersistheorderingoftheURLsinthequeue,whichspecifiestheorderofvisitingthenodesofthegraph.Inthefocusedcrawlerliterature[27],best-firstsearchisthemostcommonlyusedtechniqueandisconsideredthestate-of-the-artfocusedcrawler,takingintoconsiderationtheestimatedrelevanceoftheURLs/webpagesduringcrawling.AfocusedcrawlerstartsfromaseedURL.Itdownloadsthecorrespondingwebpageandextractsthetextofthatwebpage.Thefocusedcrawlerthenestimatestherelevanceofthewebpagetextualcontentwithregardtothetopic/eventofinterest.Inthenextstep,therearetwodesignoptions.Oneoptionisthatthefocusedcrawlerdecideswhetherthewebpageisrelevantornotbycomparingitsestimatedscoretoapre-definedthreshold.Ifthewebpageisconsideredrelevant,thenthefocusedcrawlerextractstheembeddedURLsfromthewebpageandinsertsthemintothequeue.TheotheroptionisthatthefocusedcrawlerextractsallembeddedURLsfromthewebpageandtheninsertsthoseintothequeue,notbeingconstrainedbythewebpagescore.Thesecondoptiontakesintoconsiderationthetunnelingphenomenaincrawling,whereanon-relevantwebpagelinkstorelevantwebpages,eitherdirectlyorthroughseveralsteps.WheninsertingtheextractedURLsintothequeue,thefocusedcrawlerhastomakeanotherdecision.OneoptionistoinsertallextractedURLs,alongwiththeestimatedscoreofthewebpagefromwhichtheywereextracted.AnotheroptionistoestimatetherelevanceofeachURLbasedonthetokensinboththeURLs’addressandanchortext,andinserttheURLanditsresultingestimatedrelevancescoreintothecorrectpositioninthepriorityqueue.WeadoptahybridapproachwhereweusetheaverageofaURL’sscoreandthescoreoftheparentwebpagefromwhichtheURLwasextracted[75].Next,thefocusedcrawlerpullsfromitsqueuetheURLwithhighestscore,andrepeatstheprocess.Figure5showsafocusedcrawleralgorithmthathandlestunneling(i.e.,extractstheURLsfromthewebpageregardlessofscore):estimatingthescoreofeachURLandinsertingitintothequeuewithitsestimatedscore.Weconsiderthisapproachasthefoundationforthebaselineforevaluationcomparisons.

Page 33: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

24

Figure5Baselinefocusedcrawleralgorithm

3.3 LargeScaleDesignConsiderationsIdeally,thefocusedcrawlershouldscorealltheURLsitextractsfromawebpageandinsertthemintoitsfrontierbasedontheirscores.Whenthesituationissmallscale,thefrontiersizeismanageable,howeveriflargescale,thefrontiersizecouldgrowveryfast,andslowdowntheperformanceofthefocusedcrawler,duetomemoryconstraints.

!!!Algorithm*!Baseline!Focused!Crawler!!Input:!Seed!URLs,!pagesLimit,!pageScoreThreshold,*urlScoreThreshold!!Insert!seed!URLs!in!priority!queue!##*Topic*Representation*topicVector!=!Build!topic!representation!from!seed!pages!!##*Crawling*while*pagesCount!<!pagesLimit*and*priorityQueue!is!not!empty:*** URL!=pop!(priorityQueue)!! append!URL!to!visited!list!!!!!!!!!!!!!!!page!=!download!(URL)!! ##*Preprocessing*(Vector*Space*Model)*! pageVector!=!process!(page!text)!

##*Relevance*Estimation*(Cosine*Similarity)*pageScore!=!calculateScore(pageVector,!topicVector)!

!!!! pagesCount!+=!1*************if*(pageScore!>=!pageScoreThreshold):************** * page.getURLs!()!!! !!! relevantPagesCount!+=!1!! !!! save!page!to!eventNrelated!collection!

for*link!in!page.outgoingURLs:!********************* * URL!=!link.address!

!!!!!!!!!!!!!!!!!!!!! validate!(URL)!*************** ******* if*URL!not*in!visited!list!and*URL!not*in!priorityQueue:!

##*Preprocessing*(Vector*Space*Model)* * *urlVector!=!process(URL!text)**##*Relevance*Estimation*(Cosine*Similarity)*

***************************** urlScore!=!calculateScore!(urlVector,!topicVector)!*************** ******************* if*urlScore!>=!urlScoreThreshold:********** * * * * push!(URL,!priorityQueue)!!

Page 34: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

25

Thefrontierisconstructedusingapriorityqueue,oftenimplementedusingamaxheap[84,85].Themaxheapdatastructureprovidespopandpushoperationslikeanormalqueue,andmaintainsthemaxheapproperty,i.e.,thattheelementwiththemaximumvalueisalwaysontopoftheheap.Thus,thepopoperationwillalwaysresultintheelementwithmaximumvalue.ThepopoperationisofO(1),whilethepushoperationwillhavetoinserttheelementinitscorrectpositiontomaintainthepropertythatthemaximumelementisalwaysonthetopoftheheap.ThepushoperationisofO(logn)wherenisthenumberofelementsintheheap.ThepushoperationisrepeatedforeveryURLextractedwhilethepopoperationisrepeatedforeveryURLvisited.Therefore,thefocusedcrawlerwouldspendaconsiderableamountoftimeinmaintainingthemaxheapproperty,especiallywhentheheapsizegrows.Thusduringlarge-scalecrawlingexperiments,welimitthenumberofURLsofthemaxheapbyinsertingonlyURLswithascorebiggerthanagiventhresholdanddiscardURLswithrelevancescorelowerthanthegiventhreshold.;wesetthethresholdto0.1empirically.Anothercomponentthatconsumesmemoryinlarge-scalesituationsisthevisitedURLslist.ThislistkeepstheURLsthatthefocusedcrawlerhasvisited(andfetchedtheircorrespondingwebpages)sothatthefocusedcrawlerdoesn’tvisitthemagain.ForeveryURLextractedorpoppedfromthefrontier,thefocusedcrawlerchecksiftheURLexistsinthevisitedURLslist.Usinganormallist,thisoperationwillbetimeconsumingwhenthelistsizeisverylarge.Forlarge-scaleexperiments,weimplementedthevisitedURLslistusingahashtable,wherethehashkeyistheURLaddress.Checkingifakeyexistsinahashtableismuchfasterthansearchingalist.Theabovementionedapproachcharacterizesthebaselinefocusedcrawlerusedinevaluationstudiesreportedbelow.Theeventfocusedcrawlerusesthesamegeneralapproach,butvariesduetoitsuseofaneventmodel(and,later,webpagesourceimportance).Thus,comparisonsallowdeterminingtheeffectsoftheeventmodelandwebpagesourceimportance.

Page 35: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

26

4 EventFocusedCrawler4.1 EventModelandRepresentation

4.1.1 EventModelingInthefocusedcrawlerliterature[74],eventsgenerallyareconsideredastopicsandwererepresentedwithalistofkeywords.Althoughthisapproachmightworkwellforsomeevents,itworkslesswellforotherkindsofevents.Forexample,representingeventsasalistofkeywordswouldworkincaseswherethetopicpartoftheeventismostdominantandimportant,whilethelocationandtimepartsaren’timportantordon’tplayasignificantroleintheevent.TheoutbreakofEbolaisagoodexampleofaneventwherethetopic(spreadingofEbola)isthemostimportantaspect.Thelocationanddatearepartofthedetails,butarenotthatimportant,i.e.,thetopicpartislargelysufficienttoclearlydescribetheevent.Ontheotherhand,shootingevents,forexample,can’tbedescribedwiththetopicpartonly.Sincetherearemanyshootingeventsindifferentplacesandatdifferenttimes,weneedthelocationanddatepartstoclearlydescribeaparticularshootingevent.Inthisdissertation,wearefocusingonunusualreal-worldnewseventswhichcauseorcreateanimpulseinthepeople’sinterestandthemediacoverageabouttheevent[61].Goodexamplesofthistypeofeventarenaturaldisasters,elections,shootings,terroristattacks,andincidents(pipeexplosion,craneorbuildingcollapse,etc.).Thistypeofeventischaracterizedbytwomainthings:1)theyhaveacentralizedphysicallocationand2)theyhavealimitedtimeperiodinwhichtheireffectappears(impulseinmediacoverageandpeopleinterest).Evenforelections,weareconcernednotwiththegeneralaspectoftheelectionbutwithaspecificincidentthatcapturespeople’sinterestorcausesanimpulseinmediacoverage.Forexample,theAlaskaelectionsonNovember2,2010capturedpopularattentionbecauseofa3-wayracewhichanindependentcandidatewon[61,86].Thisincontrasttomanyothereventdefinitionsindifferentfields.Inthecomputationallinguisticsfield,aneventcanbedefinedas“asituationthatoccurs”whileinthemultimediafielditisdefinedasachangeinthestateofanobjectinavideoorphotostream.Intheinformationextractionfield,thefocusisonextractingreal-worldlocaleventslikeconcerts,theaterperformances,birthdayparties,etc.;informationcomesfromtextualcontentandyieldsrecordsofeventsinwhichpeoplehaveinterest[87].

Page 36: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

27

EventModel:Beforeconsideringcomplexeventmodels,asimpleschemeshouldbetestedfirst.Thus,wedefineaneventassomething(e.g.,adisaster),whichhappenedinacertainplace,andatacertaintime.Thus,aneventEisatuple<T,L,D>.Thethreepartsreflectwhat,where,andwhen.Thus,Tisthetopicoftheevent,Lisitslocation,andDisitsdate.Theseareexplainedbelowinmoredetail.Topic:UsingasetofseedURLs,wecreateaneventvocabulary(asetofuniquekeywordsthatappearfrequentlyinthewebpagesassociatedwiththoseseeds).Werepresentaneventwithareferencevectorcreatedbytakingthetopkeywordsfromtheeventvocabulary.Date:Theeventdateisgivenbyauserorisextractedautomaticallyfromthesetofseedwebpages.Theeventdaterepresentsthestartingdatewhentheeventfirstoccurred.Theeventalsocouldhaveanendingdate,orevenasetofperiodsintime.Location:Asmallsetoflocationentitiesislikelytoappearfrequentlyinmostoftheseedwebpages,representingplacesrelatedtotheevent.Theselocationentitiesareextracted(asdescribednext)fromseedwebpages’texts;weperformafrequencyanalysistohelpfindthemostimportantlocationentitiesmentionedintheseedwebpages.Forexample,wemodeltheshootingthathappenedinSanBernardino,CaliforniaonDecember2,2015asfollows: Topic:shooting,shooter,… Location:SanBernardino,California Date:12/02/2015Similarly,wemodeltheattackthathappenedinBrussels,BelgiumonMarch22,2016asfollows: Topic:terror,attack,explosion,… Location:Brussels,Belgium Date:3/22/2016Oureventmodel(combiningtopic,location,anddate)candefaulttothetopic-onlymodelincaseofEbolabyignoringthedateandlocationpart(bysettingtheweightsofthelocationanddatepartstozero).WenotethatthiswouldbethecasealsoregardingtheZikavirusdiseaseoutbreak.Wecanaddapartinoursystemthatiftheeventtypeisdiseaseoutbreak(manuallyenteredbytheuser),thesystemautomaticallydefaultstothetopic-onlymodel.Alternatively,otherdefaults,likegivingasmallweighttothelocationand/orthedatepart,couldbeinstituted.Thus,oursystemisflexible,andcanbeusedincaseswhereitisdifficulttodetermine,foragivenevent,whichmodelismoreefficient.Butiftheeventisa

Page 37: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

28

news/worldeventwhichisphysicallylocalized(hasaclearcenter)andtemporallylimited(withanimpulseinnumberofarticlespublishedinarelativelyshortperiod)thenwebelieveoureventmodel(combiningtopic,location,anddate)should performmoreefficientlythanthetopic-onlymodel.Thus,ifauserisinterestedintheoutburstofZika/Eboladiseaseinacertainplaceandatcertaintime,thenoureventmodel(combiningtopic,location,anddate)shouldperformbetterthanthetopic-onlymodel. Figure6showsthestepsofbuildinganeventmodelfromseedwebpages.WestarttheprocessofbuildinganeventmodelbydownloadingthewebpagescorrespondingtotheseedURLs.WethenextractdatesfromtheseedURLsandtheseedwebpages.Todothis,wefirsttrytoextractthepublicationdatefromtheseedURLsusingapre-definedregularexpression.Ifthatfails,weextractthepublicationdatebyparsingapre-definedsetoftagsfromtheHTMLofthewebpages.

Figure6Stepsofbuildingeventmodelfromseedwebpages

Forthedateextraction,wehaveuseda library1forextractingpublishingdateofawebpageusingheuristics.The first step is to extract thepublishingdate from theURL using regular expressions, if applicable. For example, the URLhttp://www.cnn.com/2016/07/10/us/black-lives-matter-protests/index.html hasapublishingdateofJuly10,2016.IftheURLdoesn’tcontaindateinformation,thenthenextstepistolookforspecifictagsintheheaderportionofthecorrespondingwebpageHTMLtags.Anexampletagthatcontainspublishingdatelookslike:

1https://github.com/Webhose/article-date-extractor

Page 38: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

29

<metaname="pubdate"content="2015-11-26T07:11:02Z">

ThisappearsintheheadtagoftheHTMLcontentofawebpage.Therearemultiplemetatagsthatmightcontainpublishingdate;hencethelibraryhasanextensivelistofpossiblemetatagsthatarefrequentlyusedindifferentwebsites.Thefinalstepistocheckinthebodyofthewebpage,ifnopublishingdateisfoundintheheadtag.Asbefore,alistoffrequentlyusedbodytagsisusedtoguidefindingthepublishingdate.Anexampleofsuchatagcontainingpublishingdateis:

<pclass=”pubdate”>Sept3,2011</p>Iftherearemultipledatesfoundinthewebpage,thelibraryreturnsthefirstoneonly.ThelibrarytriestoextractthepublishingdatefromtheURL,thenfromtheheadtagoftheHTML,andthenfromthebodytagoftheHTML.TheorderisimportantbecauseitfollowstheaccuracyoftheextracteddatewheredateextractedfromURLisexpectedtobemoreaccuratethanfromtheheadthanfromthebody.Anextrastepthatcouldbedoneistousenaturallanguageprocessingtechniquestoextractnamedentities(dates)fromthetextualcontentofthewebpage.Usingextractednamedentitiesdates,wecanfigureoutthepublishingdateofthewebpage.However,wehavenotusedthisapproach,becausewiththelibraryweusedwemanagedtoextractpublishingdatesfrommostofthewebpagesandbecauseoftheoverheadofcallingandusingthenamedentityrecognizer.Fortheeventmodellocationsvector,wesegmentthetextofthewebpagesoftheseedURLsintosentencesandapplytheStanfordNamedEntityRecognizer(SNER)2oneachsentencetoextractlocationentities.Wethenperformfrequencyanalysisontheextractedlocationentitiesandconstructthelocationsvector.Itincludestheuniquelocationsextracted,alongwiththeirfrequencyofoccurrenceinallsentencesinallseedwebpages(i.e.,theweightofeachlocationisthecumulativefrequencyinallseedwebpages).TheresultinglocationsvectorwillincludethelocationsfrequentlymentionedinthesetofwebpagescorrespondingtotheseedURLs,whichshouldbethelocationoftheeventofinterest,assumingtheseedwebpagesarerelevantandofhighquality(withregardtocontainingenoughinformationaboutthedifferenteventaspects,namelytopic,location,anddate).TheSNERcouldextractlocationentitiesnotrelatedtotheevent from some of the seedwebpages, as a webpagemay include references to

2http://nlp.stanford.edu/software/CRF-NER.shtml

Page 39: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

30

multiple locations.This shouldnot affect themodel, however, as the frequencyofthoselocationentitiesshouldbeverysmall(sincetheytypicallyappearinfewoftheseed webpages). On the other hand, if the event occurs in multiple locations, asuitablelistoflocationsshouldbefoundthroughtheabovementionedprocessingofseedwebpages (i.e., theseedwebpagesshouldcover thedifferent locationsof theeventandnotbeconcentratedononelocationonly).Forthetopicvector,weperformthesameprocessingasforthebaselineVSM.WetokenizethetextofthewebpagesoftheseedURLsintowords,removestopwords,stemwords,performfrequencyanalysis,andconstructthetopicvectorasthesetofuniquetermsalongwiththeirfrequencyofoccurrence.

4.2 EventProcessing

4.2.1 EventModel-basedWebpageScoringInthissection,weshowhowtheeventfocusedcrawlerusestheeventmodeltocalculateascoreforeachwebpageitvisitsandfortheURLsextractedfromthatwebpage.Focusedcrawlersassigneachdownloadedwebpageascore,whichestimatestherelevanceofthewebpage.Inthecaseofevents,eventaspectsareconsideredduringtherelevanceestimationprocess.Thuswescoretherelevanceofthewebpagewithrespecttoeachaspectoftheevent,andthencombinethatinformationtocomputeafinalscore.Accordingtooureventmodel,therearethreeattributeswhichtogetherfullydescribeanevent.Awebpagecanhavesomeoralloftheattributesofanevent.Awebpageisconsideredrelevant(i.e.,talksaboutthetargetevent)ifitsatisfiesthefollowingconditions:

• Ithasanon-emptysubsetofthekeywordsthatrepresentthetopicattributeofthetargetevent(i.e.,istopicallyrelevant).

• Itspublicationdateisclosetotheeventdate.• Ithasanon-emptysubsetofthekeywordsofthelocationattributeofthe

targetevent(i.e.,thelocationentitiesextractedfromthewebpagearesimilartoeventlocationentities).

Awebpagethatsatisfiestheseconditionsshouldbeconsideredrelevantandwillbeaddedtotheoutputcollection.Theeventfocusedcrawlerfirsttakesthefollowingstepswithregardtoawebpage:

1. Extractthetextofthewebpage.

Page 40: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

31

2. Extractthepublicationdateofthewebpage.3. ExtractlocationentitiesfromthetextofthewebpageusingNamedEntity

Recognition(NER).Wehavedevelopedafunctiontomeasurethesimilaritybetweenthetargeteventmodelandthewebpagemodel.Thesimilarityfunctionproducesascorethatestimatestherelevanceofthewebpagetothetargetevent.Givenatargeteventmodelandawebpageeventmodel:e1=(T1,L1,D1)ande2=(T2,L2,D2),whereT1istheeventtopicreferencevector,L1isthelistoflocationentitiesextractedfromseedwebpagesusingNER,D1istheeventdate,T2isthebag-of-wordsvectorrepresentationofthewebpagetext,L2isthelistoflocationentitiesextractedfromthewebpagetextusingNER,D2isthepublicationdateofthewebpage,e1isthetargeteventmodel,ande2isthewebpageeventmodel.Thesimilarityfunctionsim(e1,e2)isdefinedas:

!"# $%, $' = *×!,-.$ /%, /' + 1×!,-.$ 2%, 2' + ,×!,-.$(4%, 4') (1),

where

!,-.$ /%, /' = 6(78)×6(79):;<8∩<9

>8 × >9 (2),

i.e.,thecosinesimilaritybetweentheT1andT2vectors,andw(ti)istheweightoftermtindocumenti,and

!,-.$ 2%, 2' = 6(?8)×6(?9)@;A8∩A9

B × B9 (3),

i.e.,thecosinesimilaritybetweentheL1andL2vectors,andw(ti)istheweightoflocationlindocumenti,and

!,-.$ 4%, 4' = 1 −E8FE9

GHI_KLMN (4),

Page 41: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

32

wherenum_daysisthenumberofdaysinayear.Thisparametercanbeconfiguredaccordingtotheeventcharacteristics.Ifthetwodatesaremorethanthevalueofthisparameter,thescorewillbezero.Thefinalscoreofthewebpageiscalculatedbyusingaweightedaverageofthescoresofthetopic,location,anddatevectors,whereconstantsa,b,andcaretheweightsofthetopic,location,anddatescores,respectively.Theseaddtoone:a+b+c=1.

4.2.2 CalculatingTheWeightsTheweightsa,b,andccouldbesetmanuallybyanexpertwhowouldtakeintoconsiderationthetypeoftheevent(shooting,hurricane,bombing,earthquake,etc.)andthecharacteristicsoftheevent(timedurationandlocationarea,e.g.,specificlocationandpointintimefora“sharp”event,versusmultiplelocationsandlongtimeperiodsforcomplexevents).Toautomaticallycalculatetheweights,weuseeachaspectoftheeventmodel(topic,location,anddate)separatelytoscoreasampleoflabeledwebpages.Weevaluateeachaspect’sperformanceagainstdifferentthresholdvalues,andwechoosethethresholdvaluethatproducesthebestclassificationperformanceaccordingtoagivenevaluationmetric.Inparticular,weusedtheF1-scoreastheclassificationevaluationmetric.F1-scoreisaninformationretrievalmetricthatcalculatesthegeometricmeanoftheprecisionandrecall[88].WealsousedtheF1-scoretoassigntheweightofeachaspectoftheeventmodel(topic,location,anddate),whichindicatestheimportanceofthataspectincalculatingthefinalscore.Wecalculatetheweightastheratiooftheaspect’sF1-scoretothesumoftheF1-scoresofallaspects.Figure7showsthestepsforcalculatingthescoreofawebpage.For example, assume we have 100 webpages, 50 relevant and 50 non-relevant,relative to a specific event (e.g.,Orlando shooting). Assumealso thatwehave thetarget event model (which could be extracted from another set of relevantwebpages or entered manually by the user). For each of the 100 webpages, weextract the topic vector, locations vector, and publication date. Thenwe calculatethree scores (topic score, location score, and date score) for eachwebpage usingequations1,2,and3,respectively.Afterthisprocessweendupwithamatrixof100rows(webpages)and3columns(topicscore,locationscore,anddatescore).Next,we use each of the scores (topic, location, and date) separately to predict a label

Page 42: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

33

(relevant or non-relevant) for eachwebpage.Weproduce the label by comparingthescoretoathresholdvalue(callitK);wegenerate“relevant”ifthescoreislargerthanthethresholdand“non-relevant”ifitissmaller.

Figure7Thestepsforcalculatingthescoreofawebpage

Afterthisprocessweendupwithamatrixof100rows(webpages)and3columns(labelbasedontopicscore,labelbasedonlocationscore,andlabelbasedondatescore).Thenweevaluatetheeffectivenessofeachaspectoftheevent(topic,location,anddate)bycomparingtheactuallabelsandpredictedlabelsforeachofthethreeaspects(topic,location,anddate).WeusetheF1-scoreasthemetricforevaluation.Hereweendupwith3F1-scores(oneforeachofthetopic,location,anddate)forthethresholdvalueK.Werepeatthepreviousprocessfordifferentvalues(nvalues)ofthethresholdparameter.Thenweendupwithamatrixofnrows(differentvaluesofthethreshold)and3columns(topic,location,anddate).Finally,wechoosethemaxF1-scoreforeachaspect(topic,location,anddate).TheweightofeachaspectwillbetheratioofitsF1-scoretothesumofthreeaspects’F1-scores.Inthismanner,theweightofeachaspectcorrespondstohowmuchitcontributestotheoverallperformance.Theweightsofeachaspectoftheeventmodel(topic,location,anddate)arelearntbeforethecrawlingtimebyapplyingthepreviousprocedureonagivensetofURLs

Page 43: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

34

andwebpagesthatarelabeledasrelevantornon-relevant.Theweightslearntareusedduringcrawlingandarenotmodified.

4.2.3 EventModel-basedURLScoringAsimilarprocedureisimplementedforestimatingascoreforeachURLextractedfromthewebpage.AURLisconvertedintotokensbyremovingnon-alphabeticcharacters(like‘/’,‘#’,’?’)andalsoremovingURL-specifickeywords(like‘http’,‘com’,’www’).URLtokensarecombinedwithtokensfromassociatedanchortexts.Theresultingtokensarethenconvertedtoabag-of-wordsbasedvectorrepresentation.WeextractthelocationentitiesfromURLanchortextusingSNERand extractthepublicationdatefromtheURLusingregularexpressions3(ifapplicable).

Figure8AnexamplewebpagewitharelevantURLanchortexthighlighted

Figure8showsanexamplewebpageabouttheBrusselsattackeventwitharelevantURLhighlighted.TheanchortextoftheURLis:“ParisandBrusselsterrorsuspecttofacechargesinFrance”.TheaddressthattheURLpointstois:3https://github.com/Webhose/article-date-extractor

Page 44: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

35

https://www.theguardian.com/world/2016/jun/09/mohamed-abrini-paris-brussels-terror-suspect-france-man-in-the-hat.WecanseethattheURL’saddresscontainsthepublicationdateofthecorrespondingwebpage(June9,2016).TheURLvectoraftertokenization,stopwordremoval,andstemmingwouldbe:[‘theguardian’,‘world’,’2016’,’jun’,’mohamed’,’abrini’,’paris’,’brussels’,’terror’,’suspect’,’france’,‘man’,’hat’,’charges’].WenoteherethatSNERwillcapturethelocationentitiesfromtheURL’sanchortextonly,nottheURLaddress,becausetheSNERtokenizesitsinputtosentencesandtriestoextractentitiesfromthesesentences;thiscanbedonefortheanchortext(whichincludesmeaningfultext)butnotfortheURL(sinceURLtokensdon’tformmeaningfulsentences).IftheseedURLsarefromonedomain(e.g.,www.theguardian.com),thismayaffectthequalityofinformationextractedtobuildoureventmodel.Thecoveragefromasingledomaincouldbebiasedorlimitedinscope,whilethatislesslikelyiftherearemultipledomains.TheseedURLsshouldbefromdifferentdomains,whichwillensurethatallrequiredinformationispresentintheeventmodelandthusthereisnobiastowardaparticulardomainwebsite.Asasummary, Figure9showsthestepsforcalculatingthescoreofaURL.

Figure9ThestepsforcalculatingthescoreofaURL

Page 45: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

36

Inthischapter,wehaveaddressedhypothesis1.1andresearchquestions1and2,namely“Howtomodelandrepresentanevent?”and“Howtocomparetwoeventrepresentations?”.Weexplainedoureventmodelandrepresentation,andhowtheeventfocusedcrawlercanuseittoestimatetherelevanceoftheURLsandwebpagesitvisits.

Page 46: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

37

5 ExperimentalSetupInthischapterwedescribethedatasetsusedforourexperiments,thedifferentexperimentsperformed,andtheevaluationmetrics.Weperformedfourseriesofexperiments.Thegoalofthefirstseries(seeresultsinSection6.1)wastovalidatethatoureventmodelcaneffectivelyclassifywebpagesasrelevantornon-relevanttotheeventofinterest.Wecomparedourapproachagainstthebaseline,i.e.,usingthetraditionalvectorspacemodel(VSM)topic-onlyapproach.Inthesecondseriesofexperiments(seeresultsinSection6.2),weaimedtovalidatethattheeventmodelcaneffectivelyestimatethescoresoftheURLsandwebpagesitvisitsandconsequentlyguidethecrawlingprocesstowebpagesrelevanttotheeventofinterest.Inthethirdexperiment(seeresultsinSection7.1),weevaluatedtheeffectivenessofourproposedwebpagesourceimportancemodelforcollectingmorerelevantwebpages.Inthefourthexperiment(seeresultsinSection7.2),weevaluatedtheeffectivenessofcuratingseedURLsfromsocialmediacontent(tweets)usingdifferentmethodsofselection.

5.1 DatasetsForthefirstseriesofexperiments,aboutclassification,wedevisedtwodatasets:1)asetofrelevantwebpagesforthetraining/learningmodelphase,and2)asetofrelevantandnon-relevantwebpagesforthetesting(classification)phase.First,weneedasetofrelevantwebpagesthatthetwomodels(ourevent-basedmodelandthetopic-onlymodel)willusetolearn/buildtheirmodelinthetrainingphase.Accordingly,wemanuallycuratedasetof38URLsandfetchedtheircorrespondingwebpages.Fortheclassificationphase,therewasnoexistingdataset(labeledrelevantandnon-relevantsamples)aboutthatshootingevent.Wedecidedtobuildourowngroundtruthdatasetof1000URLsandwebpages.Wecouldhavemanuallylabeledasetof1000URLsandwebpages,but(tosavetimeandeffort)weusedakeyword-basedcrawlertofetch1000webpagesusingthesetof38URLs(usedinthetrainingphase)asseeds.Weusedthetwowords“California”and“shooting”askeywordsforthecrawler.Afterthecrawlerfinishedcrawling,wemanuallylabeledtheresultingwebpagesintotwoclasses(relevantandnon-relevant).Therewere725webpages

Page 47: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

38

labeledasrelevantand275labeledasnon-relevant.WefollowedtheproceduredescribedinSection4.3forcalculatingtheweightsandperformingtheevaluation.ThemanuallylabeledsetofURLsandwebpagesaregivenasinputtobothmodelsandtheresultingcalculatedscoresandgiventhresholdparameterareusedtoproducethepredictedlabels.Thepredictedlabelsarethencomparedtothelabelsdeterminedmanually,toproducetheevaluationresults.After completing the first series of experiments, for the focused crawlingexperiments,weconsideredasetofrecentevents.Table2showsthelistofeventsused in our crawling experiments. For each event we summarize the type of theevent,thelocationanddateoftheevent,andfinallyhowmanyURLswereextractedfromthecorrespondingtweetcollectionandwereusedasseedURLsforcrawling.The number of seed URLs varies across the events because they were extractedfromtheevent’scorrespondingtweetcollections.Thenumberofdesiredwebpageswassetto50,000foreventswithlessthan10,000seedURLsand100,000foreventswithmorethan10,000seedURLs,exceptforEgyptairplanecrasheventwherethenumber of desiredwebpageswas set to 10,000URLs only andParis attack eventwhere thenumber of desiredwebpageswas set to 500,000. For the classificationexperiments,however,itseemedsufficienttojustconsidertheCaliforniashooting.

Table2Listofeventsusedinthelarge-scalecrawlingexperiments

Event Type Location Date #ofSeedURLs

#ofdesiredwebpages

CaliforniaShooting

Shooting SanBernardino,California,USA

December2,2015

4,161 50,000

BrusselsAttack

TerroristAttack

Brussels,Belgium

March22,2016

4,691 50,000

OregonShooting

Shooting Roseburg,Oregon,USA

October1,2015

22,354 100,000

EgyptairPlaneCrash

PlaneCrash MediterraneanSea,

Alexandria,Egypt

May19,2016

1,211 10,000

PanamaPapersLeak

DocumentLeak

Panama April3,2016

18,260 100,000

OrlandoShooting

Shooting Orlando,Florida,USA

June12,2016

1,988 50,000

ParisAttack TerroristAttack

Paris,France November13,2015

88,835 500,000

EcuadorEarthquake

Earthquake Ecuador April16,2016

11,348 100,000

Page 48: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

39

Sincethe IDEALproject isworkingwitha largeamountofevent-relateddata,andsincewewantedourresultstobeassessedinthecontextofsuchtypesofdata,wehadampleopportunitytoutilizedatafromeventslikethosementionedabove.

5.2 ExperimentsThegoalofthefirstserieswastovalidatethatoureventmodelcaneffectivelyclassifywebpageswithregardtorelevancetotheeventofinterest.Wecomparedtheperformance,forthetaskofclassification,oftheevent-modelvs.topic-onlymodel.WeusedthemanuallycuratedseedURLsandthestaticdatasetof1000webpagesabouttheCaliforniashootingfortheevaluation.Bothmodelsusecosinesimilarityasascoringfunction,andproduceascorefortheirinputtext(URLorwebpage)thatestimatestherelevanceoftheinputtothegivenevent.Weuseeachofthemodelsasaclassifier,i.e.,bycomparingthecosinesimilarityscoretoagiventhresholdparameter.Ifthecosinesimilarityscoreisbiggerthanthethreshold,thentheoutputlabelisrelevant,butisnon-relevantotherwise.Bothmodels(event-basedandtopic-only)aretrainedusingasetofURLs(positiveonlysamplesastheydon’trequirenegativesamplesfortraining).Thetopic-onlymodelusesthesetofURLstobuildatopicreferencevector,whiletheevent-basedmodelusesthesetofURLstobuildtheeventmodel(topic,location,anddate).ThenextstepistousethemodelstoclassifyURLsandwebpages.Weusethetwomodelsbuilt(event-basedmodelandtopic-onlymodel)toclassifythemanuallylabeledURLsandwebpages(i.e.,ourgoldenstandardtestset)todeterminetheirpredictedlabels.Thelaststepistocomparethepredictedlabels(fromboththeevent-basedmodelandthetopic-onlymodel)totheactual(manuallyproduced)labelsandevaluatetheperformanceofeachofthemodels.Weevaluatedtheperformancebyvaryingtwoparameters:

1. k,thenumberofkeywordsusedinconstructingthetopicvectorinoureventmodelandthetopicreferencevectorforVSM,and

2. threshold,thevalueofthethresholdusedforconvertingthescorestolabels(relevantifthescoreislargerthanthethreshold,otherwisenon-relevant).

Wealsorantheexperimentswithseveralvariationsofoureventmodel.Wethenranthesameexperimentwiththetwopairsoffeaturetypes:a)combinationofthetopicandlocationonly,andb)combinationofthetopicanddateonly.Figure10showsthedesignoftheexperimentsforevaluatingtheeffectivenessofclassificationusingtheeventmodel.

Page 49: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

40

Figure10Designofourevaluationmethodoftheeffectivenessoftheeventmodel

forrelevanceestimation;thetwoboxeswithanasteriskindicatethetwoparametersoptimizedintheexperiment

Inthesecondseriesofexperiments,weaimedtovalidatethattheeventmodelcaneffectively estimate the scores of the URLs and webpages to be visited, andconsequentlyguidethefocusedcrawlingprocesstowebpagesrelevanttotheeventofinterest.Intheseexperimentsweusedasetofrecentevents.Intheliterature,themostuseddataset forevaluating topical focusedcrawler is theDMOZdataset.Butsince we are evaluating focused crawlers for events, the DMOZ dataset is notsuitableinourcase.Inthecrawlingexperiments,weneedasetofrelevantURLsforeacheventthatwillbeusedasseedsforstartingthecrawling.WemanuallycuratedthesetofseedURLsfortwoevents(38forCaliforniashootingand23forBrusselsattack).ThesetwosetsofmanuallycuratedURLsareusedforrunningsmall-scalecrawlsonly.Forlarge-scalecrawls,weneedtostartfromalargersetofseedURLs,whichisverydifficulttobuildmanually.Fortherestoftheevents,weextractedthesetofseedURLsfromacorrespondingcollectionoftweets.ThetweetcollectionswerecollectedusingtheTwitterstreamingAPI.TwitterisarichsourceofURLs,asmostofthetweetspostedlinktowebpagesthatcontainmoredetailedinformation.ThenumberofURLsextracteddependsonhowbigtheeventis(highimpacteventsattractmorepeopleandthereforemoretweetsarepostedabouttheevent).

Page 50: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

41

Figure11summarizesthestepsforcalculatingtheharvestratioafterthecrawlingprocessfinishes.Theresultingwebpagesofeachcrawl,theircalculatedscores(basedoneachmodel),andagiventhresholdareusedtoproducethepredictedlabelsandthecorrespondingharvestratio.

Figure11Thedesignoftheexperimentforevaluatingtheeffectivenessofevent

modelwithfocusedcrawlertoretrievemorerelevantwebpages

5.3 EvaluationMetricsWeusedtheprecision,recall,andF1-scoremetricstoevaluatetheclassificationperformanceofoureventmodelversusthebaselinetopic-onlyapproachinthefirstseriesofexperiments.Theprecision,recall,andF1-scorearecalculatedusingtheconfusionmatrix,whichcontainsthenumberoftruepositive,truenegative,falsenegative,andfalsepositivesamples(sampleshererefertowebpages).Thetruepositivesamplesaretheonesthatwerepredictedrelevantandwereactuallyrelevant,whilethetruenegativesamplesaretheonesthatwerepredictednon-relevantandwereactuallynon-relevant.Thefalsepositivesamplesaretheonesthatwerepredictedrelevantandwereactuallynon-relevant,whilethefalsenegativesamplesaretheonesthatwerepredictednon-relevantandwereactuallyrelevant.Theprecisionisdefinedasthepercentageofretrievedsamplesthatarerelevant.Itiscalculatedastheratiooftruepositivetothesumofthetrueandfalsepositivesamples.Therecallisdefinedasthepercentageofrelevantsamplesthatareretrieved.Itiscalculatedastheratioofthetruepositivetothesumofthetruepositiveandfalsenegativesamples.TheF1-scoremeasureisthegeometricmeanoftheprecisionandrecallmeasures.Togetaperfectprecisionweusuallymusthave

Page 51: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

42

lowrecallandviceversa(perfectrecallcomeswithlowprecision),soF1-scoreisawaytocombinebothmeasures(precisionandrecall).Inotherwords,ifyougetahighvalueforprecision,thatdoesn’tensureyouhavegoodperformance,asyoumighthavealowvalueforrecall.ButifyouhaveahighvaluefortheF1-score,thismeansyouhaveahighvalueforbothprecisionandrecall.Forevaluatingtheperformanceofthefocusedcrawlersinthesecond,third,andfourthseriesofexperiments(i.e.,theabilitytocollectmorerelevantwebpages),weusedtheharvestratiometric[27-29].Theharvestratioisthepercentageofcrawledwebpagesthatarerelevant.Theharvestratiomeasurestheabilityofthecrawlertofindandcollectmorerelevantcontentthannon-relevantones.Ifthecrawlervisitsmanynon-relevantwebpagesinordertofindrelevantones,thenthismeansthecrawlerisusinganinefficientmethod.AhighlyefficientrelevanceestimationmethodwoulddirectthecrawlercorrectlyandensureitfocusesontherelevantpartoftheWebonly,thusproducingahigherharvestratio.ItisworthnotingherethatthecriticalpointforrelevancejudgmentistheabilitytoestimaterelevanceofbothURLsandwebpages.EstimatingtherelevanceofwebpagesiseasierthanforURLsduetothefactthatwebpagescontainrichertextualcontentthanURLs.Apoorrelevanceestimationmethodwouldgivehighscoresfornon-relevantURLs,whichleadstonon-relevantwebpagesandthuslowharvestratio.

Page 52: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

43

6 Results6.1 EventModel-basedvs.Topic-OnlyClassificationInthissection, aboutthefirstseriesofexperiments,weshowtheresultsofclassifyingthe1000webpagesabouttheCaliforniashootingusingthetopic-onlyvectorspacemodelversusthreevariantsofoureventmodel,namelytopic+location,topic+date,andtopic+location+date(oureventmodel).Usingthe38seedwebpagesabouttheCaliforniashootingevent,wecreatedavocabularyof1365keywordsthatappearedon5ormorewebpages.Toextractthemostrepresentativekeywords(features)fromthevocabulary,wesortedthevocabularykeywordsbasedontheircumulativenormalizedoccurrencesinalloftheseedwebpages.Wechosethetopkkeywordsfromthesortedvocabulary.Eachseedwebpageisthenrepresentedasavectorofthetopkkeywordsandtheirfrequencyofoccurrenceinthewebpage.Wecreatedthetopicreferencevectorasthecentroidvectorofallseedwebpagevectors.

6.1.1 ClassifyingURLsandwebpagesaboutCaliforniashootingThe1000URLs/webpagesdatasetabouttheCaliforniashootingconsistsofURLaddressesandtheiranchortexts,aswellasthecorrespondingwebpages.WeranexperimentstoclassifyURLsandwebpages,separately.TheURLsandwebpagesweremanuallylabeledastorelevantvs.non-relevant.Werefertothisdatasetasthelabeleddataset.Forthetopic-onlymodelandoureventmodel,wevariedtheparameterk(thenumberofkeywordsinthetopicvector)from5to1365wordswithincrementof10andthethresholdparameterfrom0to1withincrementof0.05.Table3ValuesoftheparametersthatproducedthebestF1-score.Kisthesizeofthe

topicvectorandthresholdisthecutoffvaluefordeterminingrelevantornon-relevantlabelsbasedonthescore

URLs WebpagesK Threshold K Threshold

Topic-only(baseline) 1310 0.25 10 0.45Topic,Location,andDate(Eventmodel) 1310 0.15 10 0.4

AscanbeseeninTable3,forthetopic-onlyapproach,inthecaseofURLs,thevaluesoftheparametersk(thenumberofkeywordsoftopicvector)andthresholdthatgavethebestF1scoreonthelabeleddatasetwere1310and0.25,respectively.Inthe

Page 53: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

44

caseofwebpages,theywere10and0.45,respectively.Theweightsoftopic,date,andlocationpartswere0.36,0.22,and0.42,respectively.Theweightswerecalculatedusingequations1-4asdescribedinSection4.2.2.Fortheeventmodelapproach,inthecaseofURLs,thevaluesoftheparametersk(numberofkeywordsoftopicvector)andthresholdthatgavethebestF1scoreonthelabeleddatawere1310and0.15,respectively.Inthecaseofwebpages,theywere10and0.4,respectively.Theweightsoftopic,date,andlocationpartswere0.3,0.355,and0.345,respectively.Table3summarizestheparametervaluesforallsettings.Toexaminetheeffectofthedateandlocationseparately,weranourevaluationusingtopic+locationandtopic+date.Fortopic+location,thebestthresholdvaluewas0.2andtheweightsoftopicandlocationpartswere0.64and0.36,respectively.Fortopic+date,thebestthresholdvaluewas0.2andtheweightsoftopicandlocationpartswere0.47and0.53,respectively.Tables4and5showtheprecision,recall,andF1scoreforthefourexperimentalsettings(topic-only,topic+location,topic+date,andoureventmodel,withtopic,location,anddate)usingthebestvaluesfortheparametersforboththeURLandwebpageclassificationtasks (asshowninTable3).Table4 Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,DateevaluatedonthemanuallylabeledTRAININGURLsdatasetforCalifornia

shootingevent

Precision Recall F1-scoreTopic 0.728 0.723 0.725Topic+Date 0.852 0.855 0.853Topic+Location 0.764 0.73 0.74Topic+Location+Date 0.863 0.867 0.862

AchievinghigherF1-scoremeansbetterclassificationperformance(i.e.,betterabilitytoidentifyanddifferentiatebetweenrelevantandnon-relevantwebpages).Theresultsshowthataddingdateand/orlocationinformationtothetopicenhancestheperformance.Oureventmodel(combiningtopic,location,anddate)achievesthebestperformance(highestF1-score).Thetopic-onlymodelperformedworst(lowestF1-score). Ourexaminationofthedataconfirmedthatthetopic-onlymodeldidnotdifferentiatewellbetweenwebpagestalkingaboutdifferentshootingeventsandourevent(Californiashooting),asallaretopicallyrelated(shooting).Ontheotherhand,thetopic+datemodelperformedbetterthantopic-only,becauseitmanagedtousethepublishingtimeofthewebpagestofilteroutwebpagestalking

Page 54: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

45

aboutshootingeventsthathappenedbeforetheCaliforniashootingevent.Thetopic+locationmodelperformedbetterthanthetopic-onlymodelbecauseitfilteredoutwebpagestalkingaboutshootingeventsthathappenedatotherlocationsthanCalifornia.Table5 Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,

andDateevaluatedonthemanuallylabeledTRAININGwebpagesdatasetforCaliforniashootingevent

Precision Recall F1-scoreTopic 0.738 0.734 0.736Topic+Date 0.842 0.846 0.843Topic+Location 0.856 0.859 0.857Topic+Location+Date 0.88 0.884 0.881

Wealsoexaminedtheperformanceofoureventmodel(combiningtopic,location,anddate)versusthetopic-onlyapproachacrossthedifferentvaluesofthethresholdvariablesandthebestvalueforthekparameter.Weplottedtheprecision-recallcurvesforthedifferentvaluesofthethresholdparameter.Figure12andFigure13showthecurvesforthefourdifferentsettings.Thefiguresconfirmtheresultsdescribedabove:addinglocationand/ordateinformationenhancestheperformanceofclassification.Itisalsoshownthattheeffectofaddingdateinformationismuchstrongerthanaddinglocationinformation,inthecaseofURLs.WeinvestigatedthisbehaviorandfoundthatmostoftheURLsinourlabeleddataincludedateinformationthatcanbeextractedeasily.TherewaslesslocationinformationintheURLscomparedtodateinformation.Further,someofthelocationinformationwasnotinastandardformatasexpectedbySNER(whichassumeslocationinformationexistsaspartofavalidsentence;seeSection4.2.3).WehaveusedthemanuallycuratedseedURLsforlearningoureventmodelandthebaselinetopic-onlymodel.Bothmodelshavetwoparameters:K(numberoffeatures,i.e.,words,inthetopicvector)andthreshold(valuefordeterminingthelabels:relevantornon-relevant).Wetunedthevaluesofthetwoparametersandreportedthebestvaluesofthoseparameters(seeTable3)andtheperformanceofthetwomodels(usingthebestvalueofthetwoparameters;seeTables4and5)onthemanuallylabeledtrainingdataset.Finally,wetestedtheperformanceofthetwomodelsonthemanuallylabeledtestdataset(sincethetwomodelshaven’tseenthewebpagesinthisdataset)toseehowwellthetwomodelswillgeneralizetounseenwebpages.Table6showstheperformanceofthebaselinetopic-onlymodel,oureventmodel(topic+location+date),andtwovariantsofoureventmodel(topic+locationandtopic+date).Oureventmodeloutperformsthebaselinetopic-onlymodelbyachievinganF1-scoreof0.894comparedto0.688forthebaselinetopic-onlymodel.Also,addingthedateorlocationinformationachievesbetterperformancethanthebaselinetopic-only

Page 55: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

46

model.Addingthelocationinformationismoreeffectivethanaddingthedateinformation.Thiscanbeattributedtotherichnessoflocationinformationinthewebpagescomparedtotheexistenceofpublicationdateinwebpages.Table6Precision,Recall,andF1scoreforthefourcombinationsofTopic,Location,andDateevaluatedonthemanuallylabeledTESTwebpagesdatasetforCalifornia

shootingevent

Precision Recall F1-scoreTopic 0.695 0.706 0.688Topic+Date 0.779 0.783 0.776Topic+Location 0.858 0.86 0.858Topic+Location+Date 0.899 0.896 0.894

Figure12 CaliforniashootingURLsevaluationatdifferentthresholdvalues

Page 56: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

47

Figure13Californiashootingwebpagesevaluationatdifferentthresholdvalues

Page 57: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

48

6.2 EventModel-basedvs.Topic-onlyFocusedCrawlerInthissection,aboutthesecondseriesofexperiments,wereporttheeffectofusingtheeventmodelwiththefocusedcrawler.

6.2.1 CaliforniaShootingInthisexperimentweusedthe38URLsmanuallycurated(seeSection5.2)asseedsforthetwofocusedcrawlers(oureventmodel-basedandthetopic-onlybaseline).TheeventmodelbuiltfromtheseedsisillustratedinTable7.Thefirstrowinthetablegivesthetopicvectorkeywordsandtheirnormalizedcumulativetermfrequenciesinalloftheseedwebpages.Thesameisdoneforthelocationanddate.Weranthetwofocusedcrawlerstocollect1000webpages.Weplotthepercentageofcrawledwebpagesthatarerelevant(harvestratio)atdifferentstagesofthecrawl,i.e.,forthefirst100,200,300,…crawledwebpages.Figure14showstheperformanceofthetwocrawlersinthesmall-scalesetting(1000webpagesonlyarecrawled)duringthedifferentstagesofthecrawlingprocess.Oureventmodel-basedfocusedcrawlercollectedmorerelevantwebpagesduringandattheendofthecrawlingprocessthanthebaselinetopic-onlyfocusedcrawler.Oureventmodel-basedfocusedcrawlerachievedapproximatelyaharvestratioof0.85whilethebaselinetopic-onlyfocusedcrawlerachievedapproximatelyaharvestratioof0.68.

Table7 Californiashootingeventmodel

Keywords Weight

Topic

shoot 0.93 san 0.513 bernardino 0.465 said 0.357 wa 0.323 2015 0.321 peopl 0.31 california 0.305 polic 0.258 suspect 0.177

Location San Bernardino 1 California 0.51 Calif. 0.44

Date 2015-12-02

Page 58: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

49

Figure14 Performanceevaluationofeventmodel-basedvs.topic-onlyfocused

crawlersforCaliforniashooting

Inthelarge-scalesetting,weranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)for50Kwebpages.Thetwocrawlersstartedfromasetof~4000seedURLs.Figure15showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.6(average)whilethebaselinetopic-onlyachievedaharvestratioof0.3(average).

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500 600 700 800 900 1000

Percentage4of4crawled4webpages4that4are4relevant4

(Harvest4Ratio)

Total4number4of4webpages4crawled

Performance4Evaluation4of4Event4modelHbased4 Focused4Crawler4vs4Baseline4Focused4

Crawler4for4California4 Shooting4Event

Baseline4Focused4Crawler4H Topic4Only Event4ModelHbased4Focused4Crawler4H Topic4+4Loc4+4Date

Page 59: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

50

Figure15Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforCaliforniashooting(50Kwebpages)

6.2.2 BrusselsAttackInthisexperimentweusedthe23URLsmanuallycurated(seeSection5.2)asseedsforthetwofocusedcrawlers(oureventmodel-basedandthetopic-onlybaseline).TheeventmodelbuiltfromtheseedsisillustratedinTable8.Thefirstrowinthetablegivesthetopicvectorkeywordsandtheirnormalizedcumulativetermfrequenciesinalloftheseedwebpages.Thesameisdoneforthelocationanddate.Weranthetwofocusedcrawlerstocollect1000webpages.Weplotthepercentageofcrawledwebpagesthatarerelevant(harvestratio)atdifferentstagesofthecrawl,i.e.,forthefirst100,200,300,…crawledwebpages.Figure16showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.Oureventmodel-basedfocusedcrawlercollectedmorerelevantwebpagesduringthecrawlingprocessthanthetopic-onlybaselinefocusedcrawlerinthesmall-scalesetting(1000webpagesonly).

Page 60: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

51

Table8 Brusselsattackeventmodel

Keywords Weight

Topic

brussel 0.881 attack 0.541 airport 0.539 explos 0.381 wa 0.31 peopl 0.273 station 0.254 belgium 0.242 metro 0.197 terror 0.159

Location

Brussels 1 Belgium 0.37 Brussels Airport 0.174 Zaventem 0.174 Paris 0.123

Date 2016-03-22

Figure16 Performanceevaluationofeventmodel-basedfocusedcrawlerfor

Brusselsattack

Inthelarge-scalesetting,weranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect50Kwebpages.Thetwocrawlersstartedfromasetof~4000seedURLs.Figure17showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.Oureventmodel-based

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500 600 700 800 900 1000

Percentage4of4crawled4webpage4that4are4relevant4

(Harvest4Ratio)

Total4nubmer4of4crawled4webpages

Performance4Evaluation4of4Event4modelHbased4 Focused4Crawler4vs4Baseline4Focused4

Crawler4for4Brussels4Attack4Event

Baseline4Focused4Crawler4H Topic4Only Event4modelHbased4Focused4Crawler4H Topic4+4Loc4+4Date

Page 61: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

52

focusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.7whilethebaselinetopic-onlyachievedaharvestratioof0.5.

Figure17Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforBrusselsattack(50Kwebpages)

Intheprevioustwoexperiments(CaliforniashootingandBrusselsattack),weperformedsmall-scaleandlarge-scalecrawls.Thereasonforthesmall-scaleexperimentswastoprovethepointthattheeventmodelefficientlyguidedthefocusedcrawlertofocusontherelevantpartoftheWeb,andthecrawlerasaresultretrievedmorerelevantwebpagesthanthetopic-onlybaselinefocusedcrawler.Thepurposeofthelarge-scaleexperimentsistoshowthattheperformanceofoureventmodel-basedfocusedcrawlerremainsbetterthanthetopic-onlybaselinefocusedcrawlerevenforlargenumbersofwebpages,i.e.,isscalable.Intheremainingexperiments(otherevents),weranonlythelarge-scalecrawlingexperiments.Wealreadyshowedtheeffectivenessofourapproachinthesmall-scaleexperimentsandweneedtovalidatethatthesameperformancepersistsat

Page 62: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

53

largescale.Wedidn’tmanuallypreparesetsofseedURLsfortheremainingevents;weextractedthemfromthetweetcollectionsforeachevent.

6.2.3 OregonshootingWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect100Kwebpages.Thetwocrawlersstartedfromasetof~22KseedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.6(average)whilethebaselinetopic-onlyachievedaharvestratioof0.25(average).Figure18showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.

Figure18Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforOregonshooting(100Kwebpages)

Inthisexperiment,wehadahighernumberofseedURLsthanintheprevioustwoexperiments(22Kcomparedto4K).WedidnotcontrolthenumberofseedURLs.WeextractedandfilteredtheseedURLsfromtheevent’stweetcollection.ThenumberofseedURLsdependsonthesizeofthecorrespondingtweetcollection,whichdependsontheimpact/coverageandsizeoftheevent(bigorsmall).EventswithbigimpactwillleadtolargetweetcollectionsandthereforemoreURLsextracted.Thedefinitionofimpacthereinourcontextisrelatedtocoverage.The

Page 63: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

54

Oregonshootingevent’stweetcollectionhadmoretweetsthantheprevioustwoeventsandthereforethereweremoreURLsextractedthantheprevioustwoevents.HavingmorestartingURLsmeanswehavemoreaccess/pointerstotherelevantpartoftheWebgraph,sowerantheexperimentsfor100Kwebpagesratherthan50K(likeinthefirsttwoexperiments).

6.2.4 EgyptairplanecrashWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect10Kwebpages.Thetwocrawlersstartedfromasetof~1100seedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.5whilethebaselinetopic-onlyachievedaharvestratioof0.4.Figure19showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.

Figure19Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforEgyptairplanecrash(10Kwebpages)

Aneventofbigimpactcouldhavesmallsizetweetcollection(asthecaseforthisevent)becausetherewasnotenoughcoveragefortheeventonTwitter,ortheeventdidn’tattractmuchattentionfromTwitterusers.Anotherpossiblereasonistherewereothermoreattracting/trendingtopicsthatattracted/drewattentionawayfromthatevent.

Page 64: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

55

6.2.5 PanamapapersWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect100Kwebpages.Thetwocrawlersstartedfromasetof~18KseedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.6whilethebaselinetopic-onlyachievedaharvestratioof0.4.Figure20showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.

Figure20Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforPanamaPapers(100Kwebpages)

6.2.6 OrlandoshootingWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect50Kwebpages.Thetwocrawlersstartedfromasetof~2000seedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.7whilethebaselinetopic-onlyachievedaharvestratioof0.4.Figure21showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.

Page 65: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

56

Figure21Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforOrlandoshooting(50Kwebpages)

6.2.7 ParisattacksWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect500Kwebpages.Thetwocrawlersstartedfromasetof~88KseedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedhigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.25whilethebaselinetopic-onlyachievedaharvestratioof0.18.Figure22showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.TheParisattackeventwasabigeventwithhugeimpactlocallyinFranceandinternationally.Wekeptcollectingtweetssincethestartoftheeventandforseveraldaysaftertheevent,whichisnotthecasefortheotherevents.Usuallythetweetcollectingprocessstopsonthesamedayorone/twodaysaftertheevent.ThestoppingpointdependsonthetimeTwitterusersstoppostingabouttheevent,whichtypicallypeaksonthedayoftheeventanddecreasesafterthat.Wenoteherethatthisexperiment(andallourexperiments)startedfromtheEnglishseedURLsonly(thesameappliesduringthecrawlingprocess;weare

Page 66: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

57

workingontheEnglishlanguageonly).WeexcludedalltweetsnotinEnglish.EvenwhenlimitingtoEnglishonlytweets,westillhadaround88KseedURLstostartfrom.Theperformanceofthetwocrawlersdegradedattheendofthecrawlasexpected,because(asknowninthetopicalcrawlerliterature[17,26,28,40,41,46,47,65,66,74,75,81,83,89,90])aswegetfarfromtheseedURLs,wefindfewerrelevantwebpages.TherelevantcontentisconcentratedaroundtheseedURLs,sothefurtherawaywegofromtheseedURLs,thegreaterthechancethatwehitnon-relevantcontent.

Figure22Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforParisattack(500Kwebpages)

6.2.8 EcuadorearthquakeWeranthetwocrawlers(baselinetopic-onlycrawler,andoureventmodel-basedcrawler)tocollect100Kwebpages.Thetwocrawlersstartedfromasetof~11KseedURLs.Oureventmodel-basedfocusedcrawleroutperformedthebaselinetopic-onlyfocusedcrawlerandachievedahigherharvestratioduringandattheendofthecrawlingprocess.Oureventmodel-basedcrawlerachievedaharvestratioof0.75whilethebaselinetopic-onlyachievedaharvestratioof0.4.Figure23showstheperformanceofthetwocrawlersduringthedifferentstagesofthecrawlingprocess.

Page 67: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

58

Figure23Largescaleperformanceevaluationofeventmodel-basedvs.topic-only

focusedcrawlersforEcuadorearthquake(100Kwebpages)

Inthischapterweaddressedresearchquestion3,whichistheeffectofusingoureventmodelontheperformanceofthefocusedcrawler.Weshowedthatusingtheeventmodelwithfocusedcrawlingleadstoachievingahigherharvestratio,thuscollectingmoreoftherelevantwebpagesthanthetraditionalfocusedcrawler.Thebetterperformancewasshowninsmallandlarge-scalecrawls.

Page 68: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

59

7 WebpageSourceImportanceandSocialmedia-basedSeedSelection

Sofarwehaveexploitedcontent-basedfeaturesforfindingrelevantwebpages.Inthissection,weexploretwootherparametersthataffecttheabilityofthefocusedcrawlertofindmorerelevantwebpages:webpagesourceimportanceandseedURLs.

7.1 WebpageSourceImportanceAwebpagesourceisthewebsitethewebpagebelongsto,forexample,thewebpagewiththeURLhttp://www.cnn.com/2014/04/18/world/asia/malaysia-airlines-plane/index.htmlbelongstothesourcewebsitehttp://www.cnn.com.Theimportanceofawebpagesourcewillbedeterminedbasedonthenumberofrelevantwebpagesthatbelongtothewebpagesource.Thisheuristicexhibitstwocharacteristicsthatshouldensureagoodestimateofthewebpagesourceimportance:

1) Followsthelikelihoodpropertiesofthedata,2) Changesdynamicallywithnewdataobservedduringcrawling.

Thefirstcharacteristicensuresthatthemeasureweareusingisrealisticanddescribestheactualdatabeingcollected.Thesecondcharacteristicshowshowthemeasureadaptstochangesinthedatabeingobserved,andalsofollowschangesinthecontentbeingpublishedontheWWW.ThereareseveralreasonsforchoosingourmethodofestimatingsourceimportanceandnotconsideringPageRankandHubandAuthority[42,43,77,78]methods(asanexampleofsourcepopularitymeasures):

1- Dynamicvs.static(fixed),PageRank,hubandauthority,andout-degreemeasuresareallstaticorfixedmeasures.Theyneedtobecalculatedoffline(i.e.,requirethewholedatasetorpartofittocalculatethevaluesandthenareusedafterthatduringcrawling).Thisissimilartoonlineandofflinelearningmethods.Offlinemethodsusethetrainingdatatobuildthemodelandthenuseit.Onlinemethodsdon’tuse/requiretrainingdata;theylearnthemodelanduseitonline/duringcrawling.So,themodelisupdatedduringcrawling.

2- PageRankandhubandauthoritymethodsaretimeconsumingandcomputationallyintensive,soitwillbetimeconsumingtoadaptthemforanonlineversion.

3- PageRankandhubandauthorityaremethodsformeasuringpopularityandqualityofwebpages.Wecanimaginethatusingthemforestimatingthe

Page 69: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

60

importanceofasourcewithrespecttoaspecificeventislikeusinggeneralWebcrawlersforcrawlingwebpagesaboutaspecificevent.Weexpectthatsuchpopularitymeasuresaretoogeneraltobeconsideredameasureforsourceimportance.Forexample,anunpopularwebsite(source)couldbeveryrelevanttoanevent,duetoitscontentorhavinglinkstootherrelevantwebpagesabouttheevent.Alsousingtopic-orientedPageRankandhubandauthoritymethodsislikeusingtopicalcrawlersforcrawlingaboutevents,whichisnotefficient;thatisakeypointofthisdissertation.

Soweneedadynamic,simplycalculated,andevent-specificmethodforestimatingthelikelihoodoffindinganewrelevantwebpagefromasourcebyusingthelikelihoodofthesourceinthecurrentlycrawledwebpagesorthediscoveredbutnotyetvisitedURLsinthefrontier.Thisissimilartoagraph-basedalgorithmwhichlearnsfromdifferentpathsinthegraphwhetheracertainpathwillleadtorelevantwebpagesevenifweencounternon-relevantonesinthemiddle.Thus,ifawebpageisnotrelevantbutitssourcehashighprobabilityofhavingevent-relevantwebpagesthenthereisahighprobabilitythatthecurrentlow-score(non-relevant)webpagewilllinktoarelevantwebpage.Thisapproachactslikeagreedyalgorithmwherethecrawlerwillcrawlmorefromthesourcewiththehighestimportance.Thecrawlerwillkeepcrawlingrelevantwebpagefromthemostimportantsourceuntilitnolongerfindsrelevantwebpages,andswitchestoutilizeanotherimportantsource.Wecouldexperimentwithseveralcandidatemethodsforestimatingsourceimportance,usingtheinformationaboutcurrentlycrawledwebpages,liketheonesin[1],namely(thefollowinglististakenfromthepaperin[33]withchanges):

a. NegativeAbsoluteBadfunction,wherethescoreofasourceisthenegativenumberofalreadycrawlednon-relevantwebpages;

b. BestScorefunction,wherethescoreofasourceisthemaximalscoreofoneofthediscoveredbutnotyetvisitedURLsthatbelongstothesource;

c. SuccessRatefunction,wherethescoreofasourceistheratiobetweenthenumberofrelevantwebpagescrawledandthenon-relevant;theratioisinitializedwithpriorparametersαandβwhichwesetto1:score(source)=(#relevant(source)+α)/(#non-relevant(source)+β);

d. ThompsonSamplingfunction,wherethescoreofasourceisarandomnumber,drawnfromabeta-distributionwithpriorparametersαand

Page 70: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

61

β;inthiscasewetakeasthescoretherandomvalue:score(source)=Beta(#relevant(source)+α,#non-relevant(source)+β);weinitializedthepriorsαandβwith1;

e. AbsoluteGood*BestScorefunction,wherethescoreofasourceistheproductoftheabsolutenumberofalreadycrawledrelevantwebpagesandthebestscorefunctiondescribedin(b);

f. ThompsonSampling*BestScorefunction,wherethescoreofasourceistheproductoftheThompsonsamplingfunction(d)andthebestscorefunction(b);and

g. SuccessRate*BestScorefunction,wherethescoreofasourceistheproductofthesuccessratefunction(c)andthebestscorefunction(b).

Theresultsshownin[33]indicatethatthesuccessratefunction(c)isthebestscoringfunctionwithregardtocrawlingthelargestnumberofrelevantwebpages(i.e.,thehighestharvestratio).Thuswechosetousethesuccessratefunctionasourmethodforestimatingwebpagesourceimportance.Werananexperimentwiththeeventmodel-basedfocusedcrawlerandsourceimportance.Wecombinedthewebpagesourceimportancescorewiththeeventmodel-basedrelevancescoretoproducethefinalscoreofURLs.Onepossiblemethodofcombinationismultiplyingbothscorestogether,soURLswithhighwebpageimportancescoreandhighrelevancescoreswillgetahigherfinalscore.Wenoteherethatthewebpageimportancescoreiscalculatedduringthecrawlingprocessandisnotafixedvalue,butratheradynamicvaluethatchangesduringthecrawlingprocess.Atthebeginningofthecrawl,allsourceshavethesameinitialimportancescore.Whenanewwebpageisretrievedandfoundrelevant,thenthecorrespondingwebpagesourceimportancescoreisupdated.Inthiswaythemorewefindrelevantwebpagesfromasource,themoretheimportancescoreofthissourceincreases.Figure24showstheperformanceofeventmodel-basedfocusedcrawlingwithandwithoutsourceimportanceforBrusselsattackevent.WenoticefromFigure24thatbothcrawlersachievealmostthesameperformance,withthecrawlerwithsourceimportancestrugglinginthefirsthalfbecauseofthedynamicvalueofthesourceimportance.Wefurtherexaminedtheresultsofthetwocrawlersandfoundthatthecollectionproducedfromthecrawlerwithsourceimportancehadonly21uniquewebsites(webpagessources)whilethecrawlerwithnosourceimportancehad81uniquewebsites.Thecrawlerwithsourceimportancesucceededincollectingthesamenumberofrelevantwebpagesasthecrawlerwithnosourceimportance,but

Page 71: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

62

fromfarfewerwebsites.Werantheexperimentalsoon3moreevents:Californiashooting(seeFigure25),Ecuadorearthquake(seeFigure26),andOrlandoshooting(seedFigure27).

Figure24EffectofsourceimportanceoneventfocusedcrawlingforBrusselsattack

event

Figure25EffectofsourceimportanceoneventfocusedcrawlingforCalifornia

shootingevent

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500 600 700 800 900 1000Percentage4of4craw

led4webpages4that4are4relevant

(Harvest4Ratio)

Total4number4of4crawled4webpages

Effect4of4Source4Importance4on4Event4Focused4Crawling

Event4Focused4Crawler4with4Source4Importance

Event4Focused4Crawler

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Percentage)of)crawled)pages)that)are)relevant

Number)of)pages)crawled

Effect)of)Source)Importance)on)event)focused)crawling)for)California)shooting)event

Event1Focused1Crawler Event1Focused1Crawler1 with1Source1Importance

Page 72: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

63

Figure26EffectofsourceimportanceoneventfocusedcrawlingforEcuador

earthquakeevent

Figure27EffectofsourceimportanceoneventfocusedcrawlingforOrlando

shootingevent

0

0.2

0.4

0.6

0.8

1

1.2

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Perce

ntage4of4craw

led4webpages4that4a

re4re

levant

Number4 of4webpages4crawled

Effect4of4source4importance4on4event4focused4crawling4for4Ecuador4earthquake4event

Event4Focused4Crawler Event4Focused4Crawler4with4Source4Importance

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Percentage4of4crawled4webpages4that4a

re4re

levant

Number4 of4webpages4crawled

Effect4of4source4importance4on4event4focused4crawling4for4Orlando4 shooting4event

Event4Focused4Crawler Event4Focused4Crawler4with4Source4Importance

Page 73: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

64

WecanseefromtheresultsinFigures24-27,thataddingwebpagesourceimportancetoeventmodel-basedfocusedcrawlingenhancestheperformanceandachievesabetterharvestratio.Theeventmodel-basedfocusedcrawlerwithsourceimportancestrugglesinthebeginningofthecrawlandthenmanagestoenhancetheperformanceattheendofthecrawl.Thereasonforthebadperformanceinthebeginningisthatthefocusedcrawleristryingtofindthegoodsources;onceitsettlesongoodonesitexploitstheirimportancetoreachmorerelevantwebpages.Inthissectionwecoveredresearchquestions4and5,whichaddresshowtomodelwebpagesourceimportanceandhowtointegratethemwitheventmodel-basedfocusedcrawlers.Weshowedthatusingwebpagesourceimportancehelpsthefocusedcrawlercollectrelevantwebpageswhilefocusingonimportantsources.Regardinghypothesis2,First,wehaveshownthatifweknowthebiasofdifferentwebsitesaccordingtosomecriteria,wecanincludeotherwebsitesinordertoreducebias.Second,ourexperimentalresultsshowthatintegratingeventinformationandsourceimportanceleadstoimprovedestimatesofrelevance.Additionaldemonstrationsofthesecapabilitiesaregiveninthefollowingsubsections.

7.2 SeedURLsforcrawlingThefactorsthataffectanycrawlingexperimentsarethenumberofdesiredwebpagestobecrawledandthenumberofseedURLsfedtothefocusedcrawler.SuccessfullycollectingthenumberofdesiredwebpagesdependsonthequalityandthenumberofseedURLswestartfrom.WeshouldstartfromURLsthatwilllink/leadtothelargestnumberofrelevantwebpages.TherearedifferenttypesofseedURLs.WeclassifytheseedURLswithregardtorelevanceandlinkingasfollows:relevantandlinkingtorelevantURLs,relevantandnotlinking,non-relevantandlinking,non-relevantandnotlinking.Wedon’twantthelasttype(totallyuselessURLs).Allothertypesareeithergoodintheirownright,orarepointingtoothergoodURLs.Table9summarizesthedifferenttypesofseedURLs.WedeterminewhetheraURLlinkstootherrelevantURLsornotbydownloadingthecorrespondingwebpageandextractingthelinksfromthewebpagecontent.WeestimatetherelevanceofaURLbyclassifyingtheURLtokensasrelevantornon-relevanttoanevent.URLtokensarethesetoftokensextractedfromtheURLaddressandtheURLanchortextthatappearsontheparentwebpage.

Page 74: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

65

Table9DifferenttypesofseedURLs

LinkingtorelevantURLs NotlinkingtorelevantURLs

Relevant Hubwebpage Deadend,authoritywebpage

Non-Relevant Tunneling Reject;ignorethatpath

7.3 Semi-automatedSocialMedia-basedSeedURLGenerationSocialmediahasproventobeanimportantandrichassetforcollectingwebpagesaboutevents.EnsuringfullWebarchivecoverageofaneventisnotaneasytask,forseveralreasons.First,eventsdifferinimpactandimportance.Bigeventstendtolastforalongtime,impactmultipleplaces,andevensparkarangeofdebatesaboutdiversetopics.Second,tobuildaWebcollectionthatfullycoversaneventrequiressamplinganunbiasedsetofwebpagesfromtheWWW(whichishuge,heterogeneous,anddynamicallychanging).ThesizeoftheWWWmakesitdifficulttocollect,curate,andsampleanunbiasedsetofwebpagesusingmanualtechniques.Fortunately,focusedcrawlershavebeenproveneffective[2,26,46,66,74,75,89]inautomatingandacceleratingtheprocessofcollectingwebpages,startingfromasetofseedURLs.However,theabilityofthefocusedcrawlertofindrelevantanddiversewebpagesdependsonthequality(contentqualityandlinkingstructurequality)andthebroadcoverage(seedURLsfromdifferentwebpagesources)oftheseedURLs.Wehavebeenresearchingbuildingwebpage/tweetcollectionsaboutevents.Weidentifiedthreemainapproaches:1)theInternetArchive’sArchive-Itserviceforcollectingandarchivingwebpages,2)apairofarchivingtoolsforcollectingtweets,and3)eventmodel-basedfocusedcrawlingofwebpages.WeproposedahybridapproachforbuildingunbiasedcollectionsofwebpageswithhighcoverageusingseedURLsgeneratedfromsocialmediacontent(tweets),togetherwitheventmodel-basedfocusedcrawlers.Thetweetcollectionprocesses[1,2,74,91]ensurealargesampleofseedURLswithbroadandheterogeneousgenresofwebpages(horizontal/exploringaspect)whiletheeventmodel-basedfocusedcrawlerensureshighqualityandrelevantwebpages(vertical/exploitingaspect).

Page 75: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

66

7.3.1 SelectingSeedURLsWeapplythefollowingstepsforselecting/curatingthesetofseedURLs:

• GrouplongURLsbysources/domains/hosts• CountthenumberoflongURLspersource(sourceimportance)• Sortsources(descending)accordingtonumberofURLsineachsource• PicktopKsourcesandthenchooseoneURLfromeachoftheKsources

ChoosingKuniquesourcesensuresdiversityoftheseedURLsandchoosingthetopKaccordingtosourceimportancemeasureensuresbroadcoverageandhighquality.AlthoughtweetcollectionsabouteventsareaveryrichsourceofseedURLs,theycontainalotofnoise(porn,jobmarketing,otherspam,andvariedothertypesofnon-relevanttweetsorURLs).OneimportantsteprequiredbeforeusingURLsextractedfromtweetsisanimportanceanalysisofeachURL/webpagesource,e.g.,consideringthedomainnameoftheURL.ThesetofseedURLsshouldbenormalized.Considerthelistbelow.URL1isthenormalizedversionofURL2.BothURLspointstothesamewebpage.

• URL1=www.cnn.com• URL2=www.cnn.com?utm_source=feedburner&utm_medium=twitter

TheURLsmentionedintweetsarenotallrelevant.Weneedtofilterthem.Weusedakeyword-basedfilteringmethod,whichincludesonlyURLsthathaveatleastoneofapre-definedsetofkeywords.Thesetofkeywordsiscreatedmanuallyforeachevent.Suchafilteringprocessiscloseinaccuracytoclassification,butmuchfaster.

7.3.2 SeedsURLDomain/SourceImportanceWedefinethewebpagesourceimportanceastheprobabilityoffindingmorerelevantwebpageswhenstartingwithaseedURLfromthatsource.Weassumethatthetopical/eventlocalitypropertyholdswherewebpagesaboutaneventlinktootherwebpagesaboutthesameevent.AlsoweassumethatURLs/webpagesfromasamesourceareconnected(i.e.,ifyoustartfromoneofthemyoucanreachtheothers).WeestimatethewebpagesourceimportancebycalculatingthenumberofURLsfromthesamesourceextractedfromthetweetcollection.Figure28showstheworkflowforextractingURLsfromtweetcollections.WeapplyourmethodsontheextractedURLstocalculatethesourceimportance.

Page 76: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

67

Figure28Workflowforextracting,expanding,andselectingURLsfromtweets

WerananexperimentabouttheBrusselsattackevent.Weranoureventfocusedcrawlertocollect1000webpagesstartingfromdifferentsetsofseedURLs.ThesetsofseedURLstestedinourexperimentsdifferintwoaspects:thenumber(K)ofURLsandtheuniquenessoftheURLswithrespecttotheirwebsites.WeselectedtheseedURLsfromapoolofURLsextractedfromasetoftweetscollectedusingtheTwitterstreamingAPI.Table10summarizesthestatisticsfortheBrusselsattacktweetcollection.Figure29showsthelanguagedistributionofthetweets.MostofthetweetsareintheEnglishlanguage,whichisthelanguageweareworkedoninourresearch.Figures30and31showthenumberoftweets(withoutandwithURLs),andtheirdistributionacrosstime.Mostofthetweetswerepostedonthefirstdayoftheevent;theirnumberdecreaseswithtime.Figure32showsthedistributionofthesourcesaccordingtooursourceimportancemeasure.Weusedtheharvestratiomeasuretoevaluatetheoutputofthefocusedcrawlers.

Page 77: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

68

Table10Brusselsattacktweetcollectionstatistics

Category Number

Alltweets 2,227,706

TweetsinEnglish(lang=en) 1,838,276

Tweetcreationdatedistribution:3/22/2016 1,253,152

TweetswithURLs 937,009

TweetswithURLcreationdatedistribution:3/22/2016

462,154

UniqueshortURLsextracted(lang=en) 113,402

UniquelongURLs 85,991(twitter.com=38,168)

Uniquedomains/sources 8,082(2980>=2,596>=10)

De-duplicatedURLs 74,698

URLswithkeywords“brussels,attack” 16,187

Figure29Brusselsattacktweetslanguagedistribution

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

1,600,000

1,800,000

2,000,000

en

und fr tr es nl

de ar el it th hi

ru in pl

ja svpt

daht tl fa fi

uk et

no csro ur lv hu sl

cyko iw zhbg sr

eu lt

mr is ta ne

gu

bn ka vi

te si

pa

am knhy

ml

ckb ps

or

NumberAofATweets

Language

LanguageADistribution

Page 78: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

69

Figure30Brusselsattacktweetscreationdatedistribution

Figure31BrusselsattacktweetswithURLsdistribution

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

3/22/16 3/23/16 3/24/16 3/25/16 3/26/16 3/27/16 3/28/16

Number2of2Tweets

Tweet2Creation2 Date

Tweets2Date2Distribution

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

450,000

500,000

3/22/16 3/23/16 3/24/16 3/25/16 3/26/16 3/27/16 3/28/16

Number2of2Tweets2with2URLs

Tweets2Creation2 Date

Tweets2w/2URLs2Date2Distribution

Page 79: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

70

Figure32BrusselsattackseedURLsdomainsdistribution

Table11summarizestheresultsofthedifferentsettingsregardingtheseedURLsfortheBrusselsattackevent.Thefirstrowisforthecaseofhavingkuniquedomains.WechooseoneURLfromeachdomain.Thesecondrowisforthecaseofhavingk/2uniquedomains;wechoose2URLsfromeachdomain.ThedomainsfromwhichwechoosetheURLsaresortedaccordingtohowmanyURLsbelongtothem.SothetopkdomainsaretheonesthathavethelargestnumbersofURLsbelongingtothem.ThecolumnsrepresentthedifferentvaluesforK(thedesirednumberofURLswewanttoselectasseeds).Astheresultsshows,asweincreasethenumberofseeds,thefocusedcrawlerfindsmorerelevantwebpages(leadingtoahighharvestratio).Also,distributingtheseedURLsacrossseveralwebsitesincreasestheabilityofthefocusedcrawlertofindmorerelevantwebpages.Table12showsthewebpagesourcedistributionintheresultingcrawledwebpages.Astheresultsshow,usingURLswithmoreuniquesourcesproducescollectionswithamorediversesetofsources,andthisincreasesbyincreasingthenumberofseeds.Forexample,startingfrom10seedURLsfromuniquesourcesledtoacollectionof1000webpagesfrom68uniquesourcesincontrasttotheoriginal31uniquesources.

0

50

100

150

200

250

300

350

400

450

500Nu

mber-of-URLs

Domains

Domains-Distribution

Page 80: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

71

Table11HarvestratioforeventfocusedcrawlerusingtwomethodsofseedselectionwithdifferentnumbersofseedsforBrusselsattack

K=10 K=50 K=100TopKFrequentuniquewebsites 0.685 0.752 0.817TopK/2FrequentwebsiteswithmultipleURLsfromsamesource

0.645 0.763 0.775

Table12NumberofdifferentdomainsintheoutputcollectionsofcrawlingexperimentsforBrusselsattack

K=10 K=50 K=100TopKfrequentuniquewebsites 68 74 117

TopKfrequentwebsiteswithmultipleURLsfromsamesource

31 58 78

Tables13and14showtheresultsfortheOregonshootingevent,Tables15and16showtheresults fortheCaliforniashootingevent,andTables17and18showtheresults for theOrlandoshootingevent.For theOregonshootingevent,wesee thesameeffectasfortheBrusselsattackevent,havingseedURLsfromuniquedomains(rather thanhavingmultipleURLs from the samedomain) increases theabilityofthefocusedcrawlertofindmorerelevantwebpages. However, inthecaseK=100,therewasnoperformanceimprovement.Bothmethodsachievedthesameharvestratio(0.741)butthemethodusingURLsfromuniquedomainsproducedacollectionwith 135 unique domains while the method using multiple URLs from the samedomainproducedacollectionwithonly96uniquedomains.AnotherproblemwiththeOregonshootingcollectionisthatweranourexperiments10monthsaftertheeventhappened.Most of the seeds andwebpages about the eventno longer exist(give404errorasHTTPResponse).Thisaffectstheperformanceofbothcrawlers;addingmoreseedsdoesn’thelpinfindingrelevantwebpages.

Table13HarvestratioforeventfocusedcrawlerusingtwomethodsofseedsselectionwithdifferentnumbersofseedsforOregonshooting

K=10 K=50 K=100TopKFrequentuniquewebsites 0.717 0.744 0.741TopKFrequentwebsiteswithmultipleURLsfromsamesource

0.715 0.669 0.741

Page 81: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

72

Table14NumberofdifferentdomainsintheoutputcollectionsofcrawlingexperimentsforOregonshooting

K=10 K=50 K=100TopKfrequentuniquewebsites 65 93 135

TopKfrequentwebsiteswithmultipleURLsfromsamesource

59 84 96

WealsoseethesamebehaviorintheCaliforniashootingeventasshowninTables15and16.CrawlingwithURLsfromuniquedomainsachievesbetterperformancethancrawlingwithmultipleURLsfromthesamedomain.IncreasingthenumberofseedURLsincreasestheperformanceofthefocusedcrawler.Also,inthecaseK=100,bothmethodsachieveaharvestratioofaround0.9,butthemethodusingURLsfromuniquedomainsproducedacollectionwith129uniquedomainswhilethemethodusingmultipleURLsfromsamedomainproducedacollectionwithonly101uniquedomains.AbetterwayistochooseseedURLsfromtype4(i.e.,HubURLs)(seeTable9)wheretheURLpointstoawebpageofrelevantcontentandthewebpagecontainsURLstootherrelevantwebpages.SinceweareautomatingtheprocessofselectingtheseedURLsfromthepoolofURLsextractedfromsocialmedia(i.e.,Twitter),wethinkweachievedareasonableperformancewithminimalornomanualwork.ThiscontraststothesituationwhenseedURLsarecuratedwithmanualwork.

Table15HarvestratioforeventfocusedcrawlerusingtwomethodsofseedsselectionwithdifferentnumbersofseedsforCaliforniashooting

K=10 K=50 K=100TopKFrequentuniquewebsites 0.7 0.7604 0.7818TopKFrequentwebsiteswithmultipleURLsfromsamesource

0.66 0.7218 0.7548

Table16Numberofdifferentdomainsintheoutputcollectionsofcrawling

experimentsforCaliforniashooting

K=10 K=50 K=100TopKfrequentuniquewebsites 60 97 129

TopKfrequentwebsiteswithmultipleURLsfromsamesource

54 64 101

WeexaminedtheseedURLsproducedinthecasek=100inboththeOregonandCaliforniashootingevents.WenoticedthatalthoughweselectedtheseedURLsfromimportantdomains,thetypeofthewebpagestheseURLslinkingtoareauthority/deadend,wherethecontentofthewebpagesisrelevantbuttheyarenot

Page 82: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

73

linkingtootherrelevantwebpages.ThusaddingmoreoftheseURLsdidn’thelpthefocusedcrawlerfindorreachmorerelevantwebpages.Finally,theOrlandoshootingeventleadstothesamebehaviorasinthepreviousevents.AddingmoreURLsfromuniquedomainshelpsthefocusedcrawlerfindmorerelevantwebpages,asisshowninTable16.Inthecasek=100,addingmoreuniquedomainsachievedalmostthesameperformanceasthemethodofaddingmoreURLsfromthesamedomain(likeforthepreviouslymentionedevents).AreasonablejustificationforthatbehavioristhatwhenwepickURLsfromthetop100domains,thedomaindistributiongetswider,andweincludedomainswithalownumberofURLsfromthem(thetailofthedistribution),whencomparedtothemostfrequentdomainsatthetopofthelist.ThesedomainsdonotlinktomorerelevantURLsandthusdon’thelpthefocusedcrawlerreachmorerelevantpartsoftheWWW.Weverifiedthat,byexaminingthelast10domainsinthetop100domainsintheOrlandoshootingevent.WeexaminedthenumberofURLsthatarefromthelast10domainsinthecrawledwebpagesproducedattheendofthecrawl.Wefoundthat8outofthe10domainshad1URLonlyintheresultingcollectionofwebpages.AnoptimumselectionofseedURLsisatrade-offbetweenaddingmoreuniquedomainsandhavingseedURLsfromtype4seedURLs,whicharehighlyrelevantontheirown,andalsolinktootherrelevantURLs.WeusedthenumberofURLsfromadomaintoestimatetheprobabilitythataURLfromthatdomainwilllinktootherrelevantURLs.Abetterway(butrathercomputationallyexpensive)istobuildtheWebgraphoutoftheseedURLsandselectonlytheonesthathavehigherout-degreeorthatoptimizethecoverageanddiversitytrade-off.

Table17HarvestratioforeventfocusedcrawlerusingtwomethodsofseedsselectionwithdifferentnumbersofseedsforOrlandoshooting

K=10 K=50 K=100TopKfrequentuniquewebsites 0.6 0.709 0.722

TopKfrequentwebsiteswithmultipleURLsfromsamesource

0.5 0.6856 0.7158

Table18NumberofdifferentdomainsintheoutputcollectionsofcrawlingexperimentsforOrlandoshooting

K=10 K=50 K=100TopKfrequentuniquewebsites 157 181 220

TopKfrequentwebsiteswithmultipleURLsfromsamesource

13 157 181

Page 83: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

74

8 ConclusionandFutureWorkWeproposedamodelandrepresentationforevents.Weshowedhowtorepresentaneventusingourmodel.Wecalculatedtheweightsofthethreeattributesofoureventmodelbyjointlyoptimizingtwoparameters-thenumberofkeywordsandthethresholdvalue-toyieldthebestF1-scoreevaluationmetriconamanuallylabeled(relevantandnon-relevant)datasetofURLsandwebpagesabouttheCaliforniashooting.Theresultsshowedthattheeventmodel,withtheseweightsemployed,caneffectivelyclassifyURLsandwebpagesastotheirrelevancetotheeventofinterest.Weincorporatedoureventmodelintofocusedcrawlingandshowedthatoureventmodel-basedfocusedcrawlerbuiltanevent-relatedWebcollectionmoreeffectivelythanthestate-of-the-artbest-firsttopic-onlyfocusedcrawlerontwodifferentevents:CaliforniashootingandBrusselsattack.Theresultsforsmall-scaleexperiments(collecting1000webpagesfrom38and23seedURLs,respectively)showedthatourevent-modelbasedfocusedcrawleroutperformedthetopic-onlyfocusedcrawlerbycollectingmorerelevantwebpagesaboutthetwoevents(i.e.,achievinghigherharvestratio).Weranexperimentsforlarge-scalecrawling(rangingfrom50K–500Kwebpages)on7differentevents:Californiashooting,Brusselsattack,Oregonshooting,Egyptairplanecrash,Panamapapers,Parisattack,andEcuadorearthquake.Weleveragedsocialmedia(i.e.,Twitter)toextractandselectseedURLsforcrawling.Oureventmodel-basedfocusedcrawleroutperformedthetopic-onlyfocusedcrawlerbycollectingmorerelevantwebpages.Weproposedandincorporatedwebpages’sourceimportanceintoourfocusedcrawler.Theresultsshowedthatusingwebpagesourceimportanceledtoanequivalentqualityeventrelatedcollection,relativetothebaseline,butrequiredfewersources.Finally,weshowedtheeffectoftheseedURLsonthequalityoftheresultingwebpagescollections.WedemonstrateduseofthesourceimportancemeasuretocurateandselecthighqualityseedURLsfromURLsextractedfromsocialmediacontent(tweets).

Page 84: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

75

Wehavecoveredthefiveresearchquestionsandtherelatedhypotheses.Researchquestions1and2(andhypothesis1.1)werecoveredinChapter4:buildingtheeventmodelandrepresentationandusingitwiththefocusedcrawler.Wecoveredresearchquestion3(andhypothesis1.2)inChapter6,evaluatingtheeffectivenessofoureventmodelinfocusedcrawling.Finally,wecoveredresearchquestions4and5(andhypotheses1.3and2)inChapter7,modelingwebpagesourceimportanceandintegratingitwitheventmodel-basedfocusedcrawler.

8.1 ContributionsOurcontributionsinthisdissertationresearchare:

1. Designinganeventmodelthatcapturestheinformationneededforrepresentingevents(withdisastereventsasacasestudy).

2. Developinganevent-awarefocusedcrawlerthatusestheeventmodelforthetargeteventandforwebpagerepresentation,aswellasfordevelopinganewsimilarityfunctionthathelpsinwebpagerelevanceestimation.

3. Designingandincorporatingwebpagesourceimportancemodelintoafocusedcrawlersystem.

4. Developinganewmethodologyforsemi-automatedseedURLgenerationfromsocialmediacontent.

5. Buildinganeventdigitallibraryofevent-relatedobjects(text,metadatarecords,archives,andentities).

8.2 FutureWorkOureventmodelhascapturedthreeattributesforanevent(topic,location,anddate).Weplantoextendoureventmodelbyextractingandaddingorganizationsandparticipants;thatinformationwillrepresentthe‘Who’partinthe‘WhodidWhat,WhereandWhen’eventmodel.Thiswillenrichoureventmodelandconsequentlyshouldincreasetheeventmodel-basedfocusedcrawler’spowertoestimateandretrievemorerelevantwebpages.Further,withregardtofocusedcrawlingforlargeevents,weareintegratingourtweetcollectionefforts,thatalreadyhaveresultedinover1.2billiontweetsspreadacrossabout1000collection,withfollow-upfocusedcrawlingthatstartswithseedsthatcomefromtheURLsfoundinthosetweets.Ontheapplicationside,wealsoplantouseoureventmodeltoanalyzeandsummarizeacollectionofwebpages;thiscanworkforanycollectionaboutaparticularevent(e.g.,preparedthroughmanualcuration,orusingoureventfocusedcrawler[92]).Usingoureventmodel,wewillgeneratealistofindicativesentences,

Page 85: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

76

andextractentitiestorepresentandsummarizeanevent.Therearemultiplealgorithmsandsoftwareimplementationsfortextsummarization,butwebelievethisconceptofcorpus/eventsummarizationisnewandworthinvestigation.Ourpreliminarystudyofsuchsummarizationsuggeststhatresultswillhavehighqualityandutility[92].Further,weplantorunmoreexperimentsondifferentkindsofeventsandtotestotherheuristicsforcombiningwebpagesourceimportancewitheventmodel-basedrelevancescores.Finally,wewillbuildaknowledgebaseofsources,foreachtypeofevent.TheknowledgebasewillincludealistofURLs,extractedfromsocialmediacontentaboutdifferenttypesofevents.ThelistofextractedURLswillbeusedtobuildalistofpairs:sourcesandtheirimportancescore(howmanyrelevantURLsarefromasource).Thislistcouldbeusedtocomputepriorsforthesourceimportancemodelduringcrawling.

Page 86: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

77

References1. Farag,M.M.G.andE.A.Fox,Buildingandarchivingeventwebcollections:A

focusedcrawlerapproach,inBulletinofIEEETechnicalCommitteeonDigital

Libraries.2015.p.1-2.2. Farag,M.andE.A.Fox,FocusedCrawlingForEvents.InternationalJournalof

DigitalLibraries,SpecialIssueofWebArchiving-Inreview,2016.3. Magdy,M.andE.A.Fox,IntelligentEventFocusedCrawling,inThe11th

InternationalConferenceonInformationSystemsforCrisisResponseand

Management(ISCRAM)-Poster.2014:UniversityPark,Pennsilvenya,USA.4. O'Reilly,T.,WhatisWeb2.0:Designpatternsandbusinessmodelsforthenext

generationofsoftware.Communications&strategies,2007.1(1):p.17.5. IDEAL.IntegratedDigitalEventArchiveandLibrary.2016[cited2016April26];

Availablefrom:http://eventsarchive.org/.6. Internet_Archive.InternetArchive,Adigitallibraryoffreecontentandwayback

machine.2016[cited2016April26];Availablefrom:https://archive.org/.7. Farag,M.,P.Nakate,andE.A.Fox,BigDataProcessingofSchoolShooting

Archives,inProceedingsofthe16thACM/IEEE-CSonJointConferenceonDigital

Libraries.2016,ACM:Newark,NewJersey,USA.p.271-272.8. IDEAL_Collections.IDEALWebCollectionsandTweetArchives.2016[cited2016

April26];Availablefrom:http://eventsarchive.org/eventstable.9. Archive-It.Webarchivingservicesforlibrariesandarchives.2016[cited2016

April26];Availablefrom:https://archive-it.org/.10. G.Mohr,etal.IntroductiontoHeritrix,anArchivalQualityWebCrawler.in

Proceedingsofthe4thInternationalWebArchivingWorkshop(IWAW’04).2004.11. Fox,E.A.andJ.P.Leidig,DigitalLibraryApplications:CBIR,Education,Social

Networks,eScience/Simulation,andGIS.2014:Morgan&ClaypoolPublishers.12. Fox,E.A.andR.d.S.Torres,DigitalLibraryTechnologies:ComplexObjects,

Annotation,Ontologies,Classification,Extraction,andSecurity.2014:Morgan&ClaypoolPublishers.

13. Shen,R.,M.A.Goncalves,andE.A.Fox,KeyIssuesRegardingDigitalLibraries:EvaluationandIntegration.2013:Morgan&ClaypoolPublishers.

14. Fox,E.A.,M.A.Goncalves,andR.Shen,TheoreticalFoundationsforDigitalLibraries:The5S(Societies,Scenarios,Spaces,Structures,Streams)Approach.2012:Morgan&ClaypoolPublishers.

15. Salton,G.andC.Buckley,Term-weightingapproachesinautomatictextretrieval.InformationProcessing&Management,1988.24(5):p.513-523.

16. Salton,G.andM.J.McGill,IntroductiontoModernInformationRetrieval.1986:McGraw-Hill,Inc.

17. Pant,G.,P.Srinivasan,andF.Menczer,Crawlingtheweb,inWebDynamics.2004,Springer.p.153-177.

18. Manning,C.D.,etal.,IntroductiontoInformationRetrieval.2008:CambridgeUniversityPress.496.

Page 87: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

78

19. Archive-It.Archive-ItCollections,SpontaneousEvents.2016[cited2016July];Availablefrom:https://archive-it.org/explore?show=Collections&fc=meta_Subject%3ASpontaneousevents.

20. Yang,S.,etal.AstudyofautomationfromseedURLgenerationtofocusedweb

archivedevelopment:theCTRnetcontext.inProceedingsofthe12thACM/IEEE-

CSjointconferenceonDigitalLibraries.2012.ACM.21. IDEAL.IDEALTweetCollections.2016[cited2016August24];Availablefrom:

http://hadoop.dlib.vt.edu/.22. Fox,E.A.andM.M.Farag,Reportontheworkshoponwebarchivinganddigital

libraries(WADL2013).SIGIRForum,2013.47(2):p.128-133.23. Fox,E.A.,Z.Xie,andM.Klein,WADL2016:ThirdInternationalWorkshoponWeb

ArchivingandDigitalLibraries,inProceedingsofthe16thACM/IEEE-CSonJoint

ConferenceonDigitalLibraries.2016,ACM:Newark,NewJersey,USA.p.293-294.

24. Fox,E.A.andZ.Xie,WebArchivingandDigitalLibraries(WADL),inProceedingsofthe15thACM/IEEE-CSJointConferenceonDigitalLibraries.2015,ACM:Knoxville,Tennessee,USA.p.303-303.

25. WARC.Informationanddocumentation--WARCfileformat-ISO28500:2009.2016[cited2016August24];Availablefrom:http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717.

26. Chakrabarti,S.,M.VandenBerg,andB.Dom,Focusedcrawling:anewapproachtotopic-specificWebresourcediscovery.ComputerNetworks,1999.31(11):p.1623-1640.

27. Batsakis,S.,E.G.Petrakis,andE.Milios,Improvingtheperformanceoffocused

webcrawlers.Data&KnowledgeEngineering,2009.68(10):p.1001-1013.28. Pant,G.andP.Srinivasan,Learningtocrawl:Comparingclassificationschemes.

ACMTransactionsonInformationSystems(TOIS),2005.23(4):p.430-462.29. Rennie,J.andA.McCallum.Efficientwebspideringwithreinforcementlearning.

inProceedingsoftheInternationalConferenceonMachineLearning.1999.Citeseer.

30. Grigoriadis,A.andG.Paliouras,Focusedcrawlingusingtemporaldifference-

learning,inMethodsandApplicationsofArtificialIntelligence.2004,Springer.p.142-153.

31. Singh,N.,etal.LargeScaleURL-basedClassificationUsingOnlineIncremental

Learning.inMachineLearningandApplications(ICMLA),201211thInternational

Conferenceon.2012.IEEE.32. Menczer,F.andA.E.Monge,Scalablewebsearchbyadaptiveonlineagents:An

InfoSpiderscasestudy,inIntelligentInformationAgents.1999,Springer.p.323-347.

33. Meusel,R.,P.Mika,andR.Blanco,FocusedCrawlingforStructuredData,inProceedingsofthe23rdACMInternationalConferenceonConferenceon

InformationandKnowledgeManagement.2014,ACM:Shanghai,China.p.1039-1048.

Page 88: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

79

34. Ehrig,M.andA.Maedche.Ontology-focusedcrawlingofWebdocuments.inProceedingsofthe2003ACMsymposiumonAppliedcomputing.2003.ACM.

35. Dong,H.,F.K.Hussain,andE.Chang.Asurveyinsemanticwebtechnologies-

inspiredfocusedcrawlers.inDigitalInformationManagement,2008.ICDIM

2008.ThirdInternationalConferenceon.2008.IEEE.36. Yang,S.,etal.,CTRnetDLfordisasterinformationservices,inProceedingsofthe

11thannualinternationalACM/IEEEjointconferenceonDigitallibraries.2011,ACM:Ottawa,Ontario,Canada.p.437-438.

37. Vural,A.G.,B.B.Cambazoglu,andP.Karagoz,Sentiment-focusedwebcrawling.ACMTransactionsontheWeb(TWEB),2014.8(4):p.22.

38. Fu,T.,etal.,Sentimentalspidering:leveragingopinioninformationinfocused

crawlers.ACMTransactionsonInformationSystems(TOIS),2012.30(4):p.24.39. Almpanidis,G.,C.Kotropoulos,andI.Pitas,Combiningtextandlinkanalysisfor

focusedcrawling—Anapplicationforverticalsearchengines.InformationSystems,2007.32(6):p.886-908.

40. Diligenti,M.,etal.FocusedCrawlingUsingContextGraphs.inVLDB.2000.41. Pant,G.andP.Srinivasan,Linkcontextsinclassifier-guidedtopicalcrawlers.

KnowledgeandDataEngineering,IEEETransactionson,2006.18(1):p.107-122.42. Kleinberg,J.M.,etal.Thewebasagraph:measurements,models,andmethods.

inInternationalComputingandCombinatoricsConference.1999.Springer.43. Brin,S.andL.Page,Reprintof:Theanatomyofalarge-scalehypertextualweb

searchengine.Computernetworks,2012.56(18):p.3825-3833.44. Page,L.,etal.,ThePageRankcitationranking:bringingordertotheweb.1999:

TechnicalReport.StanfordInfoLab.45. DeAssis,G.T.,etal.Exploitinggenreinfocusedcrawling.inStringProcessingand

InformationRetrieval.2007.Springer.46. Pant,G.andP.Srinivasan,Predictingwebpagestatus.InformationSystems

Research,2010.21(2):p.345-364.47. Pant,G.andP.Srinivasan,StatusLocalityontheWeb:ImplicationsforBuilding

FocusedCollections.InformationSystemsResearch,2013.24(3):p.802-821.48. Chen,Y.,Anovelhybridfocusedcrawlingalgorithmtobuilddomain-specific

collections.2007,VirginiaPolytechnicInstituteandStateUniversity.49. Allan,J.,Introductiontotopicdetectionandtracking,inTopicdetectionand

tracking.2002,Springer.p.1-16.50. Volkova,S.,etal.,Animaldiseaseeventrecognitionandclassification.UsingWeb

DataintheMedicalDomain,2010:p.54.51. Westermann,U.andR.Jain,Towardacommoneventmodelformultimedia

applications.IEEEMultiMedia,2007.14(1):p.19-29.52. Strötgen,J.,M.Gertz,andC.Junghans.Anevent-centricmodelformultilingual

documentsimilarity.inProceedingsofthe34thinternationalACMSIGIR

conferenceonResearchanddevelopmentinInformationRetrieval.2011.ACM.53. Li,Z.,etal.,Aprobabilisticmodelforretrospectivenewseventdetection,in

Proceedingsofthe28thannualinternationalACMSIGIRconferenceonResearch

Page 89: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

80

anddevelopmentininformationretrieval.2005,ACM:Salvador,Brazil.p.106-113.

54. Ha-Thuc,V.,etal.Newseventmodelingandtrackinginthesocialwebwith

ontologicalguidance.inSemanticComputing(ICSC),2010IEEEFourth

InternationalConferenceon.2010.IEEE.55. Ha-Thuc,V.,etal.Arelevance-basedtopicmodelfornewseventtracking.in

Proceedingsofthe32ndinternationalACMSIGIRconferenceonResearchand

developmentininformationretrieval.2009.ACM.56. Becker,H.,M.Naaman,andL.Gravano,Beyondtrendingtopics:Real-world

eventidentificationonTwitter,inFifthInternationalAAAIConferenceonWeblogsandSocialMedia.2011:Barcelona,Spain.

57. Parikh,R.andK.Karlapalem,ET:eventsfromtweets,inProceedingsofthe22ndInternationalConferenceonWorldWideWeb.2013,ACM:RiodeJaneiro,Brazil.p.613-620.

58. Ritter,A.,etal.,OpendomaineventextractionfromTwitter,inProceedingsofthe18thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddata

mining.2012,ACM:Beijing,China.p.1104-1112.59. Ritter,A.,etal.,WeaklySupervisedExtractionofComputerSecurityEventsfrom

Twitter,inProceedingsofthe24thInternationalConferenceonWorldWideWeb.2015,ACM:Florence,Italy.p.896-905.

60. Strotgen,J.,M.Gertz,andC.Junghans,Anevent-centricmodelformultilingual

documentsimilarity,inProceedingsofthe34thinternationalACMSIGIR

conferenceonResearchanddevelopmentinInformationRetrieval.2011,ACM:Beijing,China.p.953-962.

61. Yom-Tov,E.andF.Diaz,Locationandtimelinessofinformationsourcesduring

newsevents,inProceedingsofthe34thinternationalACMSIGIRconferenceon

ResearchanddevelopmentinInformationRetrieval.2011,ACM:Beijing,China.p.1105-1106.

62. Lakka,C.,etal.,ABayesiannetworkmodelingapproachforcrossmediaanalysis.SignalProcessing:ImageCommunication,2011.26(3):p.175-193.

63. Gossen,G.,E.Demidova,andT.Risse.iCrawl:ImprovingtheFreshnessofWeb

CollectionsbyIntegratingSocialWebandFocusedWebCrawling.inProceedingsofthe15thACM/IEEE-CSJointConferenceonDigitalLibraries.2015.Knoxville,Tennessee,USA.

64. AlNoamany,Y.,M.C.Weigle,andM.L.Nelson,Detectingoff-topicpagesinwebarchives.ResearchandAdvancedTechnologyforDigitalLibraries,Springer,2015:p.225-237.

65. Menczer,F.,etal.Evaluatingtopic-drivenwebcrawlers.inProceedingsofthe24thannualinternationalACMSIGIRconferenceonResearchanddevelopment

ininformationretrieval.2001.ACM.66. Menczer,F.,G.Pant,andP.Srinivasan,Topicalwebcrawlers:Evaluating

adaptivealgorithms.ACMTransactionsonInternetTechnology(TOIT),2004.4(4):p.378-419.

Page 90: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

81

67. Srinivasan,P.,F.Menczer,andG.Pant,Ageneralevaluationframeworkfor

topicalcrawlers.InformationRetrieval,2005.8(3):p.417-447.68. Borlund,P.,TheconceptofrelevanceinIR.JournaloftheAmericanSocietyfor

informationScienceandTechnology,2003.54(10):p.913-925.69. Hjørland,B.,Thefoundationoftheconceptofrelevance.JournaloftheAmerican

SocietyforInformationScienceandTechnology,2010.61(2):p.217-237.70. Schamber,L.,RelevanceandInformationBehavior.Annualreviewofinformation

scienceandtechnology(ARIST),1994.29:p.3-48.71. Saracevic,T.,Relevance:Areviewoftheliteratureandaframeworkforthinking

onthenotionininformationscience.PartIII:Behaviorandeffectsofrelevance.JournaloftheAmericanSocietyforInformationScienceandTechnology,2007.58(13):p.2126-2144.

72. Mizzaro,S.,Relevance:Thewholehistory.JASIS,1997.48(9):p.810-832.73. Voorhees,E.M.,Variationsinrelevancejudgmentsandthemeasurementof

retrievaleffectiveness.Informationprocessing&management,2000.36(5):p.697-716.

74. Gossen,G.,E.Demidova,andT.Risse.iCrawl:ImprovingtheFreshnessofWeb

CollectionsbyIntegratingSocialWebandFocusedWebCrawling.inProceedingsofthe15thACM/IEEE-CEonJointConferenceonDigitalLibraries.2015.ACM.

75. Batsakis,S.,E.G.M.Petrakis,andE.Milios,Improvingtheperformanceoffocused

webcrawlers.DataKnowl.Eng.,2009.68(10):p.1001-1013.76. Salton,G.,A.Wong,andC.S.Yang,AVectorSpaceModelforAutomaticIndexing.

CommunicationsoftheACM,1975.18(11):p.613-620.77. Brin,S.andL.Page,Theanatomyofalarge-scalehypertextualWebsearch

engine.ComputernetworksandISDNsystems,1998.30(1-7):p.107-117.78. Cho,J.,H.Garcia-Molina,andL.Page,EfficientCrawlingThroughURLOrdering,

inSeventhInternationalWorld-WideWebConference(WWW1998).1998:Brisbane,Australia.

79. Heydon,A.andM.Najork,Mercator:Ascalable,extensiblewebcrawler.WorldWideWeb,1999.2(4):p.219-229.

80. Kobayashi,M.andK.Takeda,Informationretrievalontheweb.ACMComput.Surv.,2000.32(2):p.144-173.

81. Chakrabarti,S.,MiningtheWeb:DiscoveringKnowledgefromHyperTextData.2002:ScienceandTechnologyBooks.350.

82. Castillo,C.,Effectivewebcrawling.SIGIRForum,2005.39(1):p.55-56.83. Chakrabarti,S.,K.Punera,andM.Subramanyam,Acceleratedfocusedcrawling

throughonlinerelevancefeedback,inProceedingsofthe11thinternationalconferenceonWorldWideWeb.2002,ACM:Honolulu,Hawaii,USA.p.148-159.

84. Atkinson,M.D.,etal.,Min-maxheapsandgeneralizedpriorityqueues.CommunicationsoftheACM,1986.29(10):p.996-1000.

85. Min-max_heap.Min-maxheap.2016[cited2016August24];Availablefrom:https://en.wikipedia.org/wiki/Min-max_heap.

86. Yom-Tov,E.andF.Diaz,Outofsight,notoutofmind:ontheeffectofsocialand

physicaldetachmentoninformationneed,inProceedingsofthe34th

Page 91: Intelligent Event Focused Crawling - Virginia Tech€¦ · Intelligent Event Focused Crawling Mohamed Magdy Gharib Farag ABSTRACT There is need for an integrated event focused crawling

82

internationalACMSIGIRconferenceonResearchanddevelopmentinInformation

Retrieval.2011,ACM:Beijing,China.p.385-394.87. Foley,J.,M.Bendersky,andV.Josifovski,LearningtoExtractLocalEventsfrom

theWeb,inProceedingsofthe38thInternationalACMSIGIRConferenceon

ResearchandDevelopmentinInformationRetrieval.2015,ACM:Santiago,Chile.p.423-432.

88. Baeza-Yates,R.andB.Ribeiro-Neto,Moderninformationretrieval.Vol.463.1999:ACMpress,NewYork.

89. Aggarwal,C.C.,F.Al-Garawi,andP.S.Yu.IntelligentcrawlingontheWorldWide

Webwitharbitrarypredicates.inProceedingsofthe10thinternationalconferenceonWorldWideWeb.2001.ACM.

90. Liu,H.,J.Janssen,andE.Milios,UsingHMMtolearnuserbrowsingpatternsfor

focusedwebcrawling.Data&KnowledgeEngineering,2006.59(2):p.270-291.91. Farag,M.andE.A.Fox,Whichwebpageshouldwecrawlfirst?Socialmedia-

basedwebpagesourceimportanceguidance,inWorkshoponWebArchivingand

DigitalLibraries(WADL2016)-JointConferenceonDigitalLibraries(JCDL2016).2016:Newark,NJ,USA.

92. Farag,M.andE.A.Fox,Webarchivecontentanalysis.2015:PresentedatInternationalInternetPresentationConsortiumGeneralAssemblyIIPC2015,California,USA.