The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang...

31
The Billion Object Pla1orm (BOP): A Real-:me, Big Data, Spa:o-Temporal Explora:on Pla1orm Harvard ABCD-GIS Benjamin Lewis, Merce Crosas, David Smiley, Devika Kakkar, Ariel Nunez

Transcript of The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang...

Page 1: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

TheBillionObjectPla1orm(BOP):AReal-:me,BigData,

Spa:o-TemporalExplora:onPla1orm

HarvardABCD-GIS

BenjaminLewis,MerceCrosas,DavidSmiley,DevikaKakkar,ArielNunez

Page 2: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

Outline•  Introduc:on•  Architecture•  Harves:ng/Archiving•  Sen:mentEnrichment•  ApacheKaSa•  SolrforGeo-enrichment•  Solr&TimeSharding•  BOPWeb-Service•  ClientUI•  Deployment/Opera:ons•  DockerandKontena

Page 3: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

BOPRequirementsSummary

•  Mostrecent~billiongeo-tweets•  Real:mesearch(<5seclatency)•  Sub-secondqueries–  Includingheatmaps!

•  Onthecheap:~6commodityservers

Provideaproof-of-conceptpla1ormdesignedtolowerthebarrierforresearcherswhoneedtoaccessbigstreamingspa:o-temporaldatasets.

Page 4: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

BOPasanExampleofaNewKindofDatasetAvailableinDataverse

Page 5: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

StreamingData:HarvestandArchive

Page 6: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

Ini:alfocusonGeo-tweets(couldbeanystreamingdataset)

•  1-2%oftweetshaveGPScoordinatesfromtheuser’sdevice,rangesfrom1to6millionperday

•  TheCGAhasbeenharves:nggeo-tweetssince2012andhasaninformalarchiveofabout8billionobjects

•  ResearcherRyanQiWangalsoharves:ngduringthisperiod.Histweetswereloadedfirst.CGAtweetswillbemergedlater.

•  Collaborators:•  HarvardDataverseTeam•  BostonAreaResearchIni:a:ve

Page 7: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

LogicalHigh-LevelArchitecture

KaSa(archive)

SolrHarves:ng Enrichment

DataflowsviaApacheKa)a HTTP

WebService

BrowserUI

Docker,Kontena,OpenStackHos:ng:MassOpenCloud

BOP

Page 8: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

ApacheKaSa•  KaSa:ascalablemessage/queuepla1orm•  SeenewKaSaStreams&KaSaConnectAPIs•  Noback-pressure;canbeachallenge•  Non-obvioususe:– Forstorage;:mepar::oning

•  Lotsofbenefitsyetseriouslimita:ons

Page 9: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

Real-TimeHarves:ng

Streamtweetsusingpredefinedusersandcoordinatesextent

KaSaTopic

ConnecttoTwijer’sStreamingAPI

Ifthetweetis

Geotagged

Page 10: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

Enrichment

Geo:QuerySolrviaspa:alpointquery;ajachrelatedmetadatatotweet

KaSaTopic Enrich KaSa

Topic

TwijerSen:mentClassifier

Geo:Solrwithregionalpolygons&metadata

Page 11: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

Sen:mentAnalysis•  Classifier:SupportVectorMachine(SVM)withLinearKernel•  SourcecodeinPython•  Usesscikit-learn,numpy,scipy,NLTK•  Twoclassesofsen:ment:Posi:ve(1),Nega:ve(0)•  TrainingCorpus:Sen:ment140,Polaritydatasetv2.0,Universityof

Michigan•  Preprocessing:Lowercase,URLs,@user,#tags,trimming,repea:ng

characters,emo:cons•  Stemming:Porterstemmer•  Precision,Recall,F1score:0.82(82%)•  Processingspeed:20ms/tweet(noemo:con),5ms/tweet(emo:con)

Page 12: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

Sen:mentAnalysisPhase1:Training

Phase2:Predic5on

Loadtheclassifier

Foreachtweet

Parse Preprocess Stem Predict

Traintheclassifier

Saveaspickle

Page 13: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

SolrforGeoEnrichment“ReverseGeocoding”

•  Tweets(docs)canhaveageolat/lon•  EnrichtweetwithCountry,State/Province,…– Gazejeerlookup(point-in-polygon)

DataSet Features Rawsize Index5me Indexsize

Admin2 46,311 824MB 510min 892MB

USStates 74,002 747MB 4.9min 840MB

MassachusejsCensusBlocks 154,621 152MB 5.9min 507MB

Page 14: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

FastPoint-in-PolygonTricksIndex/Config•  Op:mizeto1segment•  RptWithGeometry

Spa:alField–  precisionModel=

"floating_single"–  autoIndex="true"

•  <cachename="perSegSpatialFieldCache_WKT"…

Search•  EmbedSolr(in-process)•  UsedocValues,notstored

–  fl=block:field(GEOID10)Querylikethis:•  q={!fieldcache=false

f=WKT}Intersects(POINT($lon$lat))

Sub-Millisecond!

Page 15: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

ApacheSolr•  Search/analy:csserver,basedonLucene•  Customadd-ons:– Timeshardedrou:ng(index+query)– LatLonPointSpa:alField–inSolr6.5

•  Faster/leanersearch&sortforpointdata– HeatmapSpa:alField–inSolr6.6TBD

•  Faster/leanerheatmapsatscale

Page 16: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

Time“Sharding”Solrhasnobuilt-in:mebasedsharding.ASolrcustom“URP”wasdevelopedtoroutetweetstotherightby-monthshard.Itautocreatesanddeletesshards.ASolrcustom“SearchHandler”wasdevelopedtodecidewhichsubsetofshardstosearchbasedoncustomparameterssentbytheweb-service.Generallyusefulforothers.Needmoreworkforcontribu:ontoSolritself.

Page 17: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

TheBOPWeb-Service•  HTTP/RESTAPI–  Keyowrdsearch–  Face:ng

•  Heatmaps–  CSVexport

•  WhynotSolrdirect?–  DefineasupportedAPI–  Easeofuseforclients–  Security

Tech:•  Swagger•  Dropwizard•  Kotlinlang(onJVM)

Page 18: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

ClientUI•  BrowsersideUIwithnoservercomponent•  Itusesthefollowingtechnologies:– AngularJS– OpenLayers3– npm(dependencies,scriptminifica:on,development)

Page 19: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

UIadaptstolaptop

Page 20: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

UIadaptstotablets

Page 21: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

UIadaptstophones

Page 22: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

Temporalfiltering

Page 23: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

Temporalface:ng(histogram)

Page 24: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

Spa:alfiltering

Page 25: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

Spa:alface:ng(heatmap)

Page 26: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

Textface:ng(tagcloud)

Page 27: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

Nearbytweets

Page 28: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

Deployment/Opera:ons•  MassOpenCloud“MOC”– OpenStackbasedcloud(mimicsAmazonEC2)

•  CoreOS•  Kontena&Docker•  Admin/Opstools:–  KaSaManager(Yahoo!)–  Solr’sadminUI

Stats:•  12nodes(machines)

•  5toSolr•  3toKaSa•  3toenrichment,…

•  217GBRAM•  3500GBdisk•  17services(soywarepieces)

•  133containers

Page 29: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

Docker•  Easytofind/try/use

soyware–  Noinstalla:on–  Simplifiedconfigura:on(envvariables)

–  Commonlogging–  Isolated

•  Idealfor:–  Con:nuousInt.servers–  Tryingnewsoyware–  Produc:onadvantagestoo

•  but“new”

Page 30: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

DockerinProduc:on•  Weuse“Kontena”•  Commonlogging,machine/procstats,security–  VPNtosecurenetwork;accesseverythingaslocal

•  Nolongerneedtocareabout:– Ansible,Chef,Puppet,etc.–  Securityatnetworkorproxy;notservicespecific

•  Challenges:state&big-data

Page 31: The Billion Object Plaorm (BOP): A Real-:me, Big Data ...€¦ · • Dropwizard • Kotlin lang (on JVM) Client UI • Browser side UI with no server component • It uses the following

Thankyou

[email protected]

DevikaKakkar

[email protected]