MLconf Yael Elmatad
-
Upload
sessionsevents -
Category
Technology
-
view
1.846 -
download
0
description
Transcript of MLconf Yael Elmatad
Getting Cozy With Raw Data(ACautionaryTale)
YaelElmatadDataScientist,Tapad
@y_s_e
The Ad Tech SpaceThegoalofAdTechistoshowadvertisementstoconsumersontheinternetandtoensurethattherightadgetsshownto
therightperson.
Manycomponents:Publisherswhohaveadspaceontheirpages"Sell"sideplatformswhichaggregatepublishersandfacilitatesellingofadspace"Buy"sideplatforms(likeTapad)whichbidonthatspacetoshowcurrentadcampaignsAdvertiserswhoentrustdemandsideplatformstoplacetheircontentappropriately
Why Cross-Device?
DeviceProliferation:5.7internetconnecteddevicesperhousehold
ScreenSwitching:DigitalNativesswitchscreens27timeseverynon-workinghour
PurchasingAcrossDevices:40%ofshoppersconsult3ormorechannelsbeforepurchase
Sources:NPD,March2013;eMarketer,April2012;Conlumino&Webloyalty,2012
Tapad Connects Consumers' Devices
Toaddresstheseissues,TapadbuiltTheDeviceGraph.
TheDeviceGraphseekstoconnectdeviceswithinahouseholdfortargetingacrossmultiplescreens.
Ouredgesareinferredbasedonavarietyoftechniquesincludingco-location,partnershipswithothercompanies,
andobfuscatedlogindata(wherenopersonallyidentifiabledataiseverobserved).
Tapad Statistics
Over2billionnodes(devices)inTheDeviceGraph.
Representingabout100millionhouseholdsandapproximately250millionindividuals.
75%ofconnecteddevicesareconnectedto3ormoredevices.
38%ofdevicesarecomputers--36%representssmartphonesandtablets.
The (Original, Household) Device Graph
Noscoresbetweenedges.Nowaytoseparateindividuals.
iPadcomputer
Kindle
What We Wanted
EdgethicknessindicatesconfidenceoflinkbetweendevicesColorsindicatecommunitydetectionbaseddeviceclusteringThe(household)DeviceGraphnaturallyrestrictsoursearchspaceWeneverseektoidentifyindividuals-onlytogroupdevicesusedbythesameindividualGraphcanbetraversedatvaryingthresholds(scalevsaccuracy)
Scoring Edges
Weneededawaytoputweightsonedges.
FirstAttempt:UseSegmentdataProvidedbyfirstorthird-partiesTriestoputdevicesintoinferredbuckets.ex:DogloverComicbookenthusiastMale
Pros/Cons of Segment DataPros:
RelativelyextensivecoverageSimpletoread/humanintelligibleFinite
Cons:Don'tknowhowthesegmentsaredetermined(blackbox)DifferentprovidersmaynothavethesamemethodsThelongeradevicehasbeeninourgraphthemoreaudiencesitwillaccumulate(snowballs)
Plan of Attack
1. Usedthesegmentsasfeaturestocreatefeaturevectors.
2. Comparedseveralmethods:Simpledotproduct(baseline)Probabilisticapproachesthatusesegmentco-occurrenceMachinelearningapproachesthatusetruthdataandexistinggraphstructureasproxydata.
What Do We Mean By Proxy Data?
Assumption: TwonodesconnectedinThe(household)Device
Grapharemorelikelytobesimilartoeachotherthantwounconnectednodes.
Measuring Performance
TocomparemethodswecomputetheWinRate.
1. Selectpairofdevicesconnectedingraph;computescorebetweenthem(true_score).
2. Selectrandomdeviceunconnectedtooriginaldevices;computeascorewithoneoforiginaldevices(false_score).iftrue_score>false_score:win_value=1.0eliftrue_score<false_score:win_value=0.0else:#tieswin_value=0.5
Performance Expectations
Arandomalgorithmshouldachieveanaveragewin_valueofabout0.5.
Weexpectanoptimalalgorithmtoachieveanaveragewin_valueofabout0.75--50%betterthan
random.
Why?Becausecensusdatasuggestsaround2adultsperhousehold.Therefore,weexpectabouthalfofour
householdedgestobehighlycorrelated(similar)whiletheremaindershouldbestatisticallyuncorrelated(dissimilar).
Well, how do segment data perform?
Inaword:poorly.
Ourattemptsekedinjustabovetherandomlinearoundanaveragewin_value=0.55.
Atmost,10%betterthanrandom!
So what happened?
Segmentdataareriddledwith:randomness&noisehiddenbias
Anexampleofrandomness&noise:1outof4deviceswhich"selfidentifiedasmom"arealsotaggedas"male".
(Eitherwe'rereallyreallyprogressive,orsomethinghasgonehorriblywrong.)
So Much Bias!
PlatformBias:Certainsegmentsareplatformspecific.(Forexample:"usedaspecificmailclientonAndroid")SourceBias:Wedon'talwayshaveoverlapbetweendifferentfirstandthirdpartiesweworkwithandtheoverlapisnotuncorrelated.TemporalBias:Long-liveddevicestendtoaccumulatesegments(snowballs!).AudienceValueBias:Certainsegmentsareworthmoretoadvertiserssotheyappearmoreoftenthanexpected.(Example:peopleintendingtopurchaseautomobiles.)
Platform Bias
Platform Bias
Platform Bias
Source Bias
Next StepsEither:
Accountforthesebiasesexplicitlyandtrytocorrectthem.(see:engineering.tapad.com)
Or:Testdifferentalgorithms.
Or:Abandontheeffortandlookelsewherefordifferentdata.
Weoptedforthelastone.
Browsing Data
Intheend,weoptedtouseourin-housebrowsingdata.
Browsingdataaredataweobtainwhenexaminingavailableadspace.EachpieceofdatagivesusanobfuscatedIDandtheurlon
whichthedeviceisbrowsing.
Initiallyavoidedduetosparsity:Whilewesawabout20piecesofaudiencedataonaverageonadevice,wewereinsomecaseslimitedtoasingleuniqueurlperdevicebecausethisdataishardertocomebythanblackbox
segmentdata.
Plan of Attack(Preprocessing:removethefraudulenturlsassociatedwithbotnets.)
Justasbefore,createafeaturevectorbutnowthefeaturesarethelegitimateuniquedomains(tapad.com,mlconf.com,etc...).
Compareseveralmethods:Thefeaturevectordotproduct(baseline)Matrix-basedapproacheswhichuseprobabilisticcorrelationsbasedonurlco-occurrenceonnodesClustering-basedapproacheswhichreducedimensionalitybyfirstclusteringhighlycorrelatedurls
Performance
Muchbetter!
Simpledotproduct(baseline)alreadyperformsabout18%betterthanrandom.
Boththematrix-basedandclustering-basedapproachesperformupto40%betterthanrandom.
Thisisintherangeofhowweexpectanoptimalalgorithmtoperform-despitedatasparsity!
Moral
Don'tassumebecausepiecesofdataarenicelytiedinabowandplentifulthattheyaretherightdatatouse.
Questionyourdata,notonlyyouralgorithms.
Thebestpiecesofdatamaybescarceandrawbecausetheyareoftenlessfraughtwithhiddenbiasesandunnecessary
processing.
Learn more about TapadReadourblog:
http://engineering.tapad.com
Followusontwitter:@tapad
@tapadeng
FollowusonInstagram:@tapadinc(includesapictureofyourstrulyinaheadstand.)
Contactme:[email protected],@y_s_e