MLconf Yael Elmatad

26
Getting Cozy With Raw Data (A Cautionary Tale) Yael Elmatad Data Scientist, Tapad @y_s_e

description

 

Transcript of MLconf Yael Elmatad

Page 1: MLconf Yael Elmatad

Getting Cozy With Raw Data(ACautionaryTale)

YaelElmatadDataScientist,Tapad

@y_s_e

Page 2: MLconf Yael Elmatad

The Ad Tech SpaceThegoalofAdTechistoshowadvertisementstoconsumersontheinternetandtoensurethattherightadgetsshownto

therightperson.

Manycomponents:Publisherswhohaveadspaceontheirpages"Sell"sideplatformswhichaggregatepublishersandfacilitatesellingofadspace"Buy"sideplatforms(likeTapad)whichbidonthatspacetoshowcurrentadcampaignsAdvertiserswhoentrustdemandsideplatformstoplacetheircontentappropriately

Page 3: MLconf Yael Elmatad

Why Cross-Device?

DeviceProliferation:5.7internetconnecteddevicesperhousehold

ScreenSwitching:DigitalNativesswitchscreens27timeseverynon-workinghour

PurchasingAcrossDevices:40%ofshoppersconsult3ormorechannelsbeforepurchase

Sources:NPD,March2013;eMarketer,April2012;Conlumino&Webloyalty,2012

Page 4: MLconf Yael Elmatad

Tapad Connects Consumers' Devices

Toaddresstheseissues,TapadbuiltTheDeviceGraph.

TheDeviceGraphseekstoconnectdeviceswithinahouseholdfortargetingacrossmultiplescreens.

Ouredgesareinferredbasedonavarietyoftechniquesincludingco-location,partnershipswithothercompanies,

andobfuscatedlogindata(wherenopersonallyidentifiabledataiseverobserved).

Page 5: MLconf Yael Elmatad

Tapad Statistics

Over2billionnodes(devices)inTheDeviceGraph.

Representingabout100millionhouseholdsandapproximately250millionindividuals.

75%ofconnecteddevicesareconnectedto3ormoredevices.

38%ofdevicesarecomputers--36%representssmartphonesandtablets.

Page 6: MLconf Yael Elmatad

The (Original, Household) Device Graph

Noscoresbetweenedges.Nowaytoseparateindividuals.

iPadcomputer

Kindle

Page 7: MLconf Yael Elmatad

What We Wanted

EdgethicknessindicatesconfidenceoflinkbetweendevicesColorsindicatecommunitydetectionbaseddeviceclusteringThe(household)DeviceGraphnaturallyrestrictsoursearchspaceWeneverseektoidentifyindividuals-onlytogroupdevicesusedbythesameindividualGraphcanbetraversedatvaryingthresholds(scalevsaccuracy)

Page 8: MLconf Yael Elmatad

Scoring Edges

Weneededawaytoputweightsonedges.

FirstAttempt:UseSegmentdataProvidedbyfirstorthird-partiesTriestoputdevicesintoinferredbuckets.ex:DogloverComicbookenthusiastMale

Page 9: MLconf Yael Elmatad

Pros/Cons of Segment DataPros:

RelativelyextensivecoverageSimpletoread/humanintelligibleFinite

Cons:Don'tknowhowthesegmentsaredetermined(blackbox)DifferentprovidersmaynothavethesamemethodsThelongeradevicehasbeeninourgraphthemoreaudiencesitwillaccumulate(snowballs)

Page 10: MLconf Yael Elmatad

Plan of Attack

1. Usedthesegmentsasfeaturestocreatefeaturevectors.

2. Comparedseveralmethods:Simpledotproduct(baseline)Probabilisticapproachesthatusesegmentco-occurrenceMachinelearningapproachesthatusetruthdataandexistinggraphstructureasproxydata.

Page 11: MLconf Yael Elmatad

What Do We Mean By Proxy Data?

Assumption: TwonodesconnectedinThe(household)Device

Grapharemorelikelytobesimilartoeachotherthantwounconnectednodes.

Page 12: MLconf Yael Elmatad

Measuring Performance

TocomparemethodswecomputetheWinRate.

1. Selectpairofdevicesconnectedingraph;computescorebetweenthem(true_score).

2. Selectrandomdeviceunconnectedtooriginaldevices;computeascorewithoneoforiginaldevices(false_score).iftrue_score>false_score:win_value=1.0eliftrue_score<false_score:win_value=0.0else:#tieswin_value=0.5

Page 13: MLconf Yael Elmatad

Performance Expectations

Arandomalgorithmshouldachieveanaveragewin_valueofabout0.5.

Weexpectanoptimalalgorithmtoachieveanaveragewin_valueofabout0.75--50%betterthan

random.

Why?Becausecensusdatasuggestsaround2adultsperhousehold.Therefore,weexpectabouthalfofour

householdedgestobehighlycorrelated(similar)whiletheremaindershouldbestatisticallyuncorrelated(dissimilar).

Page 14: MLconf Yael Elmatad

Well, how do segment data perform?

Inaword:poorly.

Ourattemptsekedinjustabovetherandomlinearoundanaveragewin_value=0.55.

Atmost,10%betterthanrandom!

Page 15: MLconf Yael Elmatad

So what happened?

Segmentdataareriddledwith:randomness&noisehiddenbias

Anexampleofrandomness&noise:1outof4deviceswhich"selfidentifiedasmom"arealsotaggedas"male".

(Eitherwe'rereallyreallyprogressive,orsomethinghasgonehorriblywrong.)

Page 16: MLconf Yael Elmatad

So Much Bias!

PlatformBias:Certainsegmentsareplatformspecific.(Forexample:"usedaspecificmailclientonAndroid")SourceBias:Wedon'talwayshaveoverlapbetweendifferentfirstandthirdpartiesweworkwithandtheoverlapisnotuncorrelated.TemporalBias:Long-liveddevicestendtoaccumulatesegments(snowballs!).AudienceValueBias:Certainsegmentsareworthmoretoadvertiserssotheyappearmoreoftenthanexpected.(Example:peopleintendingtopurchaseautomobiles.)

Page 17: MLconf Yael Elmatad

Platform Bias

Page 18: MLconf Yael Elmatad

Platform Bias

Page 19: MLconf Yael Elmatad

Platform Bias

Page 20: MLconf Yael Elmatad

Source Bias

Page 21: MLconf Yael Elmatad

Next StepsEither:

Accountforthesebiasesexplicitlyandtrytocorrectthem.(see:engineering.tapad.com)

Or:Testdifferentalgorithms.

Or:Abandontheeffortandlookelsewherefordifferentdata.

Weoptedforthelastone.

Page 22: MLconf Yael Elmatad

Browsing Data

Intheend,weoptedtouseourin-housebrowsingdata.

Browsingdataaredataweobtainwhenexaminingavailableadspace.EachpieceofdatagivesusanobfuscatedIDandtheurlon

whichthedeviceisbrowsing.

Initiallyavoidedduetosparsity:Whilewesawabout20piecesofaudiencedataonaverageonadevice,wewereinsomecaseslimitedtoasingleuniqueurlperdevicebecausethisdataishardertocomebythanblackbox

segmentdata.

Page 23: MLconf Yael Elmatad

Plan of Attack(Preprocessing:removethefraudulenturlsassociatedwithbotnets.)

Justasbefore,createafeaturevectorbutnowthefeaturesarethelegitimateuniquedomains(tapad.com,mlconf.com,etc...).

Compareseveralmethods:Thefeaturevectordotproduct(baseline)Matrix-basedapproacheswhichuseprobabilisticcorrelationsbasedonurlco-occurrenceonnodesClustering-basedapproacheswhichreducedimensionalitybyfirstclusteringhighlycorrelatedurls

Page 24: MLconf Yael Elmatad

Performance

Muchbetter!

Simpledotproduct(baseline)alreadyperformsabout18%betterthanrandom.

Boththematrix-basedandclustering-basedapproachesperformupto40%betterthanrandom.

Thisisintherangeofhowweexpectanoptimalalgorithmtoperform-despitedatasparsity!

Page 25: MLconf Yael Elmatad

Moral

Don'tassumebecausepiecesofdataarenicelytiedinabowandplentifulthattheyaretherightdatatouse.

Questionyourdata,notonlyyouralgorithms.

Thebestpiecesofdatamaybescarceandrawbecausetheyareoftenlessfraughtwithhiddenbiasesandunnecessary

processing.

Page 26: MLconf Yael Elmatad

Learn more about TapadReadourblog:

http://engineering.tapad.com

Followusontwitter:@tapad

@tapadeng

FollowusonInstagram:@tapadinc(includesapictureofyourstrulyinaheadstand.)

Contactme:[email protected],@y_s_e