MLconf Yael Elmatad

Post on 15-Jan-2015

1.846 views 0 download

Tags:

description

 

Transcript of MLconf Yael Elmatad

Getting Cozy With Raw Data(ACautionaryTale)

YaelElmatadDataScientist,Tapad

@y_s_e

The Ad Tech SpaceThegoalofAdTechistoshowadvertisementstoconsumersontheinternetandtoensurethattherightadgetsshownto

therightperson.

Manycomponents:Publisherswhohaveadspaceontheirpages"Sell"sideplatformswhichaggregatepublishersandfacilitatesellingofadspace"Buy"sideplatforms(likeTapad)whichbidonthatspacetoshowcurrentadcampaignsAdvertiserswhoentrustdemandsideplatformstoplacetheircontentappropriately

Why Cross-Device?

DeviceProliferation:5.7internetconnecteddevicesperhousehold

ScreenSwitching:DigitalNativesswitchscreens27timeseverynon-workinghour

PurchasingAcrossDevices:40%ofshoppersconsult3ormorechannelsbeforepurchase

Sources:NPD,March2013;eMarketer,April2012;Conlumino&Webloyalty,2012

Tapad Connects Consumers' Devices

Toaddresstheseissues,TapadbuiltTheDeviceGraph.

TheDeviceGraphseekstoconnectdeviceswithinahouseholdfortargetingacrossmultiplescreens.

Ouredgesareinferredbasedonavarietyoftechniquesincludingco-location,partnershipswithothercompanies,

andobfuscatedlogindata(wherenopersonallyidentifiabledataiseverobserved).

Tapad Statistics

Over2billionnodes(devices)inTheDeviceGraph.

Representingabout100millionhouseholdsandapproximately250millionindividuals.

75%ofconnecteddevicesareconnectedto3ormoredevices.

38%ofdevicesarecomputers--36%representssmartphonesandtablets.

The (Original, Household) Device Graph

Noscoresbetweenedges.Nowaytoseparateindividuals.

iPadcomputer

Kindle

What We Wanted

EdgethicknessindicatesconfidenceoflinkbetweendevicesColorsindicatecommunitydetectionbaseddeviceclusteringThe(household)DeviceGraphnaturallyrestrictsoursearchspaceWeneverseektoidentifyindividuals-onlytogroupdevicesusedbythesameindividualGraphcanbetraversedatvaryingthresholds(scalevsaccuracy)

Scoring Edges

Weneededawaytoputweightsonedges.

FirstAttempt:UseSegmentdataProvidedbyfirstorthird-partiesTriestoputdevicesintoinferredbuckets.ex:DogloverComicbookenthusiastMale

Pros/Cons of Segment DataPros:

RelativelyextensivecoverageSimpletoread/humanintelligibleFinite

Cons:Don'tknowhowthesegmentsaredetermined(blackbox)DifferentprovidersmaynothavethesamemethodsThelongeradevicehasbeeninourgraphthemoreaudiencesitwillaccumulate(snowballs)

Plan of Attack

1. Usedthesegmentsasfeaturestocreatefeaturevectors.

2. Comparedseveralmethods:Simpledotproduct(baseline)Probabilisticapproachesthatusesegmentco-occurrenceMachinelearningapproachesthatusetruthdataandexistinggraphstructureasproxydata.

What Do We Mean By Proxy Data?

Assumption: TwonodesconnectedinThe(household)Device

Grapharemorelikelytobesimilartoeachotherthantwounconnectednodes.

Measuring Performance

TocomparemethodswecomputetheWinRate.

1. Selectpairofdevicesconnectedingraph;computescorebetweenthem(true_score).

2. Selectrandomdeviceunconnectedtooriginaldevices;computeascorewithoneoforiginaldevices(false_score).iftrue_score>false_score:win_value=1.0eliftrue_score<false_score:win_value=0.0else:#tieswin_value=0.5

Performance Expectations

Arandomalgorithmshouldachieveanaveragewin_valueofabout0.5.

Weexpectanoptimalalgorithmtoachieveanaveragewin_valueofabout0.75--50%betterthan

random.

Why?Becausecensusdatasuggestsaround2adultsperhousehold.Therefore,weexpectabouthalfofour

householdedgestobehighlycorrelated(similar)whiletheremaindershouldbestatisticallyuncorrelated(dissimilar).

Well, how do segment data perform?

Inaword:poorly.

Ourattemptsekedinjustabovetherandomlinearoundanaveragewin_value=0.55.

Atmost,10%betterthanrandom!

So what happened?

Segmentdataareriddledwith:randomness&noisehiddenbias

Anexampleofrandomness&noise:1outof4deviceswhich"selfidentifiedasmom"arealsotaggedas"male".

(Eitherwe'rereallyreallyprogressive,orsomethinghasgonehorriblywrong.)

So Much Bias!

PlatformBias:Certainsegmentsareplatformspecific.(Forexample:"usedaspecificmailclientonAndroid")SourceBias:Wedon'talwayshaveoverlapbetweendifferentfirstandthirdpartiesweworkwithandtheoverlapisnotuncorrelated.TemporalBias:Long-liveddevicestendtoaccumulatesegments(snowballs!).AudienceValueBias:Certainsegmentsareworthmoretoadvertiserssotheyappearmoreoftenthanexpected.(Example:peopleintendingtopurchaseautomobiles.)

Platform Bias

Platform Bias

Platform Bias

Source Bias

Next StepsEither:

Accountforthesebiasesexplicitlyandtrytocorrectthem.(see:engineering.tapad.com)

Or:Testdifferentalgorithms.

Or:Abandontheeffortandlookelsewherefordifferentdata.

Weoptedforthelastone.

Browsing Data

Intheend,weoptedtouseourin-housebrowsingdata.

Browsingdataaredataweobtainwhenexaminingavailableadspace.EachpieceofdatagivesusanobfuscatedIDandtheurlon

whichthedeviceisbrowsing.

Initiallyavoidedduetosparsity:Whilewesawabout20piecesofaudiencedataonaverageonadevice,wewereinsomecaseslimitedtoasingleuniqueurlperdevicebecausethisdataishardertocomebythanblackbox

segmentdata.

Plan of Attack(Preprocessing:removethefraudulenturlsassociatedwithbotnets.)

Justasbefore,createafeaturevectorbutnowthefeaturesarethelegitimateuniquedomains(tapad.com,mlconf.com,etc...).

Compareseveralmethods:Thefeaturevectordotproduct(baseline)Matrix-basedapproacheswhichuseprobabilisticcorrelationsbasedonurlco-occurrenceonnodesClustering-basedapproacheswhichreducedimensionalitybyfirstclusteringhighlycorrelatedurls

Performance

Muchbetter!

Simpledotproduct(baseline)alreadyperformsabout18%betterthanrandom.

Boththematrix-basedandclustering-basedapproachesperformupto40%betterthanrandom.

Thisisintherangeofhowweexpectanoptimalalgorithmtoperform-despitedatasparsity!

Moral

Don'tassumebecausepiecesofdataarenicelytiedinabowandplentifulthattheyaretherightdatatouse.

Questionyourdata,notonlyyouralgorithms.

Thebestpiecesofdatamaybescarceandrawbecausetheyareoftenlessfraughtwithhiddenbiasesandunnecessary

processing.

Learn more about TapadReadourblog:

http://engineering.tapad.com

Followusontwitter:@tapad

@tapadeng

FollowusonInstagram:@tapadinc(includesapictureofyourstrulyinaheadstand.)

Contactme:yael@tapad.com,@y_s_e