A Mathematical Perspective on Data Science - large (86.7 MB)

Post on 24-Apr-2022

4 views 0 download

Transcript of A Mathematical Perspective on Data Science - large (86.7 MB)

Copyright©2016SplunkInc.

AMathematicalPerspectiveonDataScience

Dr.TomLaGattaStaffSalesEngineer(previouslyStaffDataScientist)

AboutMe• MathPhDfromUniversityofArizona

– ”Geodesics ofRandomRiemannianMetrics”w/Janek Wehr– Probability+DifferentialGeometry+FunctionalAnalysis

• PostdocatCourantInstitute@NYU– FinishedGeodesicspaper,published inCommunications inMath.Physics– CollaboratedwithPoliticalScientists onheterogeneousvotingbehavior

• Was:StaffDataScientistatSplunk– HelpedcustomerswithadvancedusecasesinBusinessAnalytics, Internetof

Things,MachineLearning,DataScience

• Now:StaffSalesEngineeratSplunk– Helpingbigbigcustomerssolvebigbigbusinessproblems

Abstract● Aswithallthings,theprocessofanalyzingdataadmitsamathematicaldescription.Asa

mathematician-turned-data-scientist,Iwilldescribemyapproachtoproblemsolving,andattempttolooselyformalize"stakeholders","usecases","data"and"deliverables"inmathematicallanguagefortheenjoymentofthismostlyacademicaudience.Inparticular,Iwilldescribehowquerylanguagesareinherentlyfunctional,actingasfunctionaltransformationsofDataintoData,whichobeytheusualfunctionalcomposition law.Theprocessofanalyzingdataresultsinaniterativesequenceofqueries,convergingtoafinalquerywhichissatisfactorytotheusecase.Thesequeriesarethenorganizedintodeliverables,whichcanbe"dashboards" (webpageswithvisualizations)or"dataproducts"(withscheduledjobs&analysesrunninginthebackground).Whenthisprocessisdoneright,itresultsintheextractionof"value"forstakeholders,whichcanbemeasuredtangiblyintermsofrevenue,costsorriskmetrics.Sometimesthishasafancynamelike"datascience",butmoreoftenthannot, isjustthenormaloperationalworkofagooddata-savvyIT,Security,TechorBusinessdepartmentinmodernenterprisesandgovernmentalagencies.Therewillbenoproofs,but Iwouldbeveryinterestedtodiscussrigorousapproachestosocialorganization&problemsolvingafterthetalk.

Agenda• BasicDefinitions:

– Data,Stakeholders,UseCases,Deliverables

• Querylanguagesasfunctionalprogramming– Everyqueryisamapf:Data->Data– Exampleproblemsolving

• PuttingItAllTogether:DoingDataScience– Emphasizeactionable insights– Tieittogethertodeliver”value”tostakeholders

Copyright©2016SplunkInc.

BasicDefinitions

WhatisData?• ”Data”isanyinformationalartifactofreal-worldphenomena• A”metric”or”KPI”isanyaggregatefunctionoflow-leveldata• Examples:

– Semi-structuredtimestamped events/metrics– Structuredrelationaldata(rows&columns)– Graphdata(nodes&edges)– ”Unstructured”data(images,video,text)

• Howtomodeldata:– Events:markedpointprocesses&timeseries(Skorokhod space)– Relationalschema:categories(seeDavidSpivak’swork)– Otherdata:dependsontheusecase,mightneednewdatastructuresto

representit(incl.vectors,graphs,etc.)

WhatisaStakeholder?• A”stakeholder”isaperson,groupororganizationwhoisinvestedintheoutcomeofaninitiative.

• Example:Acompanybuyssoftware– StakeholderorgsareIT&theBusiness– Individualstakeholders includeIndividual

Contributors,Managers,Directors&Executives.

• ITstakeholdershaveperformancemetrics(num.outages,meantimetoresolution,etc)

• Businessstakeholdershavedifferentmetrics(revenue,cost,risk)• Customerstakeholdersdownstreamalsohavevalue/impactmetrics

WhatisaStakeholder?(cont.)• Howtomodelstakeholders?

– FollowGameTheoryforinspiration(butdon’tworryabout”equilibrium”)– Createanindexset Iwithallstakeholders.Variousactions,outcomes&

objectiveswillhavesubscripts i basedonstakeholders– E.g.,personi choosesactionai,t attimet– Icanbehierarchical(Personi containedinorgA,soparent(i)=A)– ”Value”modeledbyobjectivefunctions(Ui,n =objective#nforpersoni)– Istheaction”pivotal”fortheoutcome?(ie 𝔼[U|do(a)]>𝔼[U|do(nota)]?)

• Keeptrackofstakeholdersdata:– Mightbehigh-level(email,Powerpoints)– contextiskey– Mightbeindatabases(transactions,customerrecords,ticketsdata)– Mightbegranulareventsdata(webclickstream,logs,mobile,wiredata)

WhatisaUseCase?• A”usecase”consistsofabusinessproblem,astrategytoalleviatetheproblem,metricstoevaluatetheoutcome,datatopowerasolution,andstakeholderswhoareinvolvedinitsdevelopment.

• UseCase:ProblemForecasting.– Companyhascostlynetwork/systemoutages– IThiresaDataScientisttohelpbuildsolution.– Dataincludes Infrastructure(CPU,Memory),

Operations(OutageReports),Applogs,etc– Metric:costofoutages≈

numoutages*timetoresolution*costoflabor– StakeholdersincludeIT&Business, andimpactedcustomers

WhatisaDeliverable?• A”deliverable”isathingproducedtosolveausecase

– Caninclude”dashboards”:informationalwebpagesbuiltfromdataqueries– Or”workflows”:notableeventsdeliberatedtooperationsanalysts– Orfull-fledged”dataproduct”:applicationwhich*does*stuffautomatically

• Deliverable:ProblemForecastingSystem– Goal:forecastproblemsbeforetheystart,

deliver”proximaterootcause”toITtoinvestigate– Data:CPU,Memory,Latency,ServiceTickets– Buildmachine learningmodeltocorrelate

InfrastructuredatawithServiceimpact– Applymodeltoincomingevents,create

”predicted(Risk_Score)”– Surfacehigh-riskeventstoITOperations

Copyright©2016SplunkInc.

QueriesasFunctionalProgramming

QueryLanguages• Querylanguagesprovideaformulaicapproachtoworkingwithdata• A”query”isastringwhichtellswheretogetthedata,whattodowiththedata,andwheretoputthedata(incl visualizationorDB)

• Mathematically,everyqueryisaFUNCTIONf:Data->Data• Queriescanbecomposed(with|symbol),andanalysisisiterative• Example1:SuccessfulPurchaseActionsfromWebLogs

sourcetype=access_combined action=purchase status=200

• Example2:PlotPurchaseValueasMetricTimeseriessourcetype=access_combined action=purchase status=200| timechart partial=f span=5m sum(price) as value

1:SuccessfulPurchaseActionsfromWebLogs

13

2:PlotPurchaseValueasMetricTimeseries

14

Wait!ProblemIdentified!Whyarepurchasesdecreasing?

3:InvestigateDatabaseErrors

15

4:PlotDatabaseErrors

16

5:CorrelateDBproblemswithpurchasevalue

17

6:SaveasDeliverable,SendtoStakeholders

18

7:Movetowardproactivemonitoringstance

19

Copyright©2016SplunkInc.

PuttingItAllTogether:DoingDataScience

EmphasizeActionable Insights• Avoideye-candyvisualizations

– “Laserbeam”threatdashboardslookcoolbutareuseless

• “Howdoesthishelpmesolvemyproblem?”• Guidetheviewertodrilldown&actquickly

Confusingviz:notactionable Goodviz:actionable

DoingDataScience• DataScientistsresisteasycharacterization.Abitof:

– Statistician– SoftwareEngineer– Business Analyst– Spockonthebridge

• Manyscalesofaction:– Getyourhandsdirtywhenneeded– Butstepbackandseethebigpicture– Emphasize”actionable insights”throughoutthewholeorg– Alsohavepoliticalcredibilitytosay,”Thisisabaddecision,don’tdoit”

• DataScientistsguideorganizationsthroughbuildingdataproductstosolvebigproblemsanddelivervalue