A Mathematical Perspective on Data Science - large (86.7 MB)

22
Copyright © 2016 Splunk Inc. A Mathematical Perspective on Data Science Dr. Tom LaGatta Staff Sales Engineer (previously Staff Data Scientist)

Transcript of A Mathematical Perspective on Data Science - large (86.7 MB)

Page 1: A Mathematical Perspective on Data Science - large (86.7 MB)

Copyright©2016SplunkInc.

AMathematicalPerspectiveonDataScience

Dr.TomLaGattaStaffSalesEngineer(previouslyStaffDataScientist)

Page 2: A Mathematical Perspective on Data Science - large (86.7 MB)

AboutMe• MathPhDfromUniversityofArizona

– ”Geodesics ofRandomRiemannianMetrics”w/Janek Wehr– Probability+DifferentialGeometry+FunctionalAnalysis

• PostdocatCourantInstitute@NYU– FinishedGeodesicspaper,published inCommunications inMath.Physics– CollaboratedwithPoliticalScientists onheterogeneousvotingbehavior

• Was:StaffDataScientistatSplunk– HelpedcustomerswithadvancedusecasesinBusinessAnalytics, Internetof

Things,MachineLearning,DataScience

• Now:StaffSalesEngineeratSplunk– Helpingbigbigcustomerssolvebigbigbusinessproblems

Page 3: A Mathematical Perspective on Data Science - large (86.7 MB)

Abstract● Aswithallthings,theprocessofanalyzingdataadmitsamathematicaldescription.Asa

mathematician-turned-data-scientist,Iwilldescribemyapproachtoproblemsolving,andattempttolooselyformalize"stakeholders","usecases","data"and"deliverables"inmathematicallanguagefortheenjoymentofthismostlyacademicaudience.Inparticular,Iwilldescribehowquerylanguagesareinherentlyfunctional,actingasfunctionaltransformationsofDataintoData,whichobeytheusualfunctionalcomposition law.Theprocessofanalyzingdataresultsinaniterativesequenceofqueries,convergingtoafinalquerywhichissatisfactorytotheusecase.Thesequeriesarethenorganizedintodeliverables,whichcanbe"dashboards" (webpageswithvisualizations)or"dataproducts"(withscheduledjobs&analysesrunninginthebackground).Whenthisprocessisdoneright,itresultsintheextractionof"value"forstakeholders,whichcanbemeasuredtangiblyintermsofrevenue,costsorriskmetrics.Sometimesthishasafancynamelike"datascience",butmoreoftenthannot, isjustthenormaloperationalworkofagooddata-savvyIT,Security,TechorBusinessdepartmentinmodernenterprisesandgovernmentalagencies.Therewillbenoproofs,but Iwouldbeveryinterestedtodiscussrigorousapproachestosocialorganization&problemsolvingafterthetalk.

Page 4: A Mathematical Perspective on Data Science - large (86.7 MB)

Agenda• BasicDefinitions:

– Data,Stakeholders,UseCases,Deliverables

• Querylanguagesasfunctionalprogramming– Everyqueryisamapf:Data->Data– Exampleproblemsolving

• PuttingItAllTogether:DoingDataScience– Emphasizeactionable insights– Tieittogethertodeliver”value”tostakeholders

Page 5: A Mathematical Perspective on Data Science - large (86.7 MB)

Copyright©2016SplunkInc.

BasicDefinitions

Page 6: A Mathematical Perspective on Data Science - large (86.7 MB)

WhatisData?• ”Data”isanyinformationalartifactofreal-worldphenomena• A”metric”or”KPI”isanyaggregatefunctionoflow-leveldata• Examples:

– Semi-structuredtimestamped events/metrics– Structuredrelationaldata(rows&columns)– Graphdata(nodes&edges)– ”Unstructured”data(images,video,text)

• Howtomodeldata:– Events:markedpointprocesses&timeseries(Skorokhod space)– Relationalschema:categories(seeDavidSpivak’swork)– Otherdata:dependsontheusecase,mightneednewdatastructuresto

representit(incl.vectors,graphs,etc.)

Page 7: A Mathematical Perspective on Data Science - large (86.7 MB)

WhatisaStakeholder?• A”stakeholder”isaperson,groupororganizationwhoisinvestedintheoutcomeofaninitiative.

• Example:Acompanybuyssoftware– StakeholderorgsareIT&theBusiness– Individualstakeholders includeIndividual

Contributors,Managers,Directors&Executives.

• ITstakeholdershaveperformancemetrics(num.outages,meantimetoresolution,etc)

• Businessstakeholdershavedifferentmetrics(revenue,cost,risk)• Customerstakeholdersdownstreamalsohavevalue/impactmetrics

Page 8: A Mathematical Perspective on Data Science - large (86.7 MB)

WhatisaStakeholder?(cont.)• Howtomodelstakeholders?

– FollowGameTheoryforinspiration(butdon’tworryabout”equilibrium”)– Createanindexset Iwithallstakeholders.Variousactions,outcomes&

objectiveswillhavesubscripts i basedonstakeholders– E.g.,personi choosesactionai,t attimet– Icanbehierarchical(Personi containedinorgA,soparent(i)=A)– ”Value”modeledbyobjectivefunctions(Ui,n =objective#nforpersoni)– Istheaction”pivotal”fortheoutcome?(ie 𝔼[U|do(a)]>𝔼[U|do(nota)]?)

• Keeptrackofstakeholdersdata:– Mightbehigh-level(email,Powerpoints)– contextiskey– Mightbeindatabases(transactions,customerrecords,ticketsdata)– Mightbegranulareventsdata(webclickstream,logs,mobile,wiredata)

Page 9: A Mathematical Perspective on Data Science - large (86.7 MB)

WhatisaUseCase?• A”usecase”consistsofabusinessproblem,astrategytoalleviatetheproblem,metricstoevaluatetheoutcome,datatopowerasolution,andstakeholderswhoareinvolvedinitsdevelopment.

• UseCase:ProblemForecasting.– Companyhascostlynetwork/systemoutages– IThiresaDataScientisttohelpbuildsolution.– Dataincludes Infrastructure(CPU,Memory),

Operations(OutageReports),Applogs,etc– Metric:costofoutages≈

numoutages*timetoresolution*costoflabor– StakeholdersincludeIT&Business, andimpactedcustomers

Page 10: A Mathematical Perspective on Data Science - large (86.7 MB)

WhatisaDeliverable?• A”deliverable”isathingproducedtosolveausecase

– Caninclude”dashboards”:informationalwebpagesbuiltfromdataqueries– Or”workflows”:notableeventsdeliberatedtooperationsanalysts– Orfull-fledged”dataproduct”:applicationwhich*does*stuffautomatically

• Deliverable:ProblemForecastingSystem– Goal:forecastproblemsbeforetheystart,

deliver”proximaterootcause”toITtoinvestigate– Data:CPU,Memory,Latency,ServiceTickets– Buildmachine learningmodeltocorrelate

InfrastructuredatawithServiceimpact– Applymodeltoincomingevents,create

”predicted(Risk_Score)”– Surfacehigh-riskeventstoITOperations

Page 11: A Mathematical Perspective on Data Science - large (86.7 MB)

Copyright©2016SplunkInc.

QueriesasFunctionalProgramming

Page 12: A Mathematical Perspective on Data Science - large (86.7 MB)

QueryLanguages• Querylanguagesprovideaformulaicapproachtoworkingwithdata• A”query”isastringwhichtellswheretogetthedata,whattodowiththedata,andwheretoputthedata(incl visualizationorDB)

• Mathematically,everyqueryisaFUNCTIONf:Data->Data• Queriescanbecomposed(with|symbol),andanalysisisiterative• Example1:SuccessfulPurchaseActionsfromWebLogs

sourcetype=access_combined action=purchase status=200

• Example2:PlotPurchaseValueasMetricTimeseriessourcetype=access_combined action=purchase status=200| timechart partial=f span=5m sum(price) as value

Page 13: A Mathematical Perspective on Data Science - large (86.7 MB)

1:SuccessfulPurchaseActionsfromWebLogs

13

Page 14: A Mathematical Perspective on Data Science - large (86.7 MB)

2:PlotPurchaseValueasMetricTimeseries

14

Wait!ProblemIdentified!Whyarepurchasesdecreasing?

Page 15: A Mathematical Perspective on Data Science - large (86.7 MB)

3:InvestigateDatabaseErrors

15

Page 16: A Mathematical Perspective on Data Science - large (86.7 MB)

4:PlotDatabaseErrors

16

Page 17: A Mathematical Perspective on Data Science - large (86.7 MB)

5:CorrelateDBproblemswithpurchasevalue

17

Page 18: A Mathematical Perspective on Data Science - large (86.7 MB)

6:SaveasDeliverable,SendtoStakeholders

18

Page 19: A Mathematical Perspective on Data Science - large (86.7 MB)

7:Movetowardproactivemonitoringstance

19

Page 20: A Mathematical Perspective on Data Science - large (86.7 MB)

Copyright©2016SplunkInc.

PuttingItAllTogether:DoingDataScience

Page 21: A Mathematical Perspective on Data Science - large (86.7 MB)

EmphasizeActionable Insights• Avoideye-candyvisualizations

– “Laserbeam”threatdashboardslookcoolbutareuseless

• “Howdoesthishelpmesolvemyproblem?”• Guidetheviewertodrilldown&actquickly

Confusingviz:notactionable Goodviz:actionable

Page 22: A Mathematical Perspective on Data Science - large (86.7 MB)

DoingDataScience• DataScientistsresisteasycharacterization.Abitof:

– Statistician– SoftwareEngineer– Business Analyst– Spockonthebridge

• Manyscalesofaction:– Getyourhandsdirtywhenneeded– Butstepbackandseethebigpicture– Emphasize”actionable insights”throughoutthewholeorg– Alsohavepoliticalcredibilitytosay,”Thisisabaddecision,don’tdoit”

• DataScientistsguideorganizationsthroughbuildingdataproductstosolvebigproblemsanddelivervalue