Data X About Me · 01.12.2016 · Data X CS Tools in Data-X Common Tools • Python, Numpy, SciPy...
Transcript of Data X About Me · 01.12.2016 · Data X CS Tools in Data-X Common Tools • Python, Numpy, SciPy...
Data X
Ikhlaq Sidhu !Chief Scientist & Founding Director, !Sutardja Center for Entrepreneurship & Technology!IEOR Emerging Area Professor Award, UC Berkeley
AboutMe:
Introduc/onData-X:ACourseonData,Signals,andSystems
Data X
Data X
• ACourseandLab
• CustomerDriven
• AppliedProject
• IndustryPerspec/ve
CSTools
MathModels
RealLifeProblems
WhatisData-X?
Data X
CSToolsinData-X
CommonTools• Python,Numpy,SciPy• Pandas,• TensorFlow,Sklearn• SQL• NLP/NLTK• Matplotlib,• Tableau
WorkingwithData:Collect,Combine,Store,Use&Compute,Analyze,Visualize
Data X
QuantModels:Signals,Systems,NetworksinData-X
Quan@ta@veToolBox
• MarkovProcesses• BayesianDecisions• LTISystems:Fourier,Filters• Predic@on:Linear,Max.Likelihood,
Regression,Correla@ons• ControlModels• Stochas@c,MLClassifica@on,K-Means,KNN,• DeepLearning• NetworkModels,Graphs,Paths
Data X
RealLifeApplica@onApplica@onAreas
• Industry4.0• SmartCi/es• DataandHealth• FinanceandFintech• Transporta/on• NetworksandCommunica/on• Retail• Security
Data X
Whatisthisclass
MaketheTools UsetheTools(Op/mally)
ArchitecttheSystem Whyandhowyoubuild
MostCS SutardjaCenterThisIEORCourse
Data X
Whatwecoverandwhatwewon’t
Yes• ToolsforDataandMath• Opensourcetoolsets• UseofMLandNatural
Languagetools• CookbookApplica/onsfor
commonsystem• Areallifeintegratedproject• Opera/onalscalablecode
foranyfieldorapplica/on• Enoughtobedangerous• Asystem’sviewpoint
No• VerylargeLargeDataSets• Hadoop• SparkPipelines• BDASStack• AbilitytowriteanMLor
Naturallanguageframework• Detaileddatascience• Sparsedatatechniques• ApurelyCSview• Apurelysta/s/cian’sview• In-depthMathema/c
Data X
Propose Low Tech
Solution (1)
Brainstorm Challenge and Validate (4)
Demo or Die
(1)
Execute * Iterate BMoE Reflections Agile Spring (8)
Insightful Story Solution
HowtheData-XCourseWorks:
CS Tool Industry Lectures – Video Flip
Team:TechLead,ProductLead+2-4Experts
Data X
BasicToolstoGetStarted• AvailablewithAnacondaEnvironment(availableforfree):
– Python,wewilluseversion2.7,pre-requisitetoclass– NumPy,arrayprocessingfornumbers,strings,records,andobjects– Pandas,Powerfuldatastructuresanddataanalysistools– SciPy,Scien/ficLibraryforPython– Matplotlib,Python2Dplo`nglibrary– Ipython-Produc/veInterac/veCompu/ng
• Environmentincludes:– Jupyter–Interac/vewebbasedpython– Spyder–codedevelopmentenvironmentwitheditor
• Data-X:Thisclasswillhelpyoucombinemathanddataconcepts• NotData-X:Thisclassisnotabigdataclass.
Data X
ModelExamplesforthingsthatnormallyrequirehumanjudgment
ScoringWine
Winequality=12.145+0.00117x(winterrainfall)+0.0614x(averagegrowingseasontemperature)–0.00386x(harvestrainfall)
OrenAshenfelter,Princeton.NowusedbyChrisAesAucAonHouse
MoneyBall:HowtomeasureandpredictbaseballperformanceOaklandAthle/csbaseballteamanditsgeneralmanagerBillyBeaneA:WatchandtalkwithhundredsofplayersB:Runscreated=(hits+walks)xTotalBases/(AtBats+Walks)Now:Basketball,Football,andsooneveryothersport
Compe@@veAdvantageinSports
Data X
Harrah’sCasino:Knowingyourcustomer
• ServiceproviderofGamblingandCasinos
• EntryCard
• Painpoints
• Interven@on
Reference:Supercrunchers
Data X
Whathasbeenhappening
1995 2005 2015 2025
Context: InternetWeb
SocialNetsRecommend
HigherAccuracyLargertraining
Control+AISelflearning
E-Commerce AdDriven
Fin/QuantSharingEconomy
?
Data X
AnMLHighLevelFramework
• Objects
• Events/Experiments
• People/Customers
• Products
• Stocks
• …
InRealLife Features,butalsolossofinforma@on
InSample
OutofSample
Person1Person2Person3
...
PersonN
-Characteris@cs-Paberns-Models
-Predic@ons-Similari@es-Differences-Distance
Somedatahasobservedresults
Data X
CS: TableMath: MatrixX,withNrows–eachperson
mcolumns,eachfeature(age,salary,..)
X=
• Objects
• Events/Experiments
• People/Customers
• Products
• Stocks
• …
InRealLife Features,butalsolossofinforma@on
InSample
OutofSample
Person1Person2Person3
...
PersonN
-Characteris@cs-Paberns-Models
-Predic@ons-Similari@es-Differences-Distance
Somedatahasobservedresults
AnMLHighLevelFramework
Data X
AFundamentalIdea:FromTabletoN-DimensionalSpace
A
B
CD
E
F
G
H
12345
54321
WhichuserisclosesttouserA?
Element F1 F2 F3
A 4 2 2
B 4.5 1.5 3
C 3 3 5
D 1 2 2
E 3 1.5 5
F 3.5 3.5 1
.. .. .. ..
X=
Movie1
Movie2
Data X 22
A
B
CD
E
F
G
H
Movie1
Movie2
12345
54321
ClusteringbyMeasuringDistance(Unsupervised)
Distancefunc@ons
Data X
ClusteringtoClassifica@on
24
A
B
CD
E
F
G
H
Actually:70K->200KAtles(dimensions),10Mplususers(points)
Feature1
Feature2
12345
54321
• Targetcustomers?
• PicturesofCatsandDogs
• Speechrecogni@on
• RecognizeLebers:A,B,C..
Data X
AFundamentalIdea:FromTabletoScoreCust F1 F2 F3
A 4 2 2
B 4.5 1.5 3
C 3 3 5
D 1 2 2
E 3 1.5 5
F 3.5 3.5 1
.. .. .. ..
F(X)
Cust CreditScore
A 552
B 381
C 760
D 330
E 452
F 678
.. ..
X Y
X=
Data X
MachineLearning:LearningfromDataInputData=MatrixX
Customer1:[Name,income,x,y,..Features..z]Customer2:[Name,income,x,y,..Features..z]CustomerN:[Name,income,x,y,..Features..z]
OutputData=ColumnVectorY
Customer1:[20]Customer2:[60]CustomerN:[05]
Purchases/year,repaidloan,…
Target:WhatisF(X)=Y aformulathatwedon’tknow
Sampledata(training):(x1,y1)(x2,y2)…(xm,ym) wehavethis
AlgorithmAfromH
H:HypothesisSet:Allpossiblealgorithmsorformulas
FindG(x)whichisapprox.F(x)
a)SupervisedML–asshownb)Unsupervised-notrainingdatac)Reinforcedlearning–donebysimula/on
Data X
TheKeyismul@-layerlearningalgorithmssuchasDeepConvolu@onalNeuralNetworks!
Neuralnetresultsareclosetohumanresults
Data X
1. Knowingyourcustomer,bepertarge/ngandrela/onshipE.g.Target,Disney,Neqlix
2. Improvingphysicalproductorservicerwithcomplimentaryinforma/onE.g.UPS,FedEx
3. Data-drivenreliabilityorsecurityE.g.GE,BMW,Siemens
4. Informa/onBrokers,Arbitrage,andTradingOpportuni/esE.g.Investmentfunds.
5. Improvingthecustomerjourney/experienceE.g.Harrah’s
6. Func/onalApplica/ons:HR/Hiring,Opera/onsetc.E.g.Walmart,Baseball,Sports
7. EfficiencyorbeperperformanceperdollarcostE.g.GeneralIT,SAP,etc
8. RiskManagement,regula/on,andcomplianceE.g.Compliance360
Top8BusinessModelsUsingData
Data X
AHighLevelFramework
• Objects
• Events/Experiments
• People/Customers
• Products
• Stocks
• …
InRealLife Features,butalsolossofinforma@on
InSample
OutofSample
Person1Person2Person3
...
PersonN
-Characteris@cs-Paberns-Models
-Predic@ons-Similari@es-Differences-Distance
Somedatahasobservedresults
Data X
TheData-XSystemView
WeborPoll
PossibleInputsCodeBlocks
Download
Crawl
…
StreamSocialNet
AlgorithmOp/onsw/Tables/MatrixPredic/on/Classifica/onNaturalLanguage,StateSpaceFeatureExtrac/on
Computeincludingtest,train,split
Pandas:ShortTermStorage
LongTermStorage:SQLandFileFormats(JSON,CSV,Excel)
Web
PossibleOutputCodeBlocks
ControlDecision
…Chatbot
FeedbackfromExternalSystem(World)
Data X
AHighLevelFramework
• Objects
• Events/Experiments
• People/Customers
• Products
• Stocks
• …
InRealLife Features,butalsolossofinforma@on
InSample
OutofSample
Person1Person2Person3
...
PersonN
-Characteris@cs-Paberns-Models
-Predic@ons-Similari@es-Differences-Distance
Somedatahasobservedresults
Inthisclass,wewilllearnwaysto:*Collectthedataaboutobjects*Combinedatasourceswhenneeded*Usetablesanddatabasestostore*Prac/cemakinggood“features”*LearntoAnalyze;Compute,Classify,Predict*VisualizesomeresultsUsecookbookapplica/onstogetyoustartedonyourownappliedprojectinagroup.