Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros...
Transcript of Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros...
CS639:DataManagementfor
DataScienceLecture1:IntrotoDataScienceandCourseOverview
TheodorosRekatsinas1
2
3
Bigscienceisdatadriven.
4
Increasinglymanycompaniesseethemselvesasdatadriven.
5
Evenmore“traditional”companies…
Theworldisincreasinglydrivenbydata…
6
Thisclassteachesthebasicsofhowtouse&managedatatoobtainusefulinsights.
Today’sLecture
1. Motivation,Admin,&Setup
2. IntroductiontoDataScience
7
1.Motivation,Admin&Setup
8
Section1
Whatyouwilllearnaboutinthissection
1. MotivationforstudyingDataScience
2. Administrativestructure
3. Courselogistics
9
Section1
10
Section1
DataAnalysishasalwaysbeenaround
R.A.Fisher “Correlationdoesnotimplycausation”
PeterLuhn ApioneerinhashcodingandfulltextprocessingCoinedtheterm“businessintelligence”
11
Section1
DataAnalysishasalwaysbeenaround
Introducedthebox-plotAlsointroducedtheterm“bit”(laterusedbyShannon)
Thetimeofdata-drivenAI
JohnTukey
TomMitchell
12
Section1
DataAnalysishasalwaysbeenaround
EssaysonScientificDiscoverybasedondata-intensivescienceFatherofACID
(requirementsforreliabletransactionprocessing)
KnownforAIprogramming(atGoogle)
ByJimGray
PeterNorvig
13
Section1
DataAnalysisispopular
14
Section1
Whyshouldyoustudydatascience?
• Mercenary-makemore$$$:• Startupsneeddatasciencetalentrightaway=lowemployee#• Massiveindustry…
• Intellectual:• Science:datapoortodatarich
• Noideahowtohandlethedata!• Fundamentalideasto/fromallofCS:
• Systems,theory,AI,logic,stats,analysis….
15
Section1
Whatthiscourseis(andisnot)
• Discussthefundamentalsdatamanagementfordatascienceworkflows• Howtorepresentandstoredata• Howtoextractandpreparedataforanalysis• Howtoanalyzedataandhowtovisualizeandcommunicateinsights
• Youwilllearnhowtodesigndatasciencepipelines
• Thisisnotadatabases,systems,ormachinelearningclass.Wewilltouchmanytopicscoveredintheseclassesbutwewillnotgointodetails.
Whoweare…
Instructor(me)TheoRekatsinas• FacultyintheComputerSciencesandpartoftheUW-DatabaseGroup• Research:dataintegrationandcleaning,statisticalanalytics,andmachinelearning.• [email protected]• Officehours:MWFafterclass@CS4361
16
Section1
17
Section1
FrankZou HuaweiWang
Communicationw/CourseStaff
• Piazzahttps://piazza.com/wisc/spring2019/cs639
• Classmailinglist:[email protected]
• Officehours:Listedonthewebsite
• AlsoByappointment!
18
Section1
Thegoalistogetyoutoanswereach
other’squestionssoyoucanbenefitandlearnfromeachother.
CourseWebsite:https://thodrek.github.io/cs639_spring19/
19
Section1
CourseGithub:https://github.com/thodrek/cs639_spring19
20
Lectures
• Lectureslidescoveressentialmaterial• Thisisyourbestreference.• Wewillhavepointersasneeded• Recommendedtextbookslistedonwebsite
• Trytocoversamethinginmanyways:Lecture,lectureslides,homework,exams(noshock)• Attendancemakesyourlifeeasier…
Section1
GradedElements
• SixProgrammingassignments(45%)• Focusondifferentaspectsofdatascience
• Midterm(20%)
• Finalexam(35%)
21
Datesarepostedonthewebsite!!!
Section1
Whatisexpectedfromyou
• Attendlectures• Ifyoudon’t,it’satyourownperil
• Beactiveandthinkcritically• Askquestions,postcommentsonforums
• Doprogrammingprojects• Startearlyandbehonest
• Studyfortestsandexams
22
Section1
ProgrammingAssignments
• Sixprogrammingassignments• PythonplusJupiternotebooks
• Theseareindividualassignments• SubmissionviaCanvas
• ~1weekperprogrammingassignments• Askquestions,postcommentsonforums• Startearly!
• Youhavelatedays.Thepolicyisdescribedonthewebsite
23
Section1
Programmingsetupforclass
24
1. ForallassignmentswewillusetheprovidedVirtualMachine.• Ubuntu+necessarypythonlibraries• Linkprovidedonthewebsite• Wewillnotprovidesupportforanyotherplatform(youcanstilluseyourownmachine)
2. TodeployandruntheprovidedVM:1. DownloadandInstallVirtualBox:https://www.virtualbox.org/wiki/Downloads2. DownloadtheClassVMfrom:
https://www.dropbox.com/s/xjvj3jlaurzjfas/cs639_vm.ova.zip?dl=03. ImporttheVMbyfollowingtheinstructionshere:
https://blogs.oracle.com/oswald/importing-a-vdi-in-virtualbox4. RuntheVMandloginwiththefollowingcredentials:
1. Username:CS639_DS_USER2. Password:cs639_ds_user
3. Cometoofficehoursifyouneedhelpwithinstallation!
Pleasehelpoutyourpeersbypostingissues/solutionsonPiazza!
Section1
2.IntroductiontoDataScience
25
Section2
Whatyouwilllearnaboutinthissection
1. WhatisDataScience?
2. DataScienceworkflows
3. WhatshouldaDataScientistknow?
4. Overviewoflecturecoverage
26
Section2
DataScienceisanemergingfield
• https://www.oreilly.com/ideas/what-is-data-science
27
Section2
DataScienceproducts
28
Section2
Onedefinitionofdatascience
29
Section2
Datascienceisabroadfieldthatreferstothecollectiveprocesses,theories,concepts,toolsandtechnologies thatenablethereview,analysisandextractionofvaluableknowledgeandinformationfromrawdata.
Source:Techopedia
Datascienceisnotdatabases
30
Section2
Databases DataScienceDataValue “Precious” “Cheap”
DataVolume Modest Massive
Examples Bank records,Personnelrecords,Census,Medicalrecords
Onlineclicks,GPSlogs,Tweets,Building sensorreadings
Priorities Consistency,Error recovery,Auditability
Speed,Availability,Queryrichness
Structured Strongly(Schema) Weaklyornone(Text)
Properties Transactions, ACID* CAP*theorem (2/3),eventualconsistency
Realizations SQL NoSQL:Riak,Memcached,MongoDB,CouchDB,Hbase,Cassandra,…
ACID=Atomicity,Consistency,IsolationandDurability CAP=Consistency,Availability,PartitionTolerance
Datascienceisnotdatabases
31
Section2
Databases DataScience
Queryingthepast Queryingthefuture
Businessintelligence (BI)isthetransformationofrawdataintomeaningfulandusefulinformationforbusinessanalysispurposes.BIcanhandleenormous
amountsofunstructureddatatohelpidentify,developandotherwisecreatenewstrategicbusinessopportunities- Wikipedia
Datascienceworkflow
32
Section2
https://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext
Datascienceworkflow
33
Section2
Datascienceworkflow
34
Section2
DiggingAroundinData
HypothesizeModel
LargeScaleExploitation
EvaluateInterpret
Clean,prep
WhatishardaboutDataScience
35
Section2
• Overcomingassumptions• Makingad-hocexplanationsofdatapatterns• Overgeneralizing• Communication• Notcheckingenough(validatemodels,datapipelineintegrity,etc.)• Usingstatisticaltestscorrectly• Prototypeà Production transitions• Datapipelinecomplexity (whodoyouask?)
WhatishardaboutDataScience
36
Section2
WhatishardaboutDataScience
37
Section2
WhatareDataScientistsreallydoing?
38
Section2
https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf
Lectures:1st part– DataStorage
1. Overview:Whatisdatascience• Lectures1-3
2. Foundations:Relationaldatamodels&SQL• Lectures4-7• HowtomanipulatedatawithSQL,adeclarativelanguage
• reducedexpressivepowerbutthesystemcandomoreforyou• Queryoptimization
3. MapReduceandNoSQLsystems:MapReduce,KeyValue Stores,GraphDBs• Lectures8-14• Dealingwithmassiveamountsofdataandnon-relationaldata
39
Section2
Lectures:2nd part– Predictiveanalytics
4.StatisticalReasoning:Inference,SamplingBayesianMethods• Lectures15-17• Howtoreasonaboutpatternsindata
5.MachineLearning:DecisionTrees,EvaluationofMLmodels,Ensembles• Lectures18-22• OverviewofdifferentMLparadigms
6.Optimization:HowtotrainMLmodels(efficiently)?• Lecture23• LossFunctions,OptimizationviaGradientDescent• StochasticGradientDescent(SGD),ParallelSGD
40
Section2
Lectures:3rd part– DataIntegration
7.InformationExtraction:Namedentityrecognitionandrelationextraction• Lecture24• Howtoidentifyentitiesofinterestinunstructureddata?• Howtofindrelationshipsbetweenthem?
8.DataIntegration:Combineinformationfromdifferentdatasources• Lecture25• Howtofindifareal-worldentityismentionedindifferentsources?• Howtoaligndatafromdifferentsources?
9.DataCleaning:Removeerrorsandnoisefromdata• Lecture26• Howcanwedetecterrorsindatatobeusedforanalytics?• Howcanwefixtheseerrorsautomatically?
41
Section2
Lectures:4th part– CommunicatingInsights
10.DataVisualization:Creatingdatachartsthatconveyinterestingfindings• Lectures27-29• Howtoconveyinsightsmosteffectively?• Howtoexplorerawdata?
11.DataPrivacy:Sharingsensitiveinformation• Lecture30-32• Howcanwesharesensitivedata?• Howcanweperformanalyticsonsensitivedata?
42
Section2