Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros...

42
CS639: Data Management for Data Science Lecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1

Transcript of Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros...

Page 1: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

CS639:DataManagementfor

DataScienceLecture1:IntrotoDataScienceandCourseOverview

TheodorosRekatsinas1

Page 2: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

2

Page 3: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

3

Bigscienceisdatadriven.

Page 4: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

4

Increasinglymanycompaniesseethemselvesasdatadriven.

Page 5: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

5

Evenmore“traditional”companies…

Page 6: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

Theworldisincreasinglydrivenbydata…

6

Thisclassteachesthebasicsofhowtouse&managedatatoobtainusefulinsights.

Page 7: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

Today’sLecture

1. Motivation,Admin,&Setup

2. IntroductiontoDataScience

7

Page 8: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

1.Motivation,Admin&Setup

8

Section1

Page 9: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

Whatyouwilllearnaboutinthissection

1. MotivationforstudyingDataScience

2. Administrativestructure

3. Courselogistics

9

Section1

Page 10: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

10

Section1

DataAnalysishasalwaysbeenaround

R.A.Fisher “Correlationdoesnotimplycausation”

PeterLuhn ApioneerinhashcodingandfulltextprocessingCoinedtheterm“businessintelligence”

Page 11: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

11

Section1

DataAnalysishasalwaysbeenaround

Introducedthebox-plotAlsointroducedtheterm“bit”(laterusedbyShannon)

Thetimeofdata-drivenAI

JohnTukey

TomMitchell

Page 12: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

12

Section1

DataAnalysishasalwaysbeenaround

EssaysonScientificDiscoverybasedondata-intensivescienceFatherofACID

(requirementsforreliabletransactionprocessing)

KnownforAIprogramming(atGoogle)

ByJimGray

PeterNorvig

Page 13: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

13

Section1

DataAnalysisispopular

Page 14: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

14

Section1

Whyshouldyoustudydatascience?

• Mercenary-makemore$$$:• Startupsneeddatasciencetalentrightaway=lowemployee#• Massiveindustry…

• Intellectual:• Science:datapoortodatarich

• Noideahowtohandlethedata!• Fundamentalideasto/fromallofCS:

• Systems,theory,AI,logic,stats,analysis….

Page 15: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

15

Section1

Whatthiscourseis(andisnot)

• Discussthefundamentalsdatamanagementfordatascienceworkflows• Howtorepresentandstoredata• Howtoextractandpreparedataforanalysis• Howtoanalyzedataandhowtovisualizeandcommunicateinsights

• Youwilllearnhowtodesigndatasciencepipelines

• Thisisnotadatabases,systems,ormachinelearningclass.Wewilltouchmanytopicscoveredintheseclassesbutwewillnotgointodetails.

Page 16: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

Whoweare…

Instructor(me)TheoRekatsinas• FacultyintheComputerSciencesandpartoftheUW-DatabaseGroup• Research:dataintegrationandcleaning,statisticalanalytics,andmachinelearning.• [email protected]• Officehours:MWFafterclass@CS4361

16

Section1

Page 17: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

17

Section1

FrankZou HuaweiWang

Page 18: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

Communicationw/CourseStaff

• Piazzahttps://piazza.com/wisc/spring2019/cs639

• Classmailinglist:[email protected]

• Officehours:Listedonthewebsite

• AlsoByappointment!

18

Section1

Thegoalistogetyoutoanswereach

other’squestionssoyoucanbenefitandlearnfromeachother.

Page 19: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

CourseWebsite:https://thodrek.github.io/cs639_spring19/

19

Section1

CourseGithub:https://github.com/thodrek/cs639_spring19

Page 20: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

20

Lectures

• Lectureslidescoveressentialmaterial• Thisisyourbestreference.• Wewillhavepointersasneeded• Recommendedtextbookslistedonwebsite

• Trytocoversamethinginmanyways:Lecture,lectureslides,homework,exams(noshock)• Attendancemakesyourlifeeasier…

Section1

Page 21: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

GradedElements

• SixProgrammingassignments(45%)• Focusondifferentaspectsofdatascience

• Midterm(20%)

• Finalexam(35%)

21

Datesarepostedonthewebsite!!!

Section1

Page 22: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

Whatisexpectedfromyou

• Attendlectures• Ifyoudon’t,it’satyourownperil

• Beactiveandthinkcritically• Askquestions,postcommentsonforums

• Doprogrammingprojects• Startearlyandbehonest

• Studyfortestsandexams

22

Section1

Page 23: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

ProgrammingAssignments

• Sixprogrammingassignments• PythonplusJupiternotebooks

• Theseareindividualassignments• SubmissionviaCanvas

• ~1weekperprogrammingassignments• Askquestions,postcommentsonforums• Startearly!

• Youhavelatedays.Thepolicyisdescribedonthewebsite

23

Section1

Page 24: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

Programmingsetupforclass

24

1. ForallassignmentswewillusetheprovidedVirtualMachine.• Ubuntu+necessarypythonlibraries• Linkprovidedonthewebsite• Wewillnotprovidesupportforanyotherplatform(youcanstilluseyourownmachine)

2. TodeployandruntheprovidedVM:1. DownloadandInstallVirtualBox:https://www.virtualbox.org/wiki/Downloads2. DownloadtheClassVMfrom:

https://www.dropbox.com/s/xjvj3jlaurzjfas/cs639_vm.ova.zip?dl=03. ImporttheVMbyfollowingtheinstructionshere:

https://blogs.oracle.com/oswald/importing-a-vdi-in-virtualbox4. RuntheVMandloginwiththefollowingcredentials:

1. Username:CS639_DS_USER2. Password:cs639_ds_user

3. Cometoofficehoursifyouneedhelpwithinstallation!

Pleasehelpoutyourpeersbypostingissues/solutionsonPiazza!

Section1

Page 25: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

2.IntroductiontoDataScience

25

Section2

Page 26: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

Whatyouwilllearnaboutinthissection

1. WhatisDataScience?

2. DataScienceworkflows

3. WhatshouldaDataScientistknow?

4. Overviewoflecturecoverage

26

Section2

Page 27: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

DataScienceisanemergingfield

• https://www.oreilly.com/ideas/what-is-data-science

27

Section2

Page 28: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

DataScienceproducts

28

Section2

Page 29: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

Onedefinitionofdatascience

29

Section2

Datascienceisabroadfieldthatreferstothecollectiveprocesses,theories,concepts,toolsandtechnologies thatenablethereview,analysisandextractionofvaluableknowledgeandinformationfromrawdata.

Source:Techopedia

Page 30: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

Datascienceisnotdatabases

30

Section2

Databases DataScienceDataValue “Precious” “Cheap”

DataVolume Modest Massive

Examples Bank records,Personnelrecords,Census,Medicalrecords

Onlineclicks,GPSlogs,Tweets,Building sensorreadings

Priorities Consistency,Error recovery,Auditability

Speed,Availability,Queryrichness

Structured Strongly(Schema) Weaklyornone(Text)

Properties Transactions, ACID* CAP*theorem (2/3),eventualconsistency

Realizations SQL NoSQL:Riak,Memcached,MongoDB,CouchDB,Hbase,Cassandra,…

ACID=Atomicity,Consistency,IsolationandDurability CAP=Consistency,Availability,PartitionTolerance

Page 31: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

Datascienceisnotdatabases

31

Section2

Databases DataScience

Queryingthepast Queryingthefuture

Businessintelligence (BI)isthetransformationofrawdataintomeaningfulandusefulinformationforbusinessanalysispurposes.BIcanhandleenormous

amountsofunstructureddatatohelpidentify,developandotherwisecreatenewstrategicbusinessopportunities- Wikipedia

Page 32: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

Datascienceworkflow

32

Section2

https://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext

Page 33: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

Datascienceworkflow

33

Section2

Page 34: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

Datascienceworkflow

34

Section2

DiggingAroundinData

HypothesizeModel

LargeScaleExploitation

EvaluateInterpret

Clean,prep

Page 35: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

WhatishardaboutDataScience

35

Section2

• Overcomingassumptions• Makingad-hocexplanationsofdatapatterns• Overgeneralizing• Communication• Notcheckingenough(validatemodels,datapipelineintegrity,etc.)• Usingstatisticaltestscorrectly• Prototypeà Production transitions• Datapipelinecomplexity (whodoyouask?)

Page 36: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

WhatishardaboutDataScience

36

Section2

Page 37: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

WhatishardaboutDataScience

37

Section2

Page 38: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

WhatareDataScientistsreallydoing?

38

Section2

https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf

Page 39: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

Lectures:1st part– DataStorage

1. Overview:Whatisdatascience• Lectures1-3

2. Foundations:Relationaldatamodels&SQL• Lectures4-7• HowtomanipulatedatawithSQL,adeclarativelanguage

• reducedexpressivepowerbutthesystemcandomoreforyou• Queryoptimization

3. MapReduceandNoSQLsystems:MapReduce,KeyValue Stores,GraphDBs• Lectures8-14• Dealingwithmassiveamountsofdataandnon-relationaldata

39

Section2

Page 40: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

Lectures:2nd part– Predictiveanalytics

4.StatisticalReasoning:Inference,SamplingBayesianMethods• Lectures15-17• Howtoreasonaboutpatternsindata

5.MachineLearning:DecisionTrees,EvaluationofMLmodels,Ensembles• Lectures18-22• OverviewofdifferentMLparadigms

6.Optimization:HowtotrainMLmodels(efficiently)?• Lecture23• LossFunctions,OptimizationviaGradientDescent• StochasticGradientDescent(SGD),ParallelSGD

40

Section2

Page 41: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

Lectures:3rd part– DataIntegration

7.InformationExtraction:Namedentityrecognitionandrelationextraction• Lecture24• Howtoidentifyentitiesofinterestinunstructureddata?• Howtofindrelationshipsbetweenthem?

8.DataIntegration:Combineinformationfromdifferentdatasources• Lecture25• Howtofindifareal-worldentityismentionedindifferentsources?• Howtoaligndatafromdifferentsources?

9.DataCleaning:Removeerrorsandnoisefromdata• Lecture26• Howcanwedetecterrorsindatatobeusedforanalytics?• Howcanwefixtheseerrorsautomatically?

41

Section2

Page 42: Lecture 1 Intro - GitHub PagesLecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 3 Big science is data driven. 4 Increasingly many companies see themselves

Lectures:4th part– CommunicatingInsights

10.DataVisualization:Creatingdatachartsthatconveyinterestingfindings• Lectures27-29• Howtoconveyinsightsmosteffectively?• Howtoexplorerawdata?

11.DataPrivacy:Sharingsensitiveinformation• Lecture30-32• Howcanwesharesensitivedata?• Howcanweperformanalyticsonsensitivedata?

42

Section2