CS145:IntrotoDatabases
Lecture1:CourseOverview
1
Theworldisincreasinglydrivenbydata…
2
Thisclassteachesthebasicsofhowtouse&managedata.
KeyQuestionsWeWillAnswer
• Howcanwecollectandstorelargeamountsofdata?• Bybuildingtoolsanddatastructurestoefficientlyindexandservedata
• Howcanweefficientlyquerydata?• Bycompilinghigh-leveldeclarativequeriesintoefficientlow-levelplans
• Howcanwesafelyupdatedata?• Bymanagingconcurrentaccesstostateasitisreadandwritten
• Howdodifferentdatabasesystemsmanagedesigntrade-offs?• e.g.,atscale,inadistributedenvironment?
3
Whenyou’llusethismaterial
• Buildingalmostanysoftwareapplication• e.g.,mobile,cloud,consumer,enterprise,analytics,machinelearning• Corollary:everyapplicationyouuseusesadatabase• Bonus:everyprogramconsumesdata(evenifonlytheprogramtext!)
• Performingdataanalytics• Businessintelligence,datascience,predictivemodeling• (Evenifyou’reusingPandas,you’reusingrelationalalgebra!)
• Buildingdata-intensivetoolsandapplications• Manycoreconceptspowerdeeplearningframeworkstoself-drivingcars
4
Today’sLecture
1. Introduction,admin&setup• ACTIVITY:Jupyter “HelloWorld!”
2. Overviewoftherelationaldatamodel• ACTIVITY:SQLinJupyter
3. OverviewofDBMStopics:Keyconcepts&challenges
5
1.Introduction,admin&setup
6
Section1
Whatyouwilllearnaboutinthissection
1. MotivationforstudyingDBs
2. Administrativestructure
3. Courselogistics
4. Overviewoflecturecoverage
5. ACTIVITY:Jupyter “HelloWorld!”
7
Section1
BigDataLandscape…InfrastructureisChanging
8http://www.bigdatalandscape.com/
New tech.Same Principles.
Section1>Introduction
Whyshouldyou studydatabases?
• Mercenary-makemore$$$:• StartupsneedDBtalentrightaway=lowemployee#• Massiveindustry…
• Intellectual:• Science:datapoortodatarich
• Noideahowtohandlethedata!• Fundamentalideasto/fromallofCS:
• Systems,theory,AI,logic,stats,analysis….
9ManygreatcomputersystemsideasstartedinDB.
Section1>Introduction
Whatthiscourseis(andisnot)
• Discussfundamentalsofdatamanagement• Howtodesigndatabases,querydatabases,buildapplicationswiththem.• Howtodebugthemwhentheygowrong!• Not howtobeaDBAorhowtotuneOracle12g.
• We’llcoverhowdatabasemanagementsystemswork
• Andsome(butnotallof)theprinciplesofhowtobuildthem• see245,345,and346.
10
Section1>Introduction
Whoweare…
Instructor(me)PeterBailis• FacultyintheInfoLab• SecondyearatStanford,firsttimeteachingCS145!• Research:tools+systemsforlarge-scaledataanalytics• Officehours:T/Th 4:30-5:30,Gates410
11
Section1>Administrative>CourseStaff
CourseAssistants(CAs)
12
Section1>Administrative>CourseStaff
13
Section1>Administrative>CourseStaff
DevBhargavaWilliamChenSorooshHemmatiWoncheol JeongLingtongSunStephanieTangAmeliaVu
14
Section1>Administrative>CourseStaff
TaraHeadCA
15
CS145.stanford.edu
Communicationw/CourseStaff
• Piazza
• Officehours
• Byappointment!
16
OHsarelistedonthecoursewebsite!
Section1>Administrative
Piazza
Thegoalistogetyoutoanswereachother’squestionssoyoucanbenefitandlearnfromeachother.
17
Section1>Administrative
18
Important!
StudentswithdocumenteddisabilitiesshouldsendintheiraccommodationletterfromO.A.E.(OfficeofAccessibleEducation)bytheendofthisweektoTaraBalakrishnan(HeadCA)&cc’me.
19
Section1>Administrative
CourseWebsite:
cs145.stanford.edu
20
Section1>Administrative
21
Lectures
• Lectureslidescoveressentialmaterial• Thisisyourbestreference.• Wearetryingtogetawayfrombook,butdohavepointers
• Trytocoversamethinginmanyways:Lecture,lecturenotes,homework,exams(noshock)• Attendancemakesyourlifeeasier…
Section1>Logistics
22
Attendance
• Idislikemandatoryattendance…butinthepastwenoticed…• PeoplewhodidnotattenddidworseL• PeoplewhodidnotattendusedmorecourseresourcesL• PeoplewhodidnotattendwerelesshappywiththecourseL
• Lastyear:mandatoryattendance• Thisyear:voluntary(tostart!)-- reserverighttochange
Section1>Logistics
GradedElements
• ProblemSets(25%)
• Programmingproject(25%)
• Midterm(20%)
• Finalexam(30%)
23
AssignmentsaretypicallydueTuesdaybeforeclass,typically2weekstocomplete
Section1>Logistics
Un-GradedElements
• Readingsprovided tohelpyou!• Onlyitemsinlecture,homework,orprojectarefairgame.
• Activitiesareagainmainlytohelp/befun!• Willoccurduringclass- notgraded,butcountaspartoflecturematerial(fairgameaswell)
• Jupyter Notebooksprovided• Theseareoptionalbuthopefullyhelpful.• Redesignedsothatyoucan‘interactivelyreplay’partsoflecture
24
Section1>Logistics
Whatisexpectedfromyou
• Attendlectures• Ifyoudon’t,it’satyourownperil
• Beactiveandthinkcritically• Askquestions,postcommentsonforums
• Doprogrammingandhomeworkprojects• Startearlyandbehonest
• Studyfortestsandexams
25
Section1>Logistics
Lectures:1st half- fromauser’sperspective
1. Foundations:Relationaldatamodels&SQL• Lectures2-3• HowtomanipulatedatawithSQL,adeclarativelanguage
• reducedexpressivepowerbutthesystemcandomoreforyou
2. DatabaseDesign: Designtheoryandconstraints• Lectures4-6• Designingrelationalschematokeepyourdatafromgettingcorrupted
3. Transactions: Syntax&supportingsystems• Lectures7-8• Aprogrammer’sabstractionfordataconsistency
26
Section1>Lectures
Lectures:2nd half- understandinghowitworks
4.Introductiontodatabasesystems• Lectures12-16• Indexing• ExternalMemoryAlgorithms(IOmodel)forsorting,joins,etc.• Basicsofqueryoptimization(CostEstimates)• Relationalalgebra
5.SpecializedandNewDataProcessingSystems• Lectures17-19• Key-ValueStores• Hadoopandits10yearanniversary• SparkSQL.There-riseofSQL• Next-genanalyticssystems¤tintersectionswithML&AI
27
Section1>Lectures
Lectures:Anoteaboutformatofnotes
28
Section1>Lectures
Theseareasides/notes(stillneedtoknowtheseingeneral!)
Mainpointofslide/keytakeawayatbottom
Definitionsinbluewithconceptbeingdefined bold&underlined
Warnings- payattentionhere!
Takenote!!
Jupyter Notebook“HelloWorld”
• Jupyter notebooksareinteractiveshellswhichsaveoutputinanicenotebookformat• Theyalsocandisplaymarkdown,LaTeX,HTML,js…
• You’llusethesefor• in-classactivities• interactivelecturesupplements/recaps• homeworks,projects,etc.- ifhelpful!
29
Section1>ACTIVITY
FYI:“Jupyter Notebook”arealsocallediPython notebooksbuttheyhandleotherlanguagestoo.
Note:youdo needtoknoworlearnpythonforthiscourse!
Jupyter NotebookSetup
30
Section1>ACTIVITY
Asageneralpolicyinupper-levelCScourses,Windowsisnotofficiallysupported.Howeverwearemakingabest-effortattempttoprovidesomesolutionshere!
1. HIGHLYRECOMMENDED.Installonyourlaptop viatheinstructionsonthenextslide/Piazza
2. Otheroptionsrunningviaoneofthealternativemethods:1. UbuntuVM.2. Corn
3. CometoourInstallationOfficeHoursafterthisclassandtomorrow!
Pleasehelpoutyourpeersbypostingissues/solutionsonPiazza!
Jupyter NotebookSetup
31
Section1>ACTIVITY
CAswillbecomingaroundtohelpwithsetup&installation
https://github.com/stanford-futuredata/cs145-2017/blob/master/jupyter_install.md
Activity-1-1.ipynb
32
Section1>ACTIVITY
2.Overviewoftherelationaldatamodel
33
Section2
Whatyouwilllearnaboutinthissection
1. DefinitionofDBMS
2. Datamodels&therelationaldatamodel
3. Schemas&dataindependence
4. ACTIVITY:Jupyter +SQL
34
Section2
WhatisaDBMS?
• Alarge,integratedcollectionofdata
• Modelsareal-worldenterprise• Entities(e.g.,Students,Courses)• Relationships(e.g., Aliceisenrolledin145)
ADatabaseManagementSystem(DBMS) isapieceofsoftwaredesignedtostoreandmanagedatabases
35
Section2>DBMS
36
AMotivating,RunningExample
• Considerbuildingacoursemanagementsystem(CMS):
• Students• Courses• Professors
• Whotakeswhat• Whoteacheswhat
Entities
Relationships
Section2>Datamodels
Datamodels• Adatamodelisacollectionofconceptsfordescribingdata
• Therelationalmodelofdata isthemostwidelyusedmodeltoday• MainConcept:therelation- essentially,atable
• Aschema isadescriptionofaparticularcollectionofdata,usingthegivendatamodel
• E.g.everyrelation inarelationaldatamodelhasaschema describingtypes,etc.
37
Section2>Datamodels
“Relationaldatabasesformthebedrockofwesterncivilization”
- BruceLindsay,IBMResearch
38
Section2>Datamodels
ModelingtheCMS• LogicalSchema• Students(sid:string,name:string,gpa:float)• Courses(cid:string,cname:string,credits:int)• Enrolled(sid:string,cid:string,grade:string)
sid Name Gpa101 Bob 3.2123 Mary 3.8
Students
cid cname credits564 564-2 4308 417 2
Coursessid cid Grade123 564 A
Enrolled
Relations
39
Section2>Datamodels
ModelingtheCMS• LogicalSchema• Students(sid:string,name:string,gpa:float)• Courses(cid:string,cname:string,credits:int)• Enrolled(sid:string,cid:string,grade:string)
sid Name Gpa101 Bob 3.2123 Mary 3.8
Students
cid cname credits564 564-2 4308 417 2
Coursessid cid Grade123 564 A
Enrolled40
Correspondingkeys
Section2>Datamodels
OtherSchemata…
• PhysicalSchema:describesdatalayout• Relationsasunorderedfiles• Somedatainsortedorder(index)
• LogicalSchema:Previousslide
• ExternalSchema:(Views)• Course_info(cid:string,enrollment:integer)• Derivedfromothertables
Applications
Administrators
41
Section2>Schemata
DataindependenceConcept: Applicationsdonotneedtoworryabouthowthedataisstructuredandstored
Logicaldataindependence:protectionfromchangesinthelogicalstructureofthedata
Physicaldataindependence:protectionfromphysicallayoutchanges
OneofthemostimportantreasonstouseaDBMS 42
Section2>Schemata
I.e.shouldnotneedtoask:canweaddanewentityorattributewithoutrewritingtheapplication?
I.e.shouldnotneedtoask:whichdisksarethedatastoredon?Isthedataindexed?
Activity-1-2.ipynb
43
Section2>ACTIVITY
3.OverviewofDBMStopicsKeyconcepts&challenges
44
Section3
Whatyouwilllearnaboutinthissection
1. Transactions
2. Concurrency&locking
3. Atomicity&logging
4. Summary
45
Section3
ChallengeswithManyUsers• SupposethatourCMSapplicationserves1000’sofusersormore-whataresomechallenges?
DBMSallowsusertowriteprogramsasiftheyweretheonly user
Disk/SSDaccessisslow,DBMShidethelatencybydoingmoreCPUworkconcurrently
46
Section3 >DBMSChallenges
• Security:Differentusers,differentroles
• Performance:Needtoprovideconcurrentaccess
• Consistency:Concurrencycanleadtoupdateproblems
Wewon’tlookattoomuchinthiscourse,butisextremely important
Transactions• Akeyconceptisthetransaction(TXN):an atomicsequenceofdbactions(reads/writes)
Atomicity:Anactioneithercompletesentirely ornotatall
47
Section3>DBMSChallenges
Acct Balancea10 20,000a20 15,000
Acct Balancea10 17,000a20 18,000
Transfer$3kfroma10toa20:1. Debit$3kfroma102. Credit$3ktoa20
• Crashbefore1,• After1butbefore2,• After2.
Writtennaively,inwhichstatesis
atomicity preserved?
DBAlwayspreservesatomicity!
Transactions• Akeyconceptisthetransaction(TXN):an atomicsequenceofdbactions(reads/writes)• IfausercancelsaTXN,itshouldbeasifnothinghappened!
• TransactionsleavetheDBinaconsistent state• Usersmaywriteintegrityconstraints, e.g.,‘eachcourseisassignedtoexactlyoneroom’
Atomicity:Anactioneithercompletesentirely ornotatall
48
Section3>DBMSChallenges
Consistency:Anactionresultsinastatewhichconformstoallintegrityconstraints
However, notethattheDBMSdoesnotunderstandtherealmeaningoftheconstraints– consistencyburdenisstillontheuser!
Challenge:SchedulingConcurrentTransactions• TheDBMSensuresthattheexecutionof{T1,…,Tn}isequivalenttosomeserial execution
• Onewaytoaccomplishthis:Locking• Beforereadingorwriting,transactionrequiresalockfromDBMS,holdsuntiltheend
• KeyIdea: IfTi wantstowritetoanitemxandTjwantstoreadx,thenTi,Tj conflict.Solutionvialocking:• onlyonewinnergetsthelock• loserisblocked(waits)untilwinnerfinishes
AsetofTXNsisisolated iftheireffectisasifallwereexecutedserially
49
Section3 >DBMSChallenges
WhatifTiandTj needXandY,andTi asksforXbeforeTj,andTj asksforYbeforeTi?->Deadlock!Oneisaborted…
AllconcurrencyissueshandledbytheDBMS…
EnsuringAtomicity&Durability• DBMSensuresatomicity evenifaTXNcrashes!
• Onewaytoaccomplishthis:Write-aheadlogging(WAL)
• KeyIdea: Keepalogofallthewritesdone.• Afteracrash,thepartiallyexecutedTXNsareundoneusingthelog
Write-aheadLogging(WAL): Beforeanyactionisfinalized,acorrespondinglogentryisforcedtodisk
50
Section3 >DBMSChallenges
Weassumethatthelogison“stable”storage
AllatomicityissuesalsohandledbytheDBMS…
AWell-DesignedDBMSmakesmanypeoplehappy!
• EndusersandDBMSvendors• Reducescostandmakesmoney
• DBapplicationprogrammers• Canhandlemoreusers,faster,forcheaper,andwithbetterreliability/securityguarantees!
• Databaseadministrators(DBA)• Easiertimeofdesigninglogical/physicalschema,handlingsecurity/authorization,tuning,crashrecovery,andmore…
MuststillunderstandDBinternals
51
Section3 >Summary
SummaryofDBMS
• DBMSareusedtomaintain,query,andmanagelargedatasets.• Provideconcurrency,recoveryfromcrashes,quickapplicationdevelopment,integrity,andsecurity
• Keyabstractionsgivedataindependence
• DBMSR&Disoneofthebroadest,mostexcitingfieldsinCS.Fact!
52
Section3 >Summary
Top Related