DATA - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · Architecture of a Database...

Post on 22-May-2020

5 views 0 download

Transcript of DATA - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/... · Architecture of a Database...

Stratos Idreos

DATA

INDEX——HOW——

TO STORE ——DATA——

DATA

INDEX

data structure decisions define the algorithms that access data

ALGORITHMS

DATA

INDEX

[7,4,2,6,1,3,9,10,5,8]

ALGORITHMSunordered

DATA

INDEX

[7,4,2,6,1,3,9,10,5,8]

ALGORITHMSunordered

DATA

INDEX

[7,4,2,6,1,3,9,10,5,8]

ALGORITHMS[1,2,3,4,5,6,7,8,9,10]

unordered

ordered

DATA

INDEX

ALGORITHMS

DATA

INDEX

ALGORITHMS

DATA SYSTEMS

DATA STRUCTURES

DEFINE PERFORMANCE

2018

spee

d COMPUTE

DATA MOVEMENT

2018

spee

d COMPUTE

DATA MOVEMENT

register = this room

disk = Pluto memory = nearby city

Jim Gray, Turing Award 1998

caches = this city

Read Update

Memory

no perfect structure

amplification

Read Update

Memory

Mem

ory

Read

Upda

teno perfect structure

amplification

Read Update

Memory

Mem

ory

Read

Upda

te

differential approximate

pointtree

no perfect structure

amplification

Read Update

Memory

Mem

ory

Read

Upda

te

differential approximate

pointtree

no perfect structure

amplification

Array

Linked-List

Skip-List

Trie

Hash-Table

Sorted Array

B-tree

How do I make my data system run x times as fast? (sql,nosql,bigdata, …)

How do I make my data system run x times as fast?

How do I minimize my bill in the cloud?

(sql,nosql,bigdata, …)

How do I make my data system run x times as fast?

How do I minimize my bill in the cloud?

(sql,nosql,bigdata, …)

How do I extend the lifetime of my hardware?

How do I make my data system run x times as fast?

How do I minimize my bill in the cloud?

How to accelerate statistics computation for data science/ML?

(sql,nosql,bigdata, …)

How do I extend the lifetime of my hardware?

How do I make my data system run x times as fast?

How do I minimize my bill in the cloud?

How do I train my neural network x times faster?

How to accelerate statistics computation for data science/ML?

(sql,nosql,bigdata, …)

How do I extend the lifetime of my hardware?

NEW APPLICATIONS

NEW APPLICATIONS

existing systems need to change too

NEW APPLICATIONS

existing systems need to change too

WORKLOAD HARDWARE

ADAPT

NEW APPLICATIONS

existing systems need to change too

WORKLOAD HARDWARE

ADAPT

IMPROVE WITHIN A BUDGET

WHAT WILL BREAK MY SYSTEM?

REASON

more data

new applications

new h/w

continuous need

for newstorage solutions

fundamental of storage learning outcome

software engineering data-driven startup researchdata structures, SQL, NoSQL, Big Data, Neural Networks, Statistics, Data Science

interaction: in and out of class

There is no such thing as a

wrong question/answer!!!!

Recent Research Papers

review and slides should focus on

what is the problem why is it important

why is it hard why existing solutions do not work

what is the core intuition for the solution solution step by step

does the paper prove its claims exact setup of analysis/experiments are there any gaps in the logic/proof

possible next steps

* follow a few citations to gain more background

Each student: 2 reviews per week/1 presentation

Recent Research Papers

review and slides should focus on

what is the problem why is it important

why is it hard why existing solutions do not work

what is the core intuition for the solution solution step by step

does the paper prove its claims exact setup of analysis/experiments are there any gaps in the logic/proof

possible next steps

* follow a few citations to gain more background

Each student: 2 reviews per week/1 presentation

learn to judge constructively

learn to present

learn to prepare slides

systems project research project

semester project: due in the end of semester + a midway check in (early March,10%)

systems project

individual projectNoSQL, in c/c++

research project

semester project: due in the end of semester + a midway check in (early March,10%)

systems project

individual projectNoSQL, in c/c++

research project

groups of threeNoSQL, Neural Networks

Periodic Table of Data Structures

only open to cs165 students (unless proven otherwise)

semester project: due in the end of semester + a midway check in (early March,10%)

systems project

individual projectNoSQL, in c/c++

research project

groups of threeNoSQL, Neural Networks

Periodic Table of Data Structures

only open to cs165 students (unless proven otherwise)

semester project: due in the end of semester + a midway check in (early March,10%)

ACM Special Interest Group In Data Management (SIGMOD)Undergrad Research Competition

first prize in 2016, 2017, 2018Adaptive Denormalization Evolving Trees Splaying LSM-Trees

ACM Special Interest Group In Data Management (SIGMOD)Undergrad Research Competition

first prize in 2016, 2017, 2018Adaptive Denormalization Evolving Trees Splaying LSM-Trees

Design continuums at CIDR 2019, LSM/BTree hybrids in SIGMOD finals

piazza forum

all announcements & discussions (link on class website - check out usage guidelines)

classes are recorded (links on class website)

NO LAPTOP/PHONE POLICYclass is based on participation!

Project: 40% Midway Check-in:10% Discussion: 20% Presentation: 15% Reviews: 15%

Check out: syllabus, preparation readings, project 0, systems project, online sections

http://daslab.seas.harvard.edu/classes/cs265/

Get familiar with the very basics of traditional database architectures:Architecture of a Database System. By J. Hellerstein, M. Stonebraker and J. Hamilton. Foundations and Trends in Databases, 2007

Get familiar with very basics of modern database architectures:The Design and Implementation of Modern Column-store Database Systems.  By D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden. Foundations and Trends in Databases, 2013

Get familiar with the very basics of modern large scale systems:Massively Parallel Databases and MapReduce Systems. By Shivnath Babu and Herodotos Herodotou. Foundations and Trends in Databases, 2013

subarna wilson wasay

Teaching Fellows:

Off class discussions are key! question on readings, ideas, help with code/analysis

questions on logistics?

BASICS of storageIntro to RESEACH topics

Discussion phase/presentation as of week 3

Next few classes:

ask/answer questions, keep notes/ideas

registers

on chip cache

on board cache

memory

disk

CPU

memory wall

chea

per

fast

er

SRAM

DRAM

~1ns

~10ns

~100ns

cache miss: looking for something which is not in the cache

memory miss: looking for something which is not in memory

time

speed cpu

mem

registers

on chip cache

on board cache

memory

disk

CPU

memory wall

chea

per

fast

er

SRAM

DRAM

~1ns

~10ns

~100ns

cache miss: looking for something which is not in the cache

memory miss: looking for something which is not in memory

time

speed cpu

mem

Jim Gray, IBM, Tandem, DEC, Microsoft ACM Turing award ACM SIGMOD Edgar F. Codd Innovations award

disk100Kx Pluto

2 years

memory100x New York1.5 hours

on board cache10x this building

10 min

on chip cache2x this room

1 min

registers my head~0

need to only read x… but have to read all of page 1

page1 page2 page3

data value x

registers

on chip cache

on board cachememory

disk

CPU

data

mov

e

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

5 10 6 4 12

(size=120 bytes)

2 8 9 7 6 7 11 3 9 6

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

scan

5 10 6 4 12(size=120 bytes)

2 8 9 7 6 7 11 3 9 6

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

scan

5 10 6 4 12(size=120 bytes)

2 8 9 7 6

4

7 11 3 9 6

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

scan

40 bytes

5 10 6 4 12(size=120 bytes)

2 8 9 7 6

4

7 11 3 9 6

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

scan scan

40 bytes

5 10 6 4 12(size=120 bytes) 2 8 9 7 6 4

7 11 3 9 6

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

scan scan

40 bytes

5 10 6 4 12(size=120 bytes) 2 8 9 7 6 4 2

7 11 3 9 6

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

scan scan

5 10 6 4 12(size=120 bytes) 2 8 9 7 6 4 2

7 11 3 9 6

80 bytes

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

(size=120 bytes) 2 8 9 7 6 4 2

7 11 3 9 6

80 bytes

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

(size=120 bytes) 2 8 9 7 6 4 27 11 3 9 6

scan

80 bytes

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

(size=120 bytes) 2 8 9 7 6 4 27 11 3 9 6

scan

3

80 bytes

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

(size=120 bytes) 2 8 9 7 6 4 27 11 3 9 6

scan

3

120 bytes

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

5 10 6 4 12

(size=120 bytes)

2 8 9 7 6 7 11 3 9 6

an oracle gives us the positions

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

oracle

5 10 6 4 12(size=120 bytes)

2 8 9 7 6 7 11 3 9 6

an oracle gives us the positions

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

oracle

5 10 6 4 12(size=120 bytes)

2 8 9 7 6

4

7 11 3 9 6

an oracle gives us the positions

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

oracle

40 bytes

5 10 6 4 12(size=120 bytes)

2 8 9 7 6

4

7 11 3 9 6

an oracle gives us the positions

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

oracle oracle

40 bytes

5 10 6 4 12(size=120 bytes) 2 8 9 7 6 4

7 11 3 9 6

an oracle gives us the positions

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

oracle oracle

40 bytes

5 10 6 4 12(size=120 bytes) 2 8 9 7 6 4 2

7 11 3 9 6

an oracle gives us the positions

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

oracle oracle

5 10 6 4 12(size=120 bytes) 2 8 9 7 6 4 2

7 11 3 9 6

80 bytesan oracle gives us the positions

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

(size=120 bytes) 2 8 9 7 6 4 2

7 11 3 9 6

80 bytesan oracle gives us the positions

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

(size=120 bytes) 2 8 9 7 6 4 27 11 3 9 6

oracle

80 bytesan oracle gives us the positions

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

(size=120 bytes) 2 8 9 7 6 4 27 11 3 9 6

oracle

3

80 bytesan oracle gives us the positions

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

(size=120 bytes) 2 8 9 7 6 4 27 11 3 9 6

oracle

3

120 bytesan oracle gives us the positions

when does it make sense to have an oraclehow can we minimize the cost

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 65 10 6 4 12 2 8 9 7 6 7 11 3 9 6

e.g., query x<5

CPU DATA MOVEMENT

MEMORY REQUIREMENT

SPACE REQUIREMENT

(ENERGY)

ROBUSTNESS

CPU DATA MOVEMENT

MEMORY REQUIREMENT

SPACE REQUIREMENT

(ENERGY)

SQL, NoSQL, Graph, Neural Nets, Statistics —> Data to Processing

ROBUSTNESS

CPU DATA MOVEMENT

MEMORY REQUIREMENT

SPACE REQUIREMENT

(ENERGY)

SQL, NoSQL, Graph, Neural Nets, Statistics —> Data to Processing

TIME —— CLOUD COSTS

ROBUSTNESS

Stratos Idreos