DataManagementin$the$Cloud$(Lecture$2)$...

29
SCALABLE DATA STORES Data Management in the Cloud (Lecture 2) 1 I’ve failed over and over and over again in my life. And that is why I succeed.” – Michael Jordan

Transcript of DataManagementin$the$Cloud$(Lecture$2)$...

Page 1: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

SCALABLE  DATA  STORES  Data  Management  in  the  Cloud  (Lecture  2)  

1  

“I’ve  failed  over  and  over  and  over  again  in  my  life.  And  that  is  why  I  succeed.”  –  Michael  Jordan  

Page 2: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Overview  

•  New  systems  have  emerged  to  address  requirements  of  data  management  in  the  cloud  –  so-­‐called  “NoSQL”  data  stores  –  scalable  SQL  databases  

•  Horizontal  Scaling  –  shared  nothing  –  replicaHng  and  parHHoning  data  over  thousands  of  servers  –  distribute  “simple  operaHon”  workload  over  thousands  of  servers  

•  Simple  Opera<ons  –  key  lookups  –  read  and  writes  of  one  or  a  small  number  of  records  –  no  complex  queries  or  joins  

2  

Page 3: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Defining  “NoSQL”  

•  No  agreed  upon  definiHon  –  “not  only  SQL”  –  “not  relaHonal”  –  …  

•  Six  key  features  1.  ability  to  horizontally  scale  simple  operaHon  throughput  over  many  

servers  

2.  ability  to  replicate  and  distribute  (parHHon)  data  over  many  servers  3.  simple  call  level  interface  or  protocol  (in  contrast  to  a  SQL  binding)  

4.  weaker  concurrency  model  than  ACID  transacHons  of  most  relaHonal  (SQL)  database  systems  

5.  efficient  use  of  distributed  indexes  and  RAM  for  data  storage  

6.  ability  to  dynamically  add  new  aXributes  to  data  records  

3  Based  on:  “Scalable  SQL  and  NoSQL  Data  Stores”  by  R.  CaEell,  2010  

Page 4: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Data  Models  

•  Terminology  –  tuple:  row  in  a  relaHonal  table,  where  aXribute  names  and  types  are  

defined  by  a  schema,  and  values  must  be  scalar  

–  document:  supports  both  scalar  values  and  nested  documents,  and  the  aXributes  are  dynamically  defined  for  each  document  

–  column  family:  groups  key/value  pairs  (columns)  into  families  to  parHHon  and  replicate  them;  one  column  family  is  similar  to  a  document  as  new  (nested,  list-­‐valued)  aXributes  can  be  added  

–  object:  analogous  to  objects  in  programming  languages,  but  without  procedural  methods  

•  RelaHonal  –  data  is  stored  in  relaHons  (tables)  of  tuples  (rows)  of  scalar  values  –  queries  expressed  over  arbitrary  (combinaHons  of)  aXributes  

–  indexes  defined  over  arbitrary  (combinaHons  of)  aXributes  

4  Based  on:  “Scalable  SQL  and  NoSQL  Data  Stores”  by  R.  CaEell,  2010  

Page 5: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Key/Value  Data  Model  

•  Interface  –  put(key,  value)  –  get(key):  value  

•  Data  storage  –  values  (data)  are  stored  based  on  programmer-­‐defined  keys  

–  system  is  agnosHc  as  to  the  structure  (semanHcs)  of  the  value  

•  Queries  are  expressed  in  terms  of  keys  

•  Indexes  are  defined  over  keys  –  some  systems  support  secondary  indexes  over  (part  of)  the  value  

5  

k1   v1  k2   v2  k3   v3  

…  

kn   vn  

Page 6: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Document  Data  Model  

•  Interface  –  set(key,  document)  

–  get(key):  document  –  set(key,  name,  value)  

–  get(key,  name):  value  

•  Data  storage  –  documents  (data)  is  stored  based  on  programmer-­‐defined  keys  

–  system  is  aware  of  the  (arbitrary)  document  structure  –  support  for  lists,  pointers  and  nested  documents  

•  Queries  expressed  in  terms  of  key  (or  aXribute,  if  index  exists)  

•  Support  for  key-­‐based  indexes  and  secondary  indexes  6  

k1   “name”:“fred”  k2   “name”:“mary”;“age”:“25”  k3  

…  

kn   “name”:“john”;“address”:“k3”  

“name”:“oak  st”  

Page 7: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Private  Public  

Column  Family  Data  Model  •  Interface  

–  define(family)  –  insert(family,  key,  columns)  –  get(family,  key):  columns  

•  Data  storage  –  <name,  value,  Hmestamp>  triples  (so-­‐called  columns)  are  stored  based  

on  a  column  family  and  key;  a  column  family  is  similar  to  a  document  –  system  is  aware  of  (arbitrary)  structure  of  column  family  –  system  uses  column  family  informaHon  to  replicate  and  distribute  data  

•  Queries  are  expressed  based  on  key  and  column  family  •  Secondary  indexes  per  column  family  are  typically  supported  

7  

k1   “name”:“fred”  k2   “name”:“mary”  k3  

…  

kn   “name”:“john”  

“name”:“oak  st”  

“Htle”:“Mr”  

“age”:“25”  

Page 8: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Graph  Data  Model  •  Interface  

–  create:  id  –  get(id)  –  connect(id1,  id2):  id  –  addAXribute(id,  name,  value)  –  getAXribute(id,  name):  value  

•  Data  storage  –  data  is  stored  in  terms  of  nodes  and  (typed)  edges  –  both  nodes  and  edges  can  have  (arbitrary)  aXributes  

•  Queries  are  expressed  based  on  system  ids  (if  no  indexes  exist)  •  Secondary  indexes  for  nodes  and  edges  are  supported  

–  retrieve  nodes  by  aXributes  and  edges  by  type,  start  and/or  end  node,  and/or  aXributes  

8  

n1   n2  

n3  

“name”:“fred”  

“name”:“mary”;“age”:“25”  

“name”:“oak  st”  

LIKES  

LIKES  

“weight”:“-­‐1”  

Page 9: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Array  Data  Model    •  Nested  mulH-­‐dimensional  arrays  

–  cells  can  be  tuples  or  other  arrays  –  can  have  non-­‐integer  dimensions  

•  AddiHonal  “History”  dimension  on  updatable  arrays  •  Ragged  arrays  allow  each  row  or  column  to  have  a  different  length  •  Supports  mulHple  flavors  of  “null”  

–  array  cells  can  be  “EMPTY”  –  user-­‐definable  treatment  of  special  values  

9  

Page 10: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

     SciDB  DDL  

CREATE ARRAY Test_Array < A: integer NULLS, B: double, C: USER_DEFINED_TYPE > [I=0:99999,1000, 10, J=0:99999,1000, 10 ] PARTITION OVER ( Node1, Node2, Node3 ) USING block_cyclic();

Attribute names A, B, C

Index names I, J

Chunk size 1000

Overlap 10

10  

Page 11: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Object  Data  Model  

•  Interface  –  set(object)  –  get(query):  object  

•  Data  storage  –  typed  programming  language  objects  (plus  referenced  objects)  stored  

–  aXribute  can  be  collecHon-­‐valued  –  database  is  aware  of  the  type  (schema)  of  objects  

•  Objects  are  retrieved  using  queries  or  by  traversal  from  “roots”  

•  Specialized  indexes  can  be  expressed  based  on  schema  

11  

“mary”  25  

Person  “fred”  27  

Person  LIKES  

LIKES  

“oak  st”  

Address  

LIVES_AT  

Page 12: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

APPLICATION  SCENARIO  Data  Management  in  the  Cloud  

12  

Page 13: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

An  Applica<on  Scenario  •  As  an  example  applicaHon  scenario,  we  will  use  graph  data  management  and  processing  throughout  the  course  

•  Graph  data  applicaHons  –  social  networking  –  SemanHc  Web  (i.e.  RDF  graphs)  –  data  provenance  –  Web  site  ranking  (i.e.  Page  Rank)  –  …  

•  No  (mature)  graph  databases  exist  –  graph  data  stores  are  available  (Neo4j,  OrientDB,  …)  

•  Use  exisHng  (mature)  non-­‐graph  database  –  graph  data  model  must  be  mapped  to  data  model  of  database  –  algorithms  must  be  specified  in  database  language  

13  

Page 14: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Graph  Data  

•     

14  

Page 15: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Graph  Processing  

•  Classical  graph  algorithms  –  shortest  path  –  bridges  –  transiHve  closure  

•  “Web  2.0”  –  friend  of  a  friend  –  who  follows  who?  –  who  might  know  who?  

•  Social  network  analysis  –  degree  centrality  –  closeness  centrality    –  betweenness  centrality  

15  

Page 16: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Classical  Graph  Algorithms  

•     

16  

Page 17: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Social  Network  Analysis  

•     

17  

degree  centrality:  the  number  of  links  a  node  has  

closeness  centrality:  distance  defined  by  length  of  shortest  paths,  nodes  with  smaller  sum  of  shortest  paths  are  “closer”  

Page 18: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Social  Network  Analysis  

•     

18  

 betweeness  centrality:  “the  number  of  Mmes  a  node  acts  as  a  bridge  along  the  shortest  path  between  two  other  nodes”  

Page 19: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Centrality  –  In  Pictures  

19  

A  –  Degree  Centrality,  B  –  Closeness  Centrality,  C  –  Betweenenss  Centrality  

Source:  Wikipedia:  hEp://en.wikipedia.org/wiki/Centrality  

Page 20: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Usage  Profiles  

•  Database  updates  and  consistency  –  data  changes  frequently,  results  need  to  be  accurate  –  data  changes  infrequently,  results  need  to  be  accurate  –  query  results  may  not  reflect  the  latest  state  of  the  database  

•  Different  types  of  queries  –  point-­‐based,  bound:  a  node,  a  node  and  its  neighbors,  friends  of  a  

friend  

–  point-­‐based,  unbound:  a  node  and  all  its  reachable  nodes,  shortest  path  between  two  nodes  

–  graph-­‐based,  per  node:  centraliHes  –  graph-­‐based,  all:  bridges,  transiHve  closure  

20  

Page 21: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Course  Project  

•  The  course  will  be  accompanied  by  a  project  that  is  based  on  the  scenario  of  graph  data  management  and  processing  –  Task  1:  Study  quesHon  that  will  take  you  through  a  “dry  run”  of  

mapping  the  graph  data  model  to  a  NoSQL  data  model  and  make  you  think  about  how  to  answer  some  simple  queries.  

–  Task  2:  Groups  of  4-­‐5  students  will  pick  a  NoSQL  system  and  compile  a  systems  profile,  based  on  papers  and  documentaHon.  These  profiles  will  be  presented  in  class.  

–  Task  3:  Groups  of  4-­‐5  students  will  design  a  graph  management  and  processing  system  based  on  the  previously  chosen  NoSQL  system.  This  Hme  for  real!  

–  Task  4:  Groups  of  4-­‐5  students  implement  a  prototype  of  the  graph  data  management  and  processing  system.  

21  

Page 22: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Task  1:  Applica<on  Design  “Dry  Run”  

•  The  goal  of  this  project  is  to  complete  the  following  tasks  –  Pick  a  NoSQL  data  model  and  map  graph  data  model  into  that  model  

–  In  words  or  pseudo-­‐code,  describe  how  you  would  do  the  queries  below    •  find  a  node  properHes  based  on  its  idenHfier  •  find  the  neighbors  of  a  given  node  •  find  the  "friends  of  a  friend"  of  a  given  node  •  transiHve  closure  •  bridges  

–  Discuss  how  different  usage  profiles  presented  in  the  lecture  (Slide  #51)  affect  the  processing  of  these  queries  

•  Deliverable  is  a  wriXen  report  •  Students  will  conduct  this  part  of  the  project  individually  

22  

Page 23: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Task  2:  Systems  Profile  

•  Horizontally  scalable  data  management  systems  –  Riak  (hXp://wiki.basho.com/)  

–  Project  Voldemort  (hXp://project-­‐voldemort.com/)  –  CouchDB  (hXp://couchdb.apache.org/)  –  SimpleDB  (hXp://www.amazon.com/simpledb/)  

–  HBase  (hXp://hbase.apache.org/)  –  Cassandra  (hXp://cassandra.apache.org/)  –  OrientDB  (hXp://www.orientechnologies.com/)  

•  Groups  of  4-­‐5  students  collaborate  on  a  systems  profile  –  groups  will  be  formed  on  April  15  

–  decide  in  advance  with  who  you  would  like  to  work  on  what  

23  

Page 24: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Task  2:  Systems  Profile  •  Data  Model    

–  Precise  descripHon  of  the  data  model,  especially  in  terms  of  differences  from  the  "standard"  models  presented  in  the  lecture  

–  Detailed  summary  of  the  basic  data  manipulaHon  API,  i.e.  features  to  create,  retrieve,  update  and  delete  data  items.  

•  Query  Support    –  Supported  query  types,  i.e.  point,  range,  navigaHon,  and/or  arbitrary?  –  What  is  the  query  language  of  the  system?  Is  it  declaraHve,  funcHonal,  

algebraic  and/or  imperaHve?  –  Are  queries  automaHcally  opHmized?  

•  Indexes    –  What  index  structures  are  available?  –  What  can  be  indexed?  What  can  be  a  key?  What  can  be  a  value?  –  How  are  indexes  managed,  i.e.  manually  or  automaHcally?  

24  

Page 25: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Task  2:  Systems  Profile  

•  Storage    –  Disk  or  file  storage  –  In-­‐memory  (RAM)  –  Flash  or  SSD  –  TradiHonal  database  –  Cloud  Storage  (GFS,  HDFS,  S3)  

•  TransacHons  and  Concurrency  Control    –  Does  the  system  support  transacHons?  

–  How  are  transacHons  implemented,  i.e.  locks,  OCC,  MVCC,  etc.?  

–  What  consistency  guarantees  are  given?  

25  

Page 26: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Task  2:  Systems  Profile  

•  Scalability  and  ReplicaHon    –  What  types  of  replicaHon  are  supported,  i.e.  synchronous  or  

asynchronous?  

–  ...  •  Plaworm/Deployment    

–  What  cloud  infrastructures  are  supported?  

–  What  deployment  scenarios  are  supported,  i.e.  embedded,  client/server,  mulH-­‐core  CPU,  cloud,  etc.?  

–  Language  bindings?  –  CommunicaHon  protocols,  i.e.  JSON,  REST,  etc.?  

26  

Page 27: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Task  3:  Applica<on  Design  

•  Design  the  example  graph  data  management  in  the  previously  profiled  system  –  similar  to  Task  1,  but  more  technical  as  it  is  based  on  a  concrete  system  –  consider  only  "friends  of  a  friend",  transiHve  closure  and  bridges  query  –  insights  for  other  queries  opHonal,  but  highly  welcome  and  appreciated  

•  Deliverable  is  a  ten  minute  presentaHon  in  class  –  November  15  –  discuss  final  design  and  moHvate  the  design  choices  made  w.r.t.  the  

requirements  of  the  applicaHon  and  the  capabiliHes  of  the  system  

–  give  details  on  the  mapping  of  data  structures,  planned  indexes,  and  query  implementaHon  strategies  

•  Same  groups  of  4-­‐5  students  will  conHnue  to  collaborate  

27  

Page 28: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

Task  4:  Prototype  Implementa<on  

•  Implement  a  small  prototype  based  on  previous  design  –  data  model  (with  data  loading  capabiliHes)  

–  three  queries  menHoned  before  

•  The  goal  of  this  part  of  the  project  is  to  realize  of  a  small  applicaHon  and  experience  its  performance  in  pracHce  

•  AlternaMve  OpMon:  If  you  feel  you  lack  implementaHon  experience  to  complete  this  task,  you  may  contribute  to  "benchmarking"  the  systems  built  by  your  peers  

•  Deliverable  is  the  developed  source  code  by  the  students  •  The  same  groups  of  4-­‐5  students  conHnue  to  collaborate  

•  The  team  with  the  best  solu<on  in  terms  of  design  and  performance  will  win  a  price!  

28  

Page 29: DataManagementin$the$Cloud$(Lecture$2)$ …datalab.cs.pdx.edu/education/clouddbms-spr2013/notes/CloudDb-2013 … · Overview • New$systems$have$emerged$to$address$requirements$of$data

References  •  M.  Armbrust,  A.  Fox,  R.  Griffith,  A.  D.  Joseph,  R.  H.  Katz,  A.  Konwinski,  G.  Lee,  D.  A.  PaXerson,  A.  Rabkin,  I.  Stoica,  M.  Zaharia:  Above  the  Clouds:  A  Berkeley  View  of  Cloud  Compu<ng.  Tech.  Rep.  No.  UCB/EECS-­‐2009-­‐28,  2009.  

•  D.  J.  Abadi:  Data  Management  in  the  Cloud:  Limita<ons  and  Opportuni<es.  IEEE  Data  Eng.  Bull.  32(1),  pp.  3—12,  2009.  

•  R.  Agrawal,  A.  Ailamaki,  P.  A.  Bernstein,  E.  A.  Brewer,  M.  J.  Carey,  S.  Chaudhuri,  A.  Doan,  D.  Florescu,  M.  J.  Franklin,  H.  Garcia-­‐  Molina,  J.  Gehrke,  L.  Gruenwald,  L.  M.  Haas,  A.  Y.  Halevy,  J.  M.  Hellerstein,  Y.  E.  Ioannidis,  H.  F.  Korth,  D.  Kossmann,  S.  Madden,  R.  Magoulas,  B.  Chin  Ooi,  T.  O’Reilly,  R.  Ramakrishnan,  S.  Sarawagi,  M.  Stonebraker,  A.  S.  Szalay,  G.  Weikum:  The  Claremont  Report  on  Database  Research.  2008.  

•  R.  CaXell:  Scalable  SQL  and  NoSQL  Data  Stores.  SIGMOD  Rec.  39(4),  pp.  12—27,  2010.  

29