Analysis of Cloud Data Management Systems Student:Miro Szydlowski Supervisor: Prof. Mehmet Orgun...
-
date post
19-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of Analysis of Cloud Data Management Systems Student:Miro Szydlowski Supervisor: Prof. Mehmet Orgun...
Analysis of Cloud Data Management Systems
Student: Miro SzydlowskiSupervisor: Prof. Mehmet OrgunDate: 11.11.11
1970 2011
INTRODUCTION
Relational DatabaseManagement Systems
Distributed DatabasesNoSQLCloud Data Stores
?1/22
Presentation Plan
• Origins of Database Management Systems• Raise to power• ACID qualities
• Problems and Solutions• Consequences of being popular• Partitioning, Replication, Load Balancing,• Distributed Database Management Systems
• Challenges of Connected World• Cloud Computing
• Definition, Type• Place of DBMS in Cloud
• Cloud Data Management Systems• CAP, BASE, NoSQL and few other concepts• NoSQL by implementation type• Example: AmazonDB
• Which one to choose?2/22
Database Management Systems“…set of software programs that control the organisation, storage, management and data retrieval”
Database Models:Hierarchical Network
Relational Object-relational
3/22
Origins of Relational Database Management Systems
• 1970, University of California• In the following 20 years became not only
accepted not only essential, but considered the only solution for enterprise data storage
• Why?• Data normalisation• Metadata reuse• User Views <-> Community View <-> Storage• SQL!• Guarantees data integrity - ACID
4/22
ACID
• Atomicity• Consistency• Isolation• Durability
• Provides consistent state of the database• …but at a cost
5/22
Problems and Solutions
Very successful solution, but the businesses were growing…• Data volume• Data warehousing, business intelligence• Merges and acquisitions• WWW
New Solutions:• Partitioning• Hardware • Horizontal• Vertical
• Replication• Multi-master• Master-Slave
• Load Balancing
• …and finally• Distributed Database
Management Systems
…but the challenges kept coming…6/22
Challenges of the Connected World• Search Engines• Mobile Devices• Business-To-Business (Web Services)• Stream Processing• Data Warehousing• Directory Services
Current example: 2011 Twitter statistics:• 1 Billion Tweets per week• 140 million Tweets per day in average• 177 million Tweets sent on March 11, 2011.• Current record: 6,939 TPS - set 4 seconds after midnight in Japan on New Year’s Day.
New Solutions needed ASAP7/22
What is Cloud Computing?• Lots of definitions, one of them below:
“…a pool of highly scalable, abstracted infrastructure, capable of hosting end-customer applications, that is billed by consumption” (James Staten)
• Automation• Virtualization• Scalability• Pay-as-you-go pricing model
8/22
Cloud Computing Types
Application Software
Infrastructure Software
Operating System
Virtualisation Layer
Server Hardware
Network, Firewalls
Data Centre Infrastructure
Infr
astr
uctu
re a
s a
Serv
ice
Pla
tform
as a
Serv
ice
Soft
ware
as a
Serv
ice
By Deployment Type By Service Type
Cloud Data Management Systems?IaaS or PaaS
9/22
Dark CloudBeginning of 21st century – open critique of the relational database management systems:
• Too complex for an average user• Can’t cope with data volumes• Relational mapping is an overkill• One size doesn’t fit all – we want to prioritize some
features• Why do we need to build the ORM?• Distributed RDMSs are fake!• Scalability!
Why don’t we re-engineer and rebuild instead of constantly ‘patching’ RDBMS?
10/22
CAP and BASEEric Brewer at ACM Symposium in 2000 made a statement:It is unachievable to implement all three qualities of a “shared-data system” at once:
• Consistency • Availability• Partition Tolerance
…so – pick any two!
Since we can’t guarantee ACID, lets BASE our systems on another principle:
• Basically Available• Soft State• Eventually Consistent
These two ideas changed the approach to the database design……and gave birth to the ‘NoSQL’ movement
11/22
Few new conceptsHash – based partitioning• certain property of each entity is used to calculate a hash value,
which is used to determine which database server to use to store the entity
‘Shared nothing’ architecture• cluster of independent machines that communicate over a high-
speed network
Sharding• splitting up a database across multiple machines
MapReduce• not a database system, but a programming framework• every job sent is divided into two parts: a ‘Map’, and a ‘Reduce’
12/22
NoSQL Movement• Their main objection: unnecessary complexity of the
relational databases• Motto: “select a right tool for the job”• “Tool in the box” approach
• Principles of NoSQL data stores:• Built for performance• Built for real scalability• Build for high availability• Typically use a very specific data access pattern• Either schemaless or implementing very simple schemas• Weak consistency guarantees• Declarative query language (such as SQL) replaced with
simple APIs
13/22
NoSQL Databases by Implementation Type
• Key/Value Stores• BigTable• Document-based• Columnar
(also, graph, object-oriented, distributed object stores and dozen of others…)
14/22
Key/Value Stores
• Data is stored as a key/value pair• Basic APIs – Put/Get/Remove• Scalability: Sharding or Replicating data items• Advantages: Performance and scalability• Best For: High-performance systems that deal with one
type of object• Examples: HBase, SimpleDB, Cassandra• Potential Issues: Data Integrity has to be supported by
application, supports only one type of query
15/22
‘BigTable’ Databases• Named after Google’s ‘BigTable’ implementation• Each row can have different set of columns• A row can have thousands of columns• Records can have multiple fields• Records are indexed by [row-key, column-key, timestamp]• Usually sharded• Advantages: Highly optimized for write operations, highly
scalable, (quoted) extremely even performance• Examples: Google Analytics, Google Docs, Microsoft Azure
Tables• Potential Issues: Lack of text search, very difficult to import
and export data – query times out after 30 sec
16/22
Document Databases
• Completely schemaless• All document data is stored in the document itself• Document usually encoded in JSON, BSON, XML• Scalability: good, implementing asynchronous
replication• Advantages: client application can store data in its final
form; support custom views• Examples: Couch DB, MongoDB, Terrastore• Best For: wikis, blogs, document management systems• Potential Issues: They actually don’t outperform
RDBMS, not well supported
17/22
Columnar Databases• ‘Between’ SQL and NoSQL – can use SQL syntax, but use
wide columns• Each columns stored separately on different disk location• Scalability and Performance: both good because rows and
columns can be split across multiple nodes: rows – sharding, columns – column groups
• Advantages – great when you need data aggregation• Examples: Vertica, HBase• Best At: Data warehousing, data mining • Potential Issues: Not great at handling complex
relationship, better than RDBS only when row size is big and not many columns of a single row are required
18/22
Example: Amazon SimpleDB Data Store Type: Entity-Attribute-Value Data Model: Document Store/Big Table Cloud Type: Platform as a Service
• The data model based on domains, items, attributes and values:• Domains are currently limited to 10 GB each, and each account is limited
to 100 domains• Domains are collections of items that are described by attribute-value
pairs• Doesn’t have the concept of schema – everything is a string• Designed for reads rather than writes• Updates done to central database ONLY and distributed to ‘slaves’• Client interface: SOAP and REST• Availability: multiple geographically distributed copies of each data item• Scalability: Great• Pay as you do model: Clients are charged by data storage, data transfer and
machine utilization• Potential Issues: eventual consistency, no data types or constraints
19/22
Summary – RDBMS or NoSQL?
• if you have a low-volume, medium-complexity suite of applications, don’t change it – this is what the RDBMS are good at
• if your data is normalized and using joins – don’t move to the schemaless NoSQL
• if you’re looking for an off-the-shelf system and don’t want to get involved in a customized development – choose RDBMS
• if you problem can’t be resolved using RDBMS [e.g. you have serious scalability issues] and you’re determined to fix it at any cost – go ‘NoSQL’
• if you have access to sufficient quantities of sufficiently smart people - choose NoSQL.
It depends…
20/22
Summary – RDBMS or NoSQL?
‘choose a right tool for the job’
21/22
Questions?
22/22