NoSQL Architecture Overview
-
Upload
christopher-foot -
Category
Technology
-
view
204 -
download
1
Transcript of NoSQL Architecture Overview
NoSQL Architecture Overview
OVER 400 CUSTOMERSTRUST THEIR DATABASES
TO RDX
RDX Insights Series Presentation – Introduction to NoSQL Architectures
Chris FootVP DB Technologies RDX
March 23, 2017Video recording of this presentation can be found on RDX’s YouTube Channel:https://lnkd.in/g96cbUV
www. .com
NoSQL Product Offering Analysis
www. .com
NoSQL CompetitorsDocument
Graph
Key-Value
• Pairs a key with a complex data structure called a document
• Records not required to have uniform structure
• MongoDB, CouchDB, DynamoDB, Couchbas, MarkLogic
• Record can have billions of columns• Tables are collections of columns, rather than rows• Column names and record keys are not fixed• Cassandra, Bigtable, Hbase, Accumulo
• All items are stored as an indexed key-value pairs• Redis, Riak, Memcached, Oracle NoSQL, DynamoDB
• Stores nodes (data elements) with relationships
• Interconnected, strong relationships• Neo4j, Datastax Cassandra, Titan,
ArangoDB
IN-MEMORY DB
Persistent DB
Wide-Column
• Operations performed in memory
• Lightening fast read/write• Often use Key-Value or Wide-
Column as data store• Redis, Memcached, Oracle
Times 10, SAP HANA
In-Memory
www. .com
RDBMS and NoSQL will Merge
NoSQL vendors desire to increase market share will drive them to compete directly with relational product manufacturers
Vendors will add RDBMS-like functionality that allows their product to be more widely adopted. Those that don’t will quickly lose market share to those that do
The larger relational vendors will attempt to co-opt any NoSQL technology that challenges their dominant role in the industry
As they identify offerings as tangible threats, their strategy will be to ensure that the technologies used by those vendors become a component of, not a replacement for, their traditional database products
RelationalDBMS
NoSQL DBMS
General PurposeDBMS
www. .com
Unstructured Data Examples
www. .com
NoSQL Adoption Drivers - Modern Applications
Single ViewSensor DataBiometricsRadiologyVideos, ImagesWeather DataCatalogsContent ManagementGeospatialSocial Data
• IDC: Unstructured data is growing at the rate of 62% per year
• IDC: By 2022, 93% of all data in the digital universe will be unstructured
• Gartner: Data volume is set to grow 800% over the next 5 years and 80% of it will reside as unstructured data
www. .com
NoSQL architectures leverage horizontal scalability to cost effectively handle large volumes of data and/or users
NoSQL Adoption Drivers – Horizontal Scaling
Horizontal
Vertical
www. .com
Relational and NoSQL Parallel Adoption DriversHierarchical and Network Databases – IMS and CODASYL/NetworkLogical and physical layers entirely dependent upon each other. Both data storage and data navigation were rigidly defined. Programs were required to follow the prebuilt paths to navigate through the stored data
Early Releases of DB2• Flexibility• Separate logical and physical layers - schema• Set vs row processing• Ease of use• SQL language was intuitive• Poor performance• Crude locking, transaction management and
limited features
Early Releases of Oracle• Flexibility• Easy to use• Lower Total Cost of Ownership (support, product costs)• Low cost commodity hardware (as in it didn’t need a
mainframe)• Crude locking, transaction management and limited
features
Early Releases of NoSQL• Flexibility• Easy to use• Lower Total Cost of Ownership (support, product
costs)• Faster application development• Architected to scale horizontally for availability
and performance• Crude locking, transaction management and
limited features
“Niche implementations, crude technically, will never become popular, no features - no future”.
Pretty much…. “Your career is going to be toast.”
www. .com
ACID vs BASE
ACIDRelational
BASENoSQL Distributed Tradeoff
AtomicityAll operations in a single transaction succeed or fail as a group. No partial operations
ConsistencyThe database is never in an inconsistent state
IsolatedTransactions do not interfere with another. Contentious data access is handled by the database to make the transactions appear to run sequentially
DurableTransactions are permanent in the presence of failures
Basic AvailabilityThe system is able to tolerate a partial failure (loss of a single node for example)
Soft StateThe state of the system is in flux and may change over time because of bullet below
Eventual ConsistencyAs data is being added to the system, consistency is gradually replicated across all nodes. Data may be inconsistent in the short term but will eventually become consistent
The application is given a greater responsibility for data management in systems that don’t follow ACID
Leads to complex application code when strong consistency is needed across replicated nodes
www. .com
CAP Theorem Distributed Systems – Pick C or A
Consistency
A
C P Partition Tolerance
Availability
CP:MongoDB, Redis, BigTable, Hbase, MemcacheDB
CA: Oracle, SQL Server, MySQL…
AP: Cassandra, Riak, CouchDB, DynamoDB
USER
USER
USER
USER
USER USER
USER USER
SAME DATAHERE
SAME DATAHERE
Consistency:All clients see the same data
AVAILABLE AVAILABLE
Availability:All clients can read and write
Partition Tolerance:System continues to work during network partitions
www. .com
CAP TheoremAllow Updates Allow Updates
INCONSISTENT
Synchronizing Data
Partition
Allow Updates Prevent Updates
UNAVAILABLE
Synchronizing Data
Partition
AVAILABILITY
CONSISTENCY
System is available, but data is inconsistent due
to lack of synchronization
Data is in synch because only one node allows
updates. The system is unavailable to one group
of users
www. .com
Why Did RDX Choose MongoDB?
Business Drivers• Industry analyst evaluations• Customer use cases and recommendations• Largest commercial investment in any database
vendor• Popularity
• 10 million+ software downloads• 1,000 partners• 2,000 customers• 1/3 of the Fortune 100
• Robust training available• Strong open source community• Excellent partnership support
Technical Drivers• Wide scope of potential application• Low TCO• Combines capabilities of relational databases with next
generation NoSQL technologies• Schemaless, flexible data model• Nonstructured data support• Easily accommodates large data volumes• Rich query capability• Strong, tunable consistency model• Elastic, horizontal scalability• Easily configurable system resiliency• Vendor provided database support
Craigslist, New York Times, Verizon, Viacom, AstraZeneca, MTV, Google, Genetech, Adobe, GAP, Cisco, MetLife, Facebook, Expedia, Ebay, Edmunds, Washington Post, Aol, ADP, Forbes, Intuit, The Weather Channel, Carfax…..
www. .com
MongoDB Features
• Multiple storage engines WiredTiger InMemory Encrypted Third-Party MMAPV1
• Indexing Enforce uniqueness on user defined and Object ID
fields Partial – Only indexed if they meet filter expression Sparse – Only indexed if field is populated Compound – Multiple column index Multikey – Indexes on arrays TTL – Allow documents to be purged based on time Text Search Hash – Creates random values
• Easily ingests large, nonstructured data elements Decomposes large video files, images into smaller
components and rebuilds them using pointer during retrieval
Document validation rules enforce data validity Enforce checks on document structure, data types, data
ranges and the presence of mandatory fields DBAs can apply data governance standards, while
developers maintain the benefits of a flexible schema
• Automatic failover with no application redirects to new primary required
• Driver support for all common programming languages
• Data compression
• Tunable consistency model
• BI Connector allows MongoDB to act as data source for SQL based BI analytics platforms
• LDAP, Kerberos, Windows AD, x.509 authentication
• DML, DCL, DDL audit logging
• FIPs compliant and data encryption
www. .com
Rigid vs Dynamic Schemas
Relational Tables and Rows• Schema design performed before application
is developed• Schema must be built before inserting data• Enforces data structure – rows can not
deviate from the predefined schema• Schema design based on storage• Schema alterations require database and
application changes to be coordinated• Normalization process is critical
MongoDB Collections and Documents• No schema required before inserting data• Schema is created as each document is inserted• Documents in collection can have a different
schema (sets of fields)• Schema design based on application usage• Schemas can evolve iteratively during application
life-cycle• Higher dependency on application layer for data
integrity • Normalization not as important
Predescribed Self-Describing
www. .com
Flexible Schemas
Insurance Policy Document Collection
AUTO LIFE HOME EQUIPMENT CYBER
Collections do *not* enforce document structure.You do not predefine document schemas. The schema is defined during initial document insertion. Data types are selected by MongoDB based on data being inserted
www. .com
Agile Development Features
• Schemaless architecture• Flexible data model = easy schema
changes• Drivers for all major programming
languages• Ability to store all types of data
FASTER BETTER LEANER
• Flexible JSON document format• Rich content Using GridFS• Simple system provisioning• Scale vertically and horizontally• Pluggable storage engines• Easy replication setup
www. .com
Automatic Sharding
Logical Logical
PrimaryPhysical Server
SecondaryPhysical Server
SecondaryPhysical Server
PrimaryPhysical Server
SecondaryPhysical Server
SecondaryPhysical Server
Automatic Data Distribution - Sharded Cluster
Shard 1 Shard 2
PrimaryPhysical Server
SecondaryPhysical Server
SecondaryPhysical Server
Horizontally Scalable
Cluster metadata includes data location, shards, # of chunks….
Replicas ReplicasReplicas
Shard N
www. .com
Replica Sets
BI Connector
MULTI DATACENTER CLUSTER
Site 2
Sec 1.1Display
Sec 2.1Batch
Sec 3.1Batch
Site 2 – Display and Batch
Priority 1Votes 1 Site 3
Sec 1.2Batch
Sec 2.2Batch
Sec 3.2Delayed
Site 3 – Batch and DR
Priority 0Votes 1
ConfigServer
ConfigServer
Priority 1Votes 1
ConfigServer
Collection
Primary 1Display
Primary 2Display
Primary 3Display
www. .com
Global Data Distribution
Read Global/Write Local
Primary
Secondary
Secondary
www. .com
Videos and Images – Unstructured Data
• Store files larger than 16MB i.e. video, images Load chunks without reading entire file into memory
• Atomically sync files with their metadata
• Shard and distribute around the cluster
doc.jpg doc.jpg(meta data) doc.jpg
(1)
GridFSAPI
fs.files fs.chunks
Drive
www. .com
Cassandra
Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. Cassandra brings together the distributed systems technologies from Dynamo and the log-structured storage engine from Google's BigTable..
Apple, Sony, Walmart, Comcast, eBay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netflix, Reddit, Weather Channel, CERN, Constant Contact, Macy’s, Expedia
• Fault Tolerant• Data Durability• Data Center Aware• High Performance• Decentralized• Horizontal Scalability• Elastic Architecture
• Apple - 75,000 nodes storing over 10 PB of data• Netflix - 2,500 nodes, 420 TB, over 1 trillion
requests per day• Chinese search engine Easou - 270 nodes, 300 TB,
over 800 million requests per day• eBay - 100 nodes, 250 TB
.
BIG Data High # Concurrent Users
www. .com
Datastax/Cassandra Features
• Multi-model storage• Key Value NoSQL• Tabular NoSQL• JSON/Document NoSQL• Graph
• Very high “linear” scalability
• Automatic data distribution amongst nodes
• Multi-data center replication
• CQL Access Language• SQL “like” language
• Tunable consistency model
• Strong node fault detection and recovery
• Writes to Memtables in RAM
• Materialized views
• Advanced replication allows multiple clusters to be synchronized
• OpsCenter – browser based administration and monitoring toolset
• Driver support for all common programming languages
• In-Memory option allows parts (or all) of database to reside in RAM
• Tiered storage
• Interface to Spark (in-memory)• Data stream processing• Access to Spark SQL (more robust than CQL)
• Security• End to end encryption• AD, LDAP, Kerberos support
www. .com
CassandraCluster
Cassandra/DataStax
REPLICATION
Node 1Primary
Node 2Copy of 1
Node 2Copy of
1
Node 3Copy of
1
Node 4Node 4
West CoastDatacenter
East CoastDatacenter
REPLICATION
Node 3Copy of
1
Node 1Primary
www. .com
Cassandra/DataStax• Keyspace - A keyspace is a logical container for data tables and indexes. It can be compared to an Oracle
Schema or a SQL Server database. Keyspaces define how the data is replicated amongst the nodes
• Table - A collection of columns fetched by a row. Columns are ordered by name
• Column - Supports different data types and consists of a name, value and timestamp
• Primary Key - Uniquely identifies a row occurrence in a Cassandra table
• Partition Key - The partition key identifies which node in the cluster will store the row. It is responsible for data distribution across the nodes
• Clustering Key - Orders rows based on the column’s value
• Data Center - A collection of related nodes in a Cassandra Cluster
• Snitch - Determines which datacenters and racks nodes belong to. They inform Cassandra about the network topology so that requests are routed efficiently and allows Cassandra to distribute replicas by grouping machines into datacenters and racks
• Partitioner - A hashing algorithm that generates a hash value token from the partition key. The token is the value used to distribute the data across the various nodes in the cluster. The partitioner’s goal is to assign equal portions of data to each node. Each node in a Cassandra cluster becomes responsible for storing a range of hash values
• Gossip - A peer-to-peer communications mechanism that identifies and shares node information (state and location) to all nodes in the Cassandra cluster
www. .com
Cassandra/DataStax Decentralized Storage
Partitioners are hashing algorithms that generate tokens from partition keys
Each node in a Cassandra cluster is responsible for a range of tokens (hash keys)
First column of primary key becomes partition key
Can use multiple columns as primary key, partition key
Also able to cluster columns to order data
PRIMARY KEY (emp_id) PRIMARY KEY (emp_id, dept_id) WITH CLUSTERING ORDER BY (dept_loc))PRIMARY KEY (emp_id, dept_id)
Partitioner
TOKEN RANGE0 0-25
26 26-5051 51-7576 76-100
All nodes can accept reads and writes
Distributes data amongst nodes
www. .com
Cassandra/DataStax Tunable ConsistencyWrite Consistency
Read Consistency
Read and Write consistency levels are different than row replication settings.
Replication factor will affect how many copies are eventually written vs tunable consistency for fast client response
Level DescriptionALL Returns the record after all replicas have responded. The read operation
will fail if a replica does not respond.QUORUM Returns the record after a quorum of replicas from all datacenters has
responded.LOCAL_QUORUM Returns the record after a quorum of replicas in the current datacenter as
the coordinator has reported. Avoids latency of inter-datacenter communication.
ONE Returns a response from the closest replica, as determined by the snitch. By default, a read repair runs in the background to make the other replicas consistent.
TWO Returns the most recent data from two of the closest replicas.THREE Returns the most recent data from three of the closest replicas.LOCAL_ONE Returns a response from the closest replica in the local datacenter.SERIAL Allows reading the current (and possibly uncommitted) state of data
without proposing a new addition or update. If a SERIAL read finds an uncommitted transaction in progress, it will commit the transaction as part of the read.
LOCAL_SERIAL Same as SERIAL, but confined to the datacenter. Similar to LOCAL_QUORUM.
Consistency
Latency
Level DescriptionALL A write must be written to the commit log and memtable on all replica nodes in the cluster for that partition.
EACH_QUORUM
Strong consistency. A write must be written to the commit log and memtable on a quorum of replica nodes in each datacenter.
QUORUM A write must be written to the commit log and memtable on a quorum of replica nodes across all datacenters.
LOCAL_QUORUM
Strong consistency. A write must be written to the commit log and memtable on a quorum of replica nodes in the same datacenter as the coordinator. Avoids latency of inter-datacenter communication.
ONE A write must be written to the commit log and memtable of at least one replica node.TWO A write must be written to the commit log and memtable of at least two replica nodes.THREE A write must be written to the commit log and memtable of at least three replica nodes.LOCAL_ONE A write must be sent to, and successfully acknowledged by, at least one replica node in the local datacenter.ANY A write must be written to at least one node. If all replica nodes for the given partition key are down, the write
can still succeed after a hinted handoff has been written. If all replica nodes are down at write time, an ANY write is not readable until the replica nodes for that partition have recovered.
www. .com
Relational vs Cassandra NoSQL – Data Modeling
In relational systems, administrators model the data
In Cassandra, administrators design schemas that are based on query patterns
www. .com
Cassandra/DataStax Modeling
Cassandra – YOU DESIGN SCHEMAS BASED ON QUERY PATTERNS THEN DATA RELATIONSHIPS
Maximization of Denormalization
Cassandra/Datastax recommendation = 1 table per query You are prebuilding answers to unique requests for data!
Overcome data duplication by leveraging extremely fast write performance
• Determine queries accessing data FIRST, then design the data models
• No concept of foreign keys
• No concept of join operations
• Prepare data for fast reads by writing pre-built result sets
• Attempt to minimize reads from multiple partitions
• Cassandra prefers INSERTs over UPDATEs and DELETEs
www. .com
Redis
• In-Memory, Key-Value Database• Dumps to disk is configurable• Database handles swapping• All data can live in memory but key caching is required
• 1 Million Keys = 160 MEGs• 10 Million Keys – 1.6 GIGs
• ATOMIC Operations
• Master-slave replication• Scalability• Redundancy• Slaves
• Can’t respond to queries during initial synch• Automatically reconnect and resynch after outage
• Journal file• Every write is logged • Commands replayed when server is started• Configurable – Can choose between 2 settings
• Eventually consistent - “Speed” • Immediately consistent - Safety”
Tumblr, Uber, Coinbase, Flickr, Hulu, Craigslist, Alibaba, Digg
www. .com
Redis Features
• Not a replacement for relational databases but can be used as their “front end”
• Lightening fast read and write access
• Single threaded architecture – does not exploit multiple CPU/Cores
• Does not support unit-of-work roll back
• Optimistic locking – data contention (race) will cause transaction failure
• Redis Clusters• Not able to guarantee strong consistency
amongst nodes• Able to add/remove nodes in a Redis cluster
• Partitioning allows data to be split and stored in multiple Redis instances. Each instance contains a subset of keys
• Range partitioning• Hash partitioning
• Can be used as a data store or a pure cache
• When used as a Cache, can be configured as a LRU (gets rid of old data to make way for new)
• Sensor data
• Redis RDB persistence and backups• Redis snapshots at specified time intervals
= a full database backup• Move RDB files to other storage• Write operations in memory can be logged
to Append Only Files (AOF)• Appendfsynch parameter allows
administrator to configure log writes
www. .com
Neo4j
Walmart, Ebay, Cisco, Adobe, CrunchBase, Pitney Bowes, CareerBuilder, TomTom, ConocoPhillips, National Geographic, Century Link, Glassdoor, Zephyr Health, Gamesys, Telenor
• Highly scalable, native graph database
• Enterprise and community editions
• Store, manage, analyze, and use data within the context of connections, like the circles and lines drawn on whiteboards
• More than 1 Million downloads
• Understanding data relationships is also key to understanding dependencies, uncovering cascading impacts, and predicting behavior
• Access language allows you to traverse relationships in a much more simple, and easy to understand, way than relational SQL
SQL – Dozens of linesCypher – Couple of lines
www. .com
Neo4j Features
• Provides graphical browser utility to better visualize relationships
• Import data from different sources using rules
• Cypher is another SQL “like” language
• Properties are key-value pairs• Nodes with properties (node is data, not server)• Named relationships with properties• Key – string• Value – individual data types or array
• Path – connecting relationships, which you traverse using an API
• Schemaless
• Easily able to store unstructured data
• Easily able to store large volumes of data
• Full support for ACID Transactions
• Full indexing capabilities
• Constraint capabilities• Unique• Exists (like a Foreign Key with no parent
delete rules)
Find Sushi Restaurants in New York that my friends like
www. .com
Neo4j Graph Examples
Master Data Management
Graph Based Search
Recommendations
www. .com
NoSQL vs Relational
Strengths Weaknesses• ACID
• Transaction management
• Sophisticated locking and latching
• Power of the SQL Language – Two-phase commits,
foreign key constraints, joins, subqueries,
integrated aggregations, complex business rule
enforcement
• Product maturity
• Robust utilities
• Vendor support
• Most vendors have robust cloud strategies
• Strong third-party software provider adoption
(applications, tools and utilities)
• Product purchase/support costs
• Scalability can be complex and expensive
• Data normalization can impact performance
• Schemas are not flexible
• Not all data fits neatly into rows and columns
• Geographic distribution can be complex
Relational DBMS
www. .com
NoSQL vs Relational
NoSQL DBMSStrengths Weaknesses
• Dynamic schema flexibility
• Faster development times
• Total cost of ownership
• Easily stores semi, non and fully structured
data
• Horizontal and vertical scalability
• Geographic replication and data distribution
• Easier to achieve high performance accessing
large volumes of data
• Custom tailor environment to data storage
and processing needs
• Cost effective clustering
• Crude transaction management and locking
mechanisms (BASE vs ACID)
• Limited cloud offerings
• Vendor support (or lack thereof)
• Data is often denormalized leading to duplicate
updates
• Weak access languages
• No inherent data integrity enforcement
mechanisms
www. .com
NoSQL vs Relational
Transactions – COMPLEX Transactions – SIMPLEData – STRUCTURED AND STATIC Data – FULL/SEMI/NON STRUCTURED DYNAMICData Velocity – MODERATE TO HIGH Data Velocity – HIGH to ASTRONOMICALData Locations – FEWER THE BETTER Data Locations – MANY LOCATIONSData Volumes – MAINTAIN BY PURGING Data Volumes – RETAIN FOREVERData Availability – CLUSTER, LOG SHIPPING Data Availability – INHERENT ARCHITECTUREData Performance – FOCUS ON READS Data Performance – FOCUS ON READS/WRITES
RelationalDBMS
NoSQL DBMS
www. .com
Questions and Additional Information
Next Month’s Presentation – Evaluating and Selecting Cloud Database Management Systems
The RDX ReportIs NoSQL the Natural Progression of DB Technology?, Cloud’s Hidden Impact on IT Support, SQL Server 2016 Licensing Best Practices, The Rise of Corporate Ransomware
LinkedInSelecting Cloud DBMS, NoSQL Architectures, Database Security Series, Improving Customer Service 20 YEARS OF
SERVICE DELIVERY EXPERIENCE