NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a...
Transcript of NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a...
NoSQL database
�Called as ‘‘Not Only SQL’’, is a current approach for large and distributed data management and database design.
� Its name easily leads to misunderstanding that NoSQL means ‘‘not SQL’’.�On the contrary, NoSQL does not avoid SQL.
�Some NoSQL systems are entirely non-relational
�Some NoSQL systems simply avoid selected relational functionality such as fixed table schemas and join operations.
� Some analytic platforms like SQLstream and Cloudera Impala series still use SQL in its database systems, �Because SQL is more reliable and simpler query language with high performance in stream
Big Data real-time analytics.
�The mainstream Big Data platforms adopt NoSQL to break andtranscend the rigidity of normalized RDBMS schemas.
NoSQL databasefor Unstructured or non-relational Data
�Data storage and management are separated into two indepenentparts.
�This is contrary to relational databases
i)In the storage part which is also called key-value storage, NoSQL focuses on the scalability of data storage with high-performance.
ii)In the management part, NoSQL provides low-level access mechanism
�Data management tasks can be implemented in the application layer rather than having data management logic spread across in SQL or DB-specific stored procedure languages
�NoSQL systems are very flexible for data modeling
� NoSQL systems are easy to update application deployments
Hbase NoSQL Database System Architecture. (Apache Hadoop)
Hbase is one of the most famous used NoSQL databases
NoSQL databasefor unstructured or non-relational data
�An important property of the most NoSQL databases that they are commonly schema-free.
�The biggest advantage of schema-free databases is that it enables applications to quickly modify the structure of data and does not need to rewrite tables.
� It possesses greater flexibility when the structured data is heterogeneously stored.
�In the data management layer, the data is enforced to be integrated and valid.
NoSQL databasefor unstructured or non-relational data
�The most popular NoSQL database is Apache Cassandra.
�Cassandra was released as open source in 2008.
� Cassandra was Facebook proprietary database
�Other NoSQL implementations include SimpleDB, Google BigTable, Apache Hadoop, MapReduce, MemcacheDB, and Voldemort.
�Companies that use NoSQL include Twitter, LinkedIn and NetFlix
Reliable, Scalable and Maintainable Applications
�A data-intensive application is built from standard building blocks
�These blocks provide commonly needed functionality.,
� For example, many applications need to:
�Store data so that they, or another application, can find it again later (databases)
�Remember the result of an expensive operation, to speed up reads (caches)
�Allow users to search data by keyword or filter it in various ways (search indexes)
�Send a message to another process, to be handled asynchronously (stream processing)
�Periodically crunch a large amount of accumulated data (batch processing)
�The traditional data systems are such a successful abstraction
�we use them all the time without thinking too much.
� When building an application, most engineers wouldn’t dream of writing a new data storage engine from scratch
�because databases are a perfectly good tool for the job.
But reality is not that simple…
�There are many database systems with different characteristics
�because different applications have different requirements.
� There are various approaches to caching, several ways of building search indexes, and so on.
�We still need to figure out which tools and which approaches are the most appropriate for the task at hand.
� It can be hard to combine several tools when you need to do something that a single tool cannot do alone.
An Example
�If you have an application-managed caching layer (using memcachedor similar)
�If you have a full-text search server separate from your main database (such as Elastic search or Solr),
� It is normally the application code’s responsibility to keep those caches and indexes in sync with the main database
An Architecture for a Data System that CombinesSeveral Components
Data Systems
�When we combine several tools in order to provide a service
�The service’s interface or API usually hides those implementation details from clients.
�Now we have essentially created a new, special-purpose data systemfrom smaller, general-purpose components.
�Our composite data system may provide certain guarantees
�cache will be correctly invalidated or updated on writes
�outside clients see consistent results.
�We are not only an application developer, but also a data system designer.
Reliability, Scability, Maintainability
Reliability
�The system should continue to work correctly (performing the correct function at the desired performance) even in the face of adversity (hardware or software faults, and even human error).See Scalability
�As the system grows (in data volume, traffic volume or complexity), there should be reasonable ways of dealing with that growth.
Maintainability
�Over time, many different people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new use cases)�they should all be able to work on it productively.
Describing Load
�Load can be described with a few numbers which we call load parameters.
� The best choice of parameters depends on the architecture of yoursystem:
�it’s requests per second to a webserver,
� ratio of reads to writes in a database,
� the number of simultaneously active users in a chat room,
� the hit rate on a cache, or something else.
�the average case is what matters for you,
�your bottleneck is dominated by a small number of extreme cases.
Twitter as an example
using data published in November 2012
Two of Twitter’s main operations are:
Post tweet
A user can publish a new message to their followers (4.6 k requests/sec on average, over 12 k requests/sec at peak).
Home timeline
A user can view tweets recently published by the people they follow (300 k requests/sec)
Twitter as an example
�Simply handling 12,000 writes per second (the peak rate for posting tweets) would be fairly easy.
However;
�Twitter’s scaling challenge is not primarily due to tweet volume
�scaling was due to fan-out*
�Each user follows many people, and each user is followed by many people.
*Fan-out is a term that defines the maximum number of digital inputs that the output of a single logic gate can feed
Two Different Approaches for Tweeet Implementation
1. Posting a tweet simply inserts the new tweet into a global collectionof tweets.
�When a user requests home timeline, look up all the people they follow, find all recent tweets for each of those users, and merge them (sorted by time).
In a relational database, this would be a query along the lines of:
SELECT tweets.*, users.* FROM tweets
JOIN users ON tweets.sender_id = users.id
JOIN follows ON follows.followee_id = users.id
WHERE follows.follower_id = current_user
Simple relational schema for implementing a Twitter home timeline
�For this usage version of Twitter the systems struggled to keep up with the load of home timeline queries
Therefore;
� The company switched to the following solution
2. Maintain a cache for each user’s home timeline like a mailbox of tweets for each recipient user
�When a user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches.
�Then the request to read the home timeline is cheap
Because
�The result has been computed ahead of time
�This works better than the previous solution
�The average rate of published tweets is almost two orders of magnitude lower than the rate of home timeline reads
�It’s possible to do more work at write time and less at read time.
Twitter’s data pipeline for delivering tweets to followers, with load
parameters as of November 2012
�Posting a tweet requires a lot of extra work.
�On average, a tweet is delivered to about 75 followers
� 4.6 k tweets per second become 345 k writes per second to the home timeline caches.
�This average hides the fact that the number of followers per user varies wildly, and some users have over 30 million followers.
�This means that a single tweet may result in over 30 million writes to home timelines!
� Doing this in a timely manner
Twitter tries to deliver tweets to followers within 5 seconds
is a significant challenge.
Tweeter Implementation
Twitter is moving to a hybrid of both approaches.
�Most users’ tweets continue to be fanned out to home timelines at the time when they are posted
�A small number of users with a very large number of followers are excepted from this fan-out
�When the home timeline is read, the tweets followed by the user are fetchedseparately and merged with the home timeline when the timeline is read, like in the first approach
�This hybrid approach is able to deliver consistently goodperformance.
Describing Performance
�Once we have described the load on our system, we can investigate what happens when the load increases.
We can look at it in two ways:
�When we increase a load parameter, and keep the system resources (CPU, memory, network bandwidth,etc.) unchanged, how is performance of your system affected?
�When we increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?
Describing Performance
Both questions require performance numbers
�In a batch-processing system such as Hadoop, we usually care about throughput
� the number of records we can process per second,
� the total time it takes to run a job on a dataset of a certain size.
� In online systems, the response time of a service is usually more important
� The time between a client sending a request and receiving a response.
Latency and Response time
�Latency and response time are often used synonymously but they are not the same.
�The response time is what the client sees
� besides the actual time to process the request (the service time)
�it includes network delays and queueing delays.
�Latency is the duration that a request is waiting to be handled — during which it is latent, awaiting service
Data Models and Query Languages
�When we want to store data structures we express them in terms of a general-purpose data model, such as JSON or XML documents, tables in a relational database or a graph model.
�The engineers who built our database software decided on a way of representing that JSON/XML/relational/graph data in terms of bytes in memory, on disk, or on a network.
� There presentation may allow the data to be queried, searched, manipulated and processed in various ways.
Relational Model vs. Document Model
�In the 2010s, NoSQL is the latest attempt to overthrow the relational model’s dominance.
� The term NoSQL is unfortunate, since it doesn’t actually refer to any particular technology � it was intended simply as a catchy Twitter hashtag for a meetup on open
source, distributed, non-relational databases in 2009
�The term NoSQL strucked a nerve, and quickly spreaded through the web startup community and beyond.
�A number of interesting database systems are now associated withthe #NoSQL hashtag, �it has been retroactively re-interpreted as Not Only SQL
The Adoption of NoSQL databases
�A need for greater scalability than relational databases can easily achieve, including very large datasets or very high write throughput
�A widespread preference of free and open source software over commercial database products
�Specialized query operations that are not well supported by the relational model
�Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model
Representation of a LinkedIn profile as a JSON document{ "user_id": 251,
"first_name": "Bill",
"last_name": "Gates",
"summary": "Co-chair of the Bill & Melinda Gates... Active blogger.", "region_id": "us:91",
"industry_id": 131,
"photo_url": "/p/7/000/253/05b/308dd6e.jpg",
"positions": [ {"job_title": "Co-chair", "organization": "Bill & Melinda Gates Foundation"}, {"job_title": "Co-founder, Chairman", "organization": "Microsoft"}
],
"education": [
{"school_name": "Harvard University", "start": 1973, "end": 1975}, {"school_name": "Lakeside School, Seattle", "start": null, "end": null}
],
"contact_info": {
"blog": "http://thegatesnotes.com",
"twitter": "http://twitter.com/BillGates"
}
}
One-to-Many Relationships with Tree Structure
The company name is not just a string, but a link to a
company entity.
Many-to-Many Relationships
�The data within each dotted rectangle can be grouped into one document
�The references to organizations, schools and other users need to be
represented as references, and require joins when queried.
Schema Flexibility in the Document Model
�Document databases are sometimes called schemaless
�The code that reads the data usually assumes some kind of structure
� There is an implicit schema, but it is not enforced by the database
�For the term schema-on-read, the structure of the data is implicit, and only interpreted when the data is read,
�For the term schema-on-write, the traditional approach of relational databases ( the schema) is explicit and the database ensures all data conforms to it.
Schema-on-Read Schema-on Write
�Schema-on-read is similar to dynamic (run-time) type-checking in programming languages
�Schema-on-write is similar to static (compile-time) type-checking.
�Just as the advocates of static and dynamic type-checking have big debates about their relative merits, enforcement of schemas in database is a contentious topic
Document DataBase
�We can start writing new documents with the new fields
�The code in the application handles the case when old documents are read
For Example:
if (user && user.name && !user.first_name)
{ // Documents written before Dec 8, 2013 don't have first_name
user.first_name = user.name.split(" ")[0];
}
Statically Typed Database Schema
� We can perform a migration along the lines:
ALTER TABLE users ADD COLUMN first_name text;
UPDATE users SET first_name = split_part(name, ' ', 1); -- PostgreSQL
UPDATE users SET first_name = substring_index(name, ' ', 1); -- MySQL
�Schema changes have a bad reputation of being slow and requiring downtime.
�The reputation is not deserved.
�Most relational database systems execute ALTER TABLE statement