NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a...

NoSQL database

�Called as ‘‘Not Only SQL’’, is a current approach for large and distributed data management and database design.

� Its name easily leads to misunderstanding that NoSQL means ‘‘not SQL’’.�On the contrary, NoSQL does not avoid SQL.

�Some NoSQL systems are entirely non-relational

�Some NoSQL systems simply avoid selected relational functionality such as fixed table schemas and join operations.

� Some analytic platforms like SQLstream and Cloudera Impala series still use SQL in its database systems, �Because SQL is more reliable and simpler query language with high performance in stream

Big Data real-time analytics.

�The mainstream Big Data platforms adopt NoSQL to break andtranscend the rigidity of normalized RDBMS schemas.

NoSQL databasefor Unstructured or non-relational Data

�Data storage and management are separated into two indepenentparts.

�This is contrary to relational databases

i)In the storage part which is also called key-value storage, NoSQL focuses on the scalability of data storage with high-performance.

ii)In the management part, NoSQL provides low-level access mechanism

�Data management tasks can be implemented in the application layer rather than having data management logic spread across in SQL or DB-specific stored procedure languages

�NoSQL systems are very flexible for data modeling

� NoSQL systems are easy to update application deployments

Hbase NoSQL Database System Architecture. (Apache Hadoop)

Hbase is one of the most famous used NoSQL databases

NoSQL databasefor unstructured or non-relational data

�An important property of the most NoSQL databases that they are commonly schema-free.

�The biggest advantage of schema-free databases is that it enables applications to quickly modify the structure of data and does not need to rewrite tables.

� It possesses greater flexibility when the structured data is heterogeneously stored.

�In the data management layer, the data is enforced to be integrated and valid.

NoSQL databasefor unstructured or non-relational data

�The most popular NoSQL database is Apache Cassandra.

�Cassandra was released as open source in 2008.

� Cassandra was Facebook proprietary database

�Other NoSQL implementations include SimpleDB, Google BigTable, Apache Hadoop, MapReduce, MemcacheDB, and Voldemort.

�Companies that use NoSQL include Twitter, LinkedIn and NetFlix

Reliable, Scalable and Maintainable Applications

�A data-intensive application is built from standard building blocks

�These blocks provide commonly needed functionality.,

� For example, many applications need to:

�Store data so that they, or another application, can find it again later (databases)

�Remember the result of an expensive operation, to speed up reads (caches)

�Allow users to search data by keyword or filter it in various ways (search indexes)

�Send a message to another process, to be handled asynchronously (stream processing)

�Periodically crunch a large amount of accumulated data (batch processing)

�The traditional data systems are such a successful abstraction

�we use them all the time without thinking too much.

� When building an application, most engineers wouldn’t dream of writing a new data storage engine from scratch

�because databases are a perfectly good tool for the job.

But reality is not that simple…

�There are many database systems with different characteristics

�because different applications have different requirements.

� There are various approaches to caching, several ways of building search indexes, and so on.

�We still need to figure out which tools and which approaches are the most appropriate for the task at hand.

� It can be hard to combine several tools when you need to do something that a single tool cannot do alone.

An Example

�If you have an application-managed caching layer (using memcachedor similar)

�If you have a full-text search server separate from your main database (such as Elastic search or Solr),

� It is normally the application code’s responsibility to keep those caches and indexes in sync with the main database

An Architecture for a Data System that CombinesSeveral Components

Data Systems

�When we combine several tools in order to provide a service

�The service’s interface or API usually hides those implementation details from clients.

�Now we have essentially created a new, special-purpose data systemfrom smaller, general-purpose components.

�Our composite data system may provide certain guarantees

�cache will be correctly invalidated or updated on writes

�outside clients see consistent results.

�We are not only an application developer, but also a data system designer.

Reliability, Scability, Maintainability

Reliability

�The system should continue to work correctly (performing the correct function at the desired performance) even in the face of adversity (hardware or software faults, and even human error).See Scalability

�As the system grows (in data volume, traffic volume or complexity), there should be reasonable ways of dealing with that growth.

Maintainability

�Over time, many different people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new use cases)�they should all be able to work on it productively.

Describing Load

�Load can be described with a few numbers which we call load parameters.

� The best choice of parameters depends on the architecture of yoursystem:

�it’s requests per second to a webserver,

� ratio of reads to writes in a database,

� the number of simultaneously active users in a chat room,

� the hit rate on a cache, or something else.

�the average case is what matters for you,

�your bottleneck is dominated by a small number of extreme cases.

Twitter as an example

using data published in November 2012

Two of Twitter’s main operations are:

Post tweet

A user can publish a new message to their followers (4.6 k requests/sec on average, over 12 k requests/sec at peak).

Home timeline

A user can view tweets recently published by the people they follow (300 k requests/sec)

Twitter as an example

�Simply handling 12,000 writes per second (the peak rate for posting tweets) would be fairly easy.

However;

�Twitter’s scaling challenge is not primarily due to tweet volume

�scaling was due to fan-out*

�Each user follows many people, and each user is followed by many people.

*Fan-out is a term that defines the maximum number of digital inputs that the output of a single logic gate can feed

Two Different Approaches for Tweeet Implementation

1. Posting a tweet simply inserts the new tweet into a global collectionof tweets.

�When a user requests home timeline, look up all the people they follow, find all recent tweets for each of those users, and merge them (sorted by time).

In a relational database, this would be a query along the lines of:

SELECT tweets.*, users.* FROM tweets

JOIN users ON tweets.sender_id = users.id

JOIN follows ON follows.followee_id = users.id

WHERE follows.follower_id = current_user

Simple relational schema for implementing a Twitter home timeline

�For this usage version of Twitter the systems struggled to keep up with the load of home timeline queries

Therefore;

� The company switched to the following solution

2. Maintain a cache for each user’s home timeline like a mailbox of tweets for each recipient user

�When a user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches.

�Then the request to read the home timeline is cheap

Because

�The result has been computed ahead of time

�This works better than the previous solution

�The average rate of published tweets is almost two orders of magnitude lower than the rate of home timeline reads

�It’s possible to do more work at write time and less at read time.

Twitter’s data pipeline for delivering tweets to followers, with load

parameters as of November 2012

�Posting a tweet requires a lot of extra work.

�On average, a tweet is delivered to about 75 followers

� 4.6 k tweets per second become 345 k writes per second to the home timeline caches.

�This average hides the fact that the number of followers per user varies wildly, and some users have over 30 million followers.

�This means that a single tweet may result in over 30 million writes to home timelines!

� Doing this in a timely manner

Twitter tries to deliver tweets to followers within 5 seconds

is a significant challenge.

Tweeter Implementation

Twitter is moving to a hybrid of both approaches.

�Most users’ tweets continue to be fanned out to home timelines at the time when they are posted

�A small number of users with a very large number of followers are excepted from this fan-out

�When the home timeline is read, the tweets followed by the user are fetchedseparately and merged with the home timeline when the timeline is read, like in the first approach

�This hybrid approach is able to deliver consistently goodperformance.

Describing Performance

�Once we have described the load on our system, we can investigate what happens when the load increases.

We can look at it in two ways:

�When we increase a load parameter, and keep the system resources (CPU, memory, network bandwidth,etc.) unchanged, how is performance of your system affected?

�When we increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?

Describing Performance

Both questions require performance numbers

�In a batch-processing system such as Hadoop, we usually care about throughput

� the number of records we can process per second,

� the total time it takes to run a job on a dataset of a certain size.

� In online systems, the response time of a service is usually more important

� The time between a client sending a request and receiving a response.

Latency and Response time

�Latency and response time are often used synonymously but they are not the same.

�The response time is what the client sees

� besides the actual time to process the request (the service time)

�it includes network delays and queueing delays.

�Latency is the duration that a request is waiting to be handled — during which it is latent, awaiting service

Data Models and Query Languages

�When we want to store data structures we express them in terms of a general-purpose data model, such as JSON or XML documents, tables in a relational database or a graph model.

�The engineers who built our database software decided on a way of representing that JSON/XML/relational/graph data in terms of bytes in memory, on disk, or on a network.

� There presentation may allow the data to be queried, searched, manipulated and processed in various ways.

Relational Model vs. Document Model

�In the 2010s, NoSQL is the latest attempt to overthrow the relational model’s dominance.

� The term NoSQL is unfortunate, since it doesn’t actually refer to any particular technology � it was intended simply as a catchy Twitter hashtag for a meetup on open

source, distributed, non-relational databases in 2009

�The term NoSQL strucked a nerve, and quickly spreaded through the web startup community and beyond.

�A number of interesting database systems are now associated withthe #NoSQL hashtag, �it has been retroactively re-interpreted as Not Only SQL

The Adoption of NoSQL databases

�A need for greater scalability than relational databases can easily achieve, including very large datasets or very high write throughput

�A widespread preference of free and open source software over commercial database products

�Specialized query operations that are not well supported by the relational model

�Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model

Representation of a LinkedIn profile as a JSON document{ "user_id": 251,

"first_name": "Bill",

"last_name": "Gates",

"summary": "Co-chair of the Bill & Melinda Gates... Active blogger.", "region_id": "us:91",

"industry_id": 131,

"photo_url": "/p/7/000/253/05b/308dd6e.jpg",

"positions": [ {"job_title": "Co-chair", "organization": "Bill & Melinda Gates Foundation"}, {"job_title": "Co-founder, Chairman", "organization": "Microsoft"}

],

"education": [

{"school_name": "Harvard University", "start": 1973, "end": 1975}, {"school_name": "Lakeside School, Seattle", "start": null, "end": null}

],

"contact_info": {

"blog": "http://thegatesnotes.com",

"twitter": "http://twitter.com/BillGates"

}

}

One-to-Many Relationships with Tree Structure

The company name is not just a string, but a link to a

company entity.

Many-to-Many Relationships

�The data within each dotted rectangle can be grouped into one document

�The references to organizations, schools and other users need to be

represented as references, and require joins when queried.

Schema Flexibility in the Document Model

�Document databases are sometimes called schemaless

�The code that reads the data usually assumes some kind of structure

� There is an implicit schema, but it is not enforced by the database

�For the term schema-on-read, the structure of the data is implicit, and only interpreted when the data is read,

�For the term schema-on-write, the traditional approach of relational databases ( the schema) is explicit and the database ensures all data conforms to it.

Schema-on-Read Schema-on Write

�Schema-on-read is similar to dynamic (run-time) type-checking in programming languages

�Schema-on-write is similar to static (compile-time) type-checking.

�Just as the advocates of static and dynamic type-checking have big debates about their relative merits, enforcement of schemas in database is a contentious topic

Document DataBase

�We can start writing new documents with the new fields

�The code in the application handles the case when old documents are read

For Example:

if (user && user.name && !user.first_name)

{ // Documents written before Dec 8, 2013 don't have first_name

user.first_name = user.name.split(" ")[0];

}

Statically Typed Database Schema

� We can perform a migration along the lines:

ALTER TABLE users ADD COLUMN first_name text;

UPDATE users SET first_name = split_part(name, ' ', 1); -- PostgreSQL

UPDATE users SET first_name = substring_index(name, ' ', 1); -- MySQL

�Schema changes have a bad reputation of being slow and requiring downtime.

�The reputation is not deserved.

�Most relational database systems execute ALTER TABLE statement

NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a...

Documents

Transcript of NoSQL database - · PDF fileHbase NoSQL Database System Architecture. ... In a...