Apache HBASEcis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseArchitecture.p… · Hadoop/HDFS...

Apache HBASE

CIS 612

Sunnie Chung

H-Base

� Distributed Column-Oriented database on top of Hadoop/HDFS

� Provides low-latency access to single rows from billions of records

� Column oriented:� OLAP

� Best for aggregation� High compression rate: Few distinct values

� Do not have a Schema or Data type

� Built for Wide tables : Millions of columns Billions of rows

� Denormalized data� Master-Slave architecture

HBase SystemOverview

H-Base Architecture

HMaster Server

� Like Name Node in HDFS

� Manages and Monitors HBase Cluster

Operations

� Assign Region to Region Servers

� Handling Load-Balancing and Splitting

Region Server

� Like Data Node in HDFS

� Highly Scalable

� Handle Read/Write Requests

� Direct Communication with Clients

Internal Architecture

� Tables Regions

� Store

� MemStore

� FileStore Blocks

� Column Families

� HBase is composed of three main components in a master slave type of architecture.

� Region servers serve data for reads and writes.

� Region assignment, DDL (create, delete tables) operations are handled by the HBase Master process.

� Zookeeper, which is part of HDFS, maintains a live cluster state.

Apache HBase Architecture

Contd…

HBase consists of:� Set of tables

� Each table with column families and rows

� Row key acts as a Primary key in HBase.� Any access to HBase tables uses this Primary Key

� Each column qualifier present in HBase denotes attribute corresponding to the object which resides in the cell.

Apache HBase storage structure

HBase HFile and Indexing

� Fault tolerant

� Replication across the data center

� Atomic and strongly consistent row-

level operations

� High availability through automatic

failover

� Automatic sharding and load

balancing of tables

Characteristics of HBase

� Fast

� Near real time lookups

� In-memory caching via block cache and bloom

filters

� Server side processing via filters and co-

processors

� Adobe

� Airbnb uses HBase as part of its Airstream real-time stream computation framework

� Facebook uses HBase for its messaging platform.

� Flurry� Imgur uses HBase to power its notifications system

� Netflix

� Rocket Fuel� Spotify uses HBase as base for Hadoop and machine

learning jobs.

� Sears� Yahoo!

Applications

Apache ZooKeeper

ZooKeeper

� Coordination� Race Condition

� Dead-locks

� Partial Failure� Inconsistency

� What is ZooKeeper?� Distributed coordination service for distributed

applications

� Like a Centralized Repository

� Challenges for Distributed Applications

� ZooKeeper Goals� Serialization

� Atomicity

� Reliability

� Simple API

ZooKeeper Architecture

Introduction to Zookeeper

� Zookeeper: A software service for a distributed environment

that coordinates and configures different machines in a

centralized way.

� A change is not considered successful until it has been

written to a quorum

� A leader is elected within the ensemble for conflicts

� In HBase, ZooKeeper coordinates and shares state between

the Masters and RegionServers.

� Tagline: Enables highly reliable distributed coordination

� Always Odd number of nodes.

� Leader is elected by voting.

� Leader and Follower can get connected to

Clients and Perform Read Operations

� Write Operation is done only by the Leader.

� Observer nodes to address scaling problems

ZooKeeper Architecture

ZooKeeper Data Model

� Z Nodes:

� Similar to Directory in File system

� Container for data and other nodes

� Stores Statistical information and User data up to

� Used to store and share configuration information

between applications

ZooKeeper Data Model

Z Node Types

� Persistent Nodes

� Ephemeral Nodes

� Sequential Nodes

� Watch : Event system for client notification

Projects & Tools on Hadoop

� HBase

� Hive

� Pig

� Jaql

� ZooKeeper

� AVRO

� UIMA

� Sqoop

References

[1] "Apache Hadoop", http://hadoop.apache.org/Hadoop/

[2] “Apache Hive”, http://hive.apache.org/hive

[3] “Apache HBase”, https://hbase.apache.org/hbase

[4] “Apache ZooKeeper”, http://zookeeper.apache.org/zookeeper

[5] Jason Venner, "Pro Hadoop", Apress Books, 2009

[6] "Hadoop Wiki", http://wiki.apache.org/hadoop/

[7] Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam Manzanares, Xiao Qin, " Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters", 19th International Heterogeneity in Computing Workshop, Atlanta, Georgia, April 2010

[8]Dhruba Borthakur, The Hadoop Distributed File System: Architecture

and Design, The Apache Software Foundation 2007.

[9] "Apache Hadoop",

http://en.wikipedia.org/wiki/Apache_Hadoop

[10] "Hadoop Overview",

http://www.revelytix.com/?q=content/hadoop-overview

[11] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert

Chansler, The Hadoop Distributed File System, Yahoo!,

Sunnyvale, California USA, Published in: Mass Storage

Systems and Technologies (MSST), 2010 IEEE 26th

Symposium.

References

[12] Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal,

Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah,

Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia,

Benjamin Reed, Eric Baldeschwieler, Apache Hadoop YARN: Yet Another

Resource Negotiator, ACM Symposium on Cloud Computing 2013, Santa

Clara, California.

[13] Raja Appuswamy, Christos Gkantsidis, Dushyanth Narayanan, Orion

Hodson, and Antony Rowstron, Scale-up vs Scale-out for Hadoop: Time to

rethink?, Microsoft Research, ACM Symposium on Cloud Computing 2013,

Santa Clara, California.

References

Apache HBASEcis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseArchitecture.p… · Hadoop/HDFS...

Documents

Transcript of Apache HBASEcis.csuohio.edu/~sschung/cis612/LectureNotes_HBaseArchitecture.p… · Hadoop/HDFS...

stall Cloudera Manager Server - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis612/Install_Cloudera2018.pdfCloudera recommends the use of parcels for installation over packages, because

How to execute a MapReduce program (WordCount.java) in …cis.csuohio.edu/~sschung/cis612/HowtoExecuteMapReduceinEclipse.pdf2018-92-05 INFO client.RMProxy: connecting to Resourceanager

Chapter 21 Introduction to Transaction Processing …cis.csuohio.edu/~sschung/cis612/Elmasri_6e_Ch21_Transaction.pdfInterleaved processing: Concurrent execution of ... the log is periodically

Elmasri 6e Ch04 Updated - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/Elmasri_6e_Ch04_Updated_0712.pdfTitle: Microsoft PowerPoint - Elmasri_6e_Ch04_Updated.ppt Author:

CIS 612 SUNNIE HUNG - cis.csuohio.educis.csuohio.edu/~sschung/cis612/LectureNotes_Hive_Final.pdf · Supports Select, Project, Join, Aggregate, Supports Union all and Sub-queries in

MongoDB/Cassandra - csuohio.educis.csuohio.edu/~sschung/cis612/Lecture_Notes_MongoDB_Cassandr… · MongoDB/Cassandra SUNNIE CHUNG CIS 612. MongoDB ... MongoDB’shigh availability

Lecture Notes VoltDBcis.csuohio.edu/~sschung/cis612/Lecture_Notes_VoltDB_1.pdfVoltDB VoltDBis an ACID-compliant relational database management system, which uses memory as storage

An Overview of Apache Spark - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/LectureNotesSparkFinal_Combined_1.pdf• Spark takes your Transformations, and creates a graph

stall Cloudera Manager Server - csuohio.educis.csuohio.edu/~sschung/cis612/Install_Cloudera2018.pdfcloudera MANAGER Specify hosts for your CDH cluster installation. 1032 PM Support

Sunnie Chung - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/DataAnalyticsCloudIntro...• Big Data Processing Systems on Financial Data Sunnie Chung Cleveland State University

FOLLOWERS MUTUAL FRIENDScis.csuohio.edu/~sschung/cis612/TwitterPresentationAdamRyan.pdf · Develop a system to find “mutual friends followers” between users of Facebook Twitter

Part!#1:!Install Hadoopcis.csuohio.edu/~sschung/cis612/Lab_MRHadoop_Um.pdfYassmeen’Abu’Hasson’-’2555788’ ExtraLab2’ ’ ’ Step2:!Installing’Hadoop’ Step!3:’Configure’Hadoop’

Introduction of SQL Programming - csuohio.educis.csuohio.edu/~sschung/cis612/LectureNoteStoredProcedure... · Embedded SQL A language to which SQL queries are embedded is referred

Advanced Partitioning Techniques for Massively Distributed …cis.csuohio.edu/~sschung/cis612/ChrisAdvanced... · 2018-04-16 · Advanced Partitioning Techniques for Massively Distributed

HBase - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis612/LectureNotes_HBaseFinal.pdf · HBase Architecture HBase is a distributed database, designed to run on a cluster of servers.

CIS612 LectureNotes DatawarehouseWOJoinIndexcis.csuohio.edu/~sschung/cis612/CIS612_Lecture... · October 31, 2012 Data Mining: Concepts and Techniques 1 Data Warehouse Concepts and

MongoDB Guide - Cleveland State Universityeecs.csuohio.edu/~sschung/cis612/MongoDBInstallationGuide.pdf · MongoDB Guide CIS 612 SS Chung 1. Installation 1) installing MongoDb Sudo

HTML ( Hypertext MarkUP Languageeecs.csuohio.edu/~sschung/cis612/LectureNotes_webtechCMU.pdf4 HTML (Hypertext Markup Language) Common features – Tables – Frame – Form – Image

MapReduce - csuohio.edueecs.csuohio.edu/~sschung/cis612/LectureNotes_MapReduceFinal_1.pdf · MapReduce Challenges in ... MapReduce Example 1: Map Function for Word Count Map Function

Spark Tutorial with Set Up and Basic File Processingcis.csuohio.edu/~sschung/cis612/CIS612_SparkBasicProcessingTutorialHeideloff.pdf9) Show schema of review100B 10) Register the SchemaRDDs