Advance Hive, NoSQL Database (HBase) - Module 7

23
Advance Hive, NoSQL DataBase (HBase)

Transcript of Advance Hive, NoSQL Database (HBase) - Module 7

Page 1: Advance Hive, NoSQL Database (HBase) - Module 7

Advance Hive, NoSQL DataBase

(HBase)

Page 2: Advance Hive, NoSQL Database (HBase) - Module 7

HiveQL: Data Manipulation

Loading Data into Managed Tables

• Hive has no row-level insert, update, and delete operations.• Only data can be loaded in tables through “bulk” load operations.

LOAD DATA LOCAL INPATH ‘/usr/hive/warehouse/california-employees'OVERWRITE INTO TABLE employeesPARTITION (country = 'US', state = 'CA');

Page 3: Advance Hive, NoSQL Database (HBase) - Module 7

Inserting Data into Tables from Queries• INSERT statement allows to load data into a table from a query.•With OVERWRITE, any previous contents of the partition (or

whole table if not partitioned) are replaced.

HiveQL: Data Manipulation

Page 4: Advance Hive, NoSQL Database (HBase) - Module 7

Dynamic Partition Inserts• Dynamic partition feature, where it can infer the partitions to create

based on query parameters.

HiveQL: Data Manipulation

Page 5: Advance Hive, NoSQL Database (HBase) - Module 7

HiveQL: Data ManipulationCreating Tables and Loading Them in One Query

Exporting Data

Page 6: Advance Hive, NoSQL Database (HBase) - Module 7

User Defined Functions• Hive has the ability to use User Defined Functions written in Java to perform

computations that would otherwise be difficult (or impossible) to perform using the built-in Hive functions and SQL commands.

• To invoke a UDF from within a Hive script, it is required to:1. Register the JAR file that contains the UDF class, and

2. Define an alias for the function using the CREATE TEMPORARY FUNCTION command.

Page 7: Advance Hive, NoSQL Database (HBase) - Module 7

Example UDF 1

2

3

45

Page 8: Advance Hive, NoSQL Database (HBase) - Module 7

public class UDFZodiacSign extends UDF {private SimpleDateFormat df;public UDFZodiacSign() {df = new SimpleDateFormat("MM-dd-yyyy");}

public String evaluate(Date bday) {return this.evaluate(bday.getMonth(), bday.getDay());}

public String evaluate(String bday) {Date date = null;try {date = df.parse(bday);} catch (Exception ex) {return null;}return this.evaluate(date.getMonth() + 1, date.getDay());}

public String evaluate(Integer month, Integer day) {if (month == 1) {if (day < 20) {return "Capricorn";} else {return "Aquarius";}}if (month == 2) {if (day < 19) {return "Aquarius";} else {return "Pisces";}}return null;}}

Page 9: Advance Hive, NoSQL Database (HBase) - Module 7

Custom Map/Reduce in Hive

Page 10: Advance Hive, NoSQL Database (HBase) - Module 7

HBase: Introduction to HBase

• HBase is a distributed column-oriented data store built on top of HDFS.

• HBase is an Apache open source project whose goal is to provide storage for the Hadoop Distributed Computing

• Data is logically organized into tables, rows and columns

Page 11: Advance Hive, NoSQL Database (HBase) - Module 7

HBase vs. HDFS

• HDFS is good for batch processing (scans over big files)• Not good for record lookup• Not good for incremental addition of small batches• Not good for updates

• HBase is designed to efficiently address the above points• Fast record lookup• Support for record-level insertion• Support for updates (not in place)

Page 12: Advance Hive, NoSQL Database (HBase) - Module 7

Tables, Rows, Column family• Table: HBase organizes data into tables. Table names are Strings and composed of

characters that are safe for use in a file system path.

• Row: Within a table, data is stored according to its row. Rows are identified uniquely by their row key. Row keys do not have a data type and are always treated as a byte[ ] (byte array).

• Column Family: Data within a row is grouped by column family. Column families also impact the physical arrangement of data stored in HBase. For this reason, they must be defined up front and are not easily modified. Every row in a table has the same column families, although a row need not store data in all its families. Column families are Strings and composed of characters that are safe for use in a file system path.

Page 13: Advance Hive, NoSQL Database (HBase) - Module 7

• Column Qualifier: Data within a column family is addressed via its column qualifier, or simply, column. Column qualifiers need not be specified in advance. Column qualifiers need not be consistent between rows. Like row keys, column qualifiers do not have a data type and are always treated as a byte[ ].

• Cell: A combination of row key, column family, and column qualifier uniquely identifies a cell. The data stored in a cell is referred to as that cell’s value. Values also do not have a data type and are always treated as a byte[ ].

• Timestamp: Values within a cell are versioned. Versions are identified by their version number, which by default is the timestamp of when the cell was written. If a timestamp is not specified during a write, the current timestamp is used. If the timestamp is not specified for a read, the latest one is returned. The number of cell value versions retained by HBase is configured for each column family. The default number of cell versions is three.

Column, Cell, Timestamp

Page 14: Advance Hive, NoSQL Database (HBase) - Module 7

Pictorial Representation

Page 15: Advance Hive, NoSQL Database (HBase) - Module 7

Representation as a Multi Dimensional Map

SortedMap<RowKey, List<SortedMap<Column, List<Value, Timestamp>>>>

Page 16: Advance Hive, NoSQL Database (HBase) - Module 7

HBase Table as Key-Value Store

Page 17: Advance Hive, NoSQL Database (HBase) - Module 7

HBase Architecture

Page 18: Advance Hive, NoSQL Database (HBase) - Module 7

Client API: Administrative API

Page 19: Advance Hive, NoSQL Database (HBase) - Module 7

Client API: CRUD Operations put()

Page 20: Advance Hive, NoSQL Database (HBase) - Module 7

Client API: CRUD Operations get()

Page 21: Advance Hive, NoSQL Database (HBase) - Module 7

Client API: CRUD Operations delete()

Page 22: Advance Hive, NoSQL Database (HBase) - Module 7

HBase Clients• Java Client• Useful when the interacting application is written in a java language.

• REST and Thrift• HBase ships with REST and Thrift interfaces. These are useful when the

interacting application is written in a language other than Java.

Page 23: Advance Hive, NoSQL Database (HBase) - Module 7

HBase MapReduce Integrationpublic class SimpleRowCounter extends Configured implements Tool {static class RowCounterMapper extends TableMapper<ImmutableBytesWritable, Result> {public static enum Counters { ROWS }@Overridepublic void map(ImmutableBytesWritable row, Result value, Context context) {context.getCounter(Counters.ROWS).increment(1);}}@Overridepublic int run(String[] args) throws Exception {if (args.length != 1) {System.err.println("Usage: SimpleRowCounter <tablename>"); return -1;}String tableName = args[0];Scan scan = new Scan();scan.setFilter(new FirstKeyOnlyFilter());

Job job = new Job(getConf(), getClass().getSimpleName());job.setJarByClass(getClass());TableMapReduceUtil.initTableMapperJob(tableName, scan,RowCounterMapper.class, ImmutableBytesWritable.class, Result.class, job);job.setNumReduceTasks(0);job.setOutputFormatClass(NullOutputFormat.class);return job.waitForCompletion(true) ? 0 : 1;}public static void main(String[] args) throws Exception {int exitCode = ToolRunner.run(HBaseConfiguration.create(),new SimpleRowCounter(), args);System.exit(exitCode);}}