EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments...

23
Execution Environments for Distributed Computing Apache Hive EEDC 34330 Master in Computer Architecture, Networks and Systems - CANS Homework number: 3 Group number: EEDC-1 Group members: Hugo Pérez – [email protected] Sergio Mendoza – [email protected] Carlos Fenoy – [email protected]

Transcript of EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments...

Page 1: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Execution Environments for Distributed Computing

Apache Hive

EEDC 34330

Master in Computer Architecture, Networks and Systems - CANS

Homework number: 3Group number: EEDC-1

Group members:Hugo Pérez – [email protected]

Sergio Mendoza – [email protected] Fenoy – [email protected]

Page 2: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Outline● Introduction● Hive Database

○ Data Model○ Query Language

● Hive Arquitecture● Conclusions

Page 3: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Introduction

Origins on Facebook...

● Facebook has 500.000.000 logs per day

● Facebook shares a billion pieces of content daily

● Facebook stores a vast amount of data

Page 4: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Introduction

What's the problem?

● 250 million photos per day● 2.7 billion likes and comments per day● 2 billion total registered users● 100 billion friendships● ...

TOO MUCH DATA!!

Page 5: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Introduction

What is Apache Hive?

● Hive is a data warehouse infrastructure

Page 6: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Introduction

What is Apache Hive?

● Hive is a data warehouse infrastructure

and what is a Data Warehouse (DW)?

● a DW is a database for reporting and analysis

Page 7: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Introduction

How does Apache Hive works?

● Hive is built on top of Hadoop

● Hive stores data in the HDFS

● Hive compile SQL queries as MapReduce jobs and run the jobs in the cluster

Page 8: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Introduction

How does Apache Hive works?

HiveQL query

Page 9: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Introduction

How does a simple web app works?

MySQL query

Page 10: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Outline● Introduction● Hive Database

○ Data Model○ Query Language

● Hive Arquitecture● Conclusions

Page 11: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Hive structures data into the well-understood database concepts like tables, columns, rows.

Data Model

Page 12: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Hive defines a simple SQL-like query language, called QL

- Supports DDL and DML.

- Users can embed custom map-reduce scripts

- Supports UDF, UDAF and UDTF.

HiveQL

Page 13: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

REDUCE subq2.school, subq2.meme, subq2.cnt USING ‘top10.py’ AS (school,meme,cnt)FROM (SELECT subq1.school, subq1.meme, COUNT(1) AS cnt FROM (MAP b.school, a.statusUSING ‘meme-extractor.py’ AS (school,meme)FROM status_updates a JOIN profiles b ON (a.userid = b.userid) ) subq1GROUP BY subq1.school, subq1.memeDISTRIBUTE BY school, memeSORT BY school, meme, cnt desc) subq2;

HiveQL Extract

Page 14: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Outline● Introduction● Hive Database

○ Data Model○ Query Language

● Hive Arquitecture● Conclusions

Page 15: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Architecture

Page 16: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Architecture

● External Interfaces - provides both user interfaces like command line (CLI) and web UI, and application programming interfaces (API) like JDBC and ODBC

● Thrift Server exposes a very simple client API to execute HiveQL statements

● Metastore is the system catalog. All other components of Hive interact with the metastore.

Page 17: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Architecture

● Driver manages the life cycle of a HiveQL statement during compilation, optimization and execution

● Compiler translates statements into a plan which consists of a DAG of map-reduce jobs

● The driver submits the individual map-reduce jobs from the DAG to the Execution Engine in a topological order

Page 18: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Metastore

The metastore is the system catalog which contains metadata about the tables stored in Hive.

● Database - is a namespace for tables.● Table - Metadata for table contains list of columns

and their types, owner, storage and SerDe information● Partition - Each partition can have its own columns and

SerDe and storage information

Page 19: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Query Compiler

● Parser transforms a query string to a parse tree representation.

● Semantic Analyzer transforms the parse tree to a block-based internal query representation.

● Logical Plan Generator converts the internal query representation to a logical plan, which consists of a tree of logical operators

● Optimizer performs multiple passes over the logical plan and rewrites it in several ways

● Physical Plan Generator converts the logical plan into a physical plan, consisting of a DAG of map-reduce jobs

Page 20: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Outline● Introduction● Hive Database

○ Data Model○ Query Language

● Hive Arquitecture● Conclusions

Page 21: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

● Hive provides a solution to perform business intelligence of huge data on top of mature Hadoop map-reduce platform.

● The SQL-like HiveQL cuts off the learning curve compared with low-level map-reduce programs.

Conclusions

Page 22: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Questions?

Page 23: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for

Links:http://i.stanford.edu/~ragho/hive-icde2010.pdfhttp://www.vldb.org/pvldb/2/vldb09-938.pdfhttp://hive.apache.org/https://cwiki.apache.org/Hive/languagemanual-transform.htmlhttp://biggdata.blogspot.com/2011/04/refreshing-trendingtopics-website-data.htmlhttp://code.google.com/p/hive-mrc/wiki/AboutHiveCore