EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments...
-
Upload
nguyentuyen -
Category
Documents
-
view
215 -
download
2
Transcript of EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments...
![Page 1: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/1.jpg)
Execution Environments for Distributed Computing
Apache Hive
EEDC 34330
Master in Computer Architecture, Networks and Systems - CANS
Homework number: 3Group number: EEDC-1
Group members:Hugo Pérez – [email protected]
Sergio Mendoza – [email protected] Fenoy – [email protected]
![Page 2: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/2.jpg)
Outline● Introduction● Hive Database
○ Data Model○ Query Language
● Hive Arquitecture● Conclusions
![Page 3: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/3.jpg)
Introduction
Origins on Facebook...
● Facebook has 500.000.000 logs per day
● Facebook shares a billion pieces of content daily
● Facebook stores a vast amount of data
![Page 4: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/4.jpg)
Introduction
What's the problem?
● 250 million photos per day● 2.7 billion likes and comments per day● 2 billion total registered users● 100 billion friendships● ...
TOO MUCH DATA!!
![Page 5: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/5.jpg)
Introduction
What is Apache Hive?
● Hive is a data warehouse infrastructure
![Page 6: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/6.jpg)
Introduction
What is Apache Hive?
● Hive is a data warehouse infrastructure
and what is a Data Warehouse (DW)?
● a DW is a database for reporting and analysis
![Page 7: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/7.jpg)
Introduction
How does Apache Hive works?
● Hive is built on top of Hadoop
● Hive stores data in the HDFS
● Hive compile SQL queries as MapReduce jobs and run the jobs in the cluster
![Page 8: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/8.jpg)
Introduction
How does Apache Hive works?
HiveQL query
![Page 9: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/9.jpg)
Introduction
How does a simple web app works?
MySQL query
![Page 10: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/10.jpg)
Outline● Introduction● Hive Database
○ Data Model○ Query Language
● Hive Arquitecture● Conclusions
![Page 11: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/11.jpg)
Hive structures data into the well-understood database concepts like tables, columns, rows.
Data Model
![Page 12: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/12.jpg)
Hive defines a simple SQL-like query language, called QL
- Supports DDL and DML.
- Users can embed custom map-reduce scripts
- Supports UDF, UDAF and UDTF.
HiveQL
![Page 13: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/13.jpg)
REDUCE subq2.school, subq2.meme, subq2.cnt USING ‘top10.py’ AS (school,meme,cnt)FROM (SELECT subq1.school, subq1.meme, COUNT(1) AS cnt FROM (MAP b.school, a.statusUSING ‘meme-extractor.py’ AS (school,meme)FROM status_updates a JOIN profiles b ON (a.userid = b.userid) ) subq1GROUP BY subq1.school, subq1.memeDISTRIBUTE BY school, memeSORT BY school, meme, cnt desc) subq2;
HiveQL Extract
![Page 14: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/14.jpg)
Outline● Introduction● Hive Database
○ Data Model○ Query Language
● Hive Arquitecture● Conclusions
![Page 15: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/15.jpg)
Architecture
![Page 16: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/16.jpg)
Architecture
● External Interfaces - provides both user interfaces like command line (CLI) and web UI, and application programming interfaces (API) like JDBC and ODBC
● Thrift Server exposes a very simple client API to execute HiveQL statements
● Metastore is the system catalog. All other components of Hive interact with the metastore.
![Page 17: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/17.jpg)
Architecture
● Driver manages the life cycle of a HiveQL statement during compilation, optimization and execution
● Compiler translates statements into a plan which consists of a DAG of map-reduce jobs
● The driver submits the individual map-reduce jobs from the DAG to the Execution Engine in a topological order
![Page 18: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/18.jpg)
Metastore
The metastore is the system catalog which contains metadata about the tables stored in Hive.
● Database - is a namespace for tables.● Table - Metadata for table contains list of columns
and their types, owner, storage and SerDe information● Partition - Each partition can have its own columns and
SerDe and storage information
![Page 19: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/19.jpg)
Query Compiler
● Parser transforms a query string to a parse tree representation.
● Semantic Analyzer transforms the parse tree to a block-based internal query representation.
● Logical Plan Generator converts the internal query representation to a logical plan, which consists of a tree of logical operators
● Optimizer performs multiple passes over the logical plan and rewrites it in several ways
● Physical Plan Generator converts the logical plan into a physical plan, consisting of a DAG of map-reduce jobs
![Page 20: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/20.jpg)
Outline● Introduction● Hive Database
○ Data Model○ Query Language
● Hive Arquitecture● Conclusions
![Page 21: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/21.jpg)
● Hive provides a solution to perform business intelligence of huge data on top of mature Hadoop map-reduce platform.
● The SQL-like HiveQL cuts off the learning curve compared with low-level map-reduce programs.
Conclusions
![Page 22: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/22.jpg)
Questions?
![Page 23: EEDC 34330 Execution - Jordi Torres is professor at UPC ... · PDF fileExecution Environments for Distributed ... and what is a Data Warehouse (DW)? ... Database - is a namespace for](https://reader033.fdocuments.in/reader033/viewer/2022051722/5a9ee1007f8b9a84178bfe8b/html5/thumbnails/23.jpg)
Links:http://i.stanford.edu/~ragho/hive-icde2010.pdfhttp://www.vldb.org/pvldb/2/vldb09-938.pdfhttp://hive.apache.org/https://cwiki.apache.org/Hive/languagemanual-transform.htmlhttp://biggdata.blogspot.com/2011/04/refreshing-trendingtopics-website-data.htmlhttp://code.google.com/p/hive-mrc/wiki/AboutHiveCore