HCatalog - The enterprise data cloud...
Transcript of HCatalog - The enterprise data cloud...
![Page 1: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/1.jpg)
HCatalog Table Management for Hadoop
Page 1
Alan F. Gates
@alanfgates
![Page 2: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/2.jpg)
Who Am I?
Page 2
• HCatalog committer and mentor
• Co-founder of Hortonworks
• Tech lead for Data team at Hortonworks
• Pig committer and PMC Member
• Member of Apache Software Foundation and Incubator
PMC
• Author of Programming Pig from O’Reilly
© Hortonworks Inc. 2012
![Page 3: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/3.jpg)
Users: Data Sharing is Hard
Page 3
Photo Credit: totalAldo via Flickr
This is programmer Bob, he uses Pig to crunch data.
This is analyst Joe, he uses Hive to build reports and
answer ad-hoc queries.
Hmm, is it done yet? Where is it? What format did you use to store it today? Is it compressed?
And can you help me load it into Hive, I can never remember all the parameters I have to
pass that alter table command.
Ok
Bob, I need today’s data
Dude, we need HCatalog
© Hortonworks Inc. 2012
![Page 4: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/4.jpg)
Pig Example
Page 4
raw = load ‘/rawevents/20100819/data’ using MyLoader()
as (ts:long, user:chararray, url:chararray);
botless = filter raw by NotABot(user);
…
store output into ‘/processedevents/20100819/data’;
Processedevents consumers must be manually informed by producer that
data is available, or poll on HDFS (this is bad).
raw = load ‘rawevents’ using HCatLoader();
botless = filter raw by date = ‘20100819’ and NotABot(user);
…
store output into ‘processedevents’
using HCatStorer(“date=20100819”);
Processedevents consumers will be notified by HCatalog via JMS that
data is available; they can then start their jobs.
© Hortonworks Inc. 2012
![Page 5: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/5.jpg)
Developers:
Each tool requires its own Translator
Page 5
Pig Hive Map Reduce
RCFile Custom
Format
Custom
Loader
Hive
Columnar
Loader
RCFile
Input
Format
Custom
Input
Format
Columnar
SerDe
Custom
SerDe
© Hortonworks Inc. 2012
![Page 6: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/6.jpg)
Developers:
Each tool requires its own Translator
Page 6
Pig Hive Map Reduce
RCFile Custom
Format
HCatalog
HCatLoader HCatInputFormat
Columnar
SerDe
Custom
SerDe
© Hortonworks Inc. 2012
![Page 7: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/7.jpg)
Ops: Can I Ever Change the Data Format?
Page 7
• Data format can be changed (e.g. CSV to JSON)
– no need to reformat existing data
– new data will be written new format
– HCatalog can read across the format changes
– Users programs will not see the difference
• Data location can be changed without affecting user
programs
• New columns can be added via alter table add column
– no need to reformat existing data
– fields not present in old data will get a null when reading old data
© Hortonworks Inc. 2012
![Page 8: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/8.jpg)
Relationship to Hive
Page 8
• If you are a Hive user, you can transition to HCatalog with no metadata modifications (true starting with version 0.4)
– Uses Hive’s SerDes for data translation (starting in 0.4)
– Uses Hive SQL for data definition (DDL)
• Provides a different security implementation; security
delegated to underlying storage
– Currently integrated with HDFS
– Integration with HBase in progress
– In near future you will be able to choose to use Hive authorization model if you wish
© Hortonworks Inc. 2012
![Page 9: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/9.jpg)
Current Status
Page 9
• Release 0.4 (expected out next month)
– Hive/Pig/MapReduce integration
– Support for any data format with a SerDe (Text, Sequence,
RCFile, JSON SerDes included)
– Notification via JMS
– Basic HBase integration
© Hortonworks Inc. 2012
![Page 10: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/10.jpg)
Current Development
Page 10
• Improving HBase integration
– new HBase security features
– repeatable read for HBase tables
• REST API for HCatalog
– Databases, tables, partitions are objects
– PUT to create or modify objects
– GET to retrieve information about objects
– DELETE to drop objects
– Submitted documents and results encoded in JSON
© Hortonworks Inc. 2012
![Page 11: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/11.jpg)
Page 11
Future Directions
© Hortonworks Inc. 2012
![Page 12: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/12.jpg)
Storing Semi-/Unstructured Data
Page 12
Name Zip
Alice 93201
Bob 76331
select name, zip
from users;
{"name":"alice","zip":"93201"}
{"name":"bob”,"zip":"76331"}
{"name":"cindy"}
{"zip":"87890"}
Table Users File Users
A = load ‘Users’ as
(name:chararray, zip:chararray);
B = foreach A generate name, zip;
© Hortonworks Inc. 2012
![Page 13: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/13.jpg)
Storing Semi-/Unstructured Data
Page 13
Name Zip
Alice 93201
Bob 76331
select name, zip
from users;
{"name":"alice","zip":"93201"}
{"name":"bob”,"zip":"76331"}
{"name":"cindy"}
{"zip":"87890"}
Table Users File Users
A = load ‘Users’
B = foreach A generate name, zip;
© Hortonworks Inc. 2012
![Page 14: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/14.jpg)
Storing Semi-/Unstructured Data
Page 14
Name Zip
Alice 93201
Bob 76331
select name, zip
from users;
{"name":"alice","zip":"93201"}
{"name":"bob”,"zip":"76331"}
{"name":"cindy"}
{"zip":"87890"}
Table Users File Users
A = load ‘Users’
B = foreach A generate name, zip;
© Hortonworks Inc. 2012
![Page 15: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/15.jpg)
Data Lifecycle Management
Page 15
Replication
Cleaning
Compaction
Archiving
© Hortonworks Inc. 2012
![Page 16: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/16.jpg)
Partitions in Different Storage
Page 16
HBase HDFS
• Data can be
streamed in
• available for
read almost
instantly
• Must load in
batches
• scan times
10x HBase
Latest Historical
Stores data in Stores data in
© Hortonworks Inc. 2012
![Page 17: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/17.jpg)
Partitions in Different Storage
Page 17
HBase HDFS
• Data can be
streamed in
• available for
read almost
instantly
• Must load in
batches
• scan times
10x HBase
All
Stores data in
© Hortonworks Inc. 2012
![Page 18: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/18.jpg)
Federation with MPP Data Stores
Page 18
MPP Store
HCatalog
HiveStorageHandler
MPPTable
HadoopTable
Web Services
select *
from HadoopTable;
select *
from MPPTable;
© Hortonworks Inc. 2012
![Page 19: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/19.jpg)
Storing More Metadata
Page 19
• HCatalog currently stores metadata in RDBMS
– Most people use MySQL
• Table level statistics can currently be stored
• Would like to store partition statistics
– But this could overwhelm the RDBMS
• Would like to store user generated metadata for partitions
– But this could overwhelm the RDBMS
• Could we store this data in HBase instead?
• Could we store all the metadata in HBase instead?
© Hortonworks Inc. 2012
![Page 20: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/20.jpg)
Thank You
Page 20 © Hortonworks Inc. 2012
![Page 21: HCatalog - The enterprise data cloud companyhortonworks.com/wp-content/uploads/2012/02/HCatWebcast.pdfAnd can you help me load it into Hive, I can never remember all the parameters](https://reader036.fdocuments.in/reader036/viewer/2022081523/5feab0710e8d2b1fbd298856/html5/thumbnails/21.jpg)
Other Resources
Page 21 © Hortonworks Inc. 2012
• Next Webcast: Extending Hadoop beyond MapReduce
– March 7
– 10am Pacific/1pm Eastern
– http://hortonworks.com/webinars/
• Hadoop Summit
– June 13-14
– San Jose, California
– Call for Papers Deadline extended
– Hadoopsummit.org
• Hadoop Training and Certification
– Developing Solutions Using Apache Hadoop
– Administering Apache Hadoop
– http://hortonworks.com/training/