My life as a beekeeper
-
Upload
pedro-figueiredo -
Category
Technology
-
view
1.029 -
download
0
Transcript of My life as a beekeeper
![Page 1: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/1.jpg)
My life as a beekeeper
@89clouds
![Page 2: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/2.jpg)
Who am I?
Pedro Figueiredo ([email protected])
Hadoop et al
SocialFacebook games, media (TV, publishing)
Elastic MapReduce, Cloudera
NoSQL, as in “Not a SQL guy”
![Page 3: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/3.jpg)
The problem with Hive
It looks like SQL
![Page 4: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/4.jpg)
No, seriously SELECT CONCAT(vishi,vislo), SUM( CASE WHEN searchengine = 'google' THEN 1 ELSE 0 END ) AS google_searches FROM omniture WHERE year(hittime) = 2011 AND month(hittime) = 8 AND is_search = 'Y' GROUP BY CONCAT(vishi,vislo);
![Page 5: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/5.jpg)
“It’s just like Oracle!”
Analysts will be very happy
At least until they join with that 30 billion-record table
Pro tip: explain MapReduce and then MAPJOIN
set hive.mapjoin.smalltable.filesize=xxx;
![Page 6: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/6.jpg)
Your first interview question
“Explain the difference between CREATE TABLE and CREATE EXTERNAL TABLE”
![Page 7: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/7.jpg)
Dynamic partitions
Partitions are the poor person’s indexes
Unstructured data is full of surprises set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.dynamic.partitions.pernode=100000;
Plan your partitions ahead
![Page 8: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/8.jpg)
Multi-vitamins
You can minimise input scans by using multi-table INSERTs:
FROM inputINSERT INTO TABLE output1 SELECT fooINSERT INTO TABLE output2 SELECT bar;
![Page 9: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/9.jpg)
Persistence, do you speak it?
External Hive metastore
Avoid the pain of cluster set up
Use an RDS metastore if on AWS, RDBMS otherwise.
10GB will get you a long way, this thing is tiny
![Page 10: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/10.jpg)
Now you have 2 problems
Regular expressions are great, if you’re using a real programming language.
WHERE foo RLIKE ‘(a|b|c)’ will hurt
WHERE foo=‘a’ OR foo=‘b’ OR foo=‘c’
Generate these statements, if needs be, it will pay off.
![Page 11: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/11.jpg)
Avro
Serialisation framework (think Thrift/Protocol Buffers).
Avro container files are SequenceFile-like, splittable.
Support for snappy built-in.
If using the LinkedIn SerDe, the table creation syntax changes.
![Page 12: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/12.jpg)
AvroCREATE EXTERNAL TABLE IF NOT EXISTS mytable PARTITIONED BY (ds STRING) ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe' WITH SERDEPROPERTIES ('schema.url'='hdfs:///user/hadoop/avro/myschema.avsc') STORED AS INPUTFORMAT 'com.linkedin.haivvreo.AvroContainerInputFormat' OUTPUTFORMAT 'com.linkedin.haivvreo.AvroContainerOutputFormat' LOCATION '/data/mytable';
![Page 13: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/13.jpg)
MAKE! MONEY! FAST!
Use spot instances in EMR
Usually stick around until America wakes up
Brilliant for worker nodes
![Page 14: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/14.jpg)
Bag of tricksset hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
![Page 15: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/15.jpg)
Bag of tricksset hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
![Page 16: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/16.jpg)
Bag of tricksset hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
![Page 17: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/17.jpg)
Bag of tricksset hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
![Page 18: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/18.jpg)
Bag of tricksset hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
![Page 19: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/19.jpg)
Bag of tricksset hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
![Page 20: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/20.jpg)
To be or not to be“Consider a traditional RDBMS”
At what size should we do this?
Hive is not an end, it’s the means
Data on HDFS/S3 is simply available, not “available to Hive”
Hive isn’t suitable for near real time
![Page 21: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/21.jpg)
Hive != MapReduce
Don’t use Hive instead of Native/Streaming
“I know, I’ll just stream this bit through a shell script!”
Imo, Hive excels at analysis and aggregation, so use it for that
![Page 22: My life as a beekeeper](https://reader034.fdocuments.in/reader034/viewer/2022050614/546caf89af7959294f8b4712/html5/thumbnails/22.jpg)
Thank you
Fred Easey (@poppa_f)
Peter Hanlon