Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

20
Practical Steps to Improve Hive Queries Performance Sergey Kovalev Software Engineer at Altoros

Transcript of Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

Page 1: Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

Practical Steps to Improve Hive Queries PerformanceSergey Kovalev

Software Engineer at Altoros

Page 2: Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

How Hive works

Page 3: Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

1. Use partitions whenever possible

/folder1/video_data/file1

id, title, channelId, description, uploadYear1, title1, channelId1, description1, 20122, title2, channelId2, description2, 20123, title3, channelId3, description3, 20134, title4, channelId4, description4, 2013

/folder1/video_data/2012/file1

1, title1, channelId1, description1, 20122, title2, channelId2, description2, 2012

/folder1/video_data/2013/file1

3, title3, channelId3, description3, 20134, title4, channelId4, description4, 2013

SELECT * from video WHERE uploadYear=’2013-04-08’

Page 4: Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

1. Use partitions whenever possible

create table video (

id STRING,

title STRING,

description STRING,

viewCount BIGINT

) PARTITIONED BY (uploadYear date)STORED AS ORC;

insert into table video PARTITION (uploadYear) select * from video_external;

Page 5: Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

2. Use bucketing

create table video ( id STRING, channelId STRING, title STRING, description STRING, ) CLUSTERED BY(channelId)

INTO 2 BUCKETSSTORED AS ORC;

create table channel ( id STRING, title STRING, description STRING, viewCount BIGINT ) CLUSTERED BY(id)

INTO 2 BUCKETSSTORED AS ORC;

SELECT v.title FROM video v JOIN channel ch ON v.channelId = ch.id WHERE ch.viewCount>1000

Page 6: Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

2. Use bucketing/folder1/video_data/file1

id, title, channelId, description, uploadYear1, title1, channelId1, description1, 20122, title2, channelId2, description2, 20123, title3, channelId3, description3, 20124, title4, channelId4, description4, 20125, title5, channelId5, description5, 20136, title6, channelId6, description6, 20137, title7, channelId7, description7, 20138, title8, channelId8, description8, 2013

/folder1/video_data/file1

2, title2, channelId2, description2, 20124, title4, channelId4, description4, 20126, title6, channelId6, description6, 20138, title8, channelId8, description8, 2013

/folder1/video_data/file2

1, title1, channelId1, description1, 20123, title3, channelId3, description3, 20125, title5, channelId5, description5, 20137, title7, channelId7, description7, 2013

Page 7: Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

2. Use bucketing/folder1/channel_data/file1

id, title, description, viewCountchannelId1, title1, description1, viewCount1channelId2, title2, description2, viewCount2channelId3, title3, description3, viewCount3channelId4, title4, description4, viewCount4channelId5, title5, description5, viewCount5channelId6, title6, description6, viewCount6channelId7, title7, description7, viewCount7channelId8, title8, description8, viewCount8

/folder1/channel_data/file1

channelId2, title2, description2, viewCount2channelId4, title4, description4, viewCount4channelId6, title6, description6, viewCount6channelId8, title8, description8, viewCount8

/folder1/channel_data/file2

channelId1, title1, description1, viewCount1channelId3, title3, description3, viewCount3channelId5, title5, description5, viewCount5channelId7, title7, description7, viewCount7

Page 8: Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

3. Partitions + bucketingcreate table video ( id STRING, channelId STRING, title STRING, description STRING, viewCount BIGINT ) PARTITIONED BY (uploadYear date) CLUSTERED BY(channelId)

INTO 2 BUCKETSSTORED AS ORC;

Page 9: Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

3. Partitions + bucketing/folder1/video_data/file1

id, title, channelId, viewCount, uploadYear1, title1, channelId1, viewCount1, 20122, title2, channelId2, viewCount2, 20123, title3, channelId3, viewCount3, 20124, title4, channelId4, viewCount4, 20125, title5, channelId5, viewCount5, 20136, title6, channelId6, viewCount6, 20137, title7, channelId7, viewCount7, 20138, title8, channelId8, viewCount8, 2013

/folder1/video_data/2012/file12, title2, description2, viewCount2, 20124, title4, description4, viewCount4, 2012

/folder1/video_data/2012/file21, title1, description1, viewCount1, 20123, title3, description3, viewCount3, 2012

/folder1/video_data/2013/file16, title6, description6, viewCount6, 20138, title8, description8, viewCount8, 2013

/folder1/video_data/2013/file25, title5, description5, viewCount5, 20137, title7, description7, viewCount7, 2013

Page 10: Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

4. Use joins optimization

Shuffle join/Common join:

Page 11: Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

4. Use joins optimization

Map-side join:

Page 12: Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

4. Use joins optimization

Sort-merge-bucket (SMB) join:

Page 13: Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

5. Choose the right input formatRow Data Column Store

Page 14: Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

6. Other optimization

Avoid highly normalized table structures

Compress map/reduce output

For map output compression, execute set mapred.compress.map.output = true.

For job output compression, execute set mapred.output.compress = true.

Use parallel executionSET hive.exce.parallel=true;

Page 15: Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

7. Use the 'explain' keyword to improve the query execution plan

EXPLAIN query...

Page 16: Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

7. Use the 'explain' keyword to improve the query execution plan

Page 17: Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

8. Stinger Initiative

Use cost-based optimization

Use vectorization

Transactions with ACID semantics

Page 18: Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

8. Hive on Tez

Page 19: Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

8. Sub-Second Queries with Hive LLAPNew approach using a hybrid engine that leverages Tez and something new called LLAP (Live

Long and Process)

Page 20: Сергей Ковалёв (Altoros): Practical Steps to Improve Apache Hive Performance

Questiones?