Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the...
Transcript of Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the...
![Page 1: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/1.jpg)
1©Cloudera,Inc.Allrightsreserved.
TheImpalaCookbookFromCloudera’sImpalaTeamUpdatedJan.2017
*CurrentlyanApacheIncubatorproject
![Page 2: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/2.jpg)
2©Cloudera,Inc.Allrightsreserved.
TopicOutline
• Part1– TheBasics• PhysicalandSchemaDesign•MemoryUsageinImpala
• Part2– ThePracticalIssues• ClusterSizingandHardwareRecommendations• BenchmarkingImpala•Multi-tenancyBestPractices• QueryTuningBasics
• Part3– OutsideImpala• InteractionwithApacheHive,ApacheSentry,andApacheParquet
![Page 3: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/3.jpg)
3©Cloudera,Inc.Allrightsreserved.
TopicOutline
• Part1– TheBasics• PhysicalandSchemaDesign•MemoryUsageinImpala
• Part2– ThePracticalIssues• ClusterSizingandHardwareRecommendations• BenchmarkingImpala•Multi-tenancyBestPractices• QueryTuningBasics
• Part3– OutsideImpala• InteractionwithApacheHive,ApacheSentry,andApacheParquet
![Page 4: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/4.jpg)
4©Cloudera,Inc.Allrightsreserved.
PhysicalandSchemaDesign- Outline
• Schemadesignbestpractices• Datatypes• Partitiondesign• Complextypes• Commonquestions
• Physicaldesign• Fileformat– whentousewhat?• Blocksize(option)
![Page 5: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/5.jpg)
5©Cloudera,Inc.Allrightsreserved.
PhysicalandSchemaDesign- DataTypes• UseNumericTypes(notstrings)• Avoidstringtypeswhenpossible• Strings=>highermemoryconsumption,morestorage,slowertocompute(80%slowercomparedtonumeric)• E.gdatestring“20161031”orunixtime“1479459272”,switchtobigint
• DecimalvsFloat/Double• Decimaliseasiertoreasonabout• Currentlycan'tuseDecimalaspartitionkeyorinUDFs
• UseStringonlyfor:• HBaserowkey– stringistherecommendedtype!• Timestamp– oktousestring,butconsiderusingnumericaswell!
• PreferStringoverChar/Varchar(exceptforSAS)
![Page 6: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/6.jpg)
6©Cloudera,Inc.Allrightsreserved.
PartitionDesign:3SimpleRules
1. IdentifyaccesspatternsfromtheusecaseorexistingSQL.2. Estimatethetotalnumberofpartitions(betterbe<100k!).3. (Optional)Ifneeded,reducethenumberofpartitions.
![Page 7: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/7.jpg)
7©Cloudera,Inc.Allrightsreserved.
PartitionDesign– IdentifyAccessPatterns
• ThetablecolumnsthatarecommonlyusedintheWHEREclausearecandidatesforpartitionkeys.• Dateisalmostalwaysacommonaccesspatternandthemostcommonpartitionkey.
• CanhaveMULTIPLEPARTITIONKEYS!Examples:• Select col_1 from store_sales where sold_date between ‘2014-01-31’ and ‘2016-02-23’;
• Select count(revenue)from store_sales where store_group_id in (1,5,10);
• Partition keys => sold_date, store_group_id
![Page 8: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/8.jpg)
8©Cloudera,Inc.Allrightsreserved.
PartitionDesign– Estimatethe#partitions
• Estimatethenumberofdistinctvalues(NDV)foreachpartitionkey(fortherequiredstorageduration).Example:• Ifpartitionedbydateandneedtostorefor1year,thenNDVfordatepartitionkeyis365.• numstore_groupwillgrowovertime,butitwillneverexceed52(oneforeachstate).
• Totalnumberofpartitions=NDVforpartkey1*NDVforpartkey2*…*NDVforpartitionkeyN.Example:• Totalnumberofpartitions=365(fordatepart)*52(forstore_group)~=19k
•Makesure#partitions<=100k!
![Page 9: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/9.jpg)
9©Cloudera,Inc.Allrightsreserved.
PartitionDesign– TooManyPartitions?
• Removesome“unimportant”partitionkeys.• Ifapartitionkeyisn’tasroutinelyusedoritdoesn’timpacttheSLA,removeit!
• Createpartition“buckets”andspecifybucketidindownstreamselectqueries.• Usemonthratherthandate.• Createartificialstore_group togroupindividualstores.• Technique:useprefixorhash• Hash(store_id)%store_groupsize=>hashittostore_group• Substring(store_id,0,2)=>usethefirst2digitsasartificialstore_group• e.g.select…fromtblwherestore_id=5
andstore_group=store_id%50;//Assume50storegroups
![Page 10: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/10.jpg)
10©Cloudera,Inc.Allrightsreserved.
NestedTypes:KnownIssues
• Themaximumsizeofasinglerowfedintoajoinbuildcannotexceed8MB(buffered-tuple-streamblocksize).Forcomplextypesthismeans:• Sizeofallmaterializedcollections(postprojectionandfiltering)cannotexceed8MBforcertainplanshapes(expectedtoberare).
• IMPALA-2603:Incorrectplanisgeneratedforinlineviewreferencingcomplextypes.• Inlineviewhasseveralrelativetablerefsthatrefertodifferentancestorqueryblockswithinthesamenestinghierarchy.
![Page 11: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/11.jpg)
11©Cloudera,Inc.Allrightsreserved.
SchemaDesign– CommonIssues
• Numberofcolumns- 2kmax• Notahardlimit;ImpalaandParquetcanhandleevenmore,but…• ItslowsdownHiveMetastoremetadataupdateandretrieval• Itleadstobigcolumnstatsmetadata,especiallyforincrementalstats
• Timestamp/Date• Usetimestampfordate;• Dateaspartitioncolumn:usestringorint(20150413asaninteger!)
• BLOB/CLOB– usestring• Stringsize- nodefinitiveupperboundbut1MBseemsok• Larger-sizedstringcancrashImpala!• Useitsparingly- thewhole1MBstringwillbeshippedeverywhere
![Page 12: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/12.jpg)
12©Cloudera,Inc.Allrightsreserved.
PhysicalDesign– FileFormat
• Parquet/Snappy• Thelong-termstorageformat• Alwaysgoodforreading!•Writeisveryslow(reportedly10xslowerthanAvro).
• SnappyvsGzip• SnappyisusuallyabettertradeoffbetweencompressionratioandCPU.• But,runyourownbenchmarktoconfirm!
• Forwrite-once-read-oncetmpETLtables,considertextinsteadbecause:•Writingisfaster.• Impalacanperformthewrite.
![Page 13: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/13.jpg)
13©Cloudera,Inc.Allrightsreserved.
PhysicalDesign– BlockSize
• Numberofblocksdefinesthedegreeofparallelism:• TrueforbothMapReduceandImpala• EachblockisprocessedbyasingleCPUcore• ToleverageallCPUcoresacrossthecluster,numblocks>=numcore
• Largerblocksize:• BettercompressionandIOthroughput,butfewerblocks,couldreduceparallelism
• Smallerblocksize:• Moreparallelism,butcouldreduceIOthroughput• CancausemetadatabloatandcreatebottlenecksonHDFSNameNode(NNRPCoverhead40K-50K/s),HiveMetastore,Impalacatalogservice
![Page 14: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/14.jpg)
14©Cloudera,Inc.Allrightsreserved.
PhysicalDesign– BlockSize• ForApacheParquet,~256MBisgoodandnoneedtogoabove1GB.• Don’tgobelow64MBexceptwhenyouneedmoreparallelism!• (Advanced)Ifyoureallywanttoconfirmtheblocksize,usethefollowingequation:• BlockSize<=p*t*c/s• p– diskscanrateat100MB/sec• t– desiredresponsetimeofthequeryinsec• c– concurrency• s- %ofcolumnselected
• Regularlycompacttablestokeepthenumberoffilesperpartitionundercontrolandimprovescanandcompressionefficiency
![Page 15: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/15.jpg)
15©Cloudera,Inc.Allrightsreserved.
TopicOutline
• Part1– TheBasics• PhysicalandSchemaDesign•MemoryUsage
• Part2– ThePracticalIssues• ClusterSizingandHardwareRecommendations• BenchmarkingImpala•Multi-tenancyBestPractices• QueryTuningBasics
• Part3– OutsideImpala• InteractionwithApacheHive,ApacheSentry,andApacheParquet
![Page 16: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/16.jpg)
16©Cloudera,Inc.Allrightsreserved.
MemoryUsage– TheBasics
•Memoryisusedby:• Hashjoin– RHStablesafterdecompression,filteringandprojection• Groupby– proportionaltothe#groups• Parquetwriterbuffer– 256MBperpartition• IObuffer(sharedacrossqueries)
•Memoryheldandreusedbylaterqueries• Impalareleasesmemoryfromtimetotimein1.4andlater
![Page 17: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/17.jpg)
17©Cloudera,Inc.Allrightsreserved.
Catalogmemoryusage
•Metadatacacheheapmemoryusagecanbecalculatedby• numoftables*5KB+numofpartitions*2KB+numoffiles*750B+numoffileblocks*300B+sum(incrementalcolstatspertable)• Incrementalstats
•Foreachtable,numcolumns*numpartitions*400B• Catalog-Updatetopicsizeshouldbenomorethan2GB• http://<statestorehost>:25010/topics
![Page 18: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/18.jpg)
18©Cloudera,Inc.Allrightsreserved.
MemoryUsage– EstimatingMemoryUsage
• UseExplainPlan• Requiresstatistics!Memoryestimatewithoutstatsismeaningless.• Reportsper-hostmemoryrequirementforthisclustersize.• Re-runifyou’vere-sizedthecluster!
![Page 19: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/19.jpg)
19©Cloudera,Inc.Allrightsreserved.
MemoryUsage– EstimatingMemoryUsage(Cont’d)
• EXPLAIN’smemoryestimateissues• Canbewayoff– muchhigherormuchlower.• groupby’sestimatecanbeparticularlyoff– whenthere’salargenumberofgroupbycolumns.• Memestimate=NDVofgroupbycolumn1*NDVofgroupbycolumn2*…NDVofgroupbycolumnn
• IgnoreEXPLAIN’sestimateifit’stoohigh!• Doyourownestimateforgroupby• GROUPBYmemusage=(totalnumberofgroups*sizeofeachrow)+(totalnumberofgroups*sizeofeachrow)/numnode
![Page 20: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/20.jpg)
20©Cloudera,Inc.Allrightsreserved.
MemoryUsage– FindingActualMemoryUsage
• Searchfor“PerNodePeakMemoryUsage”intheprofile.• Thisisaccurate.Useitforproductioncapacityplanning.
![Page 21: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/21.jpg)
21©Cloudera,Inc.Allrightsreserved.
MemoryUsage– FindingActualMemoryUsage(Cont’d)
• Forcomplexqueries,howdoIknowwhichpartofmyqueryisusingtoomuchmemoryorcausinganOut-Of-Memoryerror(i.e.hittingthemem-limit)?• Lookatthe“PeakMem”intheExecSummaryfromthequeryprofile!
![Page 22: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/22.jpg)
22©Cloudera,Inc.Allrightsreserved.
MemoryUsage– HittingMem-limit
• Topcauses(inorder)ofhittingmem-limitevenwhenrunningasinglequery:
1. Lackofstatistics2. Lotsofjoinswithinasinglequery3. Big-tablejoiningbig-table4. Giganticgroupby
![Page 23: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/23.jpg)
23©Cloudera,Inc.Allrightsreserved.
MemoryUsage– HittingMem-limit(Cont’d)
• Lackofstats•Wrongjoinorder,wrongjoinstrategy,wronginsertstrategy• ExplainPlantellsyouthat!
• Fix:ComputeStats<table>• Forhugetablethatcomputestatstakestoolongtofinish,youcanmanuallysettable/columnstats
![Page 24: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/24.jpg)
24©Cloudera,Inc.Allrightsreserved.
MemoryUsage– HittingMem-limit(Cont’d)
• Lotsofjoinswithinasinglequery• select…from fact, dim1, dim2,dim3,…dimN where …
• Eachdimtablecanfitinmemory,butnotallofthemtogether• AsofCDH5.4/Impala2.2,• Impalamightchoosethewrongplan– BROADCAST• Impalasometimesrequire256MBastheminimalrequirementperjoin!
• FIX1:useshufflehint• Select … from fact join [shuffle] dim1 on … join [shuffle] dim2 …
• FIX2:pre-jointhedimtables(ifpossible).fewjoin=>betterperf!
![Page 25: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/25.jpg)
25©Cloudera,Inc.Allrightsreserved.
MemoryUsage- HittingMem-limit(Cont’d)
• Big-tablejoinbig-table• Big-table(afterdecompression,filtering,andprojection)isatablethatisbiggerthantotalclustermemorysize.• CDH5.4/Impala2.0andlaterwilldothis(viadisk-basedjoin).• (Advanced)Forasimplequery,youcantrythisadvancedworkaround– per-partitionjoin• RequiresthepartitionkeybepartofthejoinkeySelect … from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and a.part_key in (1,2,3) union allSelect … from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and a.part_key in (4,5,6)
![Page 26: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/26.jpg)
26©Cloudera,Inc.Allrightsreserved.
MemoryUsage– Disk-basedJoin/Agg
• Disk-basedjoin/aggshouldbeyourlastresorttodealwithhittingmem-limit.• Relyondisk-basedjoin/aggifthereisonlyonejoin/aggoperatorinthequery.Forexample:• Good:selecta.*,b.*froma,bwherea.id=b.id• Good:selecta.id,a.timestamp,count(*)fromagroupbya.id,a.timestamp• OK:selectlarge_tbl.id,count(*)fromlarge_tbljointiny_tblon(id)groupbyid• Bad:selectt1.id,count(*)fromlarge_tbl_1t1,large_tbl_2t2wheret1.id=t2.idgroupbyt1.id• Bad:selecta.*,b.*,c.*froma,b,cwherea.id=b.idandb.col1=c.col2;
• Alwayssettheper-querymem-limit(2GBmin)whenusingdisk-basedjoin/agg!
![Page 27: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/27.jpg)
27©Cloudera,Inc.Allrightsreserved.
MemoryUsage- AdditionalNotes
• Useexplainplanforestimate;useprofileforaccuratemeasurement• Dataskewcanuseunevenmemory/CPUusage• Reviewpreviouscommonissuesonout-of-memory• Evenwithdisk-basedjoininImpala2.0andlater,you'llwanttoreviewtheQueryTuningstepstospeedupqueriesandusememorymoreefficiently.
![Page 28: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/28.jpg)
28©Cloudera,Inc.Allrightsreserved.
TopicOutline
• Part1– TheBasics• PhysicalandSchemaDesign•MemoryUsage
• Part2– ThePracticalIssues• ClusterSizingandHardwareRecommendations• BenchmarkingImpala•Multi-tenancyBestPractices• QueryTuningBasics
• Part3– OutsideImpala• InteractionwithApacheHive,ApacheSentry,andApacheParquet
![Page 29: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/29.jpg)
29©Cloudera,Inc.Allrightsreserved.
HardwareRecommendations
• 128GB(assignedtoImpala)ormoreforbestprice/performance• SpindlesvsSSD• Spindlesaremorecosteffective•MostworkloadisCPUbound;SSDwon’tmakeadifferenceatall
• 10Gbnetwork
![Page 30: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/30.jpg)
30©Cloudera,Inc.Allrightsreserved.
ClusterSizing– ObjectiveandKeys
• Objective:• Therecommendedclustersizeshouldrunthedesiredworkload withinagivenSLA throughouttheprojectedlifespan ofthecluster.
• Keys:•Workload- definesthefunctionalrequirement• SLA– definestheperformancerequirement• Projectedlifespan– howwillthingschangeovertime?
![Page 31: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/31.jpg)
31©Cloudera,Inc.Allrightsreserved.
ClusterSizing- SLA
• QueryThroughput• Howmanyqueriesshouldthisclusterprocesspersecond?• Thisisthemoremeaningfulmeasurementof“computingpower”
• QueryResponseTime• Howfastdoyouneedthequerytorun?• Typically,singlequeryresponsetimeisn’ttoomeaningfulbecausetherearealwaysmultiplequeriesrunningconcurrently!• Someusecasesrequireveryfastresponsetimee.gpoweringawebUI.
•Willmorepeopleberunningqueriesovertime?Thismeanshigherquerythroughput!
![Page 32: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/32.jpg)
32©Cloudera,Inc.Allrightsreserved.
ClusterSizing- Workload
• Fromtheworkload,you’llwanttoknow:• Howmuchmemorydoyouneed?• Howmuchprocessingpowerdoyouneed?(i.e.howcomplexistheworkload?)• HowmuchIObandwidthdoyouneed?
• Thebiggerthecluster,themoretotalmemory,CPU,anddisk-IObandwidthyouhave.• Butusually,thenetworkbandwidthisfixed.
![Page 33: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/33.jpg)
33©Cloudera,Inc.Allrightsreserved.
ClusterSizing– MemoryRequirements
• Howmuchmemorydoyouneed?• Anyhugegroupbyexpression?• PerNodeMem>=numberofdistinctgroup*rowsize+(numberofdistinctgroup*rowsize)/numnode• Numberofdistinctgroup:hardtoguess;justgetaroughballpark.• Rowsize:#columnsinvolvedinthequery*columnwidth• Columnwidthforint4byte,bigint8byte,etc.Forstringcolumns,takesomeroughguess.• Increasingtheclustersizewon’thelpmuchtoreducetheper-nodememrequirement.
![Page 34: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/34.jpg)
34©Cloudera,Inc.Allrightsreserved.
ClusterSizing– ConnectionRequirements
• Thelargerthecluster,themoreintraconnectionsneededbetweenimpalads.Withhighconcurrencyand/orverycomplexqueries,thiscouldcauseaconnectionstormandqueryfailures.• Impalacachesestablishedconnectionsandre-usesthem.Iftheworkloadisexecutedagain,theexistingconnectionpoolwillbeleveragedtosatisfyeachconnectionrequest.
• OnaKerberossecuredcluster,eachimpaladneedstoauthenticatewitheveryotherimpalad.Requires:N*NKDCtickets.ThiscouldoverwhelmasingleKDCserver.• Iftheclusterhasmorethan200nodes,considerusingmoreKDCserverstobalanceload.
![Page 35: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/35.jpg)
35©Cloudera,Inc.Allrightsreserved.
ClusterSizing– WorkloadComplexity(Cont’d)
• (Advanced)Ifyou’rereadytodivedeepintoworkloadanalysis…• Typically,youcanassumethefollowingprocessingrate(3columns,parquetformatwithsnappy):• Scannode8~10mrowspercore• Joinnode(singlejoinpredicate)~10mrowspersecpercore• Aggnode(singleagg)~5mrowspersecpercore• Sortnode- ~17MBpersecpercore• Parquetwriter1~5MBpersecpercore
• Fromaqueryintheworkload,basedonthe#join/aggyoucanestimate#rowspassingthroughtheoperatornodethenestimatetheeffectofpartitionpruning.• Usingtheaboveprocessingrate,youcanestimatehowmuchcputimeittooktoprocessaquery.
![Page 36: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/36.jpg)
36©Cloudera,Inc.Allrightsreserved.
ClusterSizing– Summary
• ClustersizingdependsonSLAandworkload.Youneedtoknowboth!•Memoryrequirementfordoingbigjoin/agginmemory:• TotalClustermem>=the2ndlargestbigtableinthejoinafterdecompression,filtering,andprojection• PerNodeMem>=numberofdistinctgroup*rowsize+(numberofdistinctgroup*rowsize)/numnode
![Page 37: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/37.jpg)
37©Cloudera,Inc.Allrightsreserved.
TopicOutline
• Part1– TheBasics• PhysicalandSchemaDesign•MemoryUsage
• Part2– ThePracticalIssues• ClusterSizingandHardwareRecommendations• BenchmarkingImpala•Multi-tenancyBestPractices• QueryTuningBasics
• Part3– OutsideImpala• InteractionwithApacheHive,ApacheSentry,andApacheParquet
![Page 38: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/38.jpg)
38©Cloudera,Inc.Allrightsreserved.
Benchmarking– WhyRunOne?
• UnderstandhowImpalaperforms,howitscales,howitcomparestothecurrentsystem•Measuresqueryresponsetimeaswellasquerythroughput
• UnderstandhowImpalautilizesresources•MeasureCPU,disk,networkandmemory
![Page 39: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/39.jpg)
39©Cloudera,Inc.Allrightsreserved.
BenchmarkingImpala– PreparingtheWorkload• Shouldberelevantto(orsatisfy)thebusinessrequirement.• Don’trunselect * from big_tbl limit 10 – it’smeaningless.
• Shouldnotbedictatedonthequeryform.• Youshouldbepreparedtochangethequery/schematodeliverameaningfulbenchmark.• Tunetheschema/query!
• Stayclosetothequerythatyou’regoingtoruninproduction.• Iftheresulthastobewrittentodisk,thenwritetodiskandDONOTsendresultsbacktotheclient.• Don’tstreamallthedatatotheclient(i.e.dataextraction).
• Useafastclient:You’rebenchmarkingtheserver,nottheclient,sodon’tmaketheclientabottleneck.
![Page 40: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/40.jpg)
40©Cloudera,Inc.Allrightsreserved.
Benchmarking– AvoidingTraps
• It’seasiertostartwithasmallerdatasetandsimplerquery.Tryingtorunacomplexqueryonahugedatasetonasmallclusterisnoteffective.• Adatasetthat’stoosmallcan’tutilizethewholecluster.Haveatleastoneblockperdisk.• DisableAdmissionControlwhenyou’redoingbenchmark!
![Page 41: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/41.jpg)
41©Cloudera,Inc.Allrightsreserved.
Benchmarking– PreparingtheHardware
• Shouldbeassimilartogo-livehardwareaspossible• Recommended:atleast10nodeswith128GBeach• CAUTION:Iftheclusteristoosmall(i.e.3nodes),it’sveryhardtoseetheeffectofscalabilityandidentifypotentialbottlenecks
![Page 42: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/42.jpg)
42©Cloudera,Inc.Allrightsreserved.
Benchmarking– MeasuringSingle-QueryResponseTime
• UseImpala-shell(simple,easytouse)withthe-Boption.Thisdisablesprettyformattingsoclientwon’tbethebottleneck.• Impala-shell –B –q “<your query>”
• Tosimulatetheeffectofbuffercache,runitafewtimestowarmthebuffercachebeforemeasuringtheresult.• Tosimulatetheeffectwithoutbuffercache,clearthebuffercachebyrunning:echo 1 > /proc/sys/vm/drop_caches
![Page 43: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/43.jpg)
43©Cloudera,Inc.Allrightsreserved.
Benchmarking– MeasuringQueryThroughput
• BenchmarkusingJDBC(oruseJmeter!,makesureyousetNativeQuery=1)• Inputparameter:listofhoststoconnectto,workloadqueries,durationofthebenchmark,andnumberofconcurrentconnections.• Eachconnectionwillpickahosttoconnecttoandkeeprunningaqueryforthespecifiedduration.• ReportQPS.• JustkeepincreasingthenumberofconnectionsuntilQPSdoesn’tincrease– thatwillbetheQPSofthesystem.
![Page 44: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/44.jpg)
44©Cloudera,Inc.Allrightsreserved.
Benchmarking– GeneralPerformanceNotes
• Gatherqueryprofilesandsystemutilization• YoucangetallthesefromCMor/var/log/impalad/profiles!
• PerformancevsHive- ImpalawillALWAYSbefasterafterpropertuning.Ifnot,somethingiswrongwiththebenchmark.• ImpalavsHive-on-Tez/SparkSQLbenchmark:https://blog.cloudera.com/blog/2016/02/new-sql-benchmarks-apache-impala-incubating-2-3-uniquely-delivers-analytic-database-performance/• SeetheopenTPC-DStoolkittorunyourown!https://github.com/cloudera/impala-tpcds-kit
![Page 45: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/45.jpg)
45©Cloudera,Inc.Allrightsreserved.
TopicOutline
• Part1– TheBasics• PhysicalandSchemaDesign•MemoryUsage
• Part2– ThePracticalIssues• ClusterSizingandHardwareRecommendations• BenchmarkingImpala•Multi-tenancyBestPractices• QueryTuningBasics
• Part3– OutsideImpala• InteractionwithApacheHive,ApacheSentry,andApacheParquet
![Page 46: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/46.jpg)
46©Cloudera,Inc.Allrightsreserved.
Multi-tenancyBestPractices– AdmissionControl
• Recommended:StaticpartitioningafixedmemoryforImpalaanduseAdmissionControl• Seethe“MemoryUsage”sectionfordeterminingmemoryusage!
• ThebestpublicresourceisthisnewImpalaRM‘howto’guide:http://www.cloudera.com/documentation/enterprise/latest/topics/impala_howto_rm.html
![Page 47: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/47.jpg)
47©Cloudera,Inc.Allrightsreserved.
Multi-tenancyBestPractices– Preventinga“Runaway”Query• “Runaway"query=usersubmitteda"wrong"queryaccidentallythatconsumesasignificantamountofmemory• Limittheamountofmemoryusedbyanindividualqueryusingper query mem-limit:• SetitfromImpalashell(onaperquerybasis):set mem_limit=<per query limit>
• Setadefaultper-querymemlimit:-default_query_options=‘mem_limit=<per query limit>’
• Onimpala2.6andlater,youcansetdefaultper-querymemlimitatpoollevel
![Page 48: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/48.jpg)
48©Cloudera,Inc.Allrightsreserved.
Multi-tenancyBestPractices– AdmissionControl
• HowtoapproachAdmissionControlconfig:•Workloaddependent– memoryboundornot?• “Memorybound”meansthatyou’veusedupthememoryallocatedtoImpalabeforehittinglimitoncpu,disk,network.
•Memorybound– usemem-limit• Non-memorybound– usenumberofconcurrentqueries
![Page 49: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/49.jpg)
49©Cloudera,Inc.Allrightsreserved.
Multi-tenancyBestPractices– AdmissionControl(Cont’d)
• GoalsofMemorylimitsetting:• Preventseachgroupofusersfromovercommittingsystemmemory• Preventsqueryfromhittingmem-limit• (Secondary)Simulatesprioritybyallocatingmorememorytoimportantgroup
![Page 50: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/50.jpg)
50©Cloudera,Inc.Allrightsreserved.
Multi-tenancyBestPractices– AdmissionControl(Cont’d)
• Step1: Identifysampleworkloadfromeachuser“group”,suchasHR,Analystetc• Step2*: Foreachqueryintheworkload,identifytheaccuratememoryusagebyrunningthequery.Thisisthememoryrequirementforthequery.• Step3: Minimalmemoryrequirementforeachgroup=max(memrequirementfromthequeryset)*1.2(safetyfactor).Also,settheperquerymem-limitwiththisnumber.• Step4:Youcandividethememorybasedon%too,buteachgroupshouldhaveatleasttheminmemderivedfromStep3.• NOTE:sum(memassignedtoallgroups)canbegreaterthantotalmemoryavailable.ThisisOK.
![Page 51: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/51.jpg)
51©Cloudera,Inc.Allrightsreserved.
Multi-tenancyBestPractices– AdmissionControl(Cont’d)
•Moreonstep2• Ifthememoryestimatefromtheexplainplanisinaccurate:• FIX:Useper-querylimittooverrideit,butthatwillrequireyoutosubmitquerythroughtheshell.• FIX:Adjustthepoolmem-limitaccordingly;ifit’sovertheestimate,giveitahighermem-limitandviceversa.
![Page 52: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/52.jpg)
52©Cloudera,Inc.Allrightsreserved.
Multi-tenancyBestPractices– AdmissionControl(Cont’d)
• Limitingthenumberofconcurrentqueries• Goal:• AvoidoversubscriptiontoCPU,disk,networkbecausethiscanleadtolongerresponsetime(withoutimprovingthroughput).
•Worksbestwithhomogeneousworkload•Withheterogeneousworkload,youcanstillapplythesameapproach,buttheresultwon’tbeasoptimal.
![Page 53: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/53.jpg)
53©Cloudera,Inc.Allrightsreserved.
TopicOutline
• Part1– TheBasics• PhysicalandSchemaDesign•MemoryUsage
• Part2– ThePracticalIssues• ClusterSizingandHardwareRecommendations• BenchmarkingImpala•Multi-tenancyBestPractices• QueryTuningBasics
• Part3– OutsideImpala• InteractionwithApacheHive,ApacheSentry,andApacheParquet
![Page 54: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/54.jpg)
54©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics- Overview
• Giventhatyourqueryruns,knowhowtomakeitfasterandconsumefewerresources.• AlwaysCompute Stats.• Examinethelogicofthequery.• ValidateitwithExplain Plan.• RuntimeFilter
• RuntimeAnalysis• UseQuery Profile toidentifybottlenecksandskew• Codegen
![Page 55: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/55.jpg)
55©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– MoreonCompute Stats
• Compute Stats isveryCPU-intensive,butonImpala1.4andlaterismuchfasterthanpreviousversions.• SpeedReference:~40Mcellspersecpernode+HMSupdatetime(1secper100partitions).~7millioncellpersecpernodeifcodegenisdisabled(Iftablecontainstimestamp,charcolumnthatdon’tsupportcodegenyet)• Totalnumberofcellsofatable=numrows*numcols• Onlyneedtorecomputestatswithsignificantchangesofdata(30%ormore)• Compute Stats ontables,notview
![Page 56: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/56.jpg)
56©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– IncrementalStatsMaintenance
• Compute Stats isslowandthrough2.0,ImpaladoesnotsupportIncrementalStats• ColumnStats(numberofdistinctvalue,min,max)canbeupdatedbyCompute Stats infrequently(when30%ormoredatahaschanged)•Whenaddinganewpartition,runacount(*)queryonthepartitionandupdatethepartitionrowcountstatsmanuallyviaALTER TABLE• YoucansetcolumnstatsmanuallyviaALTER TABLE aswell
![Page 57: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/57.jpg)
57©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– IncrementalStatsMaintenance
• Impala2.1orlatersupports“ComputeIncrementalStats”,butuseitonlyifthefollowingconditionsaremet:• Forallthetablesthatareusingincrementalstats,Σ(numcolumns*numpartitions)<650000.• Thesizeoftheclusterislessthan50nodes.• Foreachtable,numcolumns*numpartitions*400B<200MB
![Page 58: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/58.jpg)
58©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– ExamingQueryLogic
• Sometimes,thequerycanhaveredundantjoins,distinct,groupby,orderby(verycommonduringmigration).Removethem!• Forcommonjoinpatterns,considerpre-joiningthetables.• Usetheproperjoin:Example:
select fact.col, max(dim2.col)
from fact, dim1, dim2
where fact.key = dim1.key and fact.key2 = dim2.key
• Thejoinondim1shouldbeasemi-join!
![Page 59: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/59.jpg)
59©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– ValidatingExplain Plan
• Keypoints:• Validatejoinorderandjoinstrategy.• ValidatepartitionpruningorHBaserowkeylookup.
• Evenwithstats,attimesthejoinorder/strategymightgowrong(particularlywithlayersofviews).CanforcejoinorderbyusingSTRAIGHT_JOINhint
![Page 60: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/60.jpg)
60©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– ValidatingJoinOrder&Strategy
• JoinOrder• RHS issmallerthanLHS
• JoinStrategy- BROADCAST• RHS mustfitinmemory!
![Page 61: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/61.jpg)
61©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– CommonJoinPerformanceIssues
• HashTable• Ideally,thejoinkeyshouldbeevenlydistributed;onlyafewrowssharethesamejoinkeyfromtheRHS.• Isitatrueforeignkeyjoinormorelikearangejoin?
•Wrongjoinorder– RHSisbiggerthanLHStablefromtheplan• ToomanyLHSrows
…..
…..
Hashchain
![Page 62: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/62.jpg)
62©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– FindingBottlenecks
• UseExecSummary fromQueryProfiletoidentifybottlenecks
![Page 63: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/63.jpg)
63©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– FindingSkew
• UseExecSummary fromQueryProfiletoidentifyskew•MaxTimeissignificantlymorethanAvgTime=>Skew!
![Page 64: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/64.jpg)
64©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– FindingSkew(Cont’d)
• Inadditiontoprofile,alwaysmeasureCPU,memory,diskIOandnetworkIOacrossthecluster.• Anunevendistributionoftheloadmeansskew!
• ClouderaManager’schartscandothatbutonlyreportataminuteinterval• UseColmuxifyourworkloadisshort.
![Page 65: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/65.jpg)
65©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– TypicalSpeed
• Inatypicalquery,weobservedfollowingprocessingrate:• scannode8~10mrowspercore• joinnode~10mrowspersecpercore• aggnode~5mrowspersecpercore• sortnode~17MBpersecpercore• Rowmaterializationincoordshouldbetiny• Parquetwriter1~5MBpersecpercore
• Ifyourprocessingrateismuchlowerthanthat,it’sworthadeeperlook
![Page 66: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/66.jpg)
66©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– ImprovingScanNodePerformance
• HDFSScantime– checkouthowmuchdataisread;alwaysdoaslittlediskreadaspossible;reviewpartitionstrategy.• Columnmaterializationtime– onlyselectnecessarycolumns!Materializing1kcolisalotofwork.• Complexpredicates– string,regexarecostly;avoidthem.• Orderofpredicates– Impalawilltrytoorderpredicatesbyselectivityifstatsareavailable.Ifstatsarenotavailable,youshouldorderpredicatesbyselectivityinyourquery.
![Page 67: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/67.jpg)
67©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– AggregatePerformanceTuning
• Neededwhenmanyrowsgoingintoaggregate• Reduceexpressioncomplexityingroupby
• ComplexUDA• NocodegenforUDAyet.TrytorewriteUDAasbuilt-inaggregate
• (Usually,notabigissue)
![Page 68: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/68.jpg)
68©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– ExchangePerformanceIssues
• Toomuchdataacrossnetwork:• Checkthequeryondatasizereduction.• Checkjoinorderandjoinstrategy;wrongorder/strategycanhaveaseriouseffectonnetwork!• Foragg,checkthenumberofgroups– affectsmemorytoo!• Removeunusedcolumns.
• Keepinmindthatnetworkisatmost10GB.• Cross-racknetworkslowness• Queryprofileisusuallynotuseful.UseCMorothersystemmonitoringtools.
![Page 69: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/69.jpg)
69©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– StorageSkew
• HDFSBlockPlacementSkew• HDFSbalancesdiskusageevenlyacrossthewholeclusteronly.Anindividualtable(orpartition)’sdatacouldbeclusteredinahandfulofnodes.• Ifthishappens,you’llseethatsomenodesarebusier(diskreadandusuallycpuaswell)thantheothers.• ThisisinherenttoHDFS:morepronouncedifquerydatavolumeistinywhencomparedtothetotalstoragecapacity.• Byrunningamixedworkloadthataccessdataofabiggersetoftables,thistypeofhdfsblockplacementskewusuallyevenout.
![Page 70: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/70.jpg)
70©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– DataSkew
• PartitionedJoinNodePerformanceSkew• Joinkeydataskew.• Eachjoinkeyisre-shuffledandprocessedbyonenode.• Ifasinglejoinkeyvalueaccountforahugechuckofthedata,thenthenodethatprocessthatjoinkeywillbecomethebottleneck.
• Possibleworkaround:usebroadcastjoinbutitusesmorememory
![Page 71: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/71.jpg)
71©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics–ExpressionEvaluation
• Impalaevaluatesexpressionlazily(i.e.onlywhenthevalueisneeded)• Impaladealswithinlineviewsbysubstitutingtheselect-listexpressionsfromtheinlineviewintheparentqueryblock.• Expensiveexpressionsinsidetheinlineviewsmightbeexecutedmultipletimes
• Exampleselect concat(col1, col1, col1)
from (select regexp_extract(ori_col, ‘…..’) as col1 from tbl) x
• Theregexp_extractwillbeevaluated3timesbythecoordinator!
![Page 72: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/72.jpg)
72©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics–ExpressionEvaluation(Cont’d)
• Toavoidredundantexpressionevaluation,weneedtomaterializethevalue.• Theseoperatorsmaterializetheexpression:Orderby,Union[all],Groupby• Example:
select concat(col1, col1, col1) from (select regexp_extract(ori_col, ‘…..’) as col1 from tbl) x order by col1;
select concat(col1, col1, col1) from (select regexp_extract(ori_col, ‘…..’) as col1 from tbl union all select null from one_row_tbl where false) x;
• Takeaway:Watchoutforexpensiveexpressioninsidean(inline)view!• Takeaway:Duetolazyevaluation,itcanaffectalltheexecnodesaswellasthecoordinator!
![Page 73: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/73.jpg)
73©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– ClientsidePerformanceIssues
• Avoidlargedataextract.• It’susuallynotagoodideatodumplotsofdataoutusingJDBC/ODBC.
• ForImpala-shell,usethe–Boptiontofetchlotsofdata.
Totalquerytime
Idletime:clientisn’tfetching
![Page 74: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/74.jpg)
74©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– ExcessivePlanningTime?
• UsuallyduetoMetadataLoadingwhenrunningthequerythefirsttimeafterrestart(orafterinvalidatemetadata)• Tryrunningthequeryagain.Thetimeshouldbeshort.
Planningtime
![Page 75: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/75.jpg)
75©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– Codegen
• InCDH5.9andbefore,Codegenhasminimumcost(~500ms).forsimpleshortquery(<2s),mostoftimecouldbespentoncodegenparpareandoptimization.Disablecodegencouldgivemorethroughput.• Thiscostisreducedto10-20%inCDH5.10+
• Youcansetqueryoptiontodisablecodegen• SETDISABLE_CODEGEN=1
• Somedatatypesarenotsupported(CHAR,TIMESTAMP)
![Page 76: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/76.jpg)
76©Cloudera,Inc.Allrightsreserved.
QueryTuningBasics– Runtimefilteranddynamicpartitionpruning(CDH5.7/Impala2.5orhigher)
• Improvequeryperformanceforveryselectiveequi-joinbyreducingIO,hashjoin,andnetworktraffic.• Ifjoinisonpartitioncolumn,Impalacanuseruntimefilterstodynamicallyprunepartitionsonprobeside,skipwholefiles.Notonlycanitimproveperformance,itcanalsoreducequeryresourceusage(bothmemoryandCPU)• Ifjoinisonnon-partitioncolumns,Impalacangeneratebloomfilters.Thiscangreatlyreducescanoutputofprobesideandintermediateresult.
![Page 77: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/77.jpg)
77©Cloudera,Inc.Allrightsreserved.
Checkhowselectivethejoinis
Largehashjoinoutput(~2.8B),theoutputofupstreamoneismuch
smaller(32M,reducedtoonly~1%).Thisindicateruntimefiltercanbe
helpful.
![Page 78: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/78.jpg)
78©Cloudera,Inc.Allrightsreserved.
Howitworks
1
2
3
4
RF002<- d.d_date_skFiltercomingfromExchNode20,AppliedtoScanNode02,reducescan
outputfrom2.8B to32Mandalltheupstreamhash
joinaswell.
RF001<- (d_month_seq)FiltercomingfromScanNode05,AppliedtoScanNode03,Reducescan
outputfrom73049 to31
![Page 79: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/79.jpg)
79©Cloudera,Inc.Allrightsreserved.
Whydoesn’titworksometimes?
Filterdoesn‘ttakeeffectbecausescanwasshortandfinishedbeforefilterarrived.HDFS_SCAN_NODE(id=3):Filter1(1.00MB):- Rowsprocessed:16.38K(16383)- Rowsrejected:0(0)- Rowstotal:16.38K(16384)
![Page 80: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/80.jpg)
80©Cloudera,Inc.Allrightsreserved.
Howtotuneit?
• IncreaseRUNTIME_FILTER_WAIT_TIME_MSto5000mstoletScanNode03waitlongertimeforthefilter.
HDFS_SCAN_NODE(id=3)Filter1(1.00MB)- InactiveTotalTime:0- Rowsprocessed:73049- Rowsrejected:73018- Rowstotal:73049- TotalTime:0
• Iftheclusterisrelativelybusy,considerincreasingthewaittimetoosothatcomplicatedqueriesdonotmissopportunitiesforoptimization.
![Page 81: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/81.jpg)
81©Cloudera,Inc.Allrightsreserved.
Whichfileformatissupported
• Partitionfilter• Allfileformatsaslongasit’sapartitionedtableandtherearemappingsforpartitioncolumns.E.g.• Selectcount(*)fromt1,t2wheret1.partition_col=t2.col1 andt2.col2=“1234”
• Rowfilter• Onlyappliedforparquetformat
• Parquetcantakemostadvantageofruntimefilters.
![Page 82: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/82.jpg)
82©Cloudera,Inc.Allrightsreserved.
RuntimefilterGotcha
• Runtimefiltersrequireextramemoryoneachhostandextraworkoncoordinator.Example:on100-nodecluster,one16MBfilter•Memorycost,16MB*100=1.6GBforquery• Networkcost(onlyforGLOBALmode),16MB*100*2=3.2GBnetworktrafficoncoordinator,andCPUtimetomergeandpublishfilters.•Morefilters,highercost.•Negativeimpact:Coordinatorbecomesthebottleneck
•Resolution:reduce#offiltersbysettingMAX_NUM_RUNTIME_FILTERStoalowernumberordisablerowfilterbysetDISABLE_ROW_RUNTIME_FILTERING=1.
![Page 83: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/83.jpg)
83©Cloudera,Inc.Allrightsreserved.
TopicOutline
• Part1– TheBasics• PhysicalandSchemaDesign•MemoryUsage
• Part2– ThePracticalIssues• ClusterSizingandHardwareRecommendations• BenchmarkingImpala•Multi-tenancyBestPractices• QueryTuningBasics
• Part3– OutsideImpala• InteractionwithApacheHive,ApacheSentry,andApacheParquet
![Page 84: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/84.jpg)
84©Cloudera,Inc.Allrightsreserved.
InteractionwithSentry,Hive,andParquet
• Setup:ClouderaManager5.xdoesagoodjob;verifythedependencyparents,suchasHiveMetastore,HDFS.• StabilityinHMSmightaffectImpala;checkHMShealth.
• File-basedandApacheSentrysecurity• EvenwithSentry,Impalaneedstoread/writealldir/files.Noimpersonation.
![Page 85: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/85.jpg)
85©Cloudera,Inc.Allrightsreserved.
Summary
• Approachclustersizingsystematically- SLAandworkload• Benchmarkrunningtechniqueandmeasurement– QPS• UseAdmissionControlformulti-tenancy• Tuneyourqueries- identifyandattackbottlenecks• ClouderaManager5.0+providesatoolforverifyingwhethermanybestpracticeshavebeenfollowed
![Page 86: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/86.jpg)
86©Cloudera,Inc.Allrightsreserved.
OtherResources
• ImpalaUserGuide:http://www.cloudera.com/documentation/enterprise/latest/topics/impala.html• Impalawebsite(roadmap,repository,books,more)&Twitter:https://impala.apache.org,@ApacheImpala• Communityresources:•Mailinglist:[email protected]• Forum:http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/bd-p/Impala
![Page 87: Impala cookbook 01-2017 - Cloudera Blog · •As of CDH 5.4/Impala 2.2, •Impala might choose the wrong plan –BROADCAST •Impala sometimes require 256MB as the minimal requirement](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec5494c1146c40e7c4494e8/html5/thumbnails/87.jpg)
87©Cloudera,Inc.Allrightsreserved.
Thankyouhttps://impala.apache.org@ApacheImpala