Post on 17-Mar-2018
BuildingMulti-PetabyteDataWarehouseswithClickHouse
AlexanderZaitsev
LifeSteet,Altinity
PerconaLiveDublin,2017
Altinity
WhoamI
• GraduatedMoscowStateUniversityin1999
• Softwareengineersince1997
• Developeddistributedsystemssince2002
• Focusedonhighperformanceanalyticssince2007
• DirectorofEngineeringinLifeStreet
• Co-founderofAltinity
• AdTechcompany(adexchange,adserver,RTB,DMPetc.)
since2006
• 10,000,000,000+events/day
• 2K/event
• 3monthsretention(90-120days)
10B*2K*[90-120]=[1.8-2.4]PB
• Tried/used/evaluated:
– MySQL(TokuDB,ShardQuery)
– InfiniDB
– MonetDB
– InfoBrightEE– Paraccel(nowRedShift)
– Oracle
– Greenplum
– SnowflakeDB
– Vertica
ClickHouse
Flashback:ClickHouseat08/2016
• 1-2monthsinOpenSource
• InternalYandexproduct–nootherinstallations
• Nosupport,roadmap,communicatedplans
• 3officialdevs
• Anumberofvisiblelimitations(andmanyinvisible)
• Storiesofotherdoomedopen-sourcedDBs
Developproductionsystemwith“that”?
ClickHouseis/wasmissing:
• Transactions
• Constraints
• Consistency
• UPDATE/DELETE
• NULLs(addedfewmonthsago)
• Milliseconds
• Implicittypeconversions
• StandardSQLsupport
• Partitioningbyanycolumn(dateonly)
• Enterpriseoperationtools
SQLdevelopersreaction:
Butwetriedandsucceeded
Beforeyougo:
ü Confirmyourusecase
ü Checkbenchmarks
ü Runyourown
ü Considerlimitations,notfeatures
ü MakeaPOC
Migrationproblem:basicthingsdonotfit
MainChallenges
• Efficientschema
– UseClickHousebests
– Workaroundlimitations
• Reliabledataingestion
• Shardingandreplication
• Clientinterfaces
LifeStreetUseCase
• Publisher/Advertiserperformance
• Campaign/Creativeperformanceprediction
• Realtimealgorithmicbidding
• DMP
LifeStreetRequirements
• Load10Brows/day,500dimensions/row
• Ad-hocreportson3monthsofdata
• Lowdataandquerylatency
• HighAvailability
Multi-DimensionalAnalysis
N-dimensionalcube
M-dimensionalprojection
slice
OLAPquery:aggregation+filter+groupby
Rangefilter
Queryresult
Disclaimer:averageslie
Typicalschema:“star”
• Facts• Dimensions• Metrics• Projections
StarSchemaApproach
De-normalized:dimensionsinafacttable
Normalized:dimensionkeysinafacttableseparatedimensiontables
Singletable,simple Multipletables
Simplequeries,nojoins Morecomplexquerieswithjoins
Datacannotbechanged Dataindimensiontablescanbechanged
Sub-efficientstorage Efficientstorage
Sub-efficientqueries Moreefficientqueries
Normalizedschema:traditionalapproach-joins
• LimitedsupportinClickHouse(1level,
cascadesub-selectsformultiple)
• Dimensiontablesarenotupdatable
Dictionaries-ClickHousedimensionsapproach
• Lookupservice:key->value
• Supportsdifferentexternalsources(files,
databasesetc.)
• Refreshable
Dictionaries.ExampleSELECT country_name, sum(imps) FROM T ANY INNER JOIN dim_geo USING (geo_key) GROUP BY country_name; vs SELECT dictGetString(‘dim_geo’, ‘country_name’, geo_key) country_name, sum(imps) FROM T GROUP BY country_name;
Dictionaries.Configuration<dictionary>
<name></name>
<source>…</source>
<lifetime>...</lifetime>
<layout>…</layout>
<structure>
<id>...</id>
<attribute>...</attribute>
<attribute>...</attribute>
...
</structure>
</dictionary>
Dictionaries.Sources• file
• mysqltable
• clickhousetable
• odbcdatasource
• executablescript
• httpservice
Dictionaries.Layouts
• flat
• hashed
• cache
• complex_key_hashed
• range_hashed
Dictionaries.range_hashed
• ‘EffectiveDated’queries
<layout>
<range_hashed/>
</layout>
<structure>
<id>
<name>id</name>
</id>
<range_min>
<name>start_date</name>
</range_min>
<range_max>
<name>end_date</name>
</range_max>
dictGetFloat32('srv_ad_serving_costs','ad_imps_cpm',toUInt64(0),event_day)
Dictionaries.Updatevalues• Bytimer(default)
• AutomaticforMySQLMyISAM
• Using‘invalidate_query’
• Manuallytouchingconfigfile
• Warning:Ndict*Mnodes=N*MDB
connections
Dictionaries.Restrictions
• ‘Normal’keysareonlyUInt64
• Noondemandupdate(addedinSep2017
1.1.54289)
• Everyclusternodehasitsowncopy
• XMLconfig(DDLwouldbebetter)
Dictionariesvs.Tables
+NoJOINs
+Updatable
+Alwaysinmemoryforflat/hash(faster)
- Notapartoftheschema
- Somewhatinconvenientsyntax
Tables
• Engines
• Sharding/Distribution
• Replication
Engine=?
• Inmemory:
– Memory
– Buffer
– Join
– Set
• Ondisk:
– Log,TinyLog
– MergeTreefamily
• Interface:
– Distributed– Merge
– Dictionary• Specialpurpose:
– View
– MaterializedView
– Null
Mergetree• Whatis‘merge’
• PKsortingandindex
• Datepartitioning
• Queryperformance
Block1 Block2
Mergedblock
PKindex
Seedetailsat:https://medium.com/@f1yegor/clickhouse-primary-keys-2cf2a45d7324
MergeTreefamily
ReplicatedReplacingCollapsingSummingAggergatingGraphite
MergeTree+ +
DataLoad
• Multipleformatsaresupported,includingCSV,TSV,
JSONs,nativebinary
• Errorhandling• SimpleTransformations
• Loadlocally(better)ordistributed(possible)
• Temptableshelp
• Replicatedtableshelpwithde-dup
ThepowerofMaterializedViews
• MVisatable,i.e.engine,replicationetc.
• Updatedsynchronously
• Summing/AggregatingMergeTree–consistentaggregation
• Altersareproblematic
DataLoadDiagram
Temptables(local)
Facttables(shard)
SummingMergeTree(shard)
SummingMergeTree(shard)
LogFiles
INSERT
MV MV
INSERT Buffertables(local)
Realtimeproducers
INSERT
Bufferflush
MySQL
Dictionaries
CLICKHOUSENODE
Updatesanddeletes
• Dictionariesarerefreshable
• ReplacingandCollapsingmergetrees
– eventuallyupdates
– SELECT…FINAL
• Partitions
ShardingandReplication• ShardingandDistribution=>Performance
– FacttablesandMVs–distributedovermultipleshards
– Dimensiontablesanddicts–replicatedateverynode
(localjoinsandfilters)
• Replication=>Reliability– 2-3replicaspershard
– CrossDC
DistributedQuerySELECTfooFROMdistributed_tableGROUPbycol1
Server1,2or3
SELECTfooFROMlocal_tableGROUPBYcol1
• Server1
SELECTfooFROMlocal_tableGROUPBYcol1
• Server2
SELECTfooFROMlocal_tableGROUPBYcol1
• Server3
Replication• Pertabletopologyconfiguration:
– Dimensiontables–replicatetoanynode– Facttables–replicatetomirrorreplica
• Zookepertocommunicatethestate
– State:whatblocks/partstoreplicate
• Asynchronous=>fasterandreliableenough
• Synchronous=>slower
• Isolatequerytoreplica
• Replicationqueues
SQL• SupportsbasicSQLsyntax
• Non-standardJOINsimplementation:
– 1levelonly
– ANYvsALL
– onlyUSING
• Aliasingeverywhere
• Arrayandnesteddatatypes,lambda-expressions,ARRAYJOIN
• GLOBALIN,GLOBALJOIN
• Approximatequeries
• Someanalyticalfunctions
HardwareandDeployment
• LoadisCPUintensive=>morecores
• Queryisdiskintensive=>fasterdisks
• 10-12SATARAID10– SAS/SSD=>x2performanceforx2priceforx0.5capacity
• 10TB/serverseemsoptimal
• Zookeper–keepinonDCforfastquorum
• RemoteDCworkbad(e.g.EastanWestcoastinUS)
MainChallengesRevisited
• Designefficientschema
– UseClickHousebests
– Workaroundlimitations
• Designshardingandreplication
• Reliabledataingestion
• Clientinterfaces
Migrationprojecttimelines• August2016:POC
• October2016:firsttestruns
• December2016:productionscaledataload:
– 10-50Bevents/day,20TBdata/day– 12x2serverswith12x4TBRAID10
• March2017:ClientAPIready,startingmigration
– 30+clienttypes,20req/squeryload
• May2017:extensionto20x3servers
• June2017:migrationcompleted!
– 2-2.5PBuncompresseddata
Fewexamples
:)selectcount(*)fromdw.ad8_fact_eventwhereaccess_day=today()-1;SELECTcount(*)FROMdw.ad8_fact_eventWHEREaccess_day=(today()-1)┌────count()─┐│7585106796│└────────────┘1rowsinset.Elapsed:0.503sec.Processed12.78billionrows,25.57GB(25.41billionrows/s.,50.82GB/s.)
:)selectdictGetString('dim_country','country_code',toUInt64(country_key))country_code,count(*)cntfromdw.ad8_fact_eventwhereaccess_day=today()-1groupbycountry_codeorderbycntdesclimit5;SELECTdictGetString('dim_country','country_code',toUInt64(country_key))AScountry_code,count(*)AScntFROMdw.ad8_fact_eventWHEREaccess_day=(today()-1)GROUPBYcountry_codeORDERBYcntDESCLIMIT5┌─country_code─┬────────cnt─┐│US│2159011287││MX│448561730││FR│433144172││GB│352344184││DE│336479374│└──────────────┴────────────┘5rowsinset.Elapsed:2.478sec.Processed12.78billionrows,55.91GB(5.16billionrows/s.,22.57GB/s.)
:)SELECTdictGetString('dim_country','country_code',toUInt64(country_key))AScountry_code,sum(cnt)AScntFROM(SELECTcountry_key,count(*)AScntFROMdw.ad8_fact_eventWHEREaccess_day=(today()-1)GROUPBYcountry_keyORDERBYcntDESCLIMIT5)GROUPBYcountry_codeORDERBYcntDESC┌─country_code─┬────────cnt─┐│US│2159011287││MX│448561730││FR│433144172││GB│352344184││DE│336479374│└──────────────┴────────────┘5rowsinset.Elapsed:1.471sec.Processed12.80billionrows,55.94GB(8.70billionrows/s.,38.02GB/s.)
:)SELECTcountDistinct(name)ASnum_cols,formatReadableSize(sum(data_compressed_bytes)ASc)AScomp,formatReadableSize(sum(data_uncompressed_bytes)ASr)ASraw,c/rAScomp_ratioFROMlf.columnsWHEREtable='ad8_fact_event_shard'┌─num_cols─┬─comp───────┬─raw──────┬──────────comp_ratio─┐│308│325.98TiB│4.71PiB│0.06757640834769944│└──────────┴────────────┴──────────┴─────────────────────┘1rowsinset.Elapsed:0.289sec.Processed281.46thousandrows,33.92MB(973.22thousandrows/s.,117.28MB/s.)
ClickHouseatfall2017
• 1+yearOpenSource
• 100+prodinstallsworldwide
• Publicchangelogs,roadmap,andplans
• 5+2Yandexdevs,fewcommunitycontributors
• Activecommunity,blogs,casestudies
• Alotoffeaturesaddedbycommunityrequests
• SupportbyAltinity
Sonowitismucheasier
ClickHouseandMySQL
• MySQLiswidespreadbutweakforanalytics
– TokuDB,InfiniDBsomewhathelp
• ClickHouseisbestinanalytics
Howtocombine?
Imagine
MySQLflexibilityatClickHousespeed?
Dreams….
ClickHousewithMySQL• ProxySQLtoaccessClickHouse
dataviaMySQLprotocol(moreat
thenextsession)
• BinlogsintegrationtoloadMySQL
datainClickHouseinrealtime(in
progress)
MySQL CH
ProxySQL
binlogconsumer
ClickHouseinsteadofMySQL
• Weblogsanalytics
• Monitoringdatacollectionandanalysis
– Percona’sPMM
– InfinidatInfiniMetrics
• Othertimeseriesapps
• ..andmore!
Questions?
Contactme:
alexander.zaitsev@lifestreet.comalz@altinity.comskype:alex.zaitsev
Altinity