MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

20
BIG(GER) FASTER DATA Andrew Hood – Managing Director Cameron Gray – Data Engineer

Transcript of MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

Page 1: MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

BIG(GER)FASTERDATA

Andrew Hood – Managing DirectorCameron Gray – Data Engineer

Page 2: MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 2

Things That Are Not New

Parallel Processing

Distributed Computing

Columnar Databases

Moore’s Law

Kryder’s Law

Page 3: MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 3

Things That Are New(ish)

Cheap as Chips Cloud

Computing

Mature(ish) Open Source Technologies

Standard Platforms (e.g.

Redshift)

Attitudes to External Data

Hosting

Time to Implementation

Page 4: MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 4

Big Data: Structured Rant

1. I really don’t care how big anyone’s data is.2. I do care about how long something takes.3. I do care about how much something costs.4. Faster+Cheaper could completely change

what approach (tools/techniques) I might select for a given analytical task.

Page 5: MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 5

Use Case 1: RDBMS

Historical Transactional Data

Load into Relational

Database (e.g. PostgreSQL)

SQL Query & Transformation

Historical Transactional

Data

Load into Hadoop + HIVE

SQL Query & Transformatio

n

Page 6: MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 6

Use Case 2: Crap Analytics Tool

Adobe Clickstream Log/Google

BigQuery Export

Hadoop/ Redshift/ Impala Tableau

Page 7: MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 7

Data Processing Scenarios

Use a relational database• Using raw clickstream/log files/other data sources• Powerful querying capabilities (SQL)• Integrates well with other tools• Can handle large data sets (>1 million rows easily)• Likely requires dedicated server and

administration skillset

Page 8: MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 8

Data Processing ScenariosData readily available within analytics tool• Limited by analytics tool capabilities• Limited by aggregation and pre-processing

definitions• Limited by sampling (based on date range,

breakdowns, number of rows)• Limited by visualisation options (e.g. charting

options)

Page 9: MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 9

Data Processing Scenarios

Export data into external tool• Microsoft Excel, Tableau, R…• More control over analysis, reporting,

visualisation• Still limited by the underlying data set• Tool limitations (e.g. 1 million rows in Excel)• Limited by PC resources

Page 10: MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 10

When do RDBMs stop working efficiently?

• Sheer volume of data to process leads to problems– Limited by database server hardware

• Database can’t keep up with amount of data being inserted

• Queries have increasingly long processing times– Pre-computing queries also takes longer…

• Change in reporting requirements means reprocessing large amounts of historical data

Page 11: MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 11

What are the solutions?

• Limit requirement definitions (i.e. say “not possible”… boo!)

• Invest in very expensive server hardware– Gets very expensive, with diminishing returns– Single server means single point of failure– Having a backup/failover means needing another

very expensive server!• Use multiple servers working together?

Page 12: MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 12

• The ability to spread data and processing across multiple servers in a cluster

• As demand increases, just add more and more servers to the cluster

• Cluster provides built-in redundancy: robust to failures of individual servers

• Use technologies that scale effectively across the cluster

Horizontal Scaling

Page 13: MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 13

Technologies

Lots of others!

Page 14: MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 14

Hadoop

DistributedFile Storage

Cluster Management

Distributed Processing

Client Applications

Page 15: MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 15

How does it all fit together?

Raw Data

Import

Process, Aggregate, Compute Views

Export

Page 16: MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 16

Some experimental results

Standard method

Hadoop - 1 node

Hadoop - 2 nodes

Hadoop - 3 nodes

Hadoop - 4 nodes

Hadoop - 5 nodes

0 50 100 150 200 250

Time (s)

Standard method

Hadoop - 1 node

Hadoop - 2 nodes

Hadoop - 3 nodes

Hadoop - 4 nodes

Hadoop - 5 nodes

0 100 200 300 400 500 600

Time (s)

Test 1 – filter test (5GB)

Test 2 – aggregation test (5GB)

Page 17: MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 17

Different Tools have Different Requirements• Some tools such as Impala process as much

data as possible in memory– Requires lots of RAM

• Some tools such as Hive processes data mostly on disk– Requires high disk I/O – Either fast disks/SSDs or

as many disks as possible

Page 18: MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 18

Next steps to try out yourself!

• You can try out processing with Hadoop using a cloud service like Amazon Web Services

• Set up an account, create a few nodes, install Hadoop

• Upload some test data – the larger the better• Try running some complex data processing on

the data to get an idea of the performance

Page 19: MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 19

Things to remember

• Do testing before investing in new hardware / infrastructure– Test all tools you are interested in using with

various amounts of RAM, CPU cores and I/O performance.

• Sheer number of tools in Hadoop ecosystem – worth planning out what you need

Page 20: MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 20

Thank-you

www.lynchpin.com