MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

BIG(GER)FASTERDATA

Andrew Hood – Managing DirectorCameron Gray – Data Engineer

3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 2

Things That Are Not New

Parallel Processing

Distributed Computing

Columnar Databases

Moore’s Law

Kryder’s Law


Things That Are New(ish)

Cheap as Chips Cloud

Computing

Mature(ish) Open Source Technologies

Standard Platforms (e.g.

Redshift)

Attitudes to External Data

Hosting

Time to Implementation


Big Data: Structured Rant

1. I really don’t care how big anyone’s data is.2. I do care about how long something takes.3. I do care about how much something costs.4. Faster+Cheaper could completely change

what approach (tools/techniques) I might select for a given analytical task.


Use Case 1: RDBMS

Historical Transactional Data

Load into Relational

Database (e.g. PostgreSQL)

SQL Query & Transformation

Historical Transactional

Data

Load into Hadoop + HIVE

SQL Query & Transformatio

n


Use Case 2: Crap Analytics Tool

Adobe Clickstream Log/Google

BigQuery Export

Hadoop/ Redshift/ Impala Tableau


Data Processing Scenarios

Use a relational database• Using raw clickstream/log files/other data sources• Powerful querying capabilities (SQL)• Integrates well with other tools• Can handle large data sets (>1 million rows easily)• Likely requires dedicated server and

administration skillset


Data Processing ScenariosData readily available within analytics tool• Limited by analytics tool capabilities• Limited by aggregation and pre-processing

definitions• Limited by sampling (based on date range,

breakdowns, number of rows)• Limited by visualisation options (e.g. charting

options)


Data Processing Scenarios

Export data into external tool• Microsoft Excel, Tableau, R…• More control over analysis, reporting,

visualisation• Still limited by the underlying data set• Tool limitations (e.g. 1 million rows in Excel)• Limited by PC resources


When do RDBMs stop working efficiently?

• Sheer volume of data to process leads to problems– Limited by database server hardware

• Database can’t keep up with amount of data being inserted

• Queries have increasingly long processing times– Pre-computing queries also takes longer…

• Change in reporting requirements means reprocessing large amounts of historical data


What are the solutions?

• Limit requirement definitions (i.e. say “not possible”… boo!)

• Invest in very expensive server hardware– Gets very expensive, with diminishing returns– Single server means single point of failure– Having a backup/failover means needing another

very expensive server!• Use multiple servers working together?


• The ability to spread data and processing across multiple servers in a cluster

• As demand increases, just add more and more servers to the cluster

• Cluster provides built-in redundancy: robust to failures of individual servers

• Use technologies that scale effectively across the cluster

Horizontal Scaling


Technologies

Lots of others!


Hadoop

DistributedFile Storage

Cluster Management

Distributed Processing

Client Applications


How does it all fit together?

Raw Data

Import

Process, Aggregate, Compute Views

Export


Some experimental results

Standard method

Hadoop - 1 node

Hadoop - 2 nodes

Hadoop - 3 nodes

Hadoop - 4 nodes

Hadoop - 5 nodes

0 50 100 150 200 250

Time (s)

Standard method

Hadoop - 1 node

Hadoop - 2 nodes

Hadoop - 3 nodes

Hadoop - 4 nodes

Hadoop - 5 nodes

0 100 200 300 400 500 600

Time (s)

Test 1 – filter test (5GB)

Test 2 – aggregation test (5GB)


Different Tools have Different Requirements• Some tools such as Impala process as much

data as possible in memory– Requires lots of RAM

• Some tools such as Hive processes data mostly on disk– Requires high disk I/O – Either fast disks/SSDs or

as many disks as possible


Next steps to try out yourself!

• You can try out processing with Hadoop using a cloud service like Amazon Web Services

• Set up an account, create a few nodes, install Hadoop

• Upload some test data – the larger the better• Try running some complex data processing on

the data to get an idea of the performance


Things to remember

• Do testing before investing in new hardware / infrastructure– Test all tools you are interested in using with

various amounts of RAM, CPU cores and I/O performance.

• Sheer number of tools in Hadoop ecosystem – worth planning out what you need


Thank-you

www.lynchpin.com

MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

Data & Analytics

Transcript of MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin