MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
-
Upload
lynchpin-analytics-consultancy -
Category
Data & Analytics
-
view
669 -
download
1
Transcript of MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
BIG(GER)FASTERDATA
Andrew Hood – Managing DirectorCameron Gray – Data Engineer
3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 2
Things That Are Not New
Parallel Processing
Distributed Computing
Columnar Databases
Moore’s Law
Kryder’s Law
3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 3
Things That Are New(ish)
Cheap as Chips Cloud
Computing
Mature(ish) Open Source Technologies
Standard Platforms (e.g.
Redshift)
Attitudes to External Data
Hosting
Time to Implementation
3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 4
Big Data: Structured Rant
1. I really don’t care how big anyone’s data is.2. I do care about how long something takes.3. I do care about how much something costs.4. Faster+Cheaper could completely change
what approach (tools/techniques) I might select for a given analytical task.
3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 5
Use Case 1: RDBMS
Historical Transactional Data
Load into Relational
Database (e.g. PostgreSQL)
SQL Query & Transformation
Historical Transactional
Data
Load into Hadoop + HIVE
SQL Query & Transformatio
n
3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 6
Use Case 2: Crap Analytics Tool
Adobe Clickstream Log/Google
BigQuery Export
Hadoop/ Redshift/ Impala Tableau
3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 7
Data Processing Scenarios
Use a relational database• Using raw clickstream/log files/other data sources• Powerful querying capabilities (SQL)• Integrates well with other tools• Can handle large data sets (>1 million rows easily)• Likely requires dedicated server and
administration skillset
3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 8
Data Processing ScenariosData readily available within analytics tool• Limited by analytics tool capabilities• Limited by aggregation and pre-processing
definitions• Limited by sampling (based on date range,
breakdowns, number of rows)• Limited by visualisation options (e.g. charting
options)
3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 9
Data Processing Scenarios
Export data into external tool• Microsoft Excel, Tableau, R…• More control over analysis, reporting,
visualisation• Still limited by the underlying data set• Tool limitations (e.g. 1 million rows in Excel)• Limited by PC resources
3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 10
When do RDBMs stop working efficiently?
• Sheer volume of data to process leads to problems– Limited by database server hardware
• Database can’t keep up with amount of data being inserted
• Queries have increasingly long processing times– Pre-computing queries also takes longer…
• Change in reporting requirements means reprocessing large amounts of historical data
3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 11
What are the solutions?
• Limit requirement definitions (i.e. say “not possible”… boo!)
• Invest in very expensive server hardware– Gets very expensive, with diminishing returns– Single server means single point of failure– Having a backup/failover means needing another
very expensive server!• Use multiple servers working together?
3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 12
• The ability to spread data and processing across multiple servers in a cluster
• As demand increases, just add more and more servers to the cluster
• Cluster provides built-in redundancy: robust to failures of individual servers
• Use technologies that scale effectively across the cluster
Horizontal Scaling
3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 13
Technologies
Lots of others!
3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 14
Hadoop
DistributedFile Storage
Cluster Management
Distributed Processing
Client Applications
3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 15
How does it all fit together?
Raw Data
Import
Process, Aggregate, Compute Views
Export
3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 16
Some experimental results
Standard method
Hadoop - 1 node
Hadoop - 2 nodes
Hadoop - 3 nodes
Hadoop - 4 nodes
Hadoop - 5 nodes
0 50 100 150 200 250
Time (s)
Standard method
Hadoop - 1 node
Hadoop - 2 nodes
Hadoop - 3 nodes
Hadoop - 4 nodes
Hadoop - 5 nodes
0 100 200 300 400 500 600
Time (s)
Test 1 – filter test (5GB)
Test 2 – aggregation test (5GB)
3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 17
Different Tools have Different Requirements• Some tools such as Impala process as much
data as possible in memory– Requires lots of RAM
• Some tools such as Hive processes data mostly on disk– Requires high disk I/O – Either fast disks/SSDs or
as many disks as possible
3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 18
Next steps to try out yourself!
• You can try out processing with Hadoop using a cloud service like Amazon Web Services
• Set up an account, create a few nodes, install Hadoop
• Upload some test data – the larger the better• Try running some complex data processing on
the data to get an idea of the performance
3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 19
Things to remember
• Do testing before investing in new hardware / infrastructure– Test all tools you are interested in using with
various amounts of RAM, CPU cores and I/O performance.
• Sheer number of tools in Hadoop ecosystem – worth planning out what you need
3 May 2023 © Lynchpin Analytics Limited, All Rights Reserved 20
Thank-you
www.lynchpin.com