Final deck
-
Upload
steve-watt -
Category
Technology
-
view
4.437 -
download
3
description
Transcript of Final deck
![Page 1: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/1.jpg)
Big Data for Everyone
Twitter: #bd4e
![Page 2: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/2.jpg)
Introduction to Big Data
Steve Watt Hadoop Strategy
@wattsteve #bd4e
![Page 3: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/3.jpg)
3
What is “Big Data”?
“Every two days we create as much information as we did from the dawn of civilization up until 2003” – Eric Schmidt, Google
Current state of affairs: Explosion of user generated content Storage is really cheap so we can store what we want Traditional data stores have reached critical mass
Issues: Enterprise Amnesia Traditional architectures become brittle and slow when
tasked with trying to process data at petabyte scale How do we process unstructured data?
![Page 4: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/4.jpg)
4
How were these issues addressed?
2004 – Google publishes seminal whitepapers on Map/Reduce and the Google File System, a new programming paradigm to process data at Internet Scale
The whitepapers describe the use of Massive Parallelism to allow a system to scale horizontally, achieving linear performance improvements
This approach is well suited a cloud model whereby additional instances can be commissioned/de-commisioned to have an immediate effect on performance.
The approaches described in the Google white papers were incorporated into the open source Apache Hadoop project.
![Page 5: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/5.jpg)
5
What is Apache Hadoop ?
It is a cluster technology with a single master and multiple slaves, designed for commodity hardware
It consists of two runtimes, the Hadoop distributed file system (HDFS) and Map/Reduce
As data is copied onto the HDFS, it ensures the data is blocked and replicated to other machines (node) to provide redundancy
Self contained jobs are written in Map/Reduce and submitted to the cluster. The jobs run in parallel on each of the machines in the cluster, processing the data on the local machine (data locality).
Hadoop may execute or re-execute a job on any node in the cluster.
Node failures are automatically handled by the framework.
![Page 6: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/6.jpg)
6
The Big Data Ecosystem
ClusterChef / Apache Whirr / EC2
Hadoop
Pig / WuKong /Cascading
Cassandra / HBase
Offline Systems (Analytics) Human Consumption
BigSheets / DataMeer
Hive / Karmasphere
Provisioning
Nutch / SQOOP / Flume
Scripting
DBA
Non-Programmer
Import/Export Tooling
Visualizations
Online Systems
(OLTP @ Scale)
NoSQL
Commodity Hardware
![Page 7: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/7.jpg)
Offline customer scenario
Eric Sammer Solution Architect
@esammer #bd4e
![Page 8: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/8.jpg)
Use Case: Product Recommendations
“We can provide a better experience (and make more money) if we provide meaningful product recommendations.”
We need data:
- What products did a user buy?
- What products did a user browse, hover over, rate, add to cart (but not buy) in the last 2 months?
- What are the attributes of the user? (e.g. income, gender, friends)
- What are our margins on products, inventory, upcoming promotions?
![Page 9: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/9.jpg)
Problems
That’s a lot of data! (2 months of activity + all purchase data + all user data) Activity: ~20GB per day x ~60 days = 1.2TB User Data: ~2GB Purchase Data: ~5GB Misc: Inventory, product costs, promotion schedules
Distilling data to aggregates would reduce fidelity.
Easy to see how looking at more data could improve recommendations.
How do we keep this information current?
![Page 10: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/10.jpg)
The Answer
Calculate all qualifying products once a day for each user and store them for quick display
Use Hadoop to process data in parallel on hundreds of machines
![Page 11: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/11.jpg)
![Page 12: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/12.jpg)
Online customer scenario
Matt Pfeil CEO
@mattz62 #bd4e
![Page 13: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/13.jpg)
04/11/23 13
What is Apache Cassandra?
![Page 14: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/14.jpg)
Use Case: Managing Email
“My email volume is growing exponentially. Traditional solutions – including using a SAN – simply can’t keep up. I need to scale horizontally and get incredibly fast real time performance.”
The Problem
How do we achieve scalability, redundancy, high performance?
How do we store billions of files on commodity hardware?
How do we increase capacity by simply adding machines? (No SANs!)
How do we make it FAST?
![Page 15: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/15.jpg)
Requirements
Storage for Email Billions of emails (<100KB avg) 2M users, 100 MB of storage each = 190 TB Growth of 50% every 6 months Durable
Requirements for Storage System No Master/Single Point of Failure Linear scalability + redundancy Multiple Active Data Centers Many reads, many writes Millisecond response times Commodity hardware
![Page 16: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/16.jpg)
Solution
800 TB of Storage
~1.75 Million reads or writes/sec (No Cache!)
130 Machines Read/Write at both Data
Centers No “Master” Data Center
![Page 17: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/17.jpg)
Where to next?The Adjacent Possible
Flip Kromer CTO
@mrflip #bd4e
![Page 18: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/18.jpg)
Something about AnythingSomething about Anything
![Page 19: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/19.jpg)
Everything about SomethingEverything about Something
![Page 20: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/20.jpg)
Bigger than One Computer
![Page 21: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/21.jpg)
Bigger than Frontal Lobe
![Page 22: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/22.jpg)
Bigger than Excel
![Page 23: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/23.jpg)
what’s coming to help
![Page 24: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/24.jpg)
myth of the “data base”
Hold your data
Ask questions
![Page 25: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/25.jpg)
Managing & Shipping
Hadoop FTWCassandra, HBase, ElasticSearch, ...
Integration is still too hard
Dev OpsReliable Decoupling: Flume, Graphite
![Page 26: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/26.jpg)
Data flutters by label
Elephants make sturdy piles {GROUP}
Number becomes thoughtprocess_group
Hadoop
![Page 27: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/27.jpg)
class TwStP < Streamer def process line a = JSON.load(line) rescue {} yield a.values_at(*a.keys.sort) endendWukong.run(TwStP)
Twitter Parser in a Tweet
![Page 28: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/28.jpg)
pure functionality
![Page 29: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/29.jpg)
pure functionality
![Page 30: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/30.jpg)
Cassandra
HBase
ElasticSearch
MySQL
Redis
TokyoTyrant
SimpleDB
MongoDB
sqlite
whisper (graphite)
file system
S3
Data Stores in Production
![Page 31: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/31.jpg)
Cassandra
HBase
ElasticSearch
MySQL
Redis
TokyoTyrant
SimpleDB
MongoDB
sqlite
whisper (graphite)
file system
S3
Dev Ops: Rethink Hard
![Page 32: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/32.jpg)
Still Blind
Visual Grammar to see it: NYTimes, Stamen, Ben Fry
Interactive tools: Tableau, Spotfirebloom.io, d3.js, Gephi
![Page 33: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/33.jpg)
Human-Scale Tools
Data-as-a-Service: Infochimps, SimpleGeoDrawnToScale
Business IntelligenceFamiliar Paradigm, New Scale
BigSheets, Datameer
![Page 34: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/34.jpg)
Panel Discussion
Stu Hood Software Engineer
@stuhood #bd4e
![Page 35: Final deck](https://reader036.fdocuments.in/reader036/viewer/2022062319/554a3750b4c905863d8b45fb/html5/thumbnails/35.jpg)
Thanks for coming!Stu Hood @stuhood Flip Kromer @mrflipMatt Pfeil @mattz62Eric Sammer @esammerSteve Watt @wattsteve