A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne '14)

download A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne '14)

of 80

Embed Size (px)

description

 

Transcript of A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne '14)

  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 A real-time Lambda Architecture using Hadoop & Storm NoSQL Matters Cologne 2014 by Nathan Bijnens
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Speaker Nathan Bijnens Big Data Engineer @ Virdata @nathan_gs
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Computing Trends Past Computation (CPUs) Expensive Disk Storage Expensive Coordination Easy (Latches Dont Often Hit) DRAM Expensive Computation Cheap (Many Core Computers) Disk Storage Cheap (Cheap Commodity Disks) Coordination Hard (Latches Stall a Lot, etc) DRAM / SSD Getting Cheap Current Source: Immutability Changes Everything - Pat Helland, RICON2012
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Credits Nathan Marz Ex-Backtype & Twitter Startup in Stealthmode Creator of Storm Cascalog ElephantDB Coined the term Lambda Architecture. manning.com/marz
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 a Data System
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Not all information is equal. Some information is derived from other pieces of information. Data is more than Information
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Eventually you will reach the most raw form of information. This is the information you hold true, simply because it exists. Lets call this data, very similar to event. Data is more than Information
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Events used to manipulate the master data. Events: Before
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Today, events are the master data. Events: After
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Lets store everything. Data System
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Data is Immutable. Data System
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Data is Time Based. Data System
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Capturing change INSERT INTO contact (name, city) VALUES (Nathan, Antwerp) UPDATE contact SET city = Cologne WHERE name = Nathan Traditionally
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Capturing change INSERT INTO contact (name, city, timestamp) VALUES (Nathan, Antwerp, 2008-10-11 20:00Z) INSERT INTO contact (name, city, timestamp) VALUES (Nathan, Cologne, 2014-04-29 10:00Z) in a Data System
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 The data you query is often transformed, aggregated, ... Rarely used in its original form. Query
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Query = function ( all data ) Query
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Query: Number of people living in each city Person City Timestamp Nathan Antwerp 2008-10-11 John Cologne 2010-01-23 Dirk Antwerp 2012-09-12 Nathan Cologne 2014-04-29 City Count Antwerp 1 Cologne 2
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Query All Data QueryPrecomputed View
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Layered Architecture Batch Layer Speed Layer Serving Layer
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Layered Architecture Hadoop ElephantDB Incoming Data Cassandra Query
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Batch Layer
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Batch Layer Hadoop ElephantDB Incoming Data
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Batch Layer The batch layer can calculate anything, given enough time... Unrestrained computation.
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 No need to De-Normalize. The batch layer stores the data normalized, the generated views are often, if not always denormalized. Batch Layer
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Horizontally scalable. Batch Layer
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 High Latency. Lets for now pretend the update latency doesnt matter. Batch Layer
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Functional computation, based on immutable inputs, is idempotent. Batch Layer
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Stores a master copy of the data set Batch Layer append only
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Batch Layer
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Batch: view generation Master Dataset View #1 View #3 View #2 MapReduce MapReduce MapReduce
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 MapReduce 1. Take a large data set and divide it into subsets 2. Perform the same function on all subsets 3. Combine the output from all subsets Output DoWork() DoWork() DoWork() MAPREDUCE
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 MapReduce
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Serialization & Schema Catch errors as quickly as they happen. Validate on write vs on read. Catch errors as quickly as they happen. Validate on write vs on read.
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 CSV is actually a serialization language that is just poorly defined. Serialization & Schema
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Use a format with a schema Thrift Avro Protocolbuffers Could be combined with Parquet. Added bonus: its faster and uses less space. Serialization & Schema
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Batch View Database No random writes required. Read Only database
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Every iteration produces the views from scratch. Batch View Database
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Pure Lambda databases ElephantDB SploutSQL Databases with a batch load & read only views Voldemort Other databases that could be used ElasticSearch/Solr: generate the lucene indexes using MapReduce Cassandra: generate sstables ... Batch View Databases
  • NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Batch Layer Without the associated complexities. Eventually consistent