Real-time Big Data Processing with Storm

38
Real-time Big Data Processing with Storm: Using Twitter Streaming as Example Liang-Chi Hsieh Hadoop in Taiwan 2013 1

description

The slides "Real-time Big Data Processing with Storm: Using Twitter Streaming as Example" for the presentation in Hadoop in Taiwan 2013.

Transcript of Real-time Big Data Processing with Storm

Page 1: Real-time Big Data Processing with Storm

Real-time Big Data Processing with Storm: Using Twitter

Streaming as Example

Liang-Chi HsiehHadoop in Taiwan 2013

1

Page 2: Real-time Big Data Processing with Storm

In Today’s Talk

• Introduce stream computation in Big Data

• Introduce current stream computation platforms

• Storm

• Architecture & concepts

• Use case: analysis of Twitter streaming data

2

Page 3: Real-time Big Data Processing with Storm

Recap, the Four V’s of Big Data

• To help us talk ‘big data’, it is common to break it down into four dimensions

• Volume: Scale of Data

• Velocity: Analysis of Streaming Data

• Variety: Different Forms of Data

• Veracity: Uncertainty of Data

http://dashburst.com/infographic/big-data-volume-variety-velocity/

3

Page 4: Real-time Big Data Processing with Storm

• Velocity: Data in motion

• Require realtime response to process, analyze continuous data stream

http://www.intergen.co.nz/Global/Images/BlogImages/2013/Defining-big-data.png

4

Page 5: Real-time Big Data Processing with Storm

Streaming Data

• Data coming from:

• Logs

• Sensors

• Stock trade

• Personal devices

• Network connections

• etc...

5

Page 6: Real-time Big Data Processing with Storm

Batch Data Processing Architecture

6

Data Store Hadoop

Data Flow Batch Run

Batch View

Query• Views generated in batch maybe out of date

• Batch workflow is too slow

Page 7: Real-time Big Data Processing with Storm

Data Processing Architecture: Batch and Realtime

7

Data Store Hadoop

Batch Run

Realtime Processing

Batch View

Realtime View

Query

Data Flow

• Generate realtime views of data by using stream computation

Page 8: Real-time Big Data Processing with Storm

Current Stream Computation Platforms

• S4

• Storm

• Spark Streaming

• MillWheel

8

Page 9: Real-time Big Data Processing with Storm

S4

• General-purpose, distributed, scalable, fault-tolerant, pluggable platform for processing data stream

• Initially released by Yahoo!

• Apache Incubator project since September 2011

• Written in Java

9

AdapterPEs &

Streams

Page 10: Real-time Big Data Processing with Storm

Storm

• Distributed and fault-tolerant realtime computation

• Provide a set of general primitives for doing realtime computation

10

http://storm-project.net/

Page 11: Real-time Big Data Processing with Storm

Spark Streaming• (Near) real-time processing of stream data

• New programming model

• Discretized streams (D-Streams)

• Built on Resilient Distributed Datasets (RDDs)

• Based on Spark

• Integrated with Spark batch and interactive computation modes

11

Page 12: Real-time Big Data Processing with Storm

Spark Streaming• D-Streams

• Treat a streaming computation as a series of deterministic batch computations on a small time intervals

• Latencies can be as low as a second, supported by the fast execution engine Spark

val ssc = new StreamingContext(sparkUrl, "Tutorial", Seconds(1), sparkHome, Seq(jarFile))

val tweets = ssc.twitterStream(twitterUsername, twitterPassword)

val statuses = tweets.map(status => status.getText())

statuses.print()

batch@t batch@t+1 batch@t+2Twitter Streaming Data

D-Streams: RDDs

12

Page 13: Real-time Big Data Processing with Storm

MillWheel• Google’s computation framework for low-latency

stream data-processing applications

• Application logic is written as individual nodes in a directed computation graph

• Fault tolerance

• Exactly-once delivery guarantees

• Low watermarks is used to prevent logical inconsistencies caused by data delivery not in order

13

Page 14: Real-time Big Data Processing with Storm

Storm: Distributed and Fault-Tolerant Realtime Computation

• Guaranteed data processing

• Every tuple will be fully processed

• Exactly-once? Using Trident

• Horizontal scalability

• Fault-tolerance

• Easy to deploy and operate

• One click deploy on EC2

14

Page 15: Real-time Big Data Processing with Storm

Storm Architecture

• A Storm cluster is similar to a Hadoop cluster

• Togologies vs. MapReduce jobs

• Running a topology:

• Killing a topology

15

storm jar all‐my‐code.jar backtype.storm.MyTopology arg1 arg2

storm kill {topology name}

Page 16: Real-time Big Data Processing with Storm

Storm Architecture

• Two kinds of nodes

• Master node runs a daemon called Nimbus

• Each worker node runs a daemon called Supervisor

• Each worker process executes a subset of a topology

16https://github.com/nathanmarz/storm/wiki/images/storm-cluster.png

Page 17: Real-time Big Data Processing with Storm

Topologies• A topology is a graph of computation

• Each node contains processing logic

• Links between nodes represent the data flows between those processing units

• Topology definitions are Thrift structs and Nimbus is a Thrift service

• You can create and submit topologies using any programming language

17

Page 18: Real-time Big Data Processing with Storm

Topologies: Concepts• Stream: unbounded

sequence of tuples

• Primitives

• Spouts

• Bolts

• Interfaces can be implemented to run your logic

18

https://github.com/nathanmarz/storm/wiki/images/topology.png

Page 19: Real-time Big Data Processing with Storm

Data Model

• Tuples are used by Storm as data model

• A named list of values

• A field in a tuple can be an object of any type

• Storm supports all the primitive types, strings, and byte arrays

• Implement corresponding serializer for using custom type

19

Tuples

Page 20: Real-time Big Data Processing with Storm

Stream Grouping• Define how streams are distributed to downstream

tasks

• Shuffle grouping: randomly distributed

• Fields grouping: partitioned by specified fields

• All grouping: replicated to all tasks

• Global grouping: the task with lowest id

20

https://github.com/nathanmarz/storm/wiki/images/topology-tasks.png

Page 21: Real-time Big Data Processing with Storm

Simple TopologyTopologyBuilder builder = new TopologyBuilder();        builder.setSpout("words", new TestWordSpout(), 10);        builder.setBolt("exclaim1", new ExclamationBolt(), 3)        .shuffleGrouping("words");builder.setBolt("exclaim2", new ExclamationBolt(), 2)        .shuffleGrouping("exclaim1");

“words:” TestWordSpout

“exclaim1”: ExclamationBolt

“exclaim2”: ExclamationBolt

shuffleGrouping

shuffleGrouping

shuffle grouping: tuples are randomly distributed to the boltʼs tasks21

Page 22: Real-time Big Data Processing with Storm

Submit TopologyConfig conf = new Config();conf.setDebug(true);conf.setNumWorkers(2);

LocalCluster cluster = new LocalCluster();cluster.submitTopology("test", conf, builder.createTopology());Utils.sleep(10000);cluster.killTopology("test");cluster.shutdown();

Local mode:

Distributed mode:Config conf = new Config();conf.setNumWorkers(20);conf.setMaxSpoutPending(5000);StormSubmitter.submitTopology("mytopology", conf, topology);

22

Page 23: Real-time Big Data Processing with Storm

Guaranteeing Message Processing

• Every tuple will be fully processed

• Tuple tree

Fully processed: all messages in the tree must to be processed.

23

Page 24: Real-time Big Data Processing with Storm

Storm Reliability API• A Bolt to split a tuple containing a sentence to the

tuples of wordspublic void execute(Tuple tuple) {

            String sentence = tuple.getString(0);

            for(String word: sentence.split(" ")) {

                _collector.emit(tuple, new Values(word));

            }

            _collector.ack(tuple);

        }

“Anchoring” creates a new link in the

tuple tree.

Calling “ack” (or “fail”) makes the tuple as complete (or failed).

24

Page 25: Real-time Big Data Processing with Storm

Storm on YARN

• Enable Storm clusters to be deployed on Hadoop YARN

25

http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/yarn_architecture.gif

Page 26: Real-time Big Data Processing with Storm

Use Case: Analysis of Twitter Streaming Data

• Suppose we want to program a simple visualization for Twitter streaming data

• Tweet visualization on map: heatmap

• Since there are too many tweets at same time, we are like to group tweets by their geo-locations

26

Page 27: Real-time Big Data Processing with Storm

Heatmap: Tweet Visualization on Map

• Graphical representation of tweet data

• Clear visualization of the intensity of tweet count by geo-locations

• Static or dynamic

27

Page 28: Real-time Big Data Processing with Storm

Batch Approach: Hadoop

• Generating static tweet heatmap

• Continuous data collecting

• Batch data processing using Hadoop Java programs, Hive or Pig

28

Twitter Storage Batch Processing by Hadoop

Page 29: Real-time Big Data Processing with Storm

Simple Geo-location-based Tweet Grouping

• Goal

• To group geographical near tweets together

• Using Hive

29

Page 30: Real-time Big Data Processing with Storm

Data Store & Data Loading

• Simple data schema

CREATE EXTERNAL TABLE tweets (  id_str STRING,  geo STRUCT<    type:STRING,    coordinates:ARRAY<DOUBLE>>) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'LOCATION '/user/hduser/tweets';

load data local inpath '/mnt/tweets_2013_3.json' overwrite into table tweets;

• Loading data in Hive

30

Page 31: Real-time Big Data Processing with Storm

Hive Query

• Applying Hive query on collected tweets data

insert overwrite local directory '/tmp/tweets_coords.txt'   select avg(geo.coordinates[0]),            avg(geo.coordinates[1]),          count(*) as tweet_count  from tweets   group by floor(geo.coordinates[0] * 100000),            floor(geo.coordinates[1] * 100000)  sort by tweet_count desc;

31

Page 32: Real-time Big Data Processing with Storm

Static Tweet Heatmap

• Heatmap visualization of partial tweets collected in Jan, 2013

32

Page 33: Real-time Big Data Processing with Storm

Streaming Approach: Storm

• Generate realtime Twitter usage heatmap view

• Higher level Storm programming by using DSLs

• Scala DSL here

33

class ExclamationBolt extends StormBolt(outputFields = List("word")) {

  def execute(t: Tuple) = {

    t emit (t.getString(0) + "!!!")

    t ack

  }

}

Bolt DSL

class MySpout extends StormSpout(outputFields = List("word", "author")) {

  def nextTuple = {}

}Spout DSL

Page 34: Real-time Big Data Processing with Storm

Stream Computation Design

Tweets

Defined Time Slot

Calculate some statistics, e.g. average geo-locations,

for each group

Group geographical near tweets

Perform predication tasks such as classification, sentiment analysis

Send/Store results

34

Page 35: Real-time Big Data Processing with Storm

Create Topology

val builder = new TopologyBuilder

builder.setSpout("tweetstream", new TweetStreamSpout, 1) builder.setSpout("clock", new ClockSpout) builder.setBolt("geogrouping", new GeoGrouping, 12) .fieldsGrouping("tweetstream", new Fields("geo_lat", "geo_lng")) .allGrouping("clock")

• Two Spouts

• One for produce tweet stream

• One for generate time interval needed to update tweet statistics

• Only one Bolt; Stream grouping by lat, lng for tweet stream

35

Page 36: Real-time Big Data Processing with Storm

Tweet Spout & Clock Spout

class TweetStreamSpout extends StormSpout(outputFields = List("geo_lat", "geo_lng", "lat", "lng", "txt")) {

def nextTuple = { ...

emit (math.floor(lat * 10000), math.floor(lng * 10000), lat, lng, txt) ...

}}

class ClockSpout extends StormSpout(outputFields = List("timestamp")) { def nextTuple { Thread sleep 1000 * 1 emit (System.currentTimeMillis / 1000) }}

36

Page 37: Real-time Big Data Processing with Storm

GeoGrouping Boltclass GeoGrouping extends StormBolt(List("geo_lat", "geo_lng", "lat", "lng", "txt")) { def execute(t: Tuple) = t matchSeq { case Seq(clockTime: Long) => // Calculate statistics for each group of tweets // Perform classification tasks // Send/Store results case Seq(geo_lat: Double, geo_lng: Double, lat: Double, lng: Double, txt: String) => // Group tweets by geo-locations

}}

37

Page 38: Real-time Big Data Processing with Storm

Demo

38