Introduction to Apache Beam

Introduction to Apache Beam

JB Onofré - Talend

Who am I ?

● Talend○ Software Architect○ Apache team

● Apache○ Member of the Apache Software Foundation

○ Champion/Mentor/PPMC/PMC/Committer for ~ 20 projects (Beam, Falcon, Lens, Brooklyn,

Slider, Karaf, Camel, ActiveMQ, ACE, Archiva, Aries, ServiceMix, Syncope, jClouds, Unomi,

Guacamole, BatchEE, Sirona, Incubator, …)

What is Apache Beam?

1. Agnostic (unified batch + stream) Beam programming model

2. Dataflow Java SDK (soon Python, DSLs)

3. Runners for Dataflow

a. Apache Flink (thanks to data Artisans)

b. Apache Spark (thanks to Cloudera)

c. Google Cloud Dataflow (fast, no-ops)

d. Local (in-process) runner for testing

e. OSGi/Karaf

Why Apache Beam?

1. Portable - You can use the same code with different runners (abstraction) and backends on premise, in the cloud, or locally

2. Unified - Same unified model for batch and stream processing

3. Advanced features - Event windowing, triggering, watermarking, lateless, etc.

4. Extensible model and SDK - Extensible API; can define custom sources to read and write in parallel

Beam Programming Model

Data processing pipeline(executed via a Beam runner)

PTransform/IO PTransform PTransformInput Output


1. Pipelines - data processing job as a directed graph of steps

2. PCollection - the data inside a pipeline

3. Transform - a step in the pipeline (taking PCollections as input, and produce

PCollections)

a. Core transforms - common transformation provided (ParDo, GroupByKey, …)

b. Composite transforms - combine multiple transforms

c. IO transforms - endpoints of a pipeline to create PCollections (consumer/root) or use

PCollections to “write” data outside of the pipeline (producer)

Beam Programming Model - PCollection

1. PCollection is immutable, does not support random access to element, belong to a pipeline

2. Each element in PCollection has a timestamp (set by IO Source)3. Coder to support different data types4. Bounded (batch) or Unbounded (streaming) PCollection (depending of the IO

Source)5. Grouping of unbounded PCollection with Windowing (thanks to the timestamp)

a. Fixed time windowb. Sliding time windowc. Session windowd. Global window (for bounded PCollection)

e. Can deal with time skew and data lag (late data) with trigger (time-based with watermark, data-

based with counting, composite)

Beam Programming Model - IO

1. IO Sources (read data as PCollections) and Sinks (write PCollections)

2. Support Bounded and/or Unbounded PCollections

3. Provided IO - File, BigQuery, BigTable, Avro, and more coming (Kafka, JMS, …)

4. Custom IO - extensible IO API to create custom sources & sinks

5. Should deal with timestamp, watermark, deduplication, parallelism (depending of the needs)

Apache Beam SDKs

1. API for Beam Programming Model (design pipelines, transforms, …)

2. Current SDKs

a. Java - First SDK and primary focus for refactoring and improvement

b. Python - Dataflow SDK preview for batch processing, will be migrated to Apache Beam once

the Java SDK has been stabilized (and APIs/interfaces redefined)

3. Coming (possible) SDKs/languages - Scala, Go, Ruby, etc.

4. DSLs - domain specific languages on top of the SDKs (Java fluent DSL on top of Java SDK, …)

Java SDK

public static void main(String[] args) {

// Create a pipeline parameterized by commandline flags.

Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(arg));

p.apply(TextIO.Read.from("/path/to...")) // Read input.

.apply(new CountWords()) // Do some processing.

.apply(TextIO.Write.to("/path/to...")); // Write output.

// Run the pipeline.

p.run();

}


PCollection<KV<String, Integer>> scores = input

.apply(Window.into(SessionWindows.of(Duration.standardMinutes(2))

.triggering(AtWatermark()

.withEarlyFirings(

AtPeriod(Duration.standardMinutes(1)))

.withLateFirings(AtCount(1)))

.accumulatingFiredPanes())

.apply(Sum.integersPerKey());

The Apache Beam Model (by way of the Dataflow model) includes many primitives and features which are powerful but hard to express in other models and languages.

Runners and Backends

● Runners “translate” the code to a target backend (the runner itself doesn’t provide the backend)

● Many runners are tied to other top-level Apache projects, such as Apache Flink and apache Spark

● Due to this, runners can be run on-premise (on your local Flink cluster) or in a public cloud (using Google Cloud Dataproc or Amazon EMR) for example

● Apache Beam is focused on treating runners as a top-level use case (with APIs, support, etc.) so runners can be developed with minimal friction for maximum pipeline portability

Beam Runners

Google Cloud Dataflow Apache Flink* Apache Spark*

[*] With varying levels of fidelity.The Apache Beam (http://beam.incubator.apache.org) site will have more details soon.

?

Other Runner*(local, OSGi, …)

http://beam.incubator.apache.org

Use Cases

Apache Beam is a great choice for both batch and stream processing and can handle bounded and unbounded datasets

Batch can focus on ETL/ELT, catch-up processing, daily aggregations, and so on

Stream can focus on handling real-time processing on a record-by-record basis

Real use cases

● Mobile gaming data processing, both batch and stream processing (https://github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/)

● Real-time event processing from IoT devices

https://github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/



Use Case - Gaming

● A game store the gaming results in the CSV file:○ Player,team,score,timestamp

● Two pipelines:○ UserScore (batch) sum scores for each user

○ HourlyScore (batch) similar UserScore but with a Window (hour): it calculates sum scores per

team on fixed windows.

User Game - Gaming - UserScore - PipelinePipeline pipeline = Pipeline.create(options);

// Read events from a text file and parse them.

pipeline.apply(TextIO.Read.from(options.getInput()))

.apply(ParDo.named("ParseGameEvent").of(new ParseEventFn()))

// Extract and sum username/score pairs from the event data.

.apply("ExtractUserScore", new ExtractAndSumScore("user"))

.apply("WriteUserScoreSums",

new WriteToBigQuery<KV<String, Integer>>(options.

getTableName(),

configureBigQueryWrite()));

// Run the batch pipeline.

pipeline.run();

User Game - Gaming - UserScore - Avro Coder @DefaultCoder(AvroCoder.class)

static class GameActionInfo {

@Nullable String user;

@Nullable String team;

@Nullable Integer score;

@Nullable Long timestamp;

public GameActionInfo(String user, String team, Integer score, Long

timestamp) {

…

}

…}

User Game - Gaming - UserScore - Parse Event Fn static class ParseEventFn extends DoFn<String, GameActionInfo> {

// Log and count parse errors.

private static final Logger LOG = LoggerFactory.getLogger(ParseEventFn.class);

private final Aggregator<Long, Long> numParseErrors =

createAggregator("ParseErrors", new Sum.SumLongFn());

@Override

public void processElement(ProcessContext c) {

String[] components = c.element().split(",");

try {

String user = components[0].trim();

String team = components[1].trim();

Integer score = Integer.parseInt(components[2].trim());

Long timestamp = Long.parseLong(components[3].trim());

GameActionInfo gInfo = new GameActionInfo(user, team, score, timestamp);

c.output(gInfo);

} catch (ArrayIndexOutOfBoundsException | NumberFormatException e) {

numParseErrors.addValue(1L);

LOG.info("Parse error on " + c.element() + ", " + e.getMessage());

}

}

}

User Game - Gaming - UserScore - Sum Score Tr

public static class ExtractAndSumScore

extends PTransform<PCollection<GameActionInfo>, PCollection<KV<String, Integer>>> {

private final String field;

ExtractAndSumScore(String field) {

this.field = field;

}

@Override

public PCollection<KV<String, Integer>> apply(

PCollection<GameActionInfo> gameInfo) {

return gameInfo

.apply(MapElements

.via((GameActionInfo gInfo) -> KV.of(gInfo.getKey(field), gInfo.getScore()))

.withOutputType(new TypeDescriptor<KV<String, Integer>>() {}))

.apply(Sum.<String>integersPerKey());

}

}

User Game - Gaming - HourlyScore - Pipeline

pipeline.apply(TextIO.Read.from(options.getInput()))

.apply(ParDo.named("ParseGameEvent”).of(new ParseEventFn()))

// filter with byPredicate to ignore some data

.apply("FilterStartTime", Filter.byPredicate((GameActionInfo gInfo)

-> gInfo.getTimestamp() > startMinTimestamp.getMillis()))

.apply("FilterEndTime", Filter.byPredicate((GameActionInfo gInfo)

-> gInfo.getTimestamp() < stopMinTimestamp.getMillis()))

// use fixed-time window

.apply("AddEventTimestamps", WithTimestamps.of((GameActionInfo i) -> new Instant(i.getTimestamp())))

.apply(Window.named("FixedWindowsTeam")

.<GameActionInfo>into(FixedWindows.of(Duration.standardMinutes(60)))

// extract and sum teamname/score pairs from the event data.

.apply("ExtractTeamScore", new ExtractAndSumScore("team"))

// write the result

.apply("WriteTeamScoreSums",

new WriteWindowedToBigQuery<KV<String, Integer>>(options.getTableName(),

configureWindowedTableWrite()));

pipeline.run();

Roadmap

02/01/2016Enter Apache

Incubator

End 2016Cloud Dataflow

should run Beam pipelines

Early 2016Design for use cases,

begin refactoring

Mid 2016Slight chaos

Late 2016Multiple runners execute Beam

pipelines

02/25/20161st commit to

ASF repository

More information and get involved!

1: Read about Apache Beam

Apache Beam website - http://beam.incubator.apache.org

2: See what the Apache Beam team is doing

Apache Beam JIRA - https://issues.apache.org/jira/browse/BEAM

Apache Beam mailing lists - http://beam.incubator.apache.org/mailing_lists/

3: Contribute!

Apache Beam git repo - https://github.com/apache/incubator-beam

http://beam.incubator.apache.org

https://issues.apache.org/jira/browse/BEAM

http://beam.incubator.apache.org/mailing_lists/

https://github.com/apache/incubator-beam

Introduction to Apache Beam

Software

Transcript of Introduction to Apache Beam