Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data...

21
Processing Data of Any Size with Apache Beam 1 / 21 Copyright © 2016 Smoking Hand LLC. All rights Reserved. Version: bc7f1cf

Transcript of Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data...

Page 1: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

ProcessingDataofAnySizewithApacheBeam

1/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Page 2: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

Mentoring,training,andhigh-levelconsultingcompanyfocusedonBigData,NoSQLandTheCloud

Foundedin2008WehelpmakecompaniessuccessfulwithBigDataprojects

OngoingteammentoringUsecaseevaluationManagementtrainingTechnicaltrainingArchitecturereviewsLiveandemailprogrammingsupport

Gotohttp://www.bigdatainstitute.ioformoreinformation

AboutBigDataInstitute

2/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Page 3: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

Yourexperienceasadeveloper,analystoradministrator

Whichlanguagesyouuse

ExperiencewithHadoop,BigDataorNoSQL

Expectationsfromthisclass

AboutYou

3/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Page 4: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

Chapter1

IntroducingApacheBeam

4/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Page 5: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

WhatIsBeam?WhyUseBeam?UsingBeam

IntroducingApacheBeam

5/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Page 6: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

ApacheBeamisaunifiedmodelforprocessingdata

WasoriginallycreatedatGoogleLaterdonatedtotheApacheFoundationasApacheBeamNowanApachetoplevelproject

BeamcodeiswrittentoitsAPICodeisexecutedondifferentrunnersNotdirectlytiedtoaframeworkorrunner

Allinteractionsaredonethroughpipelines

ApacheBeam

6/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Page 7: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

Pipeline

The DoFNs take in the inputprocesses it, and emit the results

Source

The Source reads the input onerecord or row at a time

DoFN DoFN Sink

The Source saves the output of theDoFN to the targeted path

All work is encapsulated in aPipeline

BeamPipelinesDiagram

7/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Page 8: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

Juan

Fatima

Mark

14:00 14:30 15:00 15:30 16:00

Data is broken intosessions based on acriteria for a timeoutbetween actions.

Data can be calculatedin fixed windows wherethe time doesn't change.

Data can be calculatedin sliding windows wherethe time is fixed butadvances.

BeamWindowing

8/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Page 9: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

WhatIsBeam?WhyUseBeam?UsingBeam

IntroducingApacheBeam

9/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Page 10: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

Learningframework-specificAPIseverytimeanewframeworkcomesoutorcompletelychangestheirexistingAPIdoesn’tcreate

value

TooManyAPIs

10/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Page 11: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

Hadoop Cluster

Real-time data is published toKafka

Spark Streaming, Storm, orKafka Consumers process in

real-time

DataSource

DataSource

DataSource

RDBMS

Real-timeProcessingKafka Cluster

BI Analytics

Batch data is saved to HDFS

DataSource

DataSource

DataSource

MapReduce, Hive, Pig, Crunch,and Spark process data stored

in HDFS

Real-time data is archived toHDFS for analytics and offline

processing

GeneralArchitectureDiagram

11/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Page 12: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

OneAPItorulethemallOneAPItolearnMovebetweenframeworks

ThemostunifiedbatchandstreamAPII’veused

UnifiedAPItotheecosystem

Riskmitigationofframeworks

Multiplelanguages

WhyI'mExcitedAboutBeam

12/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Page 13: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

Beamisn'ttiedtoaspecificframework

ApacheSparkusesthespark-submit

ApacheFlinkcanbesubmittedwiththeMavenrunner

GoogleCloudDataflowcanbesubmittedwiththeMavenrunner

TheDirectRunnercanbestartedwiththeMavenrunner

RunningBeam

13/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Page 14: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

BeamContributions

14/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Page 15: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

WhatIsBeam?WhyUseBeam?UsingBeam

IntroducingApacheBeam

15/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Page 16: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

IcannotteachhimTheboyhasnopatience

PCollection<String>etl=lines.apply(MapElements.via((Stringline)->line.toUpperCase()).withOutputType(TypeDescriptors.strings()));

ICANNOTTEACHHIMTHEBOYHASNOPATIENCE

MapElements

16/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Page 17: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

Icannotteachhim.Theboyhasnopatience.Hewilllearnpatience.

PCollection<String>linecount=lines.apply(Regex.matches("I.*\\."));

Icannotteachhim.Theboyhasnopatience.

RegularexpressionscanbeusedtoparseKVs

Icannotteachhim.Theboyhasnopatience.Hewilllearnpatience.

PCollection<KV<String,String>>twoSentences=lines.apply(Regex.findKV("(.*)\\.(.*)",1,2));

<Icannotteachhim,Theboyhasnopatience>

RegexTransform

17/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Page 18: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

Icannotteachhim.Theboyhasnopatience.Hewilllearnpatience.

PCollection<String>pats=lines.apply(ParDo.of(newPatLinesFN()));

staticclassPatLinesFNextendsDoFn<String,String>{@ProcessElementpublicvoidprocessElement(DoFn<String,String>.ProcessContextcontext)throwsException{String[]pieces=context.element().split("");

for(Stringpiece:pieces){if(piece.startsWith("pat")){context.output(piece);}}}}

patience.patience.

ExampleCustomDoFN

18/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Page 19: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

importorg.apache.beam.sdk.Pipeline;importorg.apache.beam.sdk.io.TextIO;importorg.apache.beam.sdk.options.PipelineOptions;importorg.apache.beam.sdk.options.PipelineOptionsFactory;importorg.apache.beam.sdk.transforms.Count;importorg.apache.beam.sdk.transforms.Regex;importorg.apache.beam.sdk.transforms.ToString;

publicclassPicoWordCount{publicstaticvoidmain(String[]args){PipelineOptionsoptions=PipelineOptionsFactory.create();Pipelinep=Pipeline.create(options);

p.apply(TextIO.Read.from("playing_cards.tsv")).apply(Regex.split("\\W+")).apply(Count.perElement()).apply(ToString.elements()).apply(TextIO.Write.to("output/stringcounts"));

p.run();}}

PlayingCardAlgorithm

19/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Page 20: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

WhatareotherpeopledoingwithBeam?http://tiny.jesse-anderson.com/beaminterview

WhereissomesampleBeamcode?http://tiny.jesse-anderson.com/beamtutorial

MainBeamsitehttps://beam.apache.org/

Convincingyourbosshttp://tiny.jesse-anderson.com/beam1http://tiny.jesse-anderson.com/beam2

NextSteps

20/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Page 21: Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data Was originally created at Google ... Real-time data is published to Kafka Spark

Current:Instructor,ThoughtLeader,MonkeyTamer

Previously:CurriculumDeveloperandInstructor@ClouderaSeniorSoftwareEngineer@Intuit

Covered,ConferencesandPublishedIn:GigaOM,ArsTecnica,PragmaticProgrammers,Strata,OSCON,WallStreetJournal,CNN,BBC,NPR

SeeMeOn:http://www.jesse-anderson.com@jessetandersonhttp://tiny.bdi.io/linkedinhttp://tiny.bdi.io/youtube

AboutMe

21/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf