How Apache Beam Will Change Big Data - QConSP€¦ · Apache Beam is a unified model for processing...

21
How Apache Beam Will Change Big Data 1 / 21 Copyright © 2016 Smoking Hand LLC. All rights Reserved. Version: 85872ec

Transcript of How Apache Beam Will Change Big Data - QConSP€¦ · Apache Beam is a unified model for processing...

HowApacheBeamWillChangeBigData

1/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec

Mentoring,training,andhigh-levelconsultingcompanyfocusedonBigData,NoSQLandTheCloud

Foundedin2008WehelpmakecompaniessuccessfulwithBigDataprojects

OngoingteammentoringUsecaseevaluationManagementtrainingTechnicaltrainingArchitecturereviewsLiveandemailprogrammingsupport

Gotohttp://www.bigdatainstitute.ioformoreinformation

AboutBigDataInstitute

2/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec

Yourexperienceasadeveloper,analystoradministrator

Whichlanguagesyouuse

ExperiencewithHadoop,BigDataorNoSQL

Expectationsfromthisclass

AboutYou

3/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec

Chapter1

IntroducingApacheBeam

4/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec

WhatIsBeam?WhyUseBeam?UsingBeam

IntroducingApacheBeam

5/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec

ApacheBeamisaunifiedmodelforprocessingdata

WasoriginallycreatedatGoogleLaterdonatedtotheApacheFoundationasApacheBeamNowanApachetoplevelproject

BeamcodeiswrittentoitsAPICodeisexecutedondifferentrunnersNotdirectlytiedtoaframeworkorrunner

Allinteractionsaredonethroughpipelines

ApacheBeam

6/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec

Pipeline

The DoFNs take in the inputprocesses it, and emit the results

Source

The Source reads the input onerecord or row at a time

DoFN DoFN Sink

The Source saves the output of theDoFN to the targeted path

All work is encapsulated in aPipeline

BeamPipelinesDiagram

7/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec

Juan

Fatima

Mark

14:00 14:30 15:00 15:30 16:00

Data is broken intosessions based on acriteria for a timeoutbetween actions.

Data can be calculatedin fixed windows wherethe time doesn't change.

Data can be calculatedin sliding windows wherethe time is fixed butadvances.

BeamWindowing

8/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec

WhatIsBeam?WhyUseBeam?UsingBeam

IntroducingApacheBeam

9/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec

Learningframework-specificAPIseverytimeanewframeworkcomesoutorcompletelychangestheirexistingAPIdoesn’tcreate

value

TooManyAPIs

10/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec

Hadoop Cluster

Real-time data is published toKafka

Spark Streaming, Storm, orKafka Consumers process in

real-time

DataSource

DataSource

DataSource

RDBMS

Real-timeProcessingKafka Cluster

BI Analytics

Batch data is saved to HDFS

DataSource

DataSource

DataSource

MapReduce, Hive, Pig, Crunch,and Spark process data stored

in HDFS

Real-time data is archived toHDFS for analytics and offline

processing

GeneralArchitectureDiagram

11/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec

OneAPItorulethemallOneAPItolearnMovebetweenframeworks

ThemostunifiedbatchandstreamAPII’veused

UnifiedAPItotheecosystem

Riskmitigationofframeworks

WhyI'mExcitedAboutBeam

12/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec

Beamisn'ttiedtoaspecificframework

ApacheSparkusesthespark-submit

ApacheFlinkcanbesubmittedwiththeMavenrunner

GoogleCloudDataflowcanbesubmittedwiththeMavenrunner

TheDirectRunnercanbestartedwiththeMavenrunner

RunningBeam

13/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec

BeamContributions

14/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec

WhatIsBeam?WhyUseBeam?UsingBeam

IntroducingApacheBeam

15/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec

IcannotteachhimTheboyhasnopatience

PCollection<String>etl=lines.apply(MapElements.via((Stringline)->line.toUpperCase()).withOutputType(TypeDescriptors.strings()));

ICANNOTTEACHHIMTHEBOYHASNOPATIENCE

MapElements

16/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec

Icannotteachhim.Theboyhasnopatience.Hewilllearnpatience.

PCollection<String>linecount=lines.apply(Regex.matches("I.*\\."));

Icannotteachhim.Theboyhasnopatience.

RegularexpressionscanbeusedtoparseKVs

Icannotteachhim.Theboyhasnopatience.Hewilllearnpatience.

PCollection<KV<String,String>>twoSentences=lines.apply(Regex.findKV("(.*)\\.(.*)",1,2));

<Icannotteachhim,Theboyhasnopatience>

RegexTransform

17/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec

Icannotteachhim.Theboyhasnopatience.Hewilllearnpatience.

PCollection<String>pats=lines.apply(ParDo.of(newPatLinesFN()));

staticclassPatLinesFNextendsDoFn<String,String>{@ProcessElementpublicvoidprocessElement(DoFn<String,String>.ProcessContextcontext)throwsException{String[]pieces=context.element().split("");

for(Stringpiece:pieces){if(piece.startsWith("pat")){context.output(piece);}}}}

patience.patience.

ExampleCustomDoFN

18/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec

importorg.apache.beam.sdk.Pipeline;importorg.apache.beam.sdk.io.TextIO;importorg.apache.beam.sdk.options.PipelineOptions;importorg.apache.beam.sdk.options.PipelineOptionsFactory;importorg.apache.beam.sdk.transforms.Count;importorg.apache.beam.sdk.transforms.Regex;importorg.apache.beam.sdk.transforms.ToString;

publicclassPicoWordCount{publicstaticvoidmain(String[]args){PipelineOptionsoptions=PipelineOptionsFactory.create();Pipelinep=Pipeline.create(options);

p.apply(TextIO.Read.from("playing_cards.tsv")).apply(Regex.split("\\W+")).apply(Count.perElement()).apply(ToString.elements()).apply(TextIO.Write.to("output/stringcounts"));

p.run();}}

PlayingCardAlgorithm

19/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec

WhatareotherpeopledoingwithBeam?http://tiny.jesse-anderson.com/beaminterview

WhereissomesampleBeamcode?http://tiny.jesse-anderson.com/beamtutorial

MainBeamsitehttps://beam.apache.org/

Convincingyourbosshttp://tiny.jesse-anderson.com/beam1http://tiny.jesse-anderson.com/beam2

NextSteps

20/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec

Current:Instructor,ThoughtLeader,MonkeyTamer

Previously:CurriculumDeveloperandInstructor@ClouderaSeniorSoftwareEngineer@Intuit

Covered,ConferencesandPublishedIn:GigaOM,ArsTecnica,PragmaticProgrammers,Strata,OSCON,WallStreetJournal,CNN,BBC,NPR

SeeMeOn:http://www.jesse-anderson.com@jessetandersonhttp://tiny.bdi.io/linkedinhttp://tiny.bdi.io/youtube

AboutMe

21/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:85872ec