Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data...

ProcessingDataofAnySizewithApacheBeam

1/21Copyright©2016SmokingHandLLC.AllrightsReserved.Version:bc7f1cf

Mentoring,training,andhigh-levelconsultingcompanyfocusedonBigData,NoSQLandTheCloud

Foundedin2008WehelpmakecompaniessuccessfulwithBigDataprojects

OngoingteammentoringUsecaseevaluationManagementtrainingTechnicaltrainingArchitecturereviewsLiveandemailprogrammingsupport

Gotohttp://www.bigdatainstitute.ioformoreinformation

AboutBigDataInstitute


http://www.bigdatainstitute.io/

Yourexperienceasadeveloper,analystoradministrator

Whichlanguagesyouuse

ExperiencewithHadoop,BigDataorNoSQL

Expectationsfromthisclass

AboutYou


Chapter1

IntroducingApacheBeam


WhatIsBeam?WhyUseBeam?UsingBeam



ApacheBeamisaunifiedmodelforprocessingdata

WasoriginallycreatedatGoogleLaterdonatedtotheApacheFoundationasApacheBeamNowanApachetoplevelproject

BeamcodeiswrittentoitsAPICodeisexecutedondifferentrunnersNotdirectlytiedtoaframeworkorrunner

Allinteractionsaredonethroughpipelines

ApacheBeam


Pipeline

The DoFNs take in the inputprocesses it, and emit the results

Source

The Source reads the input onerecord or row at a time

DoFN DoFN Sink

The Source saves the output of theDoFN to the targeted path

All work is encapsulated in aPipeline

BeamPipelinesDiagram


Juan

Fatima

Mark

14:00 14:30 15:00 15:30 16:00

Data is broken intosessions based on acriteria for a timeoutbetween actions.

Data can be calculatedin fixed windows wherethe time doesn't change.

Data can be calculatedin sliding windows wherethe time is fixed butadvances.

BeamWindowing


Learningframework-specificAPIseverytimeanewframeworkcomesoutorcompletelychangestheirexistingAPIdoesn’tcreate

value

TooManyAPIs


Hadoop Cluster

Real-time data is published toKafka

Spark Streaming, Storm, orKafka Consumers process in

real-time

DataSource

DataSource

DataSource

RDBMS

Real-timeProcessingKafka Cluster

BI Analytics

Batch data is saved to HDFS

DataSource

DataSource

DataSource

MapReduce, Hive, Pig, Crunch,and Spark process data stored

in HDFS

Real-time data is archived toHDFS for analytics and offline

processing

GeneralArchitectureDiagram


OneAPItorulethemallOneAPItolearnMovebetweenframeworks

ThemostunifiedbatchandstreamAPII’veused

UnifiedAPItotheecosystem

Riskmitigationofframeworks

Multiplelanguages

WhyI'mExcitedAboutBeam


Beamisn'ttiedtoaspecificframework

ApacheSparkusesthespark-submit

ApacheFlinkcanbesubmittedwiththeMavenrunner

GoogleCloudDataflowcanbesubmittedwiththeMavenrunner

TheDirectRunnercanbestartedwiththeMavenrunner

RunningBeam


BeamContributions


IcannotteachhimTheboyhasnopatience

PCollection<String>etl=lines.apply(MapElements.via((Stringline)->line.toUpperCase()).withOutputType(TypeDescriptors.strings()));

ICANNOTTEACHHIMTHEBOYHASNOPATIENCE

MapElements


Icannotteachhim.Theboyhasnopatience.Hewilllearnpatience.

PCollection<String>linecount=lines.apply(Regex.matches("I.*\\."));

Icannotteachhim.Theboyhasnopatience.

RegularexpressionscanbeusedtoparseKVs


PCollection<KV<String,String>>twoSentences=lines.apply(Regex.findKV("(.*)\\.(.*)",1,2));

<Icannotteachhim,Theboyhasnopatience>

RegexTransform



PCollection<String>pats=lines.apply(ParDo.of(newPatLinesFN()));

staticclassPatLinesFNextendsDoFn<String,String>{@ProcessElementpublicvoidprocessElement(DoFn<String,String>.ProcessContextcontext)throwsException{String[]pieces=context.element().split("");

for(Stringpiece:pieces){if(piece.startsWith("pat")){context.output(piece);}}}}

patience.patience.

ExampleCustomDoFN


importorg.apache.beam.sdk.Pipeline;importorg.apache.beam.sdk.io.TextIO;importorg.apache.beam.sdk.options.PipelineOptions;importorg.apache.beam.sdk.options.PipelineOptionsFactory;importorg.apache.beam.sdk.transforms.Count;importorg.apache.beam.sdk.transforms.Regex;importorg.apache.beam.sdk.transforms.ToString;

publicclassPicoWordCount{publicstaticvoidmain(String[]args){PipelineOptionsoptions=PipelineOptionsFactory.create();Pipelinep=Pipeline.create(options);

p.apply(TextIO.Read.from("playing_cards.tsv")).apply(Regex.split("\\W+")).apply(Count.perElement()).apply(ToString.elements()).apply(TextIO.Write.to("output/stringcounts"));

p.run();}}

PlayingCardAlgorithm


WhatareotherpeopledoingwithBeam?http://tiny.jesse-anderson.com/beaminterview

WhereissomesampleBeamcode?http://tiny.jesse-anderson.com/beamtutorial

MainBeamsitehttps://beam.apache.org/

Convincingyourbosshttp://tiny.jesse-anderson.com/beam1http://tiny.jesse-anderson.com/beam2

NextSteps


http://tiny.jesse-anderson.com/beaminterview

http://tiny.jesse-anderson.com/beamtutorial

https://beam.apache.org/

http://tiny.jesse-anderson.com/beam1

http://tiny.jesse-anderson.com/beam2

Current:Instructor,ThoughtLeader,MonkeyTamer

Previously:CurriculumDeveloperandInstructor@ClouderaSeniorSoftwareEngineer@Intuit

Covered,ConferencesandPublishedIn:GigaOM,ArsTecnica,PragmaticProgrammers,Strata,OSCON,WallStreetJournal,CNN,BBC,NPR

SeeMeOn:http://www.jesse-anderson.com@jessetandersonhttp://tiny.bdi.io/linkedinhttp://tiny.bdi.io/youtube

AboutMe


http://www.jesse-anderson.com/

https://twitter.com/jessetanderson

http://tiny.bdi.io/linkedin

http://tiny.bdi.io/youtube

Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data...

Documents

Transcript of Processing Data of Any Size with Apache Beam · Apache Beam is a unified model for processing data...