Leif Edvinsson - Intellectual Capital - Metaelméleti Konferencia
Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java...
Transcript of Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java...
BigDatawithJavaMarton Elek2017march
2 ©HortonworksInc.2011– 2017. AllRightsReserved
Hortworks DataPlatform
à Collectionoffullopensourceapacheprojects
3 ©HortonworksInc.2011– 2017. AllRightsReserved
Hadoop atScale
• Yahoo– 34000nodes,478PB• eBay– 10000nodes,150PB• Linkedin – 5000nodes,• Twitter– 3500nodes,30to50PB• Spotify – 700nodes,15PBofdata• Facebook– Thousands
4 ©HortonworksInc.2011– 2017. AllRightsReserved
ApacheHadoop
Collectionofmultiplesubprojects:Ã HDFS
– Distributedfilesystem
à YARN– Distributedprocessingframeworkandclustermanagement
à MAPREDUCE– Mapreduce frameworktowritecalculationindistributedenvironment
5 ©HortonworksInc.2011– 2017. AllRightsReserved
ApacheHDFS– Hadoop DistributedFileSystem
• Verylargescaledistributedfilesystem• 10Knodes, tensofmillions filesandPeta Bytesofdata
• Supports largefiles
• Designedtorunoncommodityhardware,assumeshardwarefailures• Filesarereplicatedtohandlehardwarefailure• Detectfailuresandrecoversfromthemautomatically
• OptimizedforBatchprocessing• Datalocationsareexposedsothatthecomputationscanmovetowheredataresides
• DataCoherency• Writeonceandreadmanytimesaccesspattern• Appending issupported forexistingfiles
• Filesarebrokenupinchunkscalled‘blocks’• Blocksaredistributedovernodes
6 ©HortonworksInc.2011– 2017. AllRightsReserved
HDFSArchitecture(Master-Slave)
7 ©HortonworksInc.2011– 2017. AllRightsReserved
HDFS:KeyServices
• NameNode• Masterservice• Managesthefilesystemnamespace• Single serviceacrossthecluster(HAcanbeenabled)• Regulatesaccesstofilesbyclients• Mapsfilenametoasetofblocks• MapsablocktotheDataNode whereitresides• Replicationengine forblocks
• DataNode• Slaveservice.Runsonslavenodes• BlockServer• Managesblockread/writeforHDFS,Storesdatainthelocalfilesystem• PeriodicallysendsareportofallexistingblockstotheNameNode• PingsNameNode forinstructions• Ifheatbeatfails,DataNode isremovedfromtheclusterandreplicatedblockstakeover
• StandbyNameNode• MergesNamenode’s filesystemimageandeditlogs
8 ©HortonworksInc.2011– 2017. AllRightsReserved
ClusterTopology
HDFSClient
MasterServicesNameNodeResourceManagerHBase Masteretc..
SlaveServicesDataNode
NodeManagerRegionServer
Rack
NameNode
SecondaryNameNode
OtherMasterSvcs
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
Rack Rack
9 ©HortonworksInc.2011– 2017. AllRightsReserved
10 ©HortonworksInc.2011– 2017. AllRightsReserved
11 ©HortonworksInc.2011– 2017. AllRightsReserved
YARN
à Howtoexecuteanyjobonmultiplemachines?– Clustermanagement– Distributedprocessingframework– Goal:executeapplicationonmultiplemachine
• Manageavailableresources(CPU,memory)• Usedifferent schedulingalgorithms(CapacityScheduler,FairScheduler)
à Components– Resourcemanager(1,2…instances):
• Managetheapplicationrequests, scheduleapplications,…– Nodemanager (∞instances):
• Executethescheduledapplication
http://ercoppa.github.io/HadoopInternals/HadoopArchitectureOverview.html
12 ©HortonworksInc.2011– 2017. AllRightsReserved
TransitionfromHadoop1toHadoop2
HADOOP 1.0
HDFS(redundant, reliable storage)
MapReduce(cluster resourcemanagement
&dataprocessing)
HDFS2(redundant, reliable storage)
YARN(cluster resourcemanagement)
MapReduce(dataprocessing)
Others(dataprocessing)
HADOOP 2.0
Single Use SystemBatch Apps
Multi Purpose PlatformBatch, Interactive, Online, Streaming,
…
13 ©HortonworksInc.2011– 2017. AllRightsReserved
YARN Architecture•Cluster Operating System
•Enable’s Generic Data Processing Tasks with ‘Containers’ •Big Compute (Metal Detectors) for Big Data (Hay Stack)
•Resource Manager•Global resource scheduler
•Node Manager•Per-machine agent•Manages the life-cycle of container & resource monitoring
•Application Master•Per-application master that manages application scheduling and task execution•E.g. MapReduce Application Master
•Container •Basic unit of allocation•Fine-grained resource allocation across multiple resource types •(memory, cpu, disk, network, gpu etc.)
14 ©HortonworksInc.2011– 2017. AllRightsReserved
YARN what is it good for?
•Compute for Data Processing
•Compute for Embarrassingly Parallel Problems•Problems with tiny datasets and/or that don’t depend on one another•ie: Exhaustive Search, Trade Simulations, Climate Models, Genetic Algorithms
•Beyond MapReduce•Enables Multi Workload Compute Applications on a Single Shared Infrastructure•Stream Processing, NoSQL, Search, InMemory, Graphs, etc•ANYTHING YOU CAN START FROM CLI!
•Slider & Code Reuse•Run existing applications on YARN: HBase on YARN, Storm on YARN•Reuse existing Java code in Containers making serial applications parallel
15 ©HortonworksInc.2011– 2017. AllRightsReserved
16 ©HortonworksInc.2011– 2017. AllRightsReserved
Hadoopmapreduce
à ”HadoopMapReduceisasoftwareframeworkforeasilywritingapplicationswhich– processvastamountsofdata(multi-terabytedata-sets)– in-parallel– onlargeclusters(thousandsofnodes)ofcommodityhardware– inareliable,fault-tolerantmanner.”
à Connection:MapReducejobsare– ScheduledonYARN– UsingdatafromHDFS
à AMapReducejob usually– splitstheinputdata-setintoindependentchunkswhichareprocessedbythemaptasks
inacompletelyparallelmanner.– Theframeworksortstheoutputsofthemaps,– whicharetheninputtothereducetasks.– Typicallyboththeinputandtheoutputofthejobarestoredinafile-system.(HDFS
input/outputformat)– Theframeworktakescareofschedulingtasks,monitoringthemandre-executesthe
failedtasks.
17 ©HortonworksInc.2011– 2017. AllRightsReserved
Mapreduce example– wordcount
à Rawdata– LoremIpsum issimplydummytextoftheprintingandtypesettingindustry.LoremIpsum
hasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalley….
à Map– Lorem:1– Ipsum:1– is:1– simply:
à Shuffle– Lorem:[1]– is:[1,1,1]
à Reduce– Lorem:1– is:3
18 ©HortonworksInc.2011– 2017. AllRightsReserved
Mapreduce example– wordcount
à Rawdata– LoremIpsum issimplydummytextoftheprintingandtypesettingindustry.LoremIpsum
hasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalley….
à Map– Lorem:1– Ipsum:1– is:1– simply:
à Shuffle– Lorem:[1]– is:[1,1,1]
à Reduce– Lorem:1– is:3
publicstaticclassTokenizerMapperextendsMapper<Object,Text,Text,IntWritable>{
privatefinalstaticIntWritable one=newIntWritable(1);privateTextword=newText();
publicvoidmap(Objectkey,Textvalue,Contextcontext)throwsIOException,InterruptedException {StringTokenizer itr =newStringTokenizer(value.toString());while(itr.hasMoreTokens()){word.set(itr.nextToken());context.write(word,one);}}}
19 ©HortonworksInc.2011– 2017. AllRightsReserved
Mapreduce example– wordcount
à Rawdata– LoremIpsum issimplydummytextoftheprintingandtypesettingindustry.LoremIpsum
hasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalley….
à Map– Lorem:1– Ipsum:1– is:1– simply:
à Shuffle– Lorem:[1]– is:[1,1,1]
à Reduce– Lorem:1– is:3
publicstaticclassIntSumReducerextendsReducer<Text,IntWritable,Text,IntWritable>{privateIntWritable result=newIntWritable();
publicvoidreduce(Textkey,Iterable<IntWritable>values,Contextcontext
)throwsIOException,InterruptedException {int sum=0;for(IntWritable val :values){sum+=val.get();}result.set(sum);context.write(key,result);}}
20 ©HortonworksInc.2011– 2017. AllRightsReserved
ApacheSpark
à ”ApacheSpark isafastandgeneralengineforlarge-scaledataprocessing.”à Sameabstractionforstreaming/batchprocessing(+MachineLearning,graphprocessing)
à Multilanguagesupport:– Scala– Python– R
à FunctionalandSQLinterfacesà Supportsmultipleexecutionengine
– YARN– StandaloneSparkcluster
à In-memorycachebetweenthestages
21 ©HortonworksInc.2011– 2017. AllRightsReserved
Sparkexamples
22 ©HortonworksInc.2011– 2017. AllRightsReserved
SparkUI
23 ©HortonworksInc.2011– 2017. AllRightsReserved
SparkUI
24 ©HortonworksInc.2011– 2017. AllRightsReserved
ApacheKafka
à ”Distributedstreamingplatform”à Publish/subscribetostreamsofrecordsà Storestreamsinafault-tolerantwayà KafkaConnect:APItoeasilycreateapplicationtostreamto/fromKafkaà KafkaStream:APItodostreamprocessing
25 ©HortonworksInc.2011– 2017. AllRightsReserved
Hortworks DataPlatform:What’smore
à What’smore?à Keyvaluestoretofastkeybasedaccess
– HBase,Phonix (SQLinterface)
à Securityandgovernance– Knox,Atlas,Ranger
à Management– Ambari,Cloudbreak
à Streaming– Storm,Flume,
(Spark,Kafka)
26 ©HortonworksInc.2011– 2017. AllRightsReserved
ThankYou