Intro to hadoop tutorial
of 90
/90
-
Author
markgrover -
Category
Engineering
-
view
158 -
download
5
Embed Size (px)
Transcript of Intro to hadoop tutorial
- 1. 20102015Cloudera,Inc.AllRightsReserved Introduc=ontoApacheHadoop anditsEcosystem MarkGrover|BudapestDataForum June5th,2015 @mark_grover github.com/markgrover/hadoop-intro-fast Copyright2010-2014 Cloudera,Inc. Allrightsreserved.
- 2. 20102015Cloudera,Inc.AllRightsReserved Facili=es Ques=ons Schedule(=mesareapproximate) ScheduleandLogis=cs Time Event 9:0010:20 Tutorial 10:2010:30 Break 10:3012:00 Tutorial
- 3. 20102015Cloudera,Inc.AllRightsReserved AboutthePresenta=on Whatsahead FundamentalConcepts HDFS:TheHadoopDistributedFileSystem DataProcessingwithMapReduce TheHadoopEcosystem HadoopClusters:Past,Present,andFuture Conclusion+Q&A
- 4. 20102015Cloudera,Inc.AllRightsReserved FundamentalConcepts WhytheWorldNeedsHadoop
- 5. 20102015Cloudera,Inc.AllRightsReserved Volume Everyday Morethan1.5billionsharesaretradedontheNYSE Facebookstores2.7billioncommentsandLikes Everyminute Foursquarehandlesmorethan2,000check-ins TransUnionmakesnearly70,000updatestocreditles Everysecond Banksprocessmorethan10,000creditcardtransac=ons
- 6. 20102015Cloudera,Inc.AllRightsReserved Wearegenera=ngdatafasterthanever Processesareincreasinglyautomated Peopleareincreasinglyinterac=ngonline Systemsareincreasinglyinterconnected Velocity
- 7. 20102015Cloudera,Inc.AllRightsReserved Variety Wereproducingavarietyofdata,including Audio Video Images Logles Webpages Productra=ngs Socialnetworkconnec=ons Notallofthismapscleanlytotherela=onalmodel
- 8. 20102015Cloudera,Inc.AllRightsReserved BigDataCanMeanBigOpportunity Onetweetisananecdote Butamilliontweetsmaysignalimportanttrends Onepersonsproductreviewisanopinion Butamillionreviewsmightuncoveradesignaw Onepersonsdiagnosisisanisolatedcase Butamillionmedicalrecordscouldleadtoacure
- 9. 20102015Cloudera,Inc.AllRightsReserved WeNeedaSystemthatScales Toomuchdatafortradi=onaltools Twokeyproblems Howtoreliablystorethisdataatareasonablecost Howtoweprocessallthedatawevestored
- 10. 20102015Cloudera,Inc.AllRightsReserved WhatisApacheHadoop? Scalabledatastorageandprocessing Distributedandfault-tolerant Runsonstandardhardware Twomaincomponents Storage:HadoopDistributedFileSystem(HDFS) Processing:MapReduce Hadoopclustersarecomposedofcomputerscallednodes Clustersrangefromasinglenodeuptoseveralthousandnodes
- 11. 20102015Cloudera,Inc.AllRightsReserved HowDidApacheHadoopOriginate? HeavilyinuencedbyGooglesarchitecture Notably,theGoogleFilesystemandMapReducepapers OtherWebcompaniesquicklysawthebenets Earlyadop=onbyYahoo,Facebookandothers 2002 2003 2004 2005 2006 Google publishes MapReduce paper Nutch rewritten for MapReduce Hadoop becomes Lucene subproject Nutch spun off from Lucene Google publishes GFS paper
- 12. 20102015Cloudera,Inc.AllRightsReserved HowAreOrganiza=onsUsingHadoop? Letslookatafewcommonuses
- 13. 20102015Cloudera,Inc.AllRightsReserved HadoopUseCase:LogSessioniza=on February 12, 2014 10.174.57.241 - - [17/Feb/2014:17:57:41 -0500] "GET /s?q=widget HTTP/1.1" 200 3617 "http://www.hotbot.com/nd/dualcore" "WebTV 1.2" "U=129" 1) Search for 'Widget' 2) Widget Results 3) View Details for Widget X Recent Activity for John Smith February 17, 2014 5) Track Order 6) Click 'Contact Us' Link 7) Submit Complaint 4) Order Widget X ... 10.174.57.241 - - [17/Feb/2014:17:58:03 -0500] "GET /wres.html HTTP/1.1" 200 5741 "http://www.example.com/s?q=widget" "WebTV 1.2" "U=129" 10.174.57.241 - - [17/Feb/2014:17:58:25 -0500] "GET /detail?w=41 HTTP/1.1" 200 8584 "http://www.example.com/wres.html" "WebTV 1.2" "U=129" 10.174.57.241 - - [17/Feb/2014:17:59:36 -0500] "GET /order.do HTTP/1.1" 200 964 "http://www.example.com/detail?w=41" "WebTV 1.2" "U=129" 10.174.57.241 - - [17/Feb/2014:17:59:47 -0500] "GET /conrm HTTP/1.1" 200 964 "http://www.example.com/order.do" "WebTV 1.2" "U=129" 10.218.46.19 - - [17/Feb/2014:17:57:43 -0500] "GET /ide.html HTTP/1.1" 404 955 "http://www.example.com/s?q=JBuilder" "Mosaic/3.6 (X11;SunOS)" 10.32.51.237 - - [17/Feb/2014:17:58:04 -0500] "GET /os.html HTTP/1.1" 404 955 "http://www.example.com/s?q=VMS" "Mozilla/1.0b (Win3.11)" 10.157.96.181 - - [17/Feb/2014:17:58:26 -0500] "GET /mp3.html HTTP/1.1" 404 955 "http://www.example.com/s?q=Zune" "Mothra/2.77" "U=3622" ... Web Server Log Data Clickstream Data for User Sessions Process Logs with Hadoop
- 14. 20102015Cloudera,Inc.AllRightsReserved HadoopUseCase:CustomerAnaly=cs Example Inc. Public Web Site (February 9 - 15) Category Unique Visitors Page Views Bounce Rate Conversion RateAverage Time on Page Television 1,967,345 8,439,206 23% 51%17 seconds Smartphone 982,384 3,185,749 47% 41%23 seconds MP3 Player 671,820 2,174,913 61% 12%42 seconds Stereo 472,418 1,627,843 74% 19%26 seconds Monitor 327,018 1,241,837 56% 17%37 seconds Tablet 217,328 816,545 48% 28%53 seconds Printer 127,124 535,261 27% 64%34 seconds
- 15. 20102015Cloudera,Inc.AllRightsReserved HadoopUseCase:Sen=mentAnalysis 00 01 02 03 04 05 06 07 08 09 10 Negative Neutral Positive References to Product in Social Media (Hourly) 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 Hour 11 12 13 14 15 16 17 18 19 20 21 22 23
- 16. 20102015Cloudera,Inc.AllRightsReserved HadoopUseCase:ProductRecommenda=ons Audio Adapter - Stereo Products Recommended for You $4.99 (341 ratings) Over-the-Ear Headphones $29.99 (1,672 ratings) 9-volt Alkaline Battery $1.79 (847 ratings)
- 17. 20102015Cloudera,Inc.AllRightsReserved GraceHopper,earlyadvocateofdistributedcompu=ng Inpioneerdaystheyusedoxenforheavy pulling,andwhenoneoxcouldntbudgealog, wedidnttrytogrowalargerox
- 18. 20102015Cloudera,Inc.AllRightsReserved ComparingHadooptoOtherSystems Monolithicsystemsdontscale Modernhigh-performancecompu=ngsystemsaredistributed Theyspreadcomputa=onsacrossmanymachinesinparallel Widely-usedusedforscien=capplica=ons LetsexaminehowatypicalHPCsystemworks
- 19. 20102015Cloudera,Inc.AllRightsReserved ArchitectureofaTypicalHPCSystem Storage System Compute Nodes Fast Network
- 20. 20102015Cloudera,Inc.AllRightsReserved ArchitectureofaTypicalHPCSystem Storage System Compute Nodes Step 1: Copy input data Fast Network
- 21. 20102015Cloudera,Inc.AllRightsReserved ArchitectureofaTypicalHPCSystem Storage System Compute Nodes Step 2: Process the data Fast Network
- 22. 20102015Cloudera,Inc.AllRightsReserved ArchitectureofaTypicalHPCSystem Storage System Compute Nodes Step 3: Copy output data Fast Network
- 23. 20102015Cloudera,Inc.AllRightsReserved YouDontJustNeedSpeed Theproblemisthatwehavewaymoredatathancode $ du -ks code/ 1,087 $ du ks data/ 854,632,947,314
- 24. 20102015Cloudera,Inc.AllRightsReserved YouNeedSpeedAtScale Storage System Compute Nodes Bottleneck
- 25. 20102015Cloudera,Inc.AllRightsReserved HadoopDesignFundamental:DataLocality ThisisahallmarkofHadoopsdesign Dontbringthedatatothecomputa=on Bringthecomputa=ontothedata Hadoopusesthesamemachinesforstorageandprocessing Signicantlyreducesneedtotransferdataacrossnetwork
- 26. 20102015Cloudera,Inc.AllRightsReserved OtherHadoopDesignFundamentals Machinefailureisunavoidableembraceit Buildreliabilityintothesystem Moreisusuallybeqerthanfaster Throughputmaqersmorethanlatency
- 27. 20102015Cloudera,Inc.AllRightsReserved TheHadoopDistributedFilesystem HDFS
- 28. 20102015Cloudera,Inc.AllRightsReserved HDFS:HadoopDistributedFileSystem InspiredbytheGoogleFileSystem Reliable,low-coststorageformassiveamountsofdata SimilartoaUNIXlesysteminsomeways Hierarchical UNIX-stylepaths(e.g.,/sales/alice.txt) UNIX-styleleownershipandpermissions
- 29. 20102015Cloudera,Inc.AllRightsReserved HDFS:HadoopDistributedFileSystem Therearealsosomemajordevia=onsfromUNIXlesystems Highly-op=mizedforprocessingdatawithMapReduce Designedforsequen=alaccesstolargeles Cannotmodifylecontentoncewriqen Itsactuallyauser-spaceJavaprocess AccessedusingspecialcommandsorAPIs Noconceptofacurrentworkingdirectory
- 30. 20102015Cloudera,Inc.AllRightsReserved HDFSBlocks FilesaddedtoHDFSaresplitintoxed-sizeblocks Blocksizeiscongurable,butdefaultsto64megabytes Block #1: First 64 MB Lorem ipsum dolor sit amet, consectetur sed adipisicing elit, ado lei eiusmod tempor etma incididunt ut libore tua dolore magna alli quio ut enim ad minim veni veniam, quis nostruda exercitation ul laco es sed laboris nisi ut eres aliquip ex eaco modai consequat. Duis hona irure dolor in repre sie honerit in ame mina lo voluptate elit esse oda cillum le dolore eu fugi gia nulla aria tur. Ente culpa qui officia ledea un mollit anim id est o laborum ame elita tu a magna omnibus et. Lorem ipsum dolor sit amet, consectetur sed adipisicing elit, ado lei eiusmod tempor etma incididunt ut libore tua dolore magna alli quio ut enim ad minim veni veniam, quis nostruda exercitation ul laco es sed laboris nisi ut eres aliquip ex eaco modai consequat. Duis hona irure dolor in repre sie honerit in ame mina lo voluptate elit esse oda cillum le dolore eu fugi gia nulla aria tur. Ente culpa qui officia ledea un mollit anim id est o laborum ame elita tu a magna omnibus et. Block #4: Remaining 33 MB Block #2: Next 64 MB Block #3: Next 64 MB 225 MB File
- 31. 20102015Cloudera,Inc.AllRightsReserved WhyDoesHDFSUseSuchLargeBlocks? Current location of disk head Where the data you need is stored
- 32. 20102015Cloudera,Inc.AllRightsReserved HDFSReplica=on Eachblockisthenreplicatedacrossmul=plenodes Replica=onfactorisalsocongurable,butdefaultstothree Benetsofreplica=on Availability:dataisntlostwhenanodefails Reliability:HDFScomparesreplicasandxesdatacorrup=on Performance:allowsfordatalocality Letsseeanexampleofreplica=on
- 33. 20102015Cloudera,Inc.AllRightsReserved HDFSReplica=on(contd) Lorem ipsum dolor sit amet, consectetur sed adipisicing elit, ado lei eiusmod tempor etma incididunt ut libore tua dolore magna alli quio ut enim ad minim veni veniam, quis nostruda exercitation ul laco es sed laboris nisi ut eres aliquip ex eaco modai consequat. Duis hona irure dolor in repre sie honerit in ame mina lo voluptate elit esse oda cillum le dolore eu fugi gia nulla aria tur. Ente culpa qui officia ledea un mollit anim id est o laborum ame elita tu a magna omnibus et. A B C Block #1 Block #2 Block #3 Block #4 E D
- 34. 20102015Cloudera,Inc.AllRightsReserved HDFSReplica=on(contd) C E Lorem ipsum dolor sit amet, consectetur sed adipisicing elit, ado lei eiusmod tempor etma incididunt ut libore tua dolore magna alli quio irure dolor in repre sie honerit in ame mina lo voluptate elit esse oda cillum le dolore eu fugi gia nulla aria tur. Ente culpa qui officia ledea un mollit anim id est o laborum ame elita tu a magna omnibus et. D B ABlock #1 Block #2 Block #3 Block #4 ut enim ad minim veni veniam, quis nostruda exercitation ul laco es sed laboris nisi ut eres aliquip ex eaco modai consequat. Duis hona
- 35. 20102015Cloudera,Inc.AllRightsReserved HDFSReplica=on(contd) Lorem ipsum dolor sit amet, consectetur sed adipisicing elit, ado lei eiusmod tempor etma incididunt ut libore tua dolore magna alli quio ut enim ad minim veni veniam, quis nostruda exercitation ul laco es sed laboris nisi ut eres aliquip ex eaco modai consequat. Duis hona un mollit anim id est o laborum ame elita tu a magna omnibus et. Block #1 Block #2 Block #3 Block #4 A E B irure dolor in repre sie honerit in ame mina lo voluptate elit esse oda cillum le dolore eu fugi gia nulla aria tur. Ente culpa qui officia ledea D C
- 36. 20102015Cloudera,Inc.AllRightsReserved HDFSReplica=on(contd) Lorem ipsum dolor sit amet, consectetur sed adipisicing elit, ado lei eiusmod tempor etma incididunt ut libore tua dolore magna alli quio ut enim ad minim veni veniam, quis nostruda exercitation ul laco es sed laboris nisi ut eres aliquip ex eaco modai consequat. Duis hona irure dolor in repre sie honerit in ame mina lo voluptate elit esse oda cillum le dolore eu fugi gia nulla aria tur. Ente culpa qui officia ledea Block #1 Block #2 Block #3 Block #4 B C E A D un mollit anim id est o laborum ame elita tu a magna omnibus et.
- 37. 20102015Cloudera,Inc.AllRightsReserved AccessingHDFSviatheCommandLine UserstypicallyaccessHDFSviathehadoop fscommand Ac=onsspeciedwithsubcommands(prexedwithaminussign) MostaresimilartocorrespondingUNIXcommands $ hadoop fs -ls /user/tomwheeler $ hadoop fs -cat /customers.csv $ hadoop fs -rm /webdata/access.log $ hadoop fs -mkdir /reports/marketing
- 38. 20102015Cloudera,Inc.AllRightsReserved CopyingLocalDataToandFromHDFS RememberthatHDFSisdis=nctfromyourlocallesystem hadoop fs putcopieslocallestoHDFS hadoop fs getfetchesalocalcopyofalefromHDFS $ hadoop fs -put sales.txt /reports Hadoop Cluster Client Machine $ hadoop fs -get /reports/sales.txt
- 39. 20102015Cloudera,Inc.AllRightsReserved HDFSDaemons TherearetwodaemonprocessesinHDFS NameNode(master) Exactlyoneac=veNameNodepercluster Managesnamespaceandmetadata DataNode(slave) Manypercluster Performsblockstorageandretrieval
- 40. 20102015Cloudera,Inc.AllRightsReserved HowFileReadsWork client C D A B E Step 1: Get block locations from the NameNode
- 41. 20102015Cloudera,Inc.AllRightsReserved HowFileReadsWork(cont'd) client C D A B E Step 2: Read those blocks directly from the DataNodes
- 42. 20102015Cloudera,Inc.AllRightsReserved HDFSDemo Iwillnowdemonstratethefollowing 1. Howtolistthecontentsofadirectory 2. HowtocreateadirectoryinHDFS 3. HowtocopyalocalletoHDFS 4. HowtodisplaythecontentsofaleinHDFS 5. HowtoremovealefromHDFS TODO:provideVMand instruc=onsfordemo
- 43. 20102015Cloudera,Inc.AllRightsReserved AScalableDataProcessingFramework DataProcessingwithMapReduce
- 44. 20102015Cloudera,Inc.AllRightsReserved WhatisMapReduce? MapReduceisaprogrammingmodel Itsawayofprocessingdata YoucanimplementMapReduceinanylanguage
- 45. 20102015Cloudera,Inc.AllRightsReserved UnderstandingMapandReduce Yousupplytwofunc=onstoprocessdata:MapandReduce Map:typicallyusedtotransform,parse,orlterdata Reduce:typicallyusedtosummarizeresults TheMapfunc=onalwaysrunsrst TheReducefunc=onrunsaverwards,butisop=onal Eachpieceissimple,butcanbepowerfulwhencombined
- 46. 20102015Cloudera,Inc.AllRightsReserved MapReduceBenets Scalability Hadoopdividestheprocessingjobintoindividualtasks Tasksexecuteinparallel(independently)acrosscluster Simplicity Processesonerecordata=me Easeofuse Hadoopprovidesjobschedulingandotherinfrastructure Farsimplerfordevelopersthantypicaldistributedcompu=ng
- 47. 20102015Cloudera,Inc.AllRightsReserved MapReduceinHadoop MapReduceprocessinginHadoopisbatch-oriented AMapReducejobisbrokendownintosmallertasks Tasksrunconcurrently Eachprocessesasmallamountofoverallinput MapReducecodeforHadoopisusuallywriqeninJava ThisusesHadoopsAPIdirectly YoucandobasicMapReduceinotherlanguages UsingtheHadoopStreamingwrapperprogram SomeadvancedfeaturesrequireJavacode
- 48. 20102015Cloudera,Inc.AllRightsReserved MapReduceExampleinPython ThefollowingexampleusesPython ViaHadoopStreaming Itprocessesloglesandsummarizeseventsbytype Illexplainboththedataowandthecode
- 49. 20102015Cloudera,Inc.AllRightsReserved JobInput Heresthejobinput Eachmaptaskgetsachunkofthisdatatoprocess TypicallycorrespondstoasingleblockinHDFS 2013-06-29 22:16:49.391 CDT INFO "This can wait" 2013-06-29 22:16:52.143 CDT INFO "Blah blah blah" 2013-06-29 22:16:54.276 CDT WARN "This seems bad" 2013-06-29 22:16:57.471 CDT INFO "More blather" 2013-06-29 22:17:01.290 CDT WARN "Not looking good" 2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant" 2013-06-29 22:17:05.362 CDT ERROR "Out of memory!"
- 50. 20102015Cloudera,Inc.AllRightsReserved #!/usr/bin/env python import sys levels = ['TRACE', 'DEBUG', 'INFO', 'WARN', 'ERROR', 'FATAL'] for line in sys.stdin: fields = line.split() level = fields[3].upper() if level in levels: print "%st1" % level 1 2 3 4 5 6 7 8 9 10 11 12 13 14 PythonCodeforMapFunc=on Ifitmatchesaknownlevel,print it,atabseparator,andtheliteral value1(sincethelevelcanonly occuronceperline) Readrecordsfromstandardinput. Usewhitespacetosplitintoelds. Denelistofknownloglevels Extractleveleldandconvertto uppercaseforconsistency.
- 51. 20102015Cloudera,Inc.AllRightsReserved OutputofMapFunc=on Themapfunc=onproduceskey/valuepairsasoutput INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1
- 52. 20102015Cloudera,Inc.AllRightsReserved TheShueandSort Hadoopautoma9callymerges,sorts,andgroupsmapoutput Theresultispassedasinputtothereducefunc=on Moreonthislater INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1 ERROR 1 INFO 1 INFO 1 INFO 1 INFO 1 WARN 1 WARN 1 ShueandSort MapOutput ReduceInput
- 53. 20102015Cloudera,Inc.AllRightsReserved InputtoReduceFunc=on Reducefunc=onreceivesakeyandallvaluesforthatkey Keysarealwayspassedtoreducersinsortedorder Althoughnotobvioushere,valuesareunordered ERROR 1 INFO 1 INFO 1 INFO 1 INFO 1 WARN 1 WARN 1
- 54. 20102015Cloudera,Inc.AllRightsReserved PythonCodeforReduceFunc=on #!/usr/bin/env python import sys previous_key = None sum = 0 for line in sys.stdin: key, value = line.split() if key == previous_key: sum = sum + int(value) # continued on next slide 1 2 3 4 5 6 7 8 9 10 11 12 13 Ini=alizeloopvariables Extractthekeyandvalue passedviastandardinput Ifkeyunchanged, incrementthecount
- 55. 20102015Cloudera,Inc.AllRightsReserved PythonCodeforReduceFunc=on # continued from previous slide else: if previous_key: print '%st%i' % (previous_key, sum) previous_key = key sum = 1 print '%st%i' % (previous_key, sum) 14 15 16 17 18 19 20 21 22 Printdataforthenal key Ifkeychanged, printdataforoldlevel Starttrackingdatafor thenewrecord
- 56. 20102015Cloudera,Inc.AllRightsReserved OutputofReduceFunc=on Itsoutputisasumforeachlevel ERROR 1 INFO 4 WARN 2
- 57. 20102015Cloudera,Inc.AllRightsReserved RecapofDataFlow ERROR 1 INFO 4 WARN 2 INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1 ERROR 1 INFO 1 INFO 1 INFO 1 INFO 1 WARN 1 WARN 1 Mapinput Mapoutput Reduceinput Reduceoutput 2013-06-29 22:16:49.391 CDT INFO "This can wait" 2013-06-29 22:16:52.143 CDT INFO "Blah blah blah" 2013-06-29 22:16:54.276 CDT WARN "This seems bad" 2013-06-29 22:16:57.471 CDT INFO "More blather" 2013-06-29 22:17:01.290 CDT WARN "Not looking good" 2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant" 2013-06-29 22:17:05.362 CDT ERROR "Out of memory!" Shue andsort
- 58. 20102015Cloudera,Inc.AllRightsReserved HowtoRunaHadoopStreamingJob Illdemonstratethisnow TODO:provideVMand instruc=onsfordemo
- 59. 20102015Cloudera,Inc.AllRightsReserved MapReduceDaemons TherearetwodaemonprocessesinMapReduce JobTracker(master) Exactlyoneac=veJobTrackerpercluster Acceptsjobsfromclient Schedulesandmonitorstasksonslavenodes Reassignstasksincaseoffailure TaskTracker(slave) Manypercluster Performstheshueandsort Executesmapandreducetasks
- 60. 20102015Cloudera,Inc.AllRightsReserved OpenSourceToolsthatComplementHadoop TheHadoopEcosystem
- 61. 20102015Cloudera,Inc.AllRightsReserved TheHadoopEcosystem "CoreHadoop"consistsofHDFSandMapReduce Thesearethekernelofamuchbroaderplazorm Hadoophasmanyrelatedprojects SomehelpyouintegrateHadoopwithothersystems Othershelpyouanalyzeyourdata ThesearenotconsideredcoreHadoop Rather,theyrepartoftheHadoopecosystem ManyarealsoopensourceApacheprojects
- 62. 20102015Cloudera,Inc.AllRightsReserved ApacheSqoop SqoopexchangesdatabetweenRDBMSandHadoop Canimporten=reDB,asingletable,oratablesubsetintoHDFS DoesthisveryecientlyviaaMap-onlyMapReducejob CanalsoexportdatafromHDFSbacktothedatabase Database Hadoop Cluster
- 63. 20102015Cloudera,Inc.AllRightsReserved ApacheFlume FlumeimportsdataintoHDFSasitisbeinggeneratedbyvarioussources Hadoop Cluster Program Output UNIX syslog Log Files Custom Sources And many more...
- 64. 20102015Cloudera,Inc.AllRightsReserved ApachePig Pigoershigh-leveldataprocessingonHadoop Analterna=vetowri=nglow-levelMapReducecode PigturnsthisintoMapReducejobsthatrunonHadoop people = LOAD '/data/customers' AS (cust_id, name); orders = LOAD '/data/orders' AS (ord_id, cust_id, cost); groups = GROUP orders BY cust_id; totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t; result = JOIN totals BY group, people BY cust_id; DUMP result;
- 65. 20102015Cloudera,Inc.AllRightsReserved ApacheHive Hiveisanotherabstrac=onontopofMapReduce LikePig,italsoreducesdevelopment=me HiveusesaSQL-likelanguagecalledHiveQL SELECT customers.cust_id, SUM(cost) AS total FROM customers JOIN orders ON customers.cust_id = orders.cust_id GROUP BY customers.cust_id ORDER BY total DESC;
- 66. 20102015Cloudera,Inc.AllRightsReserved ApacheMahout Mahoutisascalablemachinelearninglibrary Supportforseveralcategoriesofproblems Classica=on Clustering Collabora=veltering Frequentitemsetmining ManyalgorithmsimplementedinMapReduce CanparallelizeinexpensivelywithHadoop
- 67. 20102015Cloudera,Inc.AllRightsReserved ApacheHBase HBaseisaNoSQLdatabasebuiltontopofHadoop Canstoremassiveamountsofdata Gigabytes,terabytes,andevenpetabytesofdatainatable Tablescanhavemanythousandsofcolumns Scalestoprovideveryhighwritethroughput Hundredsofthousandsofinsertspersecond
- 68. 20102015Cloudera,Inc.AllRightsReserved ClouderaImpala MassivelyparallelSQLenginewhichrunsonaHadoopcluster InspiredbyGooglesDremelproject CanquerydatastoredinHDFSorHBasetables Highperformance Typically>10=mesfasterthanPigorHive Querysyntaxvirtuallyiden=caltoHive/SQL Impalais100%opensource(Apache-licensed)
- 69. 20102015Cloudera,Inc.AllRightsReserved QueryingDatawithHiveandImpala Illdemonstratethisnow TODO:provideVMand instruc=onsfordemo
- 70. 20102015Cloudera,Inc.AllRightsReserved VisualOverviewofaCompleteWorkow Import Transaction Data from RDBMSSessionize Web Log Data with Pig Analyst uses Impala for business intelligence Sentiment Analysis on Social Media with Hive Hadoop Cluster with Impala Generate Nightly Reports using Pig, Hive, or Impala Build product recommendations for Web site
- 71. 20102015Cloudera,Inc.AllRightsReserved HadoopDistribu=ons ConceptuallysimilartoaLinuxdistribu=on Thesedistribu=onsincludeastableversionofHadoop PlusecosystemtoolslikeFlume,Sqoop,Pig,Hive,Impala,etc. Benetsofusingadistribu=on Integra=ontes=nghelpsensurealltoolsworktogether Easyinstalla=onandupdates Compa=bilitycer=ca=onfromhardwarevendors Commercialsupport ApacheBigtopupstreamdistribu=onformanycommercial distribu=ons
- 72. 20102015Cloudera,Inc.AllRightsReserved ClouderasDistribu=on(CDH) ClouderasDistribu=onincludingApacheHadoop(CDH) Themostwidelyuseddistribu=onofHadoop Stable,provenandsupportedenvironment CombinesHadoopwithmanyimportantecosystemtools IncludingallthoseIvemen=oned Howmuchdoesitcost? Itscompletelyfree Apachelicensedits100%opensourcetoo
- 73. 20102015Cloudera,Inc.AllRightsReserved Past,Present,andFuture HadoopClusters
- 74. 20102015Cloudera,Inc.AllRightsReserved HadoopClusterOverview Aclusterismadeupofnodes Anodeissimplya(typicallyrackmount)server Theremaybeafeworafewthousandnodes Mostareslavenodes,butafewaremasternodes Everynodeisresponsibleforbothstorageandprocessing Nodesareconnectedtogetherbynetworkswitches SlavenodesdonotuseRAID Blocksplingandreplica=onisbuiltintoHDFS Nearlyallproduc=onclustersrunLinux
- 75. 20102015Cloudera,Inc.AllRightsReserved AnatomyofaSmallHadoopCluster Typicallyconsistsof industry-standard rackmountedservers JobTrackerandNameNode mightrunonsameserver TaskTrackerandDataNode arealwaysco-locatedon eachslavenodefordata locality Slave Nodes Master Node JobTracker NameNode TaskTracker DataNode Network switch
- 76. 20102015Cloudera,Inc.AllRightsReserved AnatomyofaLargeCluster "core" network switch connected to each top-of-rack switch 421 3 1. Master (active NameNode) 2. Master (standby NameNode) 3. Master (active JobTracker) 4. Master (standby JobTracker)
- 77. 20102015Cloudera,Inc.AllRightsReserved Build,Buy,orUsetheCloud? ThereareseveralwaystorunHadoopatscale Buildyourowncluster Buyapre-conguredclusterfromhardwarevendor RunHadoopinthecloud Privatecloud:virtualizedhardwareinyourdatacenter Publiccloud:onaservicelikeAmazonEC2 Letscoverthepros,cons,andconcerns
- 78. 20102015Cloudera,Inc.AllRightsReserved BuildversusBuy Prosforbuildingyourown Selectwhichevercomponentsyoulike Canreusecomponentsyoualreadyhave Avoidrelianceonasinglevendor Prosforbuyingpre-conguredcluster Vendortestsallthecomponentstogether(cer=ca=on) Avoidsblametheothervendorduringsupportcalls Mayactuallybelessexpensiveduetoeconomiesofscale
- 79. 20102015Cloudera,Inc.AllRightsReserved HostintheCloud? Goodchoicewhenneedforclusterissporadic Andamountofdatastored/transferredisrela=velylow YourOwnCluster HostedintheCloud HardwareCost X StangCost X Power/HVAC X StorageCost X BandwidthCost X Performance X
- 80. 20102015Cloudera,Inc.AllRightsReserved Hadoop'sCurrentDataProcessingArchitecture Hadoop'sfacili=esfordata processingarebasedonthe MapReduceframework Workswell,buttherearetwo importantlimita=ons Onlyoneac=veJobTracker percluster(scalability) MapReduceisnotanideal tforallprocessingneeds (exibility) JobTracker TaskTracker
- 81. 20102015Cloudera,Inc.AllRightsReserved YARN:NextGenera=onProcessingArchitecture YARNgeneralizesschedulingandresourcealloca=on Improvesscalability Namesandrolesofdaemonshavechanged Manyresponsibili=espreviouslyassociatedwithJobTrackerare nowdelegatedtoslavenodes Improvesexibility AllowsprocessingframeworksotherthanMapReduce Example:ApacheGiraph(graphprocessingframework) Maintainsbackwardscompa=bility MapReduceisalsosupportedbyYARN Requiresnochangetoexis=ngMapReducejobs
- 82. 20102015Cloudera,Inc.AllRightsReserved HadoopFileFormats Hadoopsupportsanumberofinput/outputformats,including Free-formtext Delimitedtext Severalspecializedformatsforecientstorage Sequenceles Avro RCFile Parquet Alsopossibletoaddsupportforcustomformats
- 83. 20102015Cloudera,Inc.AllRightsReserved TypicalDatasetExample Mostoftheseformatsstoredataasrowsofelds Eachrowcontainsalleldsforasinglerecord Addi=onallescontainaddi=onalrecordsinthesameformat 2014-02-11 22:16:49 Alice Cable 19.23 2014-02-11 22:17:52 Bob DVD 28.78 2014-02-11 22:17:54 Alice Keyboard 36.99 2014-02-12 22:16:57 Alice Adapter 19.23 2014-02-12 22:17:01 Bob Cable 28.78 2014-02-12 22:17:03 Alice Mouse 36.99 2014-02-12 22:17:05 Chuck Antenna 24.99 File #1 File #2 date time buyer item price
- 84. 20102015Cloudera,Inc.AllRightsReserved TheParquetFileFormat Parquetisanewhigh-performanceleformat OriginallydevelopedbyengineersfromClouderaandTwiqer Opensource,withanac=vedevelopercommunity Canstoreeachcolumninitsownle Allowsformuchbeqercompressionduetosimilarvalues ReducesI/Owhenonlyasubsetofcolumnsareneeded 2014-02-11 2014-02-11 2014-02-11 2014-02-12 2014-02-12 2014-02-12 2014-02-12 22:16:49 22:16:52 22:16:54 22:16:57 22:17:01 22:17:03 22:17:05 Alice Bob Alice Alice Bob Alice Chuck 19.23 28.78 36.99 19.23 28.78 36.99 24.99 File #1 (date) File #2 (time) File #3 (buyer) File #5 (price) Cable DVD Keyboard Adapter Cable Mouse Antenna File #4 (item)
- 85. 20102015Cloudera,Inc.AllRightsReserved Conclusion
- 86. 20102015Cloudera,Inc.AllRightsReserved KeyPoints Weregenera=ngmassivevolumesofdata Thisdatacanbeextremelyvaluable Companiescannowanalyzewhattheypreviouslydiscarded Hadoopsupportslarge-scaledatastorageandprocessing HeavilyinuencedbyGoogle'sarchitecture Alreadyinproduc=onbythousandsoforganiza=ons HDFSisHadoop'sstoragelayer MapReduceisHadoop'sprocessingframework ManyecosystemprojectscomplementHadoop SomehelpyoutointegrateHadoopwithexis=ngsystems Othershelpyouanalyzethedatayouvestored
- 87. 20102015Cloudera,Inc.AllRightsReserved HighlyRecommendedBooks Author:TomWhite ISBN:1-449-31152-0 Author:EricSammer ISBN:1-449-32705-2
- 88. 20102015Cloudera,Inc.AllRightsReserved Mybook @hadooparchbook hadooparchitecturebook.com
- 89. 20102015Cloudera,Inc.AllRightsReserved Helpscompaniesprotfromtheirdata FoundedbyexpertsfromFacebook,Google,Oracle,andYahoo Weoerproductsandservicesforlarge-scaledataanalysis Sovware(CDHdistribu=onandClouderaManager) Consul=ngandsupportservices Trainingandcer=ca=on Ac=vedevelopersofopensourceBigDatasovware StaincludescommiqerstoeverysingleprojectIllcovertoday AboutCloudera
- 90. 20102015Cloudera,Inc.AllRightsReserved Ques=ons? Thankyouforaqending! Illbehappytoansweranyaddi=onalques=onsnow Wanttolearnevenmore? Clouderatraining:developers,analysts,sysadmins,andmore Oeredinmorethan50ci=esworldwide,andonlinetoo! Seehqp://university.cloudera.com/formoreinfo Demoandslidesatgithub.com/markgrover/hadoop-intro-fast Twiqer:mark_grover Surveypage:=ny.cloudera.com/mark