Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to...
Transcript of Cloudera Custom Training Hands-On Exercises · coded and shown with pyspark> ... Upload a file to...
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
1
Cloudera Custom Training Hands-On Exercises
GeneralNotes...........................................................................................................................................3Hands-OnExercise:DataIngestWithHadoopTools..................................................................6Hands-OnExercise:RunningQueriesfromtheShell,Scripts,andHue............................13Hands-OnExercise:DataManagement........................................................................................17Hands-OnExercise:RelationalAnalysis......................................................................................24Hands-OnExercise:WorkingwithImpala..................................................................................26Hands-OnExercise:AnalyzingTextandComplexDataWithHive.....................................29Hands-OnExercise:DataTransformationwithHive..............................................................35Hands-OnExercise:ViewtheSparkDocumentation...............................................................43Hands-OnExercise:UsetheSparkShell......................................................................................44Hands-OnExercise:UseRDDstoTransformaDataset...........................................................46Hands-OnExercise:ProcessDataFileswithSpark..................................................................53Hands-OnExercise:UsePairRDDstoJoinTwoDatasets......................................................57
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
2
Hands-OnExercise:WriteandRunaSparkApplication........................................................62Hands-OnExercise:ConfigureaSparkApplication.................................................................67Hands-OnExercise:ViewJobsandStagesintheSparkApplicationUI.............................71Hands-OnExercise:PersistanRDD...............................................................................................77Hands-OnExercise:ImplementanIterativeAlgorithm.........................................................79Hands-OnExercise:UseBroadcastVariables............................................................................83Hands-OnExercise:UseAccumulators........................................................................................84Hands-OnExercise:UseSparkSQLforETL................................................................................85AppendixA:EnablingiPythonNotebook....................................................................................89DataModelReference........................................................................................................................91RegularExpressionReference........................................................................................................95
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
3
General Notes Cloudera’strainingcoursesuseavirtualmachine(VM)witharecentversionofCDHalreadyinstalledandconfiguredforyou.TheVMrunsinpseudo-distributedmode,aconfigurationthatenablesaHadoopclustertorunonasinglemachine.
Points to Note While Working in the VM
1. TheVMissettoautomaticallyloginastheusertraining.Ifyoulogout,youcanlogbackinastheusertrainingwiththepasswordtraining.Therootpasswordisalsotraining,thoughyoucanprefixanycommandwithsudotorunitasroot.
2. Exercisesoftencontainstepswithcommandsthatlooklikethis:
$ hdfs dfs -put accounting_reports_taxyear_2013 \
/user/training/tax_analysis/
The$symbolrepresentsthecommandprompt.Donotincludethischaracterwhencopyingandpastingcommandsintoyourterminalwindow.Also,thebackslash(\)signifiesthatthecommandcontinuesonthenextline.Youmayeitherenterthecodeasshown(ontwolines),oromitthebackslashandtypethecommandonasingleline.
SomecommandsaretobeexecutedinthePythonorScalaSparkShells;thosearecolorcodedandshownwithpyspark>(blue)orscala>(red)prompts,respectively.Linuxcommandstepsthatapplytoonlyonelanguageortheotherarealsocolorcoded,butstillprecededwiththe$prompt.
3. AlthoughmanystudentsarecomfortableusingUNIXtexteditorslikevioremacs,somemightpreferagraphicaltexteditor.Toinvokethegraphicaleditorfromthecommandline,typegeditfollowedbythepathofthefileyouwishtoedit.Appending&tothecommandallowsyoutotypeadditionalcommandswhiletheeditorisstillopen.Hereisanexampleofhowtoeditafilenamedmyfile.txt:
$ gedit myfile.txt &
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
4
Class-Specific VM Customization
YourVMisusedinseveralofCloudera’strainingclasses.Thisparticularclassdoesnotrequiresomeoftheservicesthatstartbydefault,whileotherservicesthatdonotstartbydefaultarerequiredforthisclass.Beforestartingthecourseexercises,runthecoursesetupscript:
$ ~/scripts/analyst/training_setup_da.sh
Youmaysafelyignoreanymessagesaboutservicesthathavealreadybeenstartedorshutdown.Youonlyneedtorunthisscriptonce.
Points to Note During the Exercises
SampleSolutions
Ifyouneedahintorwanttocheckyourwork,thesample_solutionsubdirectorywithineachexercisedirectorycontainscompletecodesamples.
Catch-upScript
Ifyouareunabletocompleteanexercise,wehaveprovidedascripttocatchyouupautomatically.Eachexercisehasinstructionsforrunningthecatch-upscript.
$ADIREnvironmentVariable
$ADIRisashortcutthatpointstothe/home/training/training_materials/ analystdirectory,whichcontainsthecodeanddatayouwilluseintheexercises.
FewerStep-by-StepInstructionsasYouWorkThroughTheseExercises
Astheexercisesprogress,andyougainmorefamiliaritywiththetoolsyou’reusing,weprovidefewerstep-by-stepinstructions.Youshouldfeelfreetoaskyourinstructorforassistanceatanytime,ortoconsultwithyourfellowstudents.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
5
BonusExercises
Manyoftheexercisescontainoneormoreoptional“bonus”sections.Weencourageyoutoworkthroughtheseiftimeremainsafteryoufinishthemainexerciseandwouldlikeanadditionalchallengetopracticewhatyouhavelearned.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
6
Hands-On Exercise: Data Ingest With Hadoop Tools InthisexerciseyouwillpracticeusingtheHadoopcommandlineutilitytointeract
withHadoop’sDistributedFilesystem(HDFS)anduseSqooptoimporttablesfroma
relationaldatabasetoHDFS.
To begin, you must launch the Data Analyst VM.
BesureyouhaverunthesetupscriptasdescribedintheGeneralNotessectionabove.Ifyouhavenotrunityet,dosonow:
$ ~/scripts/analyst/training_setup_da.sh
Step 1: Exploring HDFS using the Hue File Browser
1. StarttheFirefoxWebbrowseronyourVMbyclickingontheiconinthesystemtoolbar:
2. InFirefox,clickontheHuebookmarkinthebookmarktoolbar(ortypehttp://localhost:8888/homeintotheaddressbarandthenhitthe[Enter]key.)
3. Afterafewseconds,youshouldseeHue’shomescreen.Thefirsttimeyoulogin,youwillbepromptedtocreateanewusernameandpassword.Entertraininginboththeusernameandpasswordfields,andthenclickthe“SignIn”button.
4. WheneveryoulogintoHueaTipspopupwillappear.Tostopitappearinginthefuture,checktheDonotshowthisdialogagainoptionbeforedismissingthepopup.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
7
5. ClickFileBrowserintheHuetoolbar.YourHDFShomedirectory(/user/training)displays.(SinceyouruserIDontheclusteristraining,yourhomedirectoryinHDFSis/user/training.)Thedirectorycontainsnofilesordirectoriesyet.
6. Createatemporarysub-directory:selectthe+NewmenuandclickDirectory.
7. EnterdirectorynametestandclicktheCreatebutton.Yourhomedirectorynowcontainsadirectorycalledtest.
8. Clickontesttoviewthecontentsofthatdirectory;currentlyitcontainsnofilesorsubdirectories.
9. UploadafiletothedirectorybyselectingUploadàFiles.
10. ClickSelectFilestobringupafilebrowser.Bydefault,the/home/training/Desktopfolderdisplays.Clickthehomedirectorybutton(training)thennavigatetothecoursedatadirectory:training_materials/analyst/data.
11. ChooseanyofthedatafilesinthatdirectoryandclicktheOpenbutton.
12. ThefileyouselectedwillbeloadedintothecurrentHDFSdirectory.Clickthefilenametoseethefile’scontents.BecauseHDFSisdesigntostoreverylargefiles,Huewillnotdisplaytheentirefile,justthefirstpageofdata.Youcanclickthearrowbuttonsorusethescrollbartoseemoreofthedata.
13. ReturntothetestdirectorybyclickingViewfilelocationinthelefthandpanel.
14. Abovethelistoffilesinyourcurrentdirectoryisthefullpathofthedirectoryyouarecurrentlydisplaying.Youcanclickonanydirectoryinthepath,oronthefirstslash(/)
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
8
togotothetoplevel(root)directory.Clicktrainingtoreturntoyourhomedirectory.
15. Deletethetemporarytestdirectoryyoucreated,includingthefileinit,byselectingthecheckboxnexttothedirectorynamethenclickingtheMovetotrashbutton.(ConfirmthatyouwanttodeletebyclickingYes.)
Step 2: Exploring HDFS using the command line
4. Youcanusethehdfs dfscommandtointeractwithHDFSfromthecommandline.CloseorminimizeFirefox,thenopenaterminalwindowbyclickingtheiconinthesystemtoolbar:
16. Intheterminalwindow,enter:
$ hdfs dfs
Thisdisplaysahelpmessagedescribingallsubcommandsassociatedwithhdfs dfs.
17. Runthefollowingcommand:
$ hdfs dfs -ls /
ThisliststhecontentsoftheHDFSrootdirectory.Oneofthedirectorieslistedis/user.Eachuserontheclusterhasa‘home’directorybelow/usercorrespondingtohisorheruserID.
18. Ifyoudonotspecifyapath,hdfs dfsassumesyouarereferringtoyourhomedirectory:
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
9
$ hdfs dfs -ls
19. Notethe/dualcore directory.Mostofyourworkinthiscoursewillbeinthatdirectory.Trycreatingatemporarysubdirectoryin/dualcore:
$ hdfs dfs -mkdir /dualcore/test1
20. Next,addaWebserverlogfiletothisnewdirectoryinHDFS:
$ hdfs dfs -put $ADIR/data/access.log /dualcore/test1/
Overwriting Files in Hadoop
Unlike the UNIX shell, Hadoop won’t overwrite files and directories. This feature helps
protect users from accidentally replacing data that may have taken hours to produce. If
you need to replace a file or directory in HDFS, you must first remove the existing one.
Please keep this in mind in case you make a mistake and need to repeat a step during
the Hands-On Exercises.
To remove a file:
$ hdfs dfs -rm /dualcore/example.txt
To remove a directory and all its files and subdirectories (recursively):
$ hdfs dfs -rm -r /dualcore/example/
21. Verifythelaststepbylistingthecontentsofthe/dualcore/test1directory.Youshouldobservethattheaccess.logfileispresentandoccupies106,339,468bytesofspaceinHDFS:
$ hdfs dfs -ls /dualcore/test1
22. Removethetemporarydirectoryanditscontents:
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
10
$ hdfs dfs -rm -r /dualcore/test1
Step 3: Importing Database Tables into HDFS with Sqoop
Dualcorestoresinformationaboutitsemployees,customers,products,andordersinaMySQLdatabase.Inthenextfewsteps,youwillexaminethisdatabasebeforeusingSqooptoimportitstablesintoHDFS.
5. Inaterminalwindow,logintoMySQLandselectthedualcoredatabase:
$ mysql --user=training --password=training dualcore
23. Next,listtheavailabletablesinthedualcoredatabase(mysql>representstheMySQLclientpromptandisnotpartofthecommand):
mysql> SHOW TABLES;
24. Reviewthestructureoftheemployeestableandexamineafewofitsrecords:
mysql> DESCRIBE employees;
mysql> SELECT emp_id, fname, lname, state, salary FROM
employees LIMIT 10;
25. ExitMySQLbytypingquit,andthenhittheenterkey:
mysql> quit
Data Model Reference
For your convenience, you will find a reference section depicting the structure for the
tables you will use in the exercises at the end of this Exercise Manual.
26. Next,runthefollowingcommand,whichimportstheemployeestableintothe/dualcoredirectorycreatedearlierusingtabcharacterstoseparateeachfield:
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
11
$ sqoop import \
--connect jdbc:mysql://localhost/dualcore \
--username training --password training \
--fields-terminated-by '\t' \
--warehouse-dir /dualcore \
--table employees
Hiding Passwords
Typing the database password on the command line is a potential security risk since
others may see it. An alternative to using the --password argument is to use -P and let
Sqoop prompt you for the password, which is then not visible when you type it.
Sqoop Code Generation
After running the sqoop import command above, you may notice a new file named
employee.java in your local directory. This is an artifact of Sqoop’s code generation
and is really only of interest to Java developers, so you can ignore it.
27. RevisethepreviouscommandandimportthecustomerstableintoHDFS.
28. RevisethepreviouscommandandimporttheproductstableintoHDFS.
29. RevisethepreviouscommandandimporttheorderstableintoHDFS.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
12
30. Next,importtheorder_detailstableintoHDFS.Thecommandisslightlydifferentbecausethistableonlyholdsreferencestorecordsintheordersandproductstable,andlacksaprimarykeyofitsown.Consequently,youwillneedtospecifythe--split-byoptionandinstructSqooptodividetheimportworkamongtasksbasedonvaluesintheorder_idfield.Analternativeistousethe-m 1optiontoforceSqooptoimportallthedatawithasingletask,butthiswouldsignificantlyreduceperformance.
$ sqoop import \
--connect jdbc:mysql://localhost/dualcore \
--username training --password training \
--fields-terminated-by '\t' \
--warehouse-dir /dualcore \
--table order_details \
--split-by=order_id
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
13
Hands-On Exercise: Running Queries from the Shell, Scripts, and Hue
Exercise directory: $ADIR/exercises/queries
InthisexerciseyouwillpracticeusingtheHuequeryeditorandtheImpalaandHive
shellstoexecutesimplequeries.Theseexercisesusethetablesthathavebeen
populatedwithdatayouimportedtoHDFSusingSqoopinthe“DataIngestWith
HadoopTools”exercise.
IMPORTANT:Inordertopreparethedataforthisexercise,youmustrunthefollowingcommandbeforecontinuing:
$ ~/scripts/analyst/catchup.sh
Step #1: Explore the customers table using Hue
OnewaytorunImpalaandHivequeriesisthroughyourWebbrowserusingHue’sQueryEditors.Thisisespeciallyconvenientifyouusemorethanonecomputer–orifyouuseadevice(suchasatablet)thatisn’tcapableofrunningtheImpalaorBeelineshellsitself–becauseitdoesnotrequireanysoftwareotherthanabrowser.
6. StarttheFirefoxWebbrowserifitisn’trunning,thenclickontheHuebookmarkintheFirefoxbookmarktoolbar(ortypehttp://localhost:8888/homeintotheaddressbarandthenhitthe[Enter]key.)
7. Afterafewseconds,youshouldseeHue’shomescreen.Ifyoudon’tcurrentlyhaveanactivesession,youwillfirstbepromptedtologin.Entertraininginboththeusernameandpasswordfields,andthenclicktheSignInbutton.
8. SelecttheQueryEditorsmenuintheHuetoolbar.NotethattherearequeryeditorsforbothImpalaandHive(aswellasothertoolssuchasPig.)TheinterfaceisverysimilarforbothHiveandImpala.Fortheseexercises,selecttheImpalaqueryeditor.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
14
9. ThisisthefirsttimewehaverunImpalasinceweimportedthedatausingSqoop.TellImpalatoreloadtheHDFSmetadataforthetablebyenteringthefollowingcommandinthequeryarea,thenclickingExecute.
INVALIDATE METADATA
10. Makesurethedefaultdatabaseisselectedinthedatabaselistontheleftsideofthepage.
11. Belowtheselecteddatabaseisalistofthetablesinthatdatabase.Selectthecustomerstabletoviewthecolumnsinthetable.
12. ClickthePreviewSampleDataicon( )nexttothetablenametoviewsampledatafromthetable.Whenyouaredone,clicktoOKbuttontoclosethewindow.
Step #2: Run a Query Using Hue
Dualcoreranacontestinwhichcustomerspostedvideosofinterestingwaystousetheirnewtablets.A$5,000prizewillbeawardedtothecustomerwhosevideoreceivedthehighestrating.
However,theregistrationdatawaslostduetoanRDBMScrash,andtheonlyinformationwehaveisfromthevideos.Thewinningcustomerintroducedherselfonlyas“BridgetfromKansasCity”inhervideo.
Youwillneedtorunaquerythatidentifiesthewinner’srecordinourcustomerdatabasesothatwecansendherthe$5,000prize.
13. AllyouknowaboutthewinneristhathernameisBridgetandshelivesinKansasCity.IntheImpalaQueryEditor,enteraqueryinthetextareatofindthewinningcustomer.UsetheLIKEoperatortodoawildcardsearchfornamessuchas"Bridget","Bridgette"or"Bridgitte".Remembertofilteronthecustomer'scity.
14. Afterenteringthequery,clicktheExecutebutton.
Whilethequeryisexecuting,theLogtabdisplaysongoinglogoutputfromthequery.Whenthequeryiscomplete,theResultstabopens,displayingtheresultsofthequery.
Question:Whichcustomerdidyourqueryidentifyasthewinnerofthe$5,000prize?
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
15
Step #3: Run a Query from the Impala Shell
Runatop-NquerytoidentifythethreemostexpensiveproductsthatDualcorecurrentlyoffers.
15. Startaterminalwindowifyoudon’tcurrentlyhaveonerunning.
16. OntheLinuxcommandlineintheterminalwindow,starttheImpalashell:
$ impala-shell
ImpaladisplaystheURLoftheImpalaserverintheshellcommandprompt,e.g.:
[localhost.localdomain:21000] >
17. Attheprompt,reviewtheschemaoftheproductstablebyentering
DESCRIBE products;
RememberthatSQLcommandsintheshellmustbeterminatedbyasemi-colon(;),unlikeintheHuequeryeditor.
18. Showasampleof10recordsfromtheproductstable:
SELECT * FROM products LIMIT 10;
19. Executeaquerythatdisplaysthethreemostexpensiveproducts.Hint:UseORDER BY.
20. Whenyouaredone,exittheImpalashell:
exit;
Step #4: Run a Script in the Impala Shell
TherulesforthecontestdescribedearlierrequirethatthewinnerboughttheadvertisedtabletfromDualcorebetweenMay1,2013andMay31,2013.Beforewecanauthorizeouraccountingdepartmenttopaythe$5,000prize,youmustensurethatBridgetiseligible.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
16
Sincethisqueryinvolvesjoiningdatafromseveraltables,andwehavenotyetcoveredJOIN,you’vebeenprovidedwithascriptintheexercisedirectory.
21. Changetothedirectoryforthishands-onexercise:
$ cd $ADIR/exercises/queries
22. Reviewthecodeforthequery:
$ cat verify_tablet_order.sql
23. Executethescriptusingtheshell’s-foption:
$ impala-shell -f verify_tablet_order.sql
Question:DidBridgetordertheadvertisedtabletinMay?
Step #5: Run a Query Using Beeline
24. AttheLinuxcommandlineinaterminalwindow,startBeeline:
$ beeline -u jdbc:hive2://localhost:10000
BeelinedisplaystheURLoftheHiveserverintheshellcommandprompt,e.g.:
0: jdbc:hive2://localhost:10000>
25. ExecuteaquerytofindalltheGigabuxbrandproductswhosepriceislessthan1000($10).
26. ExittheBeelineshellbyentering
!exit
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
17
Hands-On Exercise: Data Management Exercisedirectory:$ADIR/exercises/data_mgmt
Inthisexerciseyouwillpracticeusingseveralcommontechniquesforcreatingand
populatingtables.
IMPORTANT:Thisexercisebuildsonpreviousones.Ifyouwereunabletocompleteanypreviousexerciseorthinkyoumayhavemadeamistake,runthefollowingcommandtoprepareforthisexercisebeforecontinuing:
$ ~/scripts/analyst/catchup.sh
Step #1: Review Existing Tables using the Metastore Manager
1. InFirefox,visittheHuehomepage,andthenchooseDataBrowsersàMetastoreTablesintheHuetoolbar.
2. Makesuredefaultdatabaseisselected.
3. Selectthecustomerstabletodisplaythetablebrowserandreviewthelistofcolumns.
4. SelecttheSampletabtoviewthefirsthundredrowsofdata.
Step #2: Create and Load a New Table using the Metastore Manager
Createandthenloadatablewithproductratingsdata.
5. Beforecreatingthetable,reviewthefilescontainingtheproductratingsdata.Thefilesarein/home/training/training_materials/analyst/data.Youcanusetheheadcommandinaterminalwindowtoseethefirstfewlines:
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
18
$ head $ADIR/data/ratings_2012.txt
$ head $ADIR/data/ratings_2013.txt
6. Copythedatafilestothe/dualcoredirectoryinHDFS.YoumayuseeithertheHueFileBrowser,orthehdfscommandintheterminalwindow:
$ hdfs dfs -put $ADIR/data/ratings_2012.txt /dualcore/
$ hdfs dfs -put $ADIR/data/ratings_2013.txt /dualcore/
7. ReturntotheMetastoreManagerinHue.Selectthedefaultdatabasetoviewthetablebrowser.
8. ClickonCreateanewtablemanuallytostartthetabledefinitionwizard.
9. Thefirstwizardstepistospecifythetable’sname(required)andadescription(optional).Entertablenameratings,thenclickNext.
10. InthenextstepyoucanchoosewhetherthetablewillbestoredasaregulartextfileoruseacustomSerializer/Deserializer,orSerDe.SerDeswillbecoveredlaterinthecourse.Fornow,selectDelimited,thenclickNext.
11. Thenextstepallowsyoutochangethedefaultdelimiters.Forasimpletable,onlythefieldterminatorisrelevant;collectionandmapdelimitersareusedforcomplexdatainHive,andwillbecoveredlaterinthecourse.SelectTab(\t)forthefieldterminator,thenclickNext.
12. Inthenextstep,chooseafileformat.FileFormatswillbecoveredlaterinthecourse.Fornow,selectTextFile,thenclickNext.
13. Inthenextstep,youcanchoosewhethertostorethefileinthedefaultdatawarehousedirectoryoradifferentlocation.MakesuretheUsedefaultlocationboxischecked,thenclickNext.
14. Thenextstepinthewizardletsyouaddcolumns.Thefirstcolumnoftheratingstableisthetimestampofthetimethattheratingwasposted.Entercolumnnamepostedandchoosecolumntypetimestamp.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
19
15. YoucanaddadditionalcolumnsbyclickingtheAddacolumnbutton.Repeatthestepsabovetoenteracolumnnameandtypeforallthecolumnsfortheratingstable:
FieldName FieldTypeposted timestamp cust_id int prod_id int rating tinyint message string
16. Whenyouhaveaddedallthecolumns,scrolldownandclickCreatetable.ThiswillstartajobtodefinethetableintheMetastore,andcreatethewarehousedirectoryinHDFStostorethedata.
17. Whenthejobiscomplete,thenewtablewillappearinthetablebrowser.
18. Optional:UsetheHueFileBrowserorthehdfscommandtoviewthe/user/hive/warehousedirectorytoconfirmcreationoftheratingssubdirectory.
19. Nowthatthetableiscreated,youcanloaddatafromafile.OnewaytodothisisinHue.ClickImportTableunderActions.
20. IntheImportdatadialogbox,enterorbrowsetotheHDFSlocationofthe2012productratingsdatafile:/dualcore/ratings_2012.txt.ThenclickSubmit.(Youwillloadthe2013ratingsinamoment.)
21. Next,verifythatthedatawasloadedbyselectingtheSampletabinthetablebrowserfortheratingstable.Theresultsshouldlooklikethis:
22. Tryingqueryingthedatainthetable.InHue,switchtotheImpalaQueryEditor.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
20
23. Initiallythenewtablewillnotappear.YoumustfirstreloadImpala’smetadatacachebyenteringandexecutingthecommandbelow.(Impalametadatacachingwillbecoveredindepthlaterinthecourse.)
INVALIDATE METADATA;
24. Ifthetabledoesnotappearinthetablelistontheleft,clicktheReloadbutton: .(Thisrefreshesthepage,notthemetadataitself.)
25. Tryexecutingaquery,suchascountingthenumberofratings:
SELECT COUNT(*) FROM ratings;
Thetotalnumberofrecordsshouldbe464.
26. AnotherwaytoloaddataintoatableisusingtheLOAD DATAcommand.Loadthe2013ratingsdata:
LOAD DATA INPATH '/dualcore/ratings_2013.txt' INTO TABLE
ratings;
27. TheLOAD DATA INPATHcommandmovesthefiletothetable’sdirectory.UsingtheHueFileBrowserorhdfscommand,verifythatthefileisnolongerpresentintheoriginaldirectory:
$ hdfs dfs -ls /dualcore/ratings_2013.txt
28. Optional:Verifythatthe2013dataisshownalongsidethe2012datainthetable’swarehousedirectory.
29. Finally,counttherecordsintheratingstabletoensurethatall21,997areavailable:
SELECT COUNT(*) FROM ratings;
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
21
Step #3: Create an External Table Using CREATE TABLE
YouimporteddatafromtheemployeestableinMySQLintoHDFSinanearlierexercise.Nowwewanttobeabletoquerythisdata.SincethedataalreadyexistsinHDFS,thisisagoodopportunitytouseanexternaltable.
InthelastexerciseyoupracticedcreatingatableusingtheMetastoreManager;thistime,useanImpalaSQLstatement.YoumayuseeithertheImpalashell,ortheImpalaQueryEditorinHue.
30. WriteandexecuteaCREATE TABLE statementtocreateanexternaltableforthetab-delimitedrecordsinHDFSat/dualcore/employees.Thedataformatisshownbelow:
FieldName FieldTypeemp_id STRING fname STRING lname STRING address STRING city STRING state STRING zipcode STRING job_title STRING email STRING active STRING salary INT
31. Runthefollowingquerytoverifythatyouhavecreatedthetablecorrectly.
SELECT job_title, COUNT(*) AS num
FROM employees
GROUP BY job_title
ORDER BY num DESC
LIMIT 3;
ItshouldshowthatSalesAssociate,Cashier,andAssistantManagerarethethreemostcommonjobtitlesatDualcore.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
22
Bonus Exercise #1: Use Sqoop’s Hive Import Option to Create a Table
Ifyouhavesuccessfullyfinishedthemainexerciseandstillhavetime,feelfreetocontinuewiththisbonusexercise.
YouusedSqoopinanearlierexercisetoimportdatafromMySQLintoHDFS.SqoopcanalsocreateaHive/Impalatablewiththesamefieldsasthesourcetableinadditiontoimportingtherecords,whichsavesyoufromhavingtowriteaCREATE TABLEstatement.
32. Inaterminalwindow,executethefollowingcommandtoimportthesupplierstablefromMySQLasanewmanagedtable:
$ sqoop import \
--connect jdbc:mysql://localhost/dualcore \
--username training --password training \
--fields-terminated-by '\t' \
--table suppliers \
--hive-import
33. Itisalwaysagoodideatovalidatedataafteraddingit.ExecutethefollowingquerytocountthenumberofsuppliersinTexas.YoumayuseeithertheImpalashellortheHueImpalaQueryEditor.RemembertoinvalidatethemetadatacachesothatImpalacanfindthenewtable.
INVALIDATE METADATA;
SELECT COUNT(*) FROM suppliers WHERE state='TX';
Thequeryshouldshowthatninerecordsmatch.
Bonus Exercise #2: Alter a Table
Ifyouhavesuccessfullyfinishedthemainexerciseandstillhavetime,feelfreetocontinuewiththisbonusexercise.Youcancompareyourworkagainstthefilesfoundinthebonus_02/sample_solution/subdirectory.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
23
InthisexerciseyouwillmodifythesupplierstableyouimportedusingSqoopinthepreviousexercise.YoumaycompletetheseexercisesusingeithertheImpalashellortheImpalaqueryeditorinHue.
34. UseALTER TABLEtorenamethecompanycolumntoname.
35. UsetheDESCRIBEcommandonthesupplierstabletoverifythechange.
36. UseALTER TABLEtorenametheentiretabletovendors.
37. AlthoughtheALTER TABLEcommandoftenrequiresthatwemakeacorrespondingchangetothedatainHDFS,renamingatableorcolumndoesnot.Youcanverifythisbyrunningaqueryonthetableusingthenewnames,e.g.:
SELECT supp_id, name FROM vendors LIMIT 10;
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
24
Hands-On Exercise: Relational Analysis Exercisedirectory:$ADIR/exercises/relational_analysis
Inthisexerciseyouwillwritequeriestoanalyzedataintablesthathavebeen
populatedwithdatayouimportedtoHDFSusingSqoopinthe“DataIngest”exercise.
IMPORTANT:Inordertopreparethedataforthisexercise,youmustrunthefollowingcommandbeforecontinuing:
$ ~/scripts/analyst/catchup.sh
SeveralanalysisquestionsaredescribedbelowandyouwillneedtowritetheSQLcodetoanswerthem.Youcanusewhichevertoolyouprefer–ImpalaorHive–usingwhichevermethodyoulikebest,includingshell,script,ortheHueQueryEditor,torunyourqueries.
Step #1: Calculate Top N Products
• WhichtopthreeproductshasDualcoresoldmoreofthananyother?Hint:RememberthatifyouuseaGROUP BYclause,youmustgroupbyallfieldslistedintheSELECTclausethatarenotpartofanaggregatefunction.
Step #2: Calculate Order Total
• Whichordershadthehighesttotal?
Step #3: Calculate Revenue and Profit
• WriteaquerytoshowDualcore’srevenue(totalpriceofproductssold)andprofit(priceminuscost)bydate.
o Hint:Theorder_datecolumnintheorder_detailstableisoftypeTIMESTAMP.UseTO_DATEtogetjustthedateportionofthevalue.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
25
Thereareseveralwaysyoucouldwritethesequeries.Onepossiblesolutionforeachisinthesample_solution/directory.
Bonus Exercise #1: Rank Daily Profits by Month
Ifyouhavesuccessfullyfinishedtheearlierstepsandstillhavetime,feelfreetocontinuewiththisoptionalbonusexercise.
• Writeaquerytoshowhoweachday’sprofitrankscomparedtootherdayswithinthesameyearandmonth.
o Hint:Usethepreviousexercise’ssolutionasasub-query;findtheROW_NUMBERoftheresultswithineachyearandmonth
Thereareseveralwaysyoucouldwritethisquery.Onepossiblesolutionisinthebonus_01/sample_solution/directory.
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
26
Hands-On Exercise: Working with Impala
InthisexerciseyouwillexplorethequeryexecutionplanforvarioustypesofqueriesinImpala.
IMPORTANT:Thisexercisebuildsonpreviousones.Ifyouwereunabletocompleteanypreviousexerciseorthinkyoumayhavemadeamistake,runthefollowingcommandtoprepareforthisexercisebeforecontinuing:
$ ~/scripts/analyst/catchup.sh
Step #1: Review Query Execution Plans
1. Reviewtheexecutionplanforthefollowingquery.YoumayuseeithertheImpalaQueryEditorinHueortheImpalashellcommandlinetool.
SELECT * FROM products;
2. Notethatthequeryexplanationincludesawarningthattableandcolumnstatsarenotavailablefortheproductstable.Computethestatsbyexecuting
COMPUTE STATS products;
3. Nowviewthequeryplanagain,thistimewithoutthewarning.
4. Thepreviousquerywasaverysimplequeryagainstasingletable.Tryreviewingthequeryplanofamorecomplexquery.Thefollowingqueryreturnsthetop3productssold.BeforeEXPLAINingthequery,computestatsonthetablestobequeried.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
27
SELECT brand, name, COUNT(p.prod_id) AS sold
FROM products p
JOIN order_details d
ON (p.prod_id = d.prod_id)
GROUP BY brand, name, p.prod_id
ORDER BY sold DESC
LIMIT 3;
Questions:Howmanystagesarethereinthisquery?Whataretheestimatedper-hostmemoryrequirementsforthisquery?Whatisthetotalsizeofallpartitionstobescanned?
5. Thetablesinthequeriesaboveareallhaveonlyasinglepartition.Tryreviewingthequeryplanforapartitionedtable.Recallthatinthe“DataStorageandPerformance”exercise,youcreatedanadstablepartitionedonthenetworkcolumn.Comparethequeryplansforthefollowingtwoqueries.Thefirstcalculatesthetotalcostofclickedadseachadcampaign;theseconddoesthesame,butforalladsononeofadnetworks.
SELECT campaign_id, SUM(cpc)
FROM ads
WHERE was_clicked=1
GROUP BY campaign_id
ORDER BY campaign_id;
SELECT campaign_id, SUM(cpc)
FROM ads
WHERE network=1
GROUP BY campaign_id
ORDER BY campaign_id;
Questions:Whatistheestimateper-hostmemoryrequirementsforthetwoqueries?Whatexplainsthedifference?
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
28
Bonus Exercise #1: Review the Query Summary
Ifyouhavesuccessfullyfinishedtheearlierstepsandstillhavetime,feelfreetocontinuewiththisoptionalbonusexercise.
ThisexercisemustbecompletedintheImpalaShellcommandlinetool,becauseitusesfeaturesnotyetavailableinHue.Refertothe“RunningQueriesfromtheShell,Scripts,andHue”exerciseforhowtousetheshellifneeded.
6. Tryexecutingoneofthequeriesyouexaminedabove,e.g.
SELECT brand, name, COUNT(p.prod_id) AS sold
FROM products p
JOIN order_details d
ON (p.prod_id = d.prod_id)
GROUP BY brand, name, p.prod_id
ORDER BY sold DESC
LIMIT 3;
7. Afterthequerycompletes,executetheSUMMARYcommand:
SUMMARY;
8. Questions:Whichstagetookthelongestaveragetimetocomplete?Whichtookthemostmemory?
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
29
Hands-On Exercise: Analyzing Text and Complex Data With Hive
Exercisedirectory:$ADIR/exercises/complex_data
Inthisexercise,youwill
• UseHive'sabilitytostorecomplexdatatoworkwithdatafromacustomerloyalty
program
• UseaRegexSerDetoloadweblogdataintoHive
• UseHive’stextprocessingfeaturestoanalyzecustomers’commentsandproduct
ratings,uncoverproblemsandproposepotentialsolutions.
IMPORTANT:Thisexercisebuildsonpreviousones.Ifyouwereunabletocompleteanypreviousexerciseorthinkyoumayhavemadeamistake,runthefollowingcommandtoprepareforthisexercisebeforecontinuing:
$ ~/scripts/analyst/catchup.sh
Step #1: Create, Load and Query a Table with Complex Data
Dualcorerecentlystartedaloyaltyprogramtorewardourbestcustomers.Acolleaguehasalreadyprovideduswithasampleofthedatathatcontainsinformationaboutcustomerswhohavesignedupfortheprogram,includingtheirphonenumbers(asamap),alistofpastorderIDs(asanarray),andastructthatsummarizestheminimum,maximum,average,andtotalvalueofpastorders.Youwillcreatethetable,populateitwiththeprovideddata,andthenrunafewqueriestopracticereferencingthesetypesoffields.
YoumayuseeithertheBeelineshellorHue’sHiveQueryEditortocompletetheseexercises.
1. Createatablewiththefollowingcharacteristics:
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
30
Name:loyalty_programType:EXTERNALColumns:
FieldName FieldTypecust_id STRING fname STRING lname STRING email STRING level STRING phone MAP<STRING,STRING> order_ids ARRAY<INT> order_value STRUCT<min:INT, max:INT, avg:INT,
total:INT> FieldTerminator:|(verticalbar) Collectionitemterminator: ,(comma) MapKeyTerminator: : (colon)Location:/dualcore/loyalty_program
2. Examinethedatain$ADIR/data/loyalty_data.txttoseehowitcorrespondstothefieldsinthetable.
3. LoadthedatafilebyplacingitintotheHDFSdatawarehousedirectoryforthenewtable.YoucanuseeithertheHueFileBrowser,orthehdfscommand:
$ hdfs dfs -put $ADIR/data/loyalty_data.txt \
/dualcore/loyalty_program/
4. RunaquerytoselecttheHOMEphonenumber(hint:mapkeysarecase-sensitive)forcustomerID1200866.Youshouldsee408-555-4914astheresult.
5. Selectthethirdelementfromtheorder_idsarrayforcustomerID1200866(hint:elementsareindexedfromzero).Thequeryshouldreturn5278505.
6. Selectthetotalattributefromtheorder_valuestructforcustomerID1200866.Thequeryshouldreturn401874.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
31
Step #2: Create and Populate the Web Logs Table
Manyinterestinganalysescanbedoneondatafromtheusageofawebsite.Thefirststepistoloadthesemi-structureddataintheweblogfilesintoaHivetable.Typicallogfileformatsarenotdelimited,soyouwillneedtousetheRegexSerDeandspecifyapatternHivecanusetoparselinesintoindividualfieldsyoucanthenquery.
7. Examinethecreate_web_logs.hqlscripttogetanideaofhowitusesaRegexSerDetoparselinesinthelogfile(anexampleloglineisshowninthecommentatthetopofthefile).Whenyouhaveexaminedthescript,runittocreatethetable.YoucanpastethecodeintotheHiveQueryEditor,oruseHCatalog:
$ hcat -f $ADIR/exercises/complex_data/create_web_logs.hql
8. Populatethetablebyaddingthelogfiletothetable’sdirectoryinHDFS:
$ hdfs dfs -put $ADIR/data/access.log /dualcore/web_logs/
9. VerifythatthedataisloadedcorrectlybyrunningthisquerytoshowthetopthreeitemsuserssearchedforonourWebsite:
SELECT term, COUNT(term) AS num FROM
(SELECT LOWER(REGEXP_EXTRACT(request,
'/search\\?phrase=(\\S+)', 1)) AS term
FROM web_logs
WHERE request REGEXP '/search\\?phrase=') terms
GROUP BY term
ORDER BY num DESC
LIMIT 3;
Youshouldseethatitreturnstablet(303),ram(153)andwifi(148).
Note:TheREGEXPoperator,whichisavailableinsomeSQLdialects,issimilartoLIKE,butusesregularexpressionsformorepowerfulpatternmatching.TheREGEXPoperatorissynonymouswiththeRLIKEoperator.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
32
Bonus Exercise #1: Analyze Numeric Product Ratings
Ifyouhavesuccessfullyfinishedtheearlierstepsandstillhavetime,feelfreetocontinuewiththisoptionalbonusexercise.
CustomerratingsandfeedbackaregreatsourcesofinformationforbothcustomersandretailerslikeDualcore.
However,customercommentsaretypicallyfree-formtextandmustbehandleddifferently.Fortunately,Hiveprovidesextensivesupportfortextprocessing.
Beforedelvingintotextprocessing,you’llbeginbyanalyzingthenumericratingscustomershaveassignedtovariousproducts.Inthenextbonusexercise,youwillusetheseresultsindoingtextanalysis.
10. ReviewtheratingstablestructureusingtheHiveQueryEditororusingtheDESCRIBEcommandintheBeelineshell.
11. Wewanttofindtheproductthatcustomerslikemost,butmustguardagainstbeingmisledbyproductsthathavefewratingsassigned.Runthefollowingquerytofindtheproductwiththehighestaverageamongallthosewithatleast50ratings:
SELECT prod_id, FORMAT_NUMBER(avg_rating, 2) AS avg_rating
FROM (SELECT prod_id, AVG(rating) AS avg_rating,
COUNT(*) AS num
FROM ratings
GROUP BY prod_id) rated
WHERE num >= 50
ORDER BY avg_rating DESC
LIMIT 1;
12. Rewrite,andthenexecute,thequeryabovetofindtheproductwiththelowestaverageamongproductswithatleast50ratings.YoushouldseethattheresultisproductID1274673withanaverageratingof1.10.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
33
Bonus Exercise #2: Analyze Rating Comments
Weobservedearlierthatcustomersareverydissatisfiedwithoneoftheproductswesell.Althoughnumericratingscanhelpidentifywhichproductthatis,theydon’ttelluswhycustomersdon’tliketheproduct.Althoughwecouldsimplyreadthroughallthecommentsassociatedwiththatproducttolearnthisinformation,thatapproachdoesn’tscale.Next,youwilluseHive’stextprocessingsupporttoanalyzethecomments.
13. Thefollowingquerynormalizesallcommentsonthatproducttolowercase,breaksthemintoindividualwordsusingtheSENTENCESfunction,andpassesthosetotheNGRAMSfunctiontofindthefivemostcommonbigrams(two-wordcombinations).Runthequery:
SELECT EXPLODE(NGRAMS(SENTENCES(LOWER(message)), 2, 5))
AS bigrams
FROM ratings
WHERE prod_id = 1274673;
14. Mostofthesewordsaretoocommontoprovidemuchinsight,thoughtheword“expensive”doesstandoutinthelist.Modifythepreviousquerytofindthefivemostcommontrigrams(three-wordcombinations),andthenrunthatqueryinHive.
15. Amongthepatternsyouseeintheresultisthephrase“tentimesmore.”Thismightberelatedtothecomplaintsthattheproductistooexpensive.Nowthatyou’veidentifiedaspecificphrase,lookatafewcommentsthatcontainitbyrunningthisquery:
SELECT message
FROM ratings
WHERE prod_id = 1274673
AND message LIKE '%ten times more%'
LIMIT 3;
Youshouldseethreecommentsthatsay,“Whydoestheredonecosttentimesmorethantheothers?”
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
34
16. Wecaninferthatcustomersarecomplainingaboutthepriceofthisitem,butthecommentalonedoesn’tprovideenoughdetail.Oneofthewords(“red”)inthatcommentwasalsofoundinthelistoftrigramsfromtheearlierquery.Writeandexecuteaquerythatwillfindalldistinctcommentscontainingtheword“red”thatareassociatedwithproductID1274673.
17. Thepreviousstepshouldhavedisplayedtwocomments:
18. “Whatissospecialaboutred?”
19. “Whydoestheredonecosttentimesmorethantheothers?”
Thesecondcommentimpliesthatthisproductisoverpricedrelativetosimilarproducts.WriteandrunaquerythatwilldisplaytherecordforproductID1274673intheproductstable.
20. Yourqueryshouldhaveshownthattheproductwasa“16GBUSBFlashDrive(Red)”fromthe“Orion”brand.Next,runthisquerytoidentifysimilarproducts:
SELECT *
FROM products
WHERE name LIKE '%16 GB USB Flash Drive%'
AND brand='Orion';
Thequeryresultsshowthatwehavethreealmostidenticalproducts,buttheproductwiththenegativereviews(theredone)costsabouttentimesasmuchastheothers,justassomeofthecommentssaid.
Basedonthecostandpricecolumns,itappearsthatdoingtextprocessingontheproductratingshashelpedusuncoverapricingerror.
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
35
Hands-On Exercise: Data Transformation with Hive
Exercisedirectory:$ADIR/exercises/transform
InthisexerciseyouwillexplorethedatafromDualcore’sWebserverthatyouloaded
inthe“AnalyzingTextandComplexData”exercise.Queriesonthatdatawillreveal
thatmanycustomersabandontheirshoppingcartsbeforecompletingthecheckout
process.Youwillcreateseveraladditionaltables,usingdatafromaTRANSFORMscript
andasuppliedUDF,whichyouwilluselatertoanalyzehowDualcorecouldturnthis
problemintoanopportunity.
IMPORTANT:Thisexercisebuildsonpreviousones.Ifyouwereunabletocompleteanypreviousexerciseorthinkyoumayhavemadeamistake,runthefollowingcommandtoprepareforthisexercisebeforecontinuing:
$ ~/scripts/analyst/catchup.sh
Step #1: Analyze Customer Checkouts
AsonmanyWebsites,Dualcore’scustomersaddproductstotheirshoppingcartsandthenfollowa“checkout”processtocompletetheirpurchase.Wewanttofigureoutifcustomerswhostartthecheckoutprocessarecompletingit.Sinceeachpartofthefour-stepcheckoutprocesscanbeidentifiedbyitsURLinthelogs,wecanusearegularexpressiontoidentifythem:
Step RequestURL Description1 /cart/checkout/step1-viewcart Viewlistofitemsaddedtocart2 /cart/checkout/step2-shippingcost Notifycustomerofshippingcost3 /cart/checkout/step3-payment Gatherpaymentinformation4 /cart/checkout/step4-receipt Showreceiptforcompletedorder
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
36
Note:Becausetheweb_logstableusesaRegexSerDes,whichisafeaturenotsupportedbyImpala,thisstepmustbecompletedinHive.YoumayuseeithertheBeelineshellortheHiveQueryEditorinHue.
1. RunthefollowingqueryinHivetoshowthenumberofrequestsforeachstepofthecheckoutprocess:
SELECT COUNT(*), request
FROM web_logs
WHERE request REGEXP '/cart/checkout/step\\d.+'
GROUP BY request;
Theresultsofthisqueryhighlightamajorproblem.Aboutoneoutofeverythreecustomersabandonstheircartafterthesecondstep.Thismightmeanmillionsofdollarsinlostrevenue,solet’sseeifwecandeterminethecause.
2. Thelogfile’scookiefieldstoresavaluethatuniquelyidentifieseachusersession.Sincenotallsessionsinvolvecheckoutsatall,createanewtablecontainingthesessionIDandnumberofcheckoutstepscompletedforjustthosesessionsthatdo:
CREATE TABLE checkout_sessions AS
SELECT cookie, ip_address, COUNT(request) AS steps_completed
FROM web_logs
WHERE request REGEXP '/cart/checkout/step\\d.+'
GROUP BY cookie, ip_address;
3. Runthisquerytoshowthenumberofpeoplewhoabandonedtheircartaftereachstep:
SELECT steps_completed, COUNT(cookie) AS num
FROM checkout_sessions
GROUP BY steps_completed;
Youshouldseethatmostcustomerswhoabandonedtheirorderdidsoafterthesecondstep,whichiswhentheyfirstlearnhowmuchitwillcosttoshiptheirorder.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
37
4. Optional:Becausethenewcheckout_sessionstabledoesnotuseaSerDes,itcanbequeriedinImpala.TryrunningthesamequeryasinthepreviousstepinImpala.Whathappens?
Step #2: Use TRANSFORM for IP Geolocation
Basedonwhatyou'vejustseen,itseemslikelythatcustomersabandontheircartsduetohighshippingcosts.Theshippingcostisbasedonthecustomer'slocationandtheweightoftheitemsthey'veordered.Althoughthisinformationisn'tinthedatabase(sincetheorderwasn'tcompleted),wecangatherenoughdatafromthelogstoestimatethem.
Wedon'thavethecustomer'saddress,butwecanuseaprocessknownas"IPgeolocation"tomapthecomputer'sIPaddressinthelogfiletoanapproximatephysicallocation.Sincethisisn'tabuilt-incapabilityofHive,you'lluseaprovidedPythonscripttoTRANSFORMtheip_addressfieldfromthecheckout_sessionstabletoaZIPcode,aspartofHiveQLstatementthatcreatesanewtablecalledcart_zipcodes.
Regarding TRANSFORM and UDF Examples in this Exercise
During this exercise, you will use a Python script for IP geolocation and a UDF to
calculate shipping costs. Both are implemented merely as a simulation – compatible with
the fictitious data we use in class and intended to work even when Internet access is
unavailable. The focus of these exercises is on how to use external scripts and UDFs,
rather than how the code for the examples works internally.
5. Examinethecreate_cart_zipcodes.hqlscriptandobservethefollowing:
a. Itcreatesanewtablecalledcart_zipcodesbasedonselectstatement.
b. Thatselectstatementtransformstheip_address,cookie,andsteps_completedfieldsfromthecheckout_sessionstableusingaPythonscript.
c. ThenewtablecontainstheZIPcodeinsteadofanIPaddress,plustheothertwofieldsfromtheoriginaltable.
6. Examinetheipgeolocator.pyscriptandobservethefollowing:
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
38
a. RecordsarereadfromHiveonstandardinput.
b. Thescriptsplitsthemintoindividualfieldsusingatabdelimiter.
c. Theip_addrfieldisconvertedtozipcode,butthecookieandsteps_completedfieldsarepassedthroughunmodified.
d. Thethreefieldsineachoutputrecordaredelimitedwithtabsareprintedtostandardoutput.
7. CopythePythonfiletoHDFSsothattheHiveServercanaccessit.YoumayusetheHueFileBrowserorthehdfscommand:
$ hdfs dfs -put $ADIR/exercises/transform/ipgeolocator.py \
/dualcore/
8. Runthescripttocreatethecart_zipcodestable.YoucaneitherpastethecodeintotheHiveQueryEditor,oruseBeelineinaterminalwindow:
$ beeline -u jdbc:hive2://localhost:10000 \
-f $ADIR/exercises/transform/create_cart_zipcodes.hql
Step #3: Extract List of Products Added to Each Cart
Asdescribedearlier,estimatingtheshippingcostalsorequiresalistofitemsinthecustomer’scart.YoucanidentifyproductsaddedtothecartsincetherequestURLlookslikethis(onlytheproductIDchangesfromonerecordtothenext):/cart/additem?productid=1234567
9. WriteaHiveQLstatementtocreateatablecalledcart_itemswithtwofields:cookieandprod_idbasedondataselectedtheweb_logstable.Keepthefollowinginmindwhenwritingyourstatement:
a. Theprod_idfieldshouldcontainonlytheseven-digitproductID(hint:usetheREGEXP_EXTRACTfunction)
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
39
b. UseaWHEREclausewithREGEXPusingthesameregularexpressionasabove,sothatyouonlyincluderecordswherecustomersareaddingitemstothecart.
c. Ifyouneedahintonhowtowritethestatement,lookatthecreate_cart_items.hqlfileintheexercise’ssample_solutiondirectory.
10. Verifythecontentsofthenewtablebyrunningthisquery:
SELECT COUNT(DISTINCT cookie) FROM cart_items
WHERE prod_id=1273905;
Ifthisdoesn’treturn47,thencompareyourstatementtothecreate_cart_items.hqlfile,makethenecessarycorrections,andthenre-runyourstatement(afterdroppingthecart_itemstable).
Step #4: Create Tables to Join Web Logs with Product Data
YounowhavetablesrepresentingtheZIPcodesandproductsassociatedwithcheckoutsessions,butyou'llneedtojointhesewiththeproductstabletogettheweightoftheseitemsbeforeyoucanestimateshippingcosts.Inordertodosomemoreanalysislater,we’llalsoincludetotalsellingpriceandtotalwholesalecostinadditiontothetotalshippingweightforallitemsinthecart.
11. RunthefollowingHiveQLtocreateatablecalledcart_orderswiththeinformation:
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
40
CREATE TABLE cart_orders AS
SELECT z.cookie, steps_completed, zipcode,
SUM(shipping_wt) AS total_weight,
SUM(price) AS total_price,
SUM(cost) AS total_cost
FROM cart_zipcodes z
JOIN cart_items i
ON (z.cookie = i.cookie)
JOIN products p
ON (i.prod_id = p.prod_id)
GROUP BY z.cookie, zipcode, steps_completed;
Step #5: Create a Table Using a UDF to Estimate Shipping Cost
Wefinallyhavealltheinformationweneedtoestimatetheshippingcostforeachabandonedorder.Oneofthedevelopersonourteamhasalreadywritten,compiled,andpackagedaHiveUDFthatwillcalculatetheshippingcostgivenaZIPcodeandthetotalweightofallitemsintheorder.
12. BeforeyoucanuseaUDF,youmustmakeitavailabletoHive.First,copythefiletoHDFSsothattheHiveServercanaccessit.YoumayusetheHueFileBrowserorthehdfscommand:
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
41
$ hdfs dfs -put \
$ADIR/exercises/transform/geolocation_udf.jar \
/dualcore/
13. Next,registerthefunctionwithHiveandprovidethenameoftheUDFclassaswellasthealiasyouwanttouseforthefunction.RuntheHivecommandbelowtoassociatetheUDFwiththealiasCALC_SHIPPING_COST:
CREATE TEMPORARY FUNCTION CALC_SHIPPING_COST
AS'com.cloudera.hive.udf.UDFCalcShippingCost'
USING JAR 'hdfs:/dualcore/geolocation_udf.jar';
14. Nowcreateanewtablecalledcart_shippingthatwillcontainthesessionID,numberofstepscompleted,totalretailprice,totalwholesalecost,andtheestimatedshippingcostforeachorderbasedondatafromthecart_orderstable:
CREATE TABLE cart_shipping AS
SELECT cookie, steps_completed, total_price, total_cost,
CALC_SHIPPING_COST(zipcode, total_weight) AS shipping_cost
FROM cart_orders;
15. Finally,verifyyourtablebyrunningthefollowingquerytocheckarecord:
SELECT * FROM cart_shipping WHERE cookie='100002920697';
Thisshouldshowthatsessionashavingtwocompletedsteps,atotalretailpriceof$263.77,atotalwholesalecostof$236.98,andashippingcostof$9.09.
Note:Thetotal_price,total_cost,andshipping_costcolumnsinthecart_shippingtablecontainthenumberofcentsasintegers.Besuretodivideresultscontainingmonetaryamountsby100togetdollarsandcents.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
42
YoumaynowshutdowntheDataAnalystVMandlaunchtheSparkVM.Runthefollowingscript:
$ ~/scripts/sparkdev/training_setup_sparkdev.sh
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
43
Hands-On Exercise: View the Spark Documentation
Inthisexercise,youwillfamiliarizeyourselfwiththeSparkdocumentation.
You must now shut down the Data Analyst VM and launch the Spark VM, if
you have not already done so. IMPORTANT:Inordertoprepareforthisexercise,youmustrunthefollowingcommandbeforecontinuing:
$ ~/scripts/sparkdev/training_setup_sparkdev.sh
1. StartFirefoxinyourVirtualMachineandvisittheSparkdocumentationonyourlocalmachine,usingtheprovidedbookmarkoropeningtheURLfile:/usr/lib/spark/docs/_site/index.html
2. FromtheProgrammingGuidesmenu,selecttheSparkProgrammingGuide.Brieflyreviewtheguide.Youmaywishtobookmarkthepageforlaterreview.
3. FromtheAPIDocsmenu,selecteitherScalaorPython,dependingonyourlanguagepreference.BookmarktheAPIpageforuseduringclass.Laterexerciseswillreferyoutothisdocumentation.
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
44
Hands-On Exercise: Use the Spark Shell
Inthisexercise,youwillstarttheSparkShellandviewtheSparkContextobject.
YoumaychoosetodothisexerciseusingeitherScalaorPython.FollowtheinstructionsbelowforPython,orskiptothenextsectionforScala.
MostofthelaterexercisesassumeyouareusingPython,butScalasolutionsareprovidedonyourvirtualmachine,soyoushouldfeelfreetouseScalaifyouprefer.
Using the Python Spark Shell
1. Inaterminalwindow,startthepysparkshell:
$ pyspark
YoumaygetseveralINFOandWARNINGmessages,whichyoucandisregard.Ifyoudon’tseetheIn[n]>promptafterafewseconds,hitReturnafewtimestoclearthescreenoutput.
Note: Your environment is set up to use IPython shell by default. If you would prefer to
use the regular Python shell, set IPYTHON=0 before starting pyspark.
4. SparkcreatesaSparkContextobjectforyoucalledsc.Makesuretheobjectexists:
pyspark> sc
Pysparkwilldisplayinformationaboutthescobjectsuchas:
<pyspark.context.SparkContext at 0x2724490>
5. Usingcommandcompletion,youcanseealltheavailableSparkContextmethods:typesc.(scfollowedbyadot)andthenthe[TAB]key.
6. YoucanexittheshellbyhittingCtrl-Dorbytypingexit.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
45
Using the Scala Spark Shell
7. Inaterminalwindow,starttheScalaSparkShell:
$ spark-shell
YoumaygetseveralINFOandWARNINGmessages,whichyoucandisregard.Ifyoudon’tseethescala>promptafterafewseconds,hitEnterafewtimestoclearthescreenoutput.
8. SparkcreatesaSparkContextobjectforyoucalledsc.Makesuretheobjectexists:
scala> sc
Scalawilldisplayinformationaboutthescobjectsuchas:
res0: org.apache.spark.SparkContext =
org.apache.spark.SparkContext@2f0301fa
9. Usingcommandcompletion,youcanseealltheavailableSparkContextmethods:typesc.(scfollowedbyadot)andthenthe[TAB]key.
10. YoucanexittheshellbyhittingCtrl-Dortypingexit.
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
46
Hands-On Exercise: Use RDDs to Transform a Dataset
Files and Data Used in This Exercise:
Data files (local):
~/training_materials/sparkdev/data/frostroad.txt
~/training_materials/sparkdev/data/weblogs/2013-09-15.log
Solutions:
~/training_materials/sparkdev/solutions/LogIPs.pyspark
~/training_materials/sparkdev/solutions/LogIPs.scalaspark
InthisexerciseyouwillpracticeusingRDDsintheSparkShell.
Youwillstartbyreadingasimpletextfile.ThenyouwilluseSparktoexploreandtransformtheApachewebserveroutputlogsofthecustomerservicesiteofafictionalmobilephoneserviceprovidercalledLoudacre.
Loading and Viewing a Text File
1. Reviewthesimpletextfilewewillbeusingbyviewing(withoutediting)thefileinatexteditor.Thefileislocatedat:~/training_materials/sparkdev/data/frostroad.txt.
2. StarttheSparkShellifyouexiteditfromthepreviousexercise.YoumayuseeitherScala(spark-shell)orPython(pyspark).
3. DefineanRDDtobecreatedbyreadinginasimpletestfile.ForPython,enter:
pyspark> mydata = sc.textFile(\
"file:/home/training/training_materials/sparkdev/\
data/frostroad.txt")
OrforScala,enter:
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
47
scala> val mydata = sc.textFile(
"file:/home/training/training_materials/sparkdev/data/frostro
ad.txt")
• Note:FortheremainderoftheHands-OnExercises,notethecolorcodingandpromptinexercisetextsnippetstofollowtheinstructionsforwhicheverlanguageyouareusing.)
4. NotethatSparkhasnotyetreadthefile.ItwillnotdosountilyouperformanoperationontheRDD.Trycountingthenumberoflinesinthedataset:
pyspark> mydata.count()
scala> mydata.count()
ThecountoperationcausestheRDDtobematerialized(createdandpopulated),afterwhichtheresult(23)isdisplayed.TheexamplebelowshowstheoutputforPyspark(Scalaproducesthesameresult,buttheoutputformatwilldifferslightly):
Out[2]: 23
5. TryexecutingthecollectoperationtodisplaythedataintheRDD.Notethatthisreturnsanddisplaystheentiredataset.ThisisconvenientforverysmallRDDslikethisone,butbecarefulusingcollectforlargedatasets,whicharecommonwhenusingSpark.
pyspark> mydata.collect()
scala> mydata.collect()
6. Usingcommandcompletion,youcanseealltheavailabletransformationsandoperationsyoucanperformonanRDD.Typemydata.andthenthe[TAB]key.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
48
Exploring the Loudacre Web Log Files
Inthisexercise,youwillbeusingdatain~/training_materials/sparkdev/data/weblogs.Initiallyyouwillworkwiththelogfilefromasingleday.Lateryouwillworkwiththefulldatasetconsistingofmanydaysworthoflogs.
7. Reviewoneofthe.logfilesinthedirectory.Notetheformatofthelines:
8. Inthepreviousexampleyouusedalocaldatafile.Intherealworld,youwillalmostalwaysbeworkingwithdataontheHDFSclusterinstead.CreateanHDFSdirectoryforthecourse,andthencopythedatasettoyourHDFShomedirectory.Inaseparateterminalwindow(notyourSparkshell)execute:
$ hdfs dfs -mkdir /loudacre
$ hdfs dfs -put \
~/training_materials/sparkdev/data/weblogs/ \
/loudacre/
9. InyourSparkShell,setavariableforthedatafilesoyoudonothavetoretypeiteachtime.
pyspark> logfile="/loudacre/weblogs/2013-09-15.log"
scala> val logfile="/loudacre/weblogs/2013-09-15.log"
10. CreateanRDDfromthedatafile.
116.180.70.237 - 128 [15/Sep/2013:23:59:53 +0100]
"GET /KBDOC-00031.html HTTP/1.0" 200 1388
"http://www.loudacre.com" "Loudacre CSR Browser"
IP Address User ID Request
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
49
pyspark> logs = sc.textFile(logfile)
scala> val logs = sc.textFile(logfile)
11. CreateanRDDcontainingonlythoselinesthatarerequestsforJPGfiles.
pyspark> jpglogs=\
logs.filter(lambda x: ".jpg" in x)
scala> val jpglogs = logs.
filter(line => line.contains(".jpg"))
12. Viewthefirst10linesofthedatausingtake:
pyspark> jpglogs.take(10)
scala> jpglogs.take(10)
13. Sometimesyoudonotneedtostoreintermediatedatainavariable,inwhichcaseyoucancombinethestepsintoasinglelineofcode.Forinstance,ifallyouneedistocountthenumberofJPGrequests,youcanexecutethisinasinglecommand:
pyspark> sc.textFile(logfile).filter(lambda x: \
".jpg" in x).count()
scala> sc.textFile(logfile).filter(line =>
line.contains(".jpg")).count()
14. NowtryusingthemapfunctiontodefineanewRDD.Startwithaverysimplemapthatreturnsthelengthofeachlineinthelogfile.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
50
pyspark> logs.map(lambda s: len(s)).take(5)
scala> logs.map(line => line.length).take(5)
Thisprintsoutanarrayoffiveintegerscorrespondingtothefirstfivelinesinthefile.
15. That’snotveryuseful.Instead,trymappingtoanarrayofwordsforeachline:
pyspark> logs.map(lambda s: s.split()).take(5)
scala> logs.map(line => line.split(' ')).take(5)
Thistimeitprintsoutfivearrays,eachcontainingthewordsinthecorrespondinglogfileline.
16. Nowthatyouknowhowmapworks,defineanewRDDcontainingjusttheIPaddressesfromeachlineinthelogfile.(TheIPaddressisthefirstfieldineachline).
pyspark> ips = logs.map(lambda s: s.split()[0])
pyspark> ips.take(5)
scala> val ips = logs.map(line => line.split(' ')(0))
scala> ips.take(5)
17. AlthoughtakeandcollectareusefulwaystolookatdatainanRDD,theiroutputissometimesnotveryreadable.Fortunately,though,theyreturnarrays,whichyoucaniteratethrough:
pyspark> for x in ips.take(5): print x
scala> ips.take(5).foreach(println)
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
51
18. Finally,savethelistofIPaddressesasatextfile:
pyspark> ips.saveAsTextFile("/loudacre/iplist")
scala> ips.saveAsTextFile("/loudacre/iplist")
19. Inaterminalwindow,listthecontentsoftheHDFSiplistdirectory(inyourHDFShomedirectory):
$ hdfs dfs -ls /loudacre/iplist
20. Youshouldseemultiplefiles.Theoneyoucareaboutispart-00000,whichshouldcontainthelistofIPaddresses.“Part”(partition)filesarenumberedbecausetheremayberesultsfrommultipletasksrunningonthecluster;youwilllearnmoreaboutthislater.
If You Have More Time
Ifyouhavemoretime,attemptthefollowingchallenges:
21. Challenge1:Asyoudidinthepreviousstep,savealistofIPaddresses,butthistime,usethewholeweblogdataset(weblogs/*)insteadofasingleday’slog.
• Tip:Youcanusetheup-arrowtoeditandexecutepreviouscommands.Youshouldonlyneedtomodifythelinesthatreadandsavethefiles.
22. Challenge2:UseRDDtransformationstocreateadatasetconsistingoftheIPaddressandcorrespondinguserIDforeachrequestforanHTMLfile.(Disregardrequestsforotherfiletypes).TheuserIDisthethirdfieldineachlogfileline.
Displaythedataintheformipaddress/userid,suchas:
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
52
165.32.101.206/8
100.219.90.44/102
182.4.148.56/173
246.241.6.175/45395
175.223.172.207/4115
…
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
53
Hands-On Exercise: Process Data Files with Spark
Files and Data Used in This Exercise:
Data files (local):
~/training_materials/sparkdev/data/activations/*
~/training_materials/sparkdev/data/devicestatus.txt (Bonus)
Stubs: stubs/ActivationModels.pyspark
stubs/ActivationModels.scalaspark
Solutions: solutions/ActivationModels.pyspark
solutions/ActivationModels.scalaspark
solutions/DeviceStatusETL.pyspark (Bonus)
solutions/DeviceStatusETL.scalaspark (Bonus)
InthisexerciseyouwillparseasetofactivationrecordsinXMLformattoextracttheaccountnumbersandmodelnames.
OneofthecommonusesforSparkisdoingdataExtract/Transform/Loadoperations.Sometimesdataisstoredinline-orientedrecords,liketheweblogsinthepreviousexercise,butsometimesthedataisinamulti-lineformatthatmustbeprocessedasawholefile.Inthisexerciseyouwillpracticeworkingwithfile-basedinsteadofline-basedformats.
Reviewing the API Documentation for RDD Operations
VisittheSparkAPIpageyoubookmarkedpreviously.FollowthelinkatthetopfortheRDDclassandreviewthelistofavailablemethods.
The Data
1. Reviewthedatainactivations(inthecoursedatadirectory).EachXMLfilecontainsdataforallthedevicesactivatedbycustomersduringaspecificmonth.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
54
Sampleinputdata:
<activations>
<activation timestamp="1225499258" type="phone">
<account-number>316</account-number>
<device-id>
d61b6971-33e1-42f0-bb15-aa2ae3cd8680
</device-id>
<phone-number>5108307062</phone-number>
<model>iFruit 1</model>
</activation>
…
</activations>
2. CopythisdatatoHDFS:
$ hdfs dfs -put \
~/training_materials/sparkdev/data/activations \
/loudacre/
The Task
YourcodeshouldgothroughasetofactivationXMLfilesandextracttheaccountnumberanddevicemodelforeachactivation,andsavethelisttoafileasaccount_number:model.
Theoutputwilllooksomethinglike:
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
55
1234:iFruit 1
987:Sorrento F00L
4566:iFruit 1
…
3. StartwiththeActivationModelsstubscript.(AstubisprovidedforScalaandPython;usewhicheverlanguageyouprefer.)NotethatforconvenienceyouhavebeenprovidedwithfunctionstoparsetheXML,asthatisnotthefocusofthisexercise.CopythestubcodeintotheSparkShell.
4. UsewholeTextFilestocreateanRDDfromtheactivationsdataset.TheresultingRDDwillconsistoftuples,inwhichthefirstvalueisthenameofthefile,andthesecondvalueisthecontentsofthefile(XML)asastring.
5. EachXMLfilecancontainmanyactivationrecords;useflatMaptomapthecontentsofeachfiletoacollectionofXMLrecordsbycallingtheprovidedgetactivationsfunction.getactivationstakesanXMLstring,parsesit,andreturnsacollectionofXMLrecords;flatMapmapseachrecordtoaseparateRDDelement.
6. Mapeachactivationrecordtoastringintheformataccount-number:model.Usetheprovidedgetaccountandgetmodelfunctionstofindthevaluesfromtheactivationrecord.
7. Savetheformattedstringstoatextfileinthedirectory/loudacre/account-models.
Bonus Exercise
AnothercommonpartoftheETLprocessisdatascrubbing.Inthisbonusexercise,youwillprocessdatainordertogetitintoastandardizedformatforlaterprocessing.
Reviewthecontentsofthedatafiledevicestatus.txt.ThisfilecontainsdatacollectedfrommobiledevicesonLoudacre’snetwork,includingdeviceID,currentstatus,locationandsoon.BecauseLoudacrepreviouslyacquiredothermobileprovider’snetworks,thedatafromdifferentsubnetworkshasadifferentformat.Notethattherecordsinthisfile
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
56
havedifferentfielddelimiters:someusecommas,someusepipes(|)andsoon.Yourtasksareto:
• Loadthedataset
• Determinewhichdelimitertouse(hint:thecharacteratposition19isthefirstuseofthedelimiter)
• Filteroutanyrecordswhichdonotparsecorrectly(hint:eachrecordshouldhaveexactly14values)
• Extractthedate(firstfield),model(secondfield),deviceID(thirdfield),andlatitudeandlongitude(13thand14thfieldsrespectively)
• Thesecondfieldcontainsthedevicemanufacturerandmodelname(e.g.Ronin S2.)Splitthisfieldbyspacestoseparatethemanufacturerfromthemodel(e.g.,manufacturerRonin,modelS2.)
• Savetheextracteddatatocommadelimitedtextfilesinthe/loudacre/devicestatus_etldirectoryonHDFS.
• Confirmthatthedatainthefile(s)wassavedcorrectly.
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
57
Hands-On Exercise: Use Pair RDDs to Join Two Datasets
Files Used in This Exercise:
Data files (HDFS): /loudacre/weblogs/*
Data files (local):
~/training_materials/sparkdev/data/accounts.csv
Solution: solutions/UserRequests.pyspark
solutions/UserRequests.scalaspark
InthisexerciseyouwillcontinueexploringtheLoudacrewebserverlogfiles,aswell
astheLoudacreuseraccountdata,usingkey-valuePairRDDs.
Exploring Web Log Files
Continueworkingwiththeweblogfiles,asinthepreviousexercise.
Tip:Inthisexerciseyouwillbereducingandjoininglargedatasets,whichcantakealotoftime.Youmaywishtoperformtheexercisesbelowusingasmallerdataset,consistingofonlyafewoftheweblogfiles,ratherthanallofthem.Rememberthatyoucanspecifyawildcard;textFile("/loudacre/weblogs/*6.log")wouldincludeonlyfilenamesendingwiththedigit6andhavingalogfileextension.
1. Usingmapandreduce,countthenumberofrequestsfromeachuser.
a. UsemaptocreateaPairRDDwiththeuserIDasthekey,andtheinteger1asthevalue.(TheuserIDisthethirdfieldineachline.)Yourdatawilllooksomethinglikethis:
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
58
b. UsereducetosumthevaluesforeachuserID.YourRDDdatawillbesimilarto:
2. UsecountByKeytodeterminehowmanyusersvisitedthesiteforeachfrequency.Thatis,howmanyusersvisitedonce,twice,threetimesandsoon.
a. Usemaptoreversethekeyandvalue,likethis:
b. UsethecountByKeyactiontoreturnaMapoffrequency:user-countpairs.
3. CreateanRDDwheretheuseridisthekey,andthevalueisthelistofalltheIPaddressesthatuserhasconnectedfrom.(IPaddressisthefirstfieldineachrequestline.)
• Hint:Mapto(userid,ipaddress)andthenusegroupByKey.
(userid,1) (userid,1) (userid,1) …
(userid,5) (userid,7) (userid,2) …
(5,userid) (7,userid) (2,userid) …
(userid,20.1.34.55) (userid,245.33.1.1) (userid,65.50.196.141) …
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
59
Joining Web Log Data with Account Data
4. Copythefileaccounts.csvdatafiletoHDFS:
$ hdfs dfs -put \
~/training_materials/sparkdev/data/accounts.csv \
/loudacre/
ThisdatasetconsistsofinformationaboutLoudacre’suseraccounts.ThefirstfieldineachlineistheuserID,whichcorrespondstotheuserIDinthewebserverlogs.Theotherfieldsincludeaccountdetailssuchascreationdate,firstandlastnameandsoon.
5. JointheaccountsdatawiththeweblogdatatoproduceadatasetkeyedbyuserIDwhichcontainstheuseraccountinformationandthenumberofwebsitehitsforthatuser.
a. Maptheaccountsdatatokey/value-listpairs:(userid,[values…])
b. JointhePairRDDwiththesetofuserid/hitcountscalculatedinthefirststep.
(userid,[20.1.34.55, 74.125.239.98]) (userid,[75.175.32.10, 245.33.1.1, 66.79.233.99]) (userid,[65.50.196.141]) …
(userid1,[userid1,2008-11-24 10:04:08,\N,Cheryl,West,4905 Olive Street,San Francisco,CA,…]) (userid2,[userid2,2008-11-23 14:05:07,\N,Elizabeth,Kerns,4703 Eva Pearl Street,Richmond,CA,…]) (userid3,[userid3,2008-11-02 17:12:12,2013-07-18 16:42:36,Melissa,Roman,3539 James Martin Circle,Oakland,CA,…]) …
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
60
c. DisplaytheuserID,hitcount,andfirstname(3rdvalue)andlastname(4thvalue)forthefirst5elements,e.g.:
userid1 4 Cheryl West
userid2 8 Elizabeth Kerns
userid3 1 Melissa Roman
Bonus Exercises
Ifyouhavemoretime,attemptthefollowingchallenges:
6. Challenge1:UsekeyBytocreateanRDDofaccountdatawiththepostalcode(9thfieldintheCSVfile)asthekey.
• Tip:AssignthisnewRDDtoavariableforuseinthenextchallenge
7. Challenge2:CreateapairRDDwithpostalcodeasthekeyandalistofnames(LastName,FirstName)inthatpostalcodeasthevalue.
• Hint:Firstnameandlastnamearethe4thand5thfieldsrespectively
• Optional:TryusingthemapValuesoperation
8. Challenge3:Sortthedatabypostalcode,thenforthefirstfivepostalcodes,displaythecodeandlistthenamesinthatpostalzone,suchas:
(userid1,([userid1,2008-11-24 10:04:08,\N,Cheryl,West,4905 Olive Street,San Francisco,CA,…],4)) (userid2,([userid2,2008-11-23 14:05:07,\N,Elizabeth,Kerns,4703 Eva Pearl Street,Richmond,CA,…],8)) (userid3,([userid3,2008-11-02 17:12:12,2013-07-18 16:42:36,Melissa,Roman,3539 James Martin Circle,Oakland,CA,…],1)) …
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
61
--- 85003
Jenkins,Thad
Rick,Edward
Lindsay,Ivy
…
--- 85004
Morris,Eric
Reiser,Hazel
Gregg,Alicia
Preston,Elizabeth
…
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
62
Hands-On Exercise: Write and Run a Spark Application
Files and Directories Used in This Exercise:
Data files (HDFS): /loudacre/weblogs/*
Scala Project Directory: projects/countjpgs
Scala Classes: stubs.CountJPGs
solution.CountJPGs
Python Stub: stubs/CountJPGs.py
Python Solution: solutions/CountJPGs.py
Inthisexercise,youwillwriteyourownSparkapplicationinsteadofusingthe
interactiveSparkShellapplication.
WriteasimpleprogramthatcountsthenumberofJPGrequestsinaweblogfile.Thenameofthefileshouldbepassedintotheprogramasanargument.
Thisisthesametaskyoudidearlierinthe“GettingStartedWithRDDs”exercise.Thelogicisthesame,butthistimeyouwillneedtosetuptheSparkContextobjectyourself.
Dependingonwhichprogramminglanguageyouareusing,followtheappropriatesetofinstructionsbelowtowriteaSparkprogram.
Beforerunningyourprogram,besuretoexitfromtheSparkShell.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
63
Writing a Spark Application in Python
You may use any text editor you wish. If you don’t have an editor preference, you may
wish to use gedit, which includes language-specific support for Python.
1. Asimplestubfiletogetstartedhasbeenprovided:~/training_materials/sparkdev/stubs/countjpgs.py.ThisstubimportstherequiredSparkclassandsetsupyourmaincodeblock.Copythisstubtoyourworkareaandeditittocompletethisexercise.
2. SetupaSparkContextusingthefollowingcode:
sc = SparkContext()
3. Inthebodyoftheprogram,loadthefilepassedintotheprogram,countthenumberofJPGrequests,anddisplaythecount.Youmaywishtoreferbacktothe“GettingStartedwithRDDs”exerciseforthecodetodothis.
4. Attheendoftheprogram,besuretocall:
sc.stop()
5. Runtheprogramlocally,passingthenameofthelogfiletoprocess,suchas:
$ spark-submit CountJPGs.py /loudacre/weblogs/*
6. SkiptheScalaexercisesbelow,andskiptothesection“StarttheSparkStandaloneCluster.”
Writing a Spark Application in Scala
You may use any text editor you wish. If you don’t have an editor preference, you may
wish to use gedit, which includes language-specific support for Scala. If you are familiar
with the Idea IntelliJ IDE, you may choose to use that; the provided project directories
include IntelliJ configuration.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
64
AMavenprojecttogetstartedhasbeenprovidedintheprojects/countjpgsdirectory.
7. EdittheScalacodeinsrc/main/scala/stubs/CountJPGs.scala.
8. SetupaSparkContextusingthefollowingcode:
val sc = new SparkContext()
9. Inthebodyoftheprogram,loadthefilepassedintotheprogram,countthenumberofJPGrequests,anddisplaythecount.Youmaywishtoreferbacktothe“GettingStartedwithRDDs”exerciseforthecodetodothis.
10. Attheendoftheprogram,besuretocall:
sc.stop
11. Fromthecountjpgsworkingdirectory,buildyourprojectusingthefollowingcommand:
$ mvn package
12. Ifthebuildissuccessful,itwillgenerateaJARfilecalledcountjpgs-1.0.jarincountjpgs/target.Runtheprogramusingthefollowingcommand:
$ spark-submit \
--class stubs.CountJPGs \
target/countjpgs-1.0.jar /loudacre/weblogs/*
• Note:Use--class solution.CountJPGstorunthesolutioninstead.
Starting the Spark Standalone Cluster
13. Inaterminalwindow,starttheSparkMasterandSparkWorkerdaemons:
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
65
$ sudo service spark-master start
$ sudo service spark-worker start
Note:Youcanstoptheservicesbyreplacingstartwithstop,orforcetheservicetorestartbyusingrestart.YoumayneedtodothisifyoususpendandrestarttheVM.
14. ViewtheSparkStandaloneClusterUI:StartFirefoxonyourVMandvisittheSparkMasterUIbyusingtheprovidedbookmarkorvisitinghttp://localhost:18080/.
15. YoushouldnotseeanyapplicationsintheRunningApplicationsorCompletedApplicationsareasbecauseyouhavenotrunanyapplicationsontheclusteryet.
16. Areal-worldSparkclusterwouldhaveseveralworkersconfigured.Inthisclasswehavejustone,runninglocally,whichisnamedbythedateitstarted,thehostitisrunningon,andtheportitislisteningon.Forexample:
17. ClickontheworkerIDlinktoviewtheSparkWorkerUIandnotethattherearenoexecutorscurrentlyrunningonthenode.
18. Intheprevioussection,youranyourapplicationlocally,becauseyoudidnotspecifyamasterwhenstartingit.Re-runtheprogram,specifyingtheclustermasterinordertorunitonthecluster.
ForPython:
$ spark-submit --master spark://localhost:7077 \
CountJPGs.py /loudacre/weblogs/*
ForScala:
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
66
$ spark-submit \
--class stubs.CountJPGs \
--master spark://localhost:7077 \
target/countjpgs-1.0.jar /loudacre/weblogs/*
19. VisittheStandaloneSparkMasterUIandconfirmthattheprogramisrunningonthecluster.
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
67
Hands-On Exercise: Configure a Spark Application
Files Used in This Exercise:
Data files (HDFS): /loudacre/weblogs
Properties files (local): spark.conf
log4j.properties
Inthisexercise,youwillpracticesettingvariousSparkconfigurationoptions.
YouwillworkwiththeCountJPGsprogramyouwroteinthepriorexercise.
Setting Configuration Options at the Command Line
1. Re-runtheCountJPGsPythonorScalaprogramyouwroteinthepreviousexercise,thistimespecifyinganapplicationname.Forexample:
$ spark-submit --master spark://localhost:7077 \
--name 'Count JPGs' \
CountJPGs.py /loudacre/weblogs/*
$ spark-submit \
--class stubs.CountJPGs \
--master spark://localhost:7077 \
--name 'Count JPGs' \
target/countjpgs-1.0.jar /loudacre/weblogs/*
2. VisittheStandaloneSparkMasterUI(http://localhost:18080/)andnotetheapplicationnamelistedistheonespecifiedinthecommandline.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
68
3. Optional:Whiletheapplicationisrunning,visittheSparkApplicationUIandviewtheEnvironmenttab.Takenoteofthespark.*propertiessuchasmaster,appName,anddriverproperties.
Setting Configuration Options in a Configuration File
4. Changedirectoriestoyourworkingdirectory.(IfyouareworkinginScala,thatisthecountjpgsprojectdirectory.)
5. Usingatexteditor,createafileintheworkingdirectorycalledmyspark.conf,containingsettingsforthepropertiesshownbelow:
spark.app.name My Spark App
spark.master yarn-client
spark.executor.memory 400M
6. Re-runyourapplication,thistimeusingthepropertiesfileinsteadofusingthescriptoptionstoconfigureSparkproperties:
$ spark-submit --properties-file myspark.conf \
CountJPGs.py /loudacre/weblogs/*
$ spark-submit --properties-file myspark.conf \
--class stubs.CountJPGs \
target/countjpgs-1.0.jar /loudacre/weblogs/*
7. Whiletheapplicationisrunning,viewtheStandaloneSparkMasterUItoconfirmapplicationnameiscorrectlydisplayedas“MySparkApp,”suchas:
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
69
Setting Logging Levels
8. Copythetemplatefile/etc/spark/conf/log4j.properties.templatetolog4j.propertiesinyourworkingdirectory.
9. Editlog4j.properties.Thefirstlinecurrentlyreads:
log4j.rootCategory=INFO, console
ReplaceINFOwithDEBUG:
log4j.rootCategory=DEBUG, console
10. Re-runyourSparkapplication.BecausethecurrentdirectoryisontheJavaclasspath,yourlog4.propertiesfilewillsettheloggingleveltoDEBUG.
11. NoticethattheoutputnowcontainsboththeINFOmessagesitdidbeforeandDEBUGmessages,similartowhatisshownbelow:
15/03/19 11:40:45 INFO MemoryStore: ensureFreeSpace(154293) called with
curMem=0, maxMem=311387750
15/03/19 11:40:45 INFO MemoryStore: Block broadcast_0 stored as values to
memory (estimated size 150.7 KB, free 296.8 MB)
15/03/19 11:40:45 DEBUG BlockManager: Put block broadcast_0 locally took
79 ms
15/03/19 11:40:45 DEBUG BlockManager: Put for block broadcast_0 without
replication took 79 ms
Debugloggingcanbeusefulwhendebugging,testing,oroptimizingyourcode,butinmostcasesgeneratesunnecessarilydistractingoutput.
12. Editthelog4j.propertiesfiletoreplaceDEBUGwithWARNandtryagain.ThistimenoticethatnoINFOorDEBUGmessagesaredisplayed;onlyWARNmessages.
13. YoucanalsosettheloglevelfortheSparkShellbyplacingthelog4j.propertiesfileinyourworkingdirectorybeforestartingtheshell.TrystartingtheshellfromthedirectoryinwhichyouplacedthefileandnotethatonlyWARNmessagesnowappear.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
70
Note:Duringtherestoftheexercises,youmaychangethesesettingsdependingonwhetheryoufindtheextraloggingmessageshelpfulordistracting.
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
71
Hands-On Exercise: View Jobs and Stages in the Spark Application UI
Files and Data Used in This Exercise:
Data files (HDFS): /loudacre/weblogs/*
/loudacre/accounts.csv
Solutions: solutions/SparkStages.pyspark
solutions/SparkStages.scalaspark
InthisexerciseyouwillusetheSparkApplicationUItoviewtheexecutionstagesfor
ajob.
Inapreviousexercise,youwroteascriptintheSparkShelltojoindatafromtheaccountsdatasetwiththeweblogsdataset,inordertodeterminethetotalnumberofwebhitsforeveryaccount.Nowyouwillexplorethestagesandtasksinvolvedinthatjob.
Exploring Partitioning of File-Based RDDs
1. Start(orrestart,ifnecessary)theSparkShell.AlthoughyouwouldtypicallyrunaSparkapplicationonacluster,yourcourseVMclusterhasonlyasingleworkernodethatcansupportonlyasingleexecutor.Tosimulateamorerealisticmulti-nodecluster,runinlocalmodewithtwothreads.ForPython:
$ pyspark --master local[2]
orforScala:
$ spark-shell --master local[2]
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
72
2. CreateanRDDbasedontheaccountsdatafile(/loudacre/accounts.csv)andthencalltoDebugStringontheRDD,whichdisplaysthenumberofpartitionsinparentheses()beforetheRDDID.HowmanypartitionsareintheresultingRDD?
pyspark> accounts=sc.textFile("/loudacre/accounts.csv")
pyspark> print accounts.toDebugString()
scala> val accounts=sc.
textFile("/loudacre/accounts.csv")
scala> accounts.toDebugString
3. Repeatthisprocess,butspecifyaminimumofthreeofpartitions:textFile(filename,3).DoestheRDDcorrectlyhavethreepartitions?
4. CreateanotherRDDbasedonalltheweblogsdatasetfiles(/loudacre/weblogs/*)andthencalltoDebugStringontheRDD.HowmanypartitionsareintheweblogsRDD?
pyspark> weblogs=sc.textFile("/loudacre/weblogs/*")
pyspark> print weblogs.toDebugString()
scala> val weblogs=sc.textFile("/loudacre/weblogs/*")
scala> weblogs.toDebugString
HowdoesthenumberoffilesinthedatasetcomparetothenumberofpartitionsintheRDD?
5. Repeatthisprocess,butspecifyonlyasubsetofthefiles:thoseforthemonthofOctoberin2013,/loudacre/weblogs/2013-10-*.log.
6. Bonus:useforeachPartitiontoprintoutthefirstrecordofeachpartition.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
73
Setting up the Job
7. First,createanRDDofaccounts,keyedbyIDandwithfirstname,lastnameforthevalue.
pyspark> accountsByID = accounts \
.map(lambda s: s.split(',')) \
.map(lambda values: \
(values[0],values[4] + ',' + values[3]))
scala> val accountsByID = accounts.
map(line => line.split(',')).
map(values => (values(0),values(4)+','+values(3)))
8. ConstructanRDDwiththetotalnumberofwebhitsforeachuserID:
pyspark> userreqs = weblogs \
.map(lambda line: line.split()) \
.map(lambda words: (words[2],1)) \
.reduceByKey(lambda v1,v2: v1+v2)
scala> val userreqs = weblogs.
map(line => line.split(' ')).
map(words => (words(2),1)).
reduceByKey((v1,v2) => v1 + v2)
9. ThenjointhetwoRDDsbyuserID,andconstructanewRDDbasedonfirstname,lastnameandtotalhits:
pyspark> accounthits = accountsByID.join(userreqs)\
.values()
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
74
scala> val accounthits =
accountsByID.join(userreqs).values
10. Printtheresultsofaccounthits.toDebugStringandreviewtheoutput.Basedonthis,seeifyoucandetermine:
a. Howmanystagesareinthisjob?
b. Whichstagesaredependentonwhich?
c. Howmanytaskswilleachstageconsistof?
Runing and Reviewing the Job in the Spark Application UI
11. Inyourbrowser,visitingtheSparkApplicationUIbyusingtheprovidedtoolbarbookmark,orvisitinghttp://localhost:4040/inyourbrowser.
12. IntheSparkUI,makesuretheJobstabisselected.Nojobsareyetrunningsothelistwillbeempty.
13. Returntotheshellandstartthejobbyexecutinganaction(saveAsTextFile):
pyspark> accounthits.\
saveAsTextFile("/loudacre/userreqs")
scala> accounthits.
saveAsTextFile("/loudacre/userreqs")
14. ReloadtheSparkUIJobspageinyourbrowser.YourjobwillappearintheActiveJobslistuntilitcompletes,andthenitwilldisplayintheCompletedJobsList.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
75
15. Clickonthejobdescription(whichisthelastactioninthejob)toseethestages.Asthejobprogressesyoumaywanttorefreshthepageafewtimes.
Thingstonote:
a. Howmanystagesareinthejob?DoesitmatchthenumberyouexpectedfromtheRDD’stoDebugStringoutput?
b. Thestagesarenumbered,butnumbersdonotrelatetotheorderofexecution.Notethetimesthestagesweresubmittedtodeterminetheorder.DoestheordermatchwhatyouexpectedbasedonRDDdependency?
c. Howmanytasksareineachstage?Thenumberoftasksinthefirststagescorrespondstothenumberofpartitions.
d. TheShuffleReadandShuffleWritecolumnsindicatehowmuchdatawascopiedbetweentasks.Thisisusefultoknowbecausecopyingtoomuchdataacrossthenetworkcancauseperformanceissues.
16. Clickonthestagestoviewdetailsaboutthatstage.
Thingstonote:
a. TheSummaryMetricsareashowsyouhowmuchtimewasspendonvarioussteps.Thiscanhelpyounarrowdownperformanceproblems.
b. TheTasksarealistseachtask.TheLocalityLevelcolumnindicateswhethertheprocessranonthesamenodewherethepartitionwasphysicallystoredornot.RememberthatSparkwillattempttoalwaysruntaskswherethedatais,butmaynotalwaysbeableto,ifthenodeisbusy.
c. Inareal-worldcluster,theexecutorcolumnintheTaskareawoulddisplaythedifferentworkernodesthatranthetasks.Inthissingle-nodecluster,alltasksrunonthesamehost:localhost.
17. Whenthejobiscomplete,returntotheJobstabtoseethefinalstatisticsforthenumberoftasksexecutedandthetimethejobtook.
18. Optional:Tryre-runningthelastaction.YouwillneedtoeitherdeletethesaveAsTextFileoutputdirectoryinHDFS,orspecifyadifferentdirectoryname.You
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
76
willprobablyfindthatthejobcompletesmuchfaster,andthatseveralstages(andthetasksinthem)showas“skipped.”
Bonusquestion:Whichtaskswereskippedandwhy?
LeavetheSparkShellrunningforthenextexercise.
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
77
Hands-On Exercise: Persist an RDD Files and Data Used in This Exercise:
Data files (HDFS): /loudacre/weblogs/*
/loudacre/accounts.csv
Job Setup: solutions/SparkStages.pyspark
solutions/SparkStages.scalaspark
Inthisexerciseyouwillexploretheperformanceeffectofcaching(thatis,persisting
tomemory)anRDD.
1. MakesuretheSparkShellisstillrunningfromthelastexercise.Ifitisn’t,restartit(inlocalmodewithtwothreads)andpasteinthejobsetupcodefromthesolutionfileorthepreviousexercise.
2. Thistimetostartthejobyouaregoingtoperformaslightlydifferentactionthanlasttime:countthenumberofuseraccountswithatotalhitcountgreaterthanfive:
pyspark> accounthits\
.filter(lambda (firstlast,hitcount): hitcount > 5)\
.count()
scala> accounthits.filter(pair => pair._2 > 5).count()
3. Cache(persisttomemory)theRDDbycallingaccounthits.persist().
4. Inyourbrowser,viewtheSparkApplicationUIandselecttheStoragetab.Atthispoint,youhavemarkedyourRDDtobepersisted,buthavenotyetperformedanactionthatwouldcauseittobematerializedandpersisted,soyouwillnotyetseeanypersistedRDDs.
5. IntheSparkShell,executethecountagain.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
78
6. ViewtheRDD’stoDebugString.Noticethattheoutputindicatesthepersistencelevelselected.
7. ReloadtheStoragetabinyourbrowser,andthistimenotethattheRDDyoupersistedisshown.ClickontheRDDIDtoseedetailsaboutpartitionsandpersistence.
8. ClickontheExecutorstabandtakenoteoftheamountofmemoryusedandavailableforyouroneworkernode.
Notethattheclassroomenvironmenthasasingleworkernodewithasmallamountofmemoryallocated,soyoumayseethatnotallofthedatasetisactuallycachedinmemory.Intherealworld,forgoodperformanceaclusterwillhavemorenodes,eachwithmorememory,sothatmoreofyouractivedatacanbecached.
9. Optional:SettheRDD’spersistenceleveltoStorageLevel.DISK_ONLYandcomparethestoragereportintheSparkApplicationWebUI.Hint:BecauseyouhavealreadypersistedtheRDDatadifferentlevel,youwillneedtounpersist()firstbeforeyoucansetanewlevel.
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
79
Hands-On Exercise: Implement an Iterative Algorithm
Files and Data Used in This Exercise:
Data files (HDFS): /loudacre/devicestatus_etl/
Stubs: stubs/KMeansCoords.pyspark
stubs/KMeansCoords.scalaspark
Solutions: solutions/KMeansCoords.pyspark
solutions/KMeansCoords.scalaspark
Inthisexercise,youwillpracticeimplementingiterativealgorithmsinSparkby
calculatingk-meansforasetofpoints.
Reviewing the Data
Inthebonussectionofthe“UseRDDstoTransformaDataset”exercise,youusedSparktoextractthedate,maker,deviceID,latitudeandlongitudefromthedevicestatus.txtdatafile,andstoretheresultsintheHDFSdirectory/loudacre/devicestatus_etl.
Ifyoudidnothavetimetocompletethatbonusexercise,runthesolutionscriptnowfollowingthetwostepsbelow.
• Copy~/training_materials/sparkdev/data/devicestatus.txttothe/loudacre/directoryinHDFS.
• RuntheSparkscript~/training_materials/solutions/DeviceStatusETL(either.pysparkor.scalaparkdependingonwhichlanguageyouareusing)
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
80
Examinethedatainthedataset.Notethatthelatitudeandlongitudearethe4thand5thfields,respectively,suchas:
2014-03-15:10:10:20,Sorrento,8cc3b47e-bd01-4482-b500-
28f2342679af,33.6894754264,-117.543308253
2014-03-15:10:10:20,MeeToo,ef8c7564-0a1a-4650-a655-
c8bbd5f8f943,37.4321088904,-121.485029632
Calculate k-means for Device Location
Ifyouarealreadyfamiliarwithcalculatingk-means,trydoingtheexerciseonyourown.Otherwise,followthestep-by-stepprocessbelow.
1. StartbycopyingtheprovidedKMeansCoordsstubfile,whichcontainsthefollowingconveniencefunctionsusedincalculatingk-means:
• closestPoint:givena(latitude/longitude)pointandanarrayofcurrentcenterpoints,returnstheindexinthearrayofthecenterclosesttothegivenpoint
• addPoints:giventwopoints,returnapointwhichisthesumofthetwopoints–thatis,(x1+x2,y1+y2)
• distanceSquared:giventwopoints,returnsthesquareddistanceofthetwo.Thisisacommoncalculationrequiredingraphanalysis.
2. SetthevariableK(thenumberofmeanstocalculate).ForthisuseK=5.
3. SetthevariableconvergeDist.Thiswillbeusedtodecidewhenthek-meanscalculationisdone–whentheamountthelocationsofthemeanschangesbetweeniterationsislessthanconvergeDist.A“perfect”solutionwouldbe0;thisnumberrepresentsa“goodenough”solution.Forthisexercise,useavalueof0.1.
4. Parsetheinputfile,whichisdelimitedbyacommacharacterwithin(latitude,longitude)pairs(the4thand5thfieldsineachline).Onlyincludeknownlocations(thatis,filterout(0,0)locations).Besuretopersist(cache)theresultingRDDbecauseyouwillaccessiteachtimethroughtheiteration.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
81
5. Createak-lengtharraycalledkPointsbytakingarandomsampleofklocationpointsfromtheRDDasstartingmeans(centerpoints);forexample:
data.takeSample(False, K, 42)
6. IterativelycalculateanewsetofKmeansuntilthetotaldistancebetweenthemeanscalculatedforthisiterationandthelastissmallerthanconvergeDist.Foreachiteration:
a. Foreachcoordinatepoint,usetheprovidedclosestPointfunctiontomapeachpointtotheindexinthekPointsarrayofthelocationclosesttothatpoint.TheresultingRDDshouldbekeyedbytheindex,andthevalueshouldbethepair:(point, 1).Thevalue“1”willlaterbeusedtocountthenumberofpointsclosesttoagivenmean;forexample:
(1, ((37.43210, -121.48502), 1))
(4, ((33.11310, -111.33201), 1))
(0, ((39.36351, -119.40003), 1))
(1, ((40.00019, -116.44829), 1))
…
b. Reducetheresult:foreachcenterinthekPointsarray,sumthelatitudesandlongitudes,respectively,ofallthepointsclosesttothatcenter,andthenumberofclosestpoints.Forexample:
(0, ((2638919.87653,-8895032.182481), 74693)))
(1, ((3654635.24961,-12197518.55688), 101268))
(2, ((1863384.99784,-5839621.052003), 48620))
(3, ((4887181.82600,-14674125.94873), 126114))
(4, ((2866039.85637,-9608816.13682), 81162))
c. ThereducedRDDshouldhave(atmost)kmembers.Mapeachtoanewcenterpointbycalculatingtheaveragelatitudeandlongitudeforeachsetofclosest
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
82
points:thatis,map(index,(totalX,totalY),n)to(index,(totalX/n, totalY/n))
d. Collectthesenewpointsintoalocalmaporarraykeyedbyindex.
e. UsetheprovideddistanceSquaredmethodtocalculatehowmucheachcenter“moved”betweenthecurrentiterationandthelast.Thatis,foreachcenterinkPoints,calculatethedistancebetweenthatpointandthecorrespondingnewpoint,andsumthosedistances.Thatisthedeltabetweeniterations;whenthedeltaislessthanconvergeDist,stopiterating.
f. CopythenewcenterpointstothekPointsarrayinpreparationforthenextiteration.
7. Whentheiterationiscomplete,displaythefinalkcenterpoints.
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
83
Hands-On Exercise: Use Broadcast Variables
Files Used in This Exercise:
Data files (HDFS): /loudacre/weblogs/*
Data files (local):
~/training_materials/sparkdev/data/targetmodels.txt
Stubs: stubs/TargetModels.pyspark
stubs/TargetModels.scalaspark
Solutions: solutions/TargetModels.pyspark
solutions/TargetModels.scalaspark
Inthisexercise,youwillfilterwebrequeststoincludeonlythosefromdevices
includedinalistoftargetmodels.
Loudacrewantstodosomeanalysisonwebtrafficproducedfromspecificdevices.Thelistoftargetmodelsisin~training_materials/sparkdev/data/targetmodels.txt
Filterthewebserverlogstoincludeonlythoserequestsfromdevicesinthelist.Themodelnameofthedevicewillbeinthelineinthelogfile.Useabroadcastvariabletopassthelistoftargetdevicestotheworkersthatwillrunthefiltertasks.
Hint:Usethestubfileforthisexercisein~/training_materials/sparkdev/stubsforthecodetoloadinthelistoftargetmodels.
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
84
Hands-On Exercise: Use Accumulators Files Used in This Exercise:
Data files (HDFS): /loudacre/weblogs/*
Solutions: solutions/RequestAccumulator.pyspark
solutions/RequestAccumulator.scalaspark
Inthisexercise,youwillcountthenumberofdifferenttypesoffilesrequestedinasetofwebserverlogs.
Usingaccumulators,countthenumberofeachtypeoffile(HTML,CSSandJPG)requestedinthewebserverlogfiles.
Hint:usethefileextensionstringtodeterminethetypeofrequest,suchas.html,.css,or.jpg.
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
85
Hands-On Exercise: Use Spark SQL for ETL
Files and Data Used in this Exercise
MySQL table: loudacre.webpage
Output HDFS directory: /loudacre/webpage_files
Solutions: solutions/SparkSQL-webpage-files.pyspark
solutions/SparkSQL-webpage-files.scalaspark
Inthisexercise,youwilluseSparkSQLtoloaddatafromMySQL,processit,andstoreittoHDFS.
Reviewing the Data in MySQL
ReviewthedatacurrentlyintheMySQLloudacre.mysqltable.
1. Listthecolumnsandtypesinthetable:
$ mysql -utraining -ptraining loudacre \
-e"DESCRIBE webpage"
2. Viewthefirstfewrowsfromthetable:
$ mysql -utraining -ptraining loudacre \
-e"SELECT * FROM webpage LIMIT 5"
Notethatthedataintheassociated_filescolumnisacomma-delimitedstring.LoudacrewouldliketomakethisdataavailableinanImpalatable,butinordertoperformtherequiredanalysis,theassociated_filesdatamustbeextractedandnormalized.YourgoalinthenextsectionistouseSparkSQLtoextractthedatainthecolumn,splitthestring,andcreateanewdatasetinHDFScontainingeachwebpagenumber,anditsassociatedfilesinseparaterows.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
86
Loading the Data from MySQL
3. Ifnecessary,starttheSparkShell.
4. ImporttheSQLContextclassdefinition,anddefineaSQLcontext:
pyspark> from pyspark.sql import SQLContext
pyspark> sqlCtx = SQLContext(sc)
scala> import org.apache.spark.sql.SQLContext
scala> val sqlCtx = new SQLContext(sc)
5. CreateanewDataFramebasedonthewebpagetablefromthedatabase:
pyspark> webpages=sqlCtx.load(source="jdbc", \
url="jdbc:mysql://localhost/loudacre?user=training&password=t
raining", \
dbtable="webpage")
scala> val webpages=sqlCtx.load("jdbc",
Map("url"->
"jdbc:mysql://localhost/loudacre?user=training&password=train
ing",
"dbtable" -> "webpage"))
6. ExaminetheschemaofthenewDataFramebycallingwebpages.printSchema().
7. CreateanewDataFramebyselectingtheweb_page_numandassociated_filescolumnsfromtheexistingDataFrame:
python> assocfiles = \
webpages.select(webpages.web_page_num,\
webpages.associated_files)
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
87
scala> val assocfiles =
webpages.select(webpages("web_page_num"),webpages("associated
_files"))
8. InordertomanipulatethedatausingSpark,converttheDataFrameintoatoaPairRDDusingthemapmethod.TheinputintothemapmethodisaRowobject.Theykeyistheweb_page_numvalue(thefirstvalueintherow),andthevalueistheassociated_filesstring(thesecondvalueintherow).
InPython,youcandynamicallyreferencethecolumnvalueoftherowbyname:
pyspark> afilesrdd = assocfiles.map(lambda row: \
(row.web_page_num,row.associated_files))
InScala,usethecorrectgetmethodforthetypeofvaluewiththecolumnindex:
scala> val afilesrdd = assocfiles.map(row =>
(row.getInt(0),row.getString(1)))
9. NowthatyouhaveanRDD,youcanusethefamiliarflatMapValuestransformationtosplitandextractthefilenamesintheassociated_filescolumn:
pyspark> afilesrdd2 = afilesrdd\
.flatMapValues(lambda filestring:filestring.split(','))
scala> val afilesrdd2 =
afilesrdd.flatMapValues(filestring =>
filestring.split(','))
10. CreateanewDataFramefromtheRDD:
pyspark> afiledf = sqlCtx.createDataFrame(afilesrdd2)
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
88
scala> val afiledf = sqlCtx.createDataFrame(afilesrdd2)
11. CallprintSchemaonthenewDataFrame.NotethatSparkSQLgavethecolumnsgenericnames:_1and_2.
12. CreateanewDataFramebyrenamingthecolumnstoreflectthedatatheyhold.
InPython,usethewithColumnRenamedmethodtorenamethetwocolumns:
pyspark> finaldf = afiledf. \
withColumnRenamed('_1','web_page_num'). \
withColumnRenamed('_2','associated_file')
InScala,youcanusethetoDFshortcutmethodtocreateanewDataFramebasedonanexistingonewiththecolumnsrenamed:
scala> val finaldf = afiledf.
toDF("web_page_num","associated_file")
13. CallprintSchematoconfirmthatthenewDataFramehasthecorrectcolumnnames.
14. YourfinalDataFramecontainstheprocesseddata,socallfinaldf.collect()toconfirmthedataiscorrect.
15. Optional:SavethefinalDataFrameinParquetformat(thedefault)in/loudacre/webpage_files.ThecodeisthesameinScalaandPython.
> finaldf.save("/loudacre/webpage_files")
ConfirmthatthedatawassavedinHDFSiscorrect.NotethatthedatawillbeinParquetfileformat,whichisabinaryformat.Thismeansthatwhenyouviewthefile,onlysomeofthecontentwillbeinreadablestringform.Thisisexpectedbehavior.
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
89
Appendix A: Enabling iPython Notebook iPythonNotebookisinstalledontheVMforthiscourse.Touseitinsteadofthecommand-lineversionofiPython,followthesesteps:
1. Openthefollowingfileforediting:/home/training/.bashrc
2. Uncommentoutthefollowingline(removetheleading#).
# export PYSPARK_DRIVER_PYTHON_OPTS='notebook ……..jax'
3. Savethefile.
4. Openanewterminalwindow.(Mustbeanewterminalsoitloadsyouredited.bashrcfile).
5. Enterpysparkintheterminal.Thiswillcauseabrowserwindowtoopen,andyoushouldseethefollowingwebpage:
6. OntherighthandsideofthepageselectPython2fromtheNewmenu
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
90
7. EntersomeSparkcodesuchasthefollowingandusetheplaybuttontoexecuteyourSparkcode.
8. Noticetheoutputdisplayed.
This is the end of the Appendix
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
91
Data Model Reference
Notethatnotalloftheinformationbelowappliestothiscustomizedversionofthecourse.
Tables Imported from MySQL
ThefollowingdepictsthestructureoftheMySQLtablesimportedintoHDFSusingSqoop.Theprimarykeycolumnfromthedatabase,ifany,isdenotedbyboldtext:
customers:201,375records(importedto/dualcore/customers)
Index Field Description Example0 cust_id CustomerID 1846532 1 fname Firstname Sam 2 lname Lastname Jones 3 address Addressofresidence 456 Clue Road 4 city City Silicon Sands 5 state State CA 6 zipcode Postalcode 94306
employees:61,712records(importedto/dualcore/employeesandlaterusedasanexternaltableinHive)
Index Field Description Example0 emp_id EmployeeID BR5331404 1 fname Firstname Betty 2 lname Lastname Richardson 3 address Addressofresidence 123 Shady Lane 4 city City Anytown 5 state State CA 6 zipcode PostalCode 90210 7 job_title Employee’sjobtitle Vice President 8 email e-mailaddress [email protected] 9 active Isactivelyemployed? Y 10 salary Annualpay(indollars) 136900
orders:1,662,951records(importedto/dualcore/orders)
Index Field Description Example0 order_id OrderID 3213254
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
92
1 cust_id CustomerID 1846532 2 order_date Date/timeoforder 2013-05-31 16:59:34
order_details:3,333,244records(importedto/dualcore/order_details)
Index Field Description Example0 order_id OrderID 3213254 1 prod_id ProductID 1754836
products:1,114records(importedto/dualcore/products)
Index Field Description Example0 prod_id ProductID 1273641 1 brand Brandname Foocorp 2 name Nameofproduct 4-port USB Hub 3 price Retailsalesprice,incents 1999 4 cost Wholesalecost,incents 1463 5 shipping_wt Shippingweight(inpounds) 1
suppliers:66records(importedto/dualcore/suppliers)
Index Field Description Example0 supp_id SupplierID 1000 1 fname Firstname ACME Inc. 2 lname Lastname Sally Jones 3 address Addressofoffice 123 Oak Street 4 city City New Athens 5 state State IL
6 zipcode Postalcode 62264 7 phone Officephonenumber (618) 555-5914
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
93
Hive/Impala Tables
Thefollowingisarecordcountfortablesthatarecreatedorqueriedduringthehands-onexercises.UsetheDESCRIBE tablenamecommandtoseethetablestructure.
TableName RecordCount
ads 788,952
cart_items 33,812
cart_orders 12,955
cart_shipping 12,955
cart_zipcodes 12,955
checkout_sessions 12,955
customers 201,375
employees 61,712
loyalty_program 311
order_details 3,333,244
orders 1,662,951
products 1,114
ratings 21,997
web_logs 412,860
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
94
Other Data Added to HDFS
ThefollowingdescribesthestructureofotherimportantdatasetsaddedtoHDFS.
CombinedAdCampaignData:(788,952recordstotal),storedintwodirectories:
• /dualcore/ad_data1(438,389records)
• /dualcore/ad_data2(350,563records).
Index Field Description Example0 campaign_id Uniquelyidentifiesourad A3 1 date Dateofaddisplay 05/23/2013 2 time Timeofaddisplay 15:39:26 3 keyword Keywordthattriggeredad tablet 4 display_site Domainwhereadshown news.example.com 5 placement LocationofadonWebpage INLINE 6 was_clicked Whetheradwasclicked 1 7 cpc Costperclick,incents 106
access.log:412,860records(uploadedto/dualcore/access.log)Thisfileisusedtopopulatetheweb_logstableinHive.NotethattheRFC931andUsernamefieldsareseldompopulatedinlogfilesformodernpublicWebsitesandareignoredinourRegexSerDe.
Index Field/Description Example0 IPaddress 192.168.1.151 RFC931(Ident) -2 Username -3 Date/Time [22/May/2013:15:01:46 -0800]4 Request "GET /foo?bar=1 HTTP/1.1"5 Statuscode 2006 Bytestransferred 7627 Referer "http://dualcore.com/"8 Useragent(browser) "Mozilla/4.0 [en] (WinNT; I)"9 Cookie(sessionID) "SESSION=8763723145"
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
95
Regular Expression Reference Thefollowingisabrieftutorialintendedfortheconvenienceofstudentswhodon’thaveexperienceusingregularexpressionsormayneedarefresher.AmorecompletereferencecanbefoundinthedocumentationforJava’sPatternclass:
http://tiny.cloudera.com/regexpattern
Introduction to Regular Expressions
Regularexpressionsareusedforpatternmatching.Therearetwokindsofpatternsinregularexpressions:literalsandmetacharacters.Literalvaluesareusedtomatchprecisepatternswhilemetacharactershavespecialmeaning;forexample,adotwillmatchanysinglecharacter.Here'sthecompletelistofmetacharacters,followedbyexplanationsofthosethatarecommonlyused:
< ( [ { \ ^ - = $ ! | ] } ) ? * + . >
Literalcharactersareanycharactersnotlistedasametacharacter.They'rematchedexactly,butifyouwanttomatchametacharacter,youmustescapeitwithabackslash.Sinceabackslashisitselfametacharacter,itmustalsobeescapedwithabackslash.Forexample,youwouldusethepattern\\.tomatchaliteraldot.
Regularexpressionssupportpatternsmuchmoreflexiblethansimplyusingadottomatchanycharacter.Thefollowingexplainshowtousecharacterclassestorestrictwhichcharactersarematched.
Character Classes [057] Matchesanysingledigitthatiseither0,5,or7[0-9] Matchesanysingledigitbetween0and9[3-6] Matchesanysingledigitbetween3and6[a-z] Matchesanysinglelowercaseletter[C-F] MatchesanysingleuppercaseletterbetweenCandF
Forexample,thepattern[C-F][3-6]wouldmatchthestringD3orF5butwouldfailtomatchG3orC7.
Therearealsosomebuilt-incharacterclassesthatareshortcutsforcommonsetsofcharacters.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
96
Predefined Character Classes \\d Matchesanysingledigit\\w Matchesanywordcharacter(lettersofanycase,plusdigitsorunderscore)\\s Matchesanywhitespacecharacter(space,tab,newline,etc.)
Forexample,thepattern\\d\\d\\d\\wwouldmatchthestring314dor934Xbutwouldfailtomatch93XorZ871.
Sometimesit'seasiertochoosewhatyoudon'twanttomatchinsteadofwhatyoudowanttomatch.Thesethreecanbenegatedbyusinganuppercaseletterinstead.
Negated Predefined Character Classes \\D Matchesanysinglenon-digitcharacter\\W Matchesanynon-wordcharacter\\S Matchesanynon-whitespacecharacter
Forexample,thepattern\\D\\D\\WwouldmatchthestringZX#or@ Pbutwouldfailtomatch93Xor36_.
Themetacharactersshownabovematcheachexactlyonecharacter.Youcanspecifythemmultipletimestomatchmorethanonecharacter,butregularexpressionssupporttheuseofquantifierstoeliminatethisrepetition.
Matching Quantifiers {5} Precedingcharactermayoccurexactlyfivetimes{0,6} Precedingcharactermayoccurbetweenzeroandsixtimes? Precedingcharacterisoptional(mayoccurzerooronetimes)+ Precedingcharactermayoccuroneormoretimes* Precedingcharactermayoccurzeroormoretimes
Bydefault,quantifierstrytomatchasmanycharactersaspossible.Ifyouusedthepatternore.+aonthestringDualcore has a store in Florida,youmightbesurprisedtolearnthatitmatchesore has a store in Floridaratherthanore haorore in Floridaasyoumighthaveexpected.Thisisbecausematchesa"greedy"bydefault.Addingaquestionmarkmakesthequantifiermatchasfewcharactersaspossibleinstead,sothepatternore.+?aonthisstringwouldmatchore ha.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent.
97
Finally,therearetwospecialmetacharactersthatmatchzerocharacters.Theyareusedtoensurethatastringmatchesapatternonlywhenitoccursatthebeginningorendofastring.
Boundary Matching Metacharacters ^ Matchesonlyatthebeginningofastring $ Matchesonlyattheendingofastring
NOTE:Whenusedinsidesquarebrackets(whichdenoteacharacterclass),the^characterisinterpreteddifferently.Inthatcontext,itnegatesthematch.Therefore,specifyingthepattern[^0-9]isequivalenttousingthepredefinedcharacterclass\\ddescribedearlier.