explorehadoopandbiginsightslab-140710172857-phpapp01

74
Explore Big Data with Hadoop and InfoSphere BigInsights Cynthia M. Saracco ([email protected]) August 15, 2014

description

Explore

Transcript of explorehadoopandbiginsightslab-140710172857-phpapp01

  • Explore Big Data with Hadoop and InfoSphere BigInsights

    Cynthia M. Saracco ([email protected])

    August 15, 2014

  • IBM Software

    Page 2 Explore Hadoop and BigInsights

    Contents

    LAB 1 OVERVIEW......................................................................................................................................................... 3 1.1. ABOUT YOUR ENVIRONMENT ..................................................................................................................... 3 1.2. GETTING STARTED ................................................................................................................................... 4

    LAB 2 ISSUING BASIC HADOOP COMMANDS .......................................................................................................... 8 2.1. CREATING A DIRECTORY IN YOUR DISTRIBUTED FILE SYSTEM......................................................................... 8 2.2. COPYING DATA INTO HDFS ...................................................................................................................... 8 2.3. RUNNING A SAMPLE MAPREDUCE APPLICATION........................................................................................... 9

    LAB 3 EXPLORING AND ADMINISTERING YOUR CLUSTER WITH THE BIGINSIGHTS WEB CONSOLE ........... 13 3.1. GETTING STARTED WITH THE WEB CONSOLE ............................................................................................ 13 3.2. ADMINISTERING BIGINSIGHTS.................................................................................................................. 14 3.3. WORKING WITH THE DISTRIBUTED FILE SYSTEM (HDFS) ............................................................................ 17 3.4. MANAGING AND LAUNCHING PRE-BUILT APPLICATIONS FROM THE WEB CATALOG .......................................... 22

    LAB 4 ANALYZING SOCIAL MEDIA DATA WITH BIGSHEETS ............................................................................... 28 4.1. CREATING A WORKBOOK......................................................................................................................... 28 4.2. ANALYZING AND CUSTOMIZING YOUR WORKBOOK ...................................................................................... 30 4.3. CREATING CHARTS................................................................................................................................. 38 4.4. CREATING A BIG SQL TABLE BASED ON YOUR WORKBOOK ......................................................................... 42 4.5. OPTIONAL: EXPORTING YOUR WORKBOOK DATA ....................................................................................... 45

    LAB 5 QUERYING DATA WITH BIG SQL .................................................................................................................. 48 5.1. CREATING A PROJECT AND EXECUTING BIG SQL STATEMENTS ................................................................... 48 5.2. CREATING SAMPLE TABLES AND LOADING SAMPLE DATA ............................................................................. 53 5.3. QUERYING TABLES WITH JOINS, AGGREGATIONS AND MORE ........................................................................ 60 5.4. OPTIONAL: USING SERDES FOR NON-TRADITIONAL DATA .......................................................................... 63 5.5. OPTIONAL: DEVELOPING A JDBC CLIENT APPLICATION WITH BIG SQL ....................................................... 65

    LAB 6 SUMMARY ....................................................................................................................................................... 70

  • Hands On Lab Page 3

    Lab 1 Overview

    In this hands-on lab, you'll learn how to work with Big Data using Apache Hadoop and InfoSphere BigInsights, IBM's Hadoop-based platform. In particular, you'll learn the basics of working with the Hadoop Distributed File System (HDFS) and see how to administer your Hadoop-based environment using the BigInsights Web console. After launching a sample MapReduce application, you'll explore a more sophisticated scenario involving social media data. In doing so, you'll learn how to use a spreadsheet-style interface to discover insights about the global coverage of a popular brand without writing any code. Finally, you'll learn how to apply industry standard SQL to data managed by BigInsights through IBM's Big SQL technology. Indeed, you'll have a chance to create tables and execute complex queries over data in HDFS, including data derived from a relational data warehouse.

    Ready to get started?

    After completing this hands-on lab, youll be able to:

    Work directly with Apache Hadoop through file system commands

    Inspect and administer your cluster through the BigInsights Web Console

    Explore big data using a spreadsheet-style tool

    Use Big SQL to create tables and issue complex queries

    Allow 2 - 3 hours to complete this lab.

    This lab was developed by Cynthia M. Saracco, IBM Silicon Valley Lab. Please post questions or comments about this lab or the technologies it describes to the forum on Hadoop Dev at https://developer.ibm.com/hadoop/.

    1.1. About your environment

    This lab was developed for the InfoSphere BigInsights 3.0 Quick Start Edition VMware image. If necessary, download and install the single-node cluster VMware image from this site: http://www-01.ibm.com/software/data/infosphere/biginsights/quick-start/downloads.html

    The VMware image is set up in the following manner:

    User Password

    VM Image root account root password

    VM Image lab user account biadmin biadmin

    BigInsights Administrator biadmin biadmin

    Big SQL Administrator bigsql bigsql

    Lab user biadmin biadmin

  • IBM Software

    Page 4 Explore Hadoop and BigInsights

    Property Value

    Host name bivm.ibm.com

    BigInsights Web Console URL http://bivm.ibm.com:8080

    Big SQL database name bigsql

    Big SQL port number 51000

    .

    About the screen captures, sample code, and environment configuration

    Screen captures in this lab depict examples and results that may vary from what you see when you complete the exercises. In addition, some code examples may need to be customized to match your environment. For example, you may need to alter directory path information or user ID information.

    1.2. Getting started

    To get started with the lab exercises, you need to install and launch the VMware image as well as start the required services.

    __1. If necessary, obtain a copy of the BigInsights 3.0 Quick Start Edition VMware image from IBM's external download site (http://www-01.ibm.com/software/data/infosphere/biginsights/quick-start/downloads.html). Use the image for the single-node cluster.

    __2. Follow the instructions provided to decompress (unzip) the file and install the image on your laptop. Note that there is a README file with additional information.

    __3. If necessary, install VMware player or other required software to run VMware images. Details are in the README file provided with the BigInsights VMware image.

    __4. Launch the VMware image. When logging in for the first time, use the root ID (with a password of password). Follow the instructions to configure your environment, accept the licensing agreement, and enter the passwords for the root and biadmin IDs (root/password and biadmin/biadmin) when prompted. This is a one-time only requirement.

  • Hands On Lab Page 5

    __5. When the one-time configuration process is completed, you will be presented with a SUSE Linux log in screen. Log in as biadmin with a password of biadmin.

  • IBM Software

    Page 6 Explore Hadoop and BigInsights

    __6. Verify that your screen appears similar to this:

    __7. Click Start BigInsights to start all required services. (Alternatively, you can open a terminal window and issue this command: $BIGINSIGHTS_HOME/bin/start-all.sh)

    Wait until the operation completes. This may take several minutes, depending on your machine's resources.

  • Hands On Lab Page 7

    __8. Verify that all required BigInsights services are up and running. From a terminal window, issue this command: $BIGINSIGHTS_HOME/bin/status.sh.

    __9. Inspect the results, a subset of which are shown below. Verify that, at a minimum, the following components started successfully: hdm, zookeeper, hadoop, catalog, hive, bigsql, oozie, console, and httpfs.

    Now you're ready to start working with big data!

    If have any questions or need help getting your environment up and running, visit Hadoop

    Dev (https://developer.ibm.com/hadoop/) and review the product documentation or post a message to the forum.

    You cannot proceed with subsequent lab exercises until you've logged into the VMware image and launched the necessary BigInsights services.

  • IBM Software

    Page 8 Explore Hadoop and BigInsights

    Lab 2 Issuing basic Hadoop commands

    In this exercise, youll work directly with Apache Hadoop to perform some basic tasks involving the Hadoop Distributed File System (HDFS) and launching a sample application. All the work youll perform here involves commands and interfaces provided with Hadoop from http://hadoop.apache.org. As mentioned earlier, Hadoop is part of IBMs InfoSphere BigInsights platform.

    Allow 15 minutes to complete this lab module.

    2.1. Creating a directory in your distributed file system

    __1. Click the BigInsights Shell icon.

    __2. Select the Terminal icon to open a terminal window.

    __3. Execute the following Hadoop file system command to create a directory in HDFS for your work:

    hadoop fs -mkdir /user/biadmin/test

    Note that HDFS is distinct from your Unix/Linux local file system directory, and working with HDFS requires using hadoop fs commands.

    2.2. Copying data into HDFS

    __1. Using standard Unix/Linux file system commands, list the contents of the /home/biadmin/licenses directory.

    ls /home/biadmin/licenses

    Note the BIlicense_en.txt file. It contains license information in English, and it will serve as a sample data file for a future exercise.

    __2. Copy the BIlicense_en.txt file into the /user/biadmin/test directory you just created in HDFS.

    hadoop fs -put /home/biadmin/licenses/BIlicense_en.txt /user/biadmin/test

  • Hands On Lab Page 9

    __3. List the contents of your target HDFS directory to verify that the file was successfully copied.

    hadoop fs -ls /user/biadmin/test

    2.3. Running a sample MapReduce application

    WordCount is one of several sample MapReduce applications provided for Apache Hadoop. Written in Java, it simply scans through input document(s) and, for each word, returns the total number of occurrences found. You can read more about WordCount on the Apache wiki (http://wiki.apache.org/hadoop/WordCount).

    Since launching MapReduce applications (or jobs) is a common practice in Hadoop, you'll explore how to do that with WordCount.

    __1. Execute the following command to launch the sample WordCount application provided with your Hadoop distribution.

    hadoop jar /opt/ibm/biginsights/IHC/hadoop-example.jar wordcount /user/biadmin/test WordCount_output

    This command specifies that the wordcount application contained in the specified .jar file is to be

    launched. The input for this application is in the /user/biadmin/test directory of HDFS. The output of

    this job will be stored in HDFS in the WordCount_output subdirectory of the user executing this

    command (biadmin). Thus, the output directory will be /user/biadmin/WordCount_output. This

    directory will be created automatically as a result of executing this application.

    NOTE: If the output folder already exists or if you try to rerun a successful MapReduce job with the same parameters, you will receive an error message. This is the default behavior of the sample WordCount application.

  • IBM Software

    Page 10 Explore Hadoop and BigInsights

    __2. Inspect the output of your job.

    hadoop fs -ls WordCount_output

    In this case, the output was small and contained written to a single file. If you had run WordCount against a larger volume of data, its output would have been split into multiple files (e.g., part-r-00001, part-r-00002, and so on).

    __3. To view the contents of part-r-0000 file, issue this command:

    hadoop fs -cat WordCount_output/*00

    Partial output is shown here:

  • Hands On Lab Page 11

    __4. Optionally, inspect details about your job. Open a Web browser, or click on the web console icon on your desktop and open a new tab. Access the URL for Hadoop's Job Tracker (http://bivm.ibm.com:50030/jobtracker.jsp). Scroll to the Completed Jobs section to

    locate the Job ID associated with the Word Count application. Click on the Job ID link to review details, such as the number of Map and Reduce tasks launched for your application, the number of bytes read and written, etc. Partial output is shown in the second image that follows.

  • IBM Software

    Page 12 Explore Hadoop and BigInsights

  • Hands On Lab Page 13

    Lab 3 Exploring and administering your cluster with the BigInsights Web console

    As you saw in the previous lab, Apache Hadoop users typically work through a command line interface to perform many common tasks. This lab introduces you to the BigInsights Web console, which enables you to administer your cluster, work with HDFS, launch jobs, and perform many other tasks using a graphical interface.

    After completing this hands-on lab, youll be able to:

    Launch the Web console.

    Work with popular resources accessible through the Welcome page.

    Administer BigInsights by inspecting the status of your cluster and accessing tools for open source components provided with BigInsights.

    Work with the distributed file system. In particular, you'll explore the HDFS directory structure, create subdirectories, and upload files to HDFS.

    Manage and launch pre-built applications from a Web catalog.

    Inspect the status of previously launched applications (jobs) and review their output.

    Allow 30 minutes to complete this section of lab.

    This lab is an introduction to a subset of console functions. Real-time monitoring, dashboards, alerts, and application linking are among the more advanced console functions that are beyond this lab's scope.

    3.1. Getting started with the Web Console

    In this exercise, you will launch the console and inspect its Welcome page.

    __1. Launch the BigInsights Web console. Direct your browser to http://bivm.ibm.com:8080 or click the Web Console icon on your desktop.

    __2. Log in with your user name and password (biadmin / biadmin).

  • IBM Software

    Page 14 Explore Hadoop and BigInsights

    __3. Verify that your Web console appears similar to this:

    __4. Briefly skim through the links provided in these sections to become familiar with resources available to you:

    Tasks: Quick access to popular BigInsights tasks Quick Links: Links to internal and external quick links and downloads to enhance your environment Learn More: Online resources available to learn more about BigInsights

    3.2. Administering BigInsights

    The Web console allows administrators to inspect the overall health of the system as well as perform basic functions, such as starting and stopping specific servers or components, adding nodes to the cluster, and so on. Youll explore a subset of these capabilities here.

  • Hands On Lab Page 15

    __5. Click on the Cluster Status tab at the top of the page.

    __6. Inspect the overall status of your cluster. The figure below was taken on a single-node cluster that had several services running. One service Monitoring -- was unavailable. Your display may differ somewhat. Its not necessary for all BigInsights services to be running to complete the exercises in this lab.

    __7. Click on the Hive service and note the detailed information provided for this service in the pane at right. For example, you can see the URL for Hive's Web interface and its process ID. In addition, note that you can start and stop services (such as the Hive service) from the Cluster Status page of the console.

  • IBM Software

    Page 16 Explore Hadoop and BigInsights

    __8. Optionally, cut-and-paste the URL for Hives Web interface into a new tab of your browser. You'll see an open source tool provided with Hive for administration purposes, as shown below.

    Other open source tools provided with Apache Hadoop are also available through IBM's packaged distribution (BigInsights), as you'll see shortly. Close this browser tab.

    __9. Click on the Welcome page of your Web console.

    __10. Click on the Access secure cluster servers button in the Quick Links section at right.

    If nothing appears, verify that the pop-up blocker of your browser is disabled; a prompt should appear at the top of the page if pop-ups are blocked.

    __11. Inspect the list of server components for which there are additional Web-based tools. The BigInsights console displays the URLs you can use to access each of these Web sites directly. (This information will only appear if the pop-up blocker is disabled on browser.)

    __12. Click on the jobtracker alias. The display should be familiar to you -- it's the same one you saw in the previous lab that introduced you to some basic Hadoop facilities.

  • Hands On Lab Page 17

    3.3. Working with the distributed file system (HDFS)

    In this section, you'll learn how to use the Web console to create directories in HDFS, navigate the file system, and upload small files -- tasks you performed earlier through a command-line interface. In addition, you'll perform a few other file-related tasks as well. Many people find the console's graphical interface to be easier to use than the command-line interface.

    __1. Click on the Files tab at the top of the page.

    __2. Expand the DFS directory tree in the left pane to display the contents of /user/biadmin. Note

    the presence of the /WordCount_output and /test subdirectories, which you created in an

    earlier lab. If desired, expand each directory and inspect its contents.

  • IBM Software

    Page 18 Explore Hadoop and BigInsights

    __3. Become familiar with the functions provided through the icons at the top of this pane, as we'll refer to some of these in subsequent sections of this module. Simply position your cursor on each icon to learn its function. From left to right, the icons enable you to copy a file or directory, move a file, create a directory, rename a file or directory, upload a file to HDFS, download a file from HDFS to your local file system, remove a file or directory from HDFS, set permissions, open a command window to launch HDFS shell commands, and refresh the Web console page.

    __4. Delete the /user/biadmin/test directory and its contents. Position your cursor on this

    directory, click the red X icon, and click Yes when prompted.

  • Hands On Lab Page 19

    __5. Create a new subdirectory in /user/biadmin. With your cursor positioned on /user/biadmin,

    click the create directory icon.

    __6. When a pop-up window appears, specify test2 as the new directory's name and click OK.

  • IBM Software

    Page 20 Explore Hadoop and BigInsights

    __7. Expand the directory hierarchy to verify that your new subdirectory was created.

    __8. Upload a file into this directory from your local file system. Click the upload icon.

  • Hands On Lab Page 21

    __9. When a pop-up window appears, click the Browse button to navigate through your local file system to /home/biadmin/licenses. Select the BIlicense_en.txt file and click Open.

    __10. Expand the /user/biadmin/test2 directory and verify that the BIlicense_en.txt file was

    successfully copied into HDFS. Note that the right pane of the Web console previews the file's contents.

  • IBM Software

    Page 22 Explore Hadoop and BigInsights

    3.4. Managing and launching pre-built applications from the Web catalog

    The Web console includes a catalog of ready-made applications that users can launch through a graphical interface. Each application's status, execution history, and output are easy to monitor from this page as well. In this exercise, you'll first manage the catalogs contents, selecting one of more than 20 pre-built applications provided with BigInsights to deploy on your cluster. Once deployed, the application will be visible to all authorized users. You'll then launch the application, monitor its execution status, and inspect its output.

    As you might have guessed, the sample application used in this lab is Word Count -- the same application you ran from a command line earlier.

    __1. Click the Applications tab of the Web console. No applications are deployed on a new cluster, so there won't be much to see yet.

    __2. In the upper left corner, click Manage. A list of applications available for deployment are displayed.

  • Hands On Lab Page 23

    __3. Expand the Test category and click on the Word Count application.

    __4. Click Deploy.

    __5. When a pop-up window appears, accept the defaults for all settings and click Deploy.

  • IBM Software

    Page 24 Explore Hadoop and BigInsights

    __6. After the application has been deployed, you're ready to run it. Click Run in the upper left pane.

    __7. Verify that the Word Count application appears in the catalog. (Any other applications that were previously deployed to the Web catalog will also appear.)

  • Hands On Lab Page 25

    __8. Click on the Word Count icon. The pane at right prompts you to enter appropriate information. For this application, you need to specify an execution name for your application's run, the HDFS directory containing the input document(s) for the Word Count application, and an output directory in HDFS.

    __9. For the Execution name, enter My Test Run 1.

    __10. For the Input path, click Browse and navigate to /user/biadmin/test2. Click OK.

    __11. For the Output path, type /user/biadmin/WordCount_console_output. (Recall that the Word

    Count application creates this output directory at run time. If you specify an existing HDFS directory for the output, the application will fail.)

    __12. Verify that your display appears similar to this and click Run.

  • IBM Software

    Page 26 Explore Hadoop and BigInsights

    __13. As your application executes, monitor its status through the Application History pane at lower right.

    __14. When the application completes successfully, click the link provided in the Output column to see the application's output.

    __15. Optionally, return to the Applications page of the console and click on the link provided in the Details column for your application's run.

  • Hands On Lab Page 27

    __16. Note that the console displays the Application Status page, which contains information about the Oozie workflow for your application as well as the application itself. If desired, click on one or more available links to explore details available for your review.

  • IBM Software

    Page 28 Explore Hadoop and BigInsights

    Lab 4 Analyzing social media data with BigSheets

    To help business analysts and those without a programming background analyze big data, IBM provides a spreadsheet-style tool called BigSheets. In this lab, you'll learn how you can explore big data through this tool without writing any scripts or MapReduce applications. The sample data for this lab consists of social media posts about a popular brand (IBM Watson) that was collected using a sample application provided with BigInsights. For background information, you may want to read the article on Analyzing social media and structured data with InfoSphere BigInsights at http://www.ibm.com/developerworks/data/library/techarticle/dm-1206socialmedia/index.html

    After completing this hands-on lab, youll be able to:

    Create a BigSheets workbook

    Analyze and customize a workbook

    Visualize your workbook's data in a chart

    Create a Big SQL table based on your workbook

    Export your workbook's data into one of several popular formats

    Allow 45 60 minutes to complete this lab.

    4.1. Creating a workbook

    To get started, copy the sample blogs-data.txt file to HDFS and create a master workbook for it.

    __1. Obtain the blogs-data.txt file. Youll find this in the sampleData.zip file provided with the article

    mentioned earlier.

    __2. Use Hadoop file system commands or the BigInsights Web console to create subdirectories in HDFS for your sample data. Under /user/biadmin, create a /sampleData directory. Beneath

    /user/biadmin/sampleData, create the /IBMWatson subdirectory.

    Where did this data come from?

    For time efficiency, social media data about "IBM Watson" was already collected using the Boardreader sample application, which collects social media data from various global sites and writes the output in JSON array format to files. This lab focuses on blog data collected about IBM Watson for a six-month interval.

    Boardreader is an IBM business partner that offers a social media content aggregation and provisioning service based on a multilingual data dating back to 2001. The service searches message boards / forums, social networks, blogs/comments, microblogs, reviews, videos/comments and online news. Customers who want to use the Boardreader service should contact the firm directly to obtain a license key.

  • Hands On Lab Page 29

    If you forgot how to create a subdirectory in HDFS, consult the earlier labs on Issuing Basic Hadoop Commands or Exploring and Administering Your Cluster with the BigInsights Web Console.

    __3. Upload the blogs-data.txt file to the /user/biadmin/sampleData/IBMWatson directory. You

    can use Hadoop file system commands or the BigInsights Web console to do this. (If you forgot how to copy a file to HDFS, consult the earlier labs on Issuing Basic Hadoop Commands or Exploring and Administering Your Cluster with the BigInsights Web Console.)

    __4. From the Files page of the Web console, position your cursor on the /user/biadmin/sampleData/IBMWatson/blogs-data.txt file, as shown in the previous

    image.

    __5. Click the Sheet radio button to preview this data in a spreadsheet-style format.

    __6. Because the sample blog data for this lab is uses a JSON Array structure, you must click on the pencil icon to select an appropriate reader (data format translator) for this data. Select the JSON Array reader and click the green check.

  • IBM Software

    Page 30 Explore Hadoop and BigInsights

    __7. Save this as a Master Workbook named Watson Blogs. Optionally, provide a description. Click Save.

    __8. Note that the BigSheets page of the Web console will open and your new workbook will be displayed.

    Now you're ready to begin exploring this data using BigSheets.

    4.2. Analyzing and customizing your workbook

    BigSheets offers analysts a variety of macros, functions, and built-in analytical features. You'll learn about a few here.

  • Hands On Lab Page 31

    __1. To make it easier to search and manage your workbooks, add a few tags to the Watson Blogs master workbook you just created. In the upper right corner, click the icon to toggle the workbook display to show additional fields.

    Depending on the size of your browser, an additional scroll bar may appear at right.

    __2. Scroll down to the Workbook Details section. Locate the Tags field, select the green plus sign (+) , enter a tag for Watson, and click the green check mark. Repeat the process to add

    separate tags for IBM and blogs.

    __3. Click on the Workbooks link the upper left corner of your open workbook.

    __4. From the list of available workbooks, you can quickly search for a specific tag. Use the drop-down Tags menu to select the blogs tag or type tag: blogs into the box.

  • IBM Software

    Page 32 Explore Hadoop and BigInsights

    __5. Open the Watson Blogs master workbook again. (Double click on it.)

    __6. Create a new workbook based on this master workbook. In BigSheets, a master workbook is a base workbook and has a limited set of things you can edit. So, to manipulate the data contained within a workbook, you want to create a new workbook derived from the master.

    __a. Click the Build new Workbook button.

    __b. When the new Workbook appears, change its default name. Click the pencil icon next to the name, enter Watson Blogs Revised as the new name, and click the green check mark.

    __c. Click the Fit column(s) button to more easily see columns A through H on your screen

    .

    __7. Remove the column IsAdult from your workbook. This is currently column E. Click on the triangle next to the column name of IsAdult and select the Remove.

  • Hands On Lab Page 33

    __8. In this case, you want to keep only a few columns. To easily remove several columns, click the triangle again (from any column) and select Organize ColumnsF

    __a. Click the red X button next to each column you want to remove.

    In this case, KEEP the following columnsT

    __i. Country __ii. FeedInfo __iii. Language __iv. Published __v. SubjectHtml __vi. Tags __vii. Type __viii. Url

    __b. Click the green check mark button when you are ready to remove the columns you selected.

    Did I lose data?

    Deleting a column does not remove data. Deleting a column in a workbook just removes the mapping to this column.

  • IBM Software

    Page 34 Explore Hadoop and BigInsights

    __9. Click on the Fit column(s) button again to show columns A through H. Verify that your screen appears similar to this:

    __10. From the Save menu at upper left, select Save. Provide a description for your workbook if youd like.

    __11. Apply a built-in function to further investigate the contents of this workbook. Click the Add Sheets button in the lower left corner.

  • Hands On Lab Page 35

    __12. From the pop-up menu, select Function. You're going to apply a built-in function that extracts the URL Host information from the full URL links associated with the blog data that was captured. Doing so will enable you to identify and chart sites with greatest blog coverage of IBM Watson.

    __13. From the Function menu, click Categories and Url.

    __14. Select the URLHOST function.

    __15. In the new menu that appears, enter Get Host URL as the sheet name and select the Url

    column as the source of input to the URLHOST function.

  • IBM Software

    Page 36 Explore Hadoop and BigInsights

    __16. At the bottom of the menu, click the Carry Over tab to specify which columns from the workbook you'd like to retain. Select Add All and click the green check mark.

    __17. Verify that your workbook contains a new URLHOST column and all previously existing columns. (Whenever you create a new Sheet or edit your workbook in some way, BigSheets will preview the results of your work against a small sample of the data represented by your workbook.) If desired, click the Fit Column button to show more columns on your screen.

  • Hands On Lab Page 37

    __18. Click Save > Save & Exit.

    __19. When prompted to Run or Close the workbook, click Run. "Running" a workbook instructs BigSheets to apply the logic you specified graphically against all data associated with your workbook. You can monitor the progress of your request by watching the status bar indicator in the upper right-hand side of the page.

    __20. When the operation completes, verify that your workbook appears similar to this:

  • IBM Software

    Page 38 Explore Hadoop and BigInsights

    __21. If desired, use the Next button in the lower right corner to see page through the content a few times, noting the various URLHOST values. If desired, you could use built-in BigSheets features to sort the data based on URLHOST (or other) values, filter records (such as blogs written in the English language), etc. But perhaps the quickest way to see which sites published the most blogs about IBM Watson during this time period is to chart the results. You'll do that next.

    4.3. Creating charts

    Now that you've customized your workbook to eliminate some unwanted columns and generate a new column containing URL host information, it's time to visualize the results. In this short exercise, you'll create two simple charts that identify the top 10 global sites with the most blog posts about IBM Watson.

    __1. If necessary, open the Watson Blogs Revised notebook.

    __2. Click on the Add chart link in the lower left.

  • Hands On Lab Page 39

    __3. Select chart > Bar as the chart type.

    __4. Specify appropriate properties for the bar chart, paying close attention to these fields:

    __a. Title: Top 10 Blog Sites for IBM Watson

    __b. X Axis: URLHOST

    __c. Sort By: Y Axis

    __d. Occurrence Order: Descending

    __e. Limit: 10

  • IBM Software

    Page 40 Explore Hadoop and BigInsights

    __5. Click the green check mark.

    __6. When prompted, Run the chart. This causes BigSheets to apply your instructions to the entire data set.

    __7. Inspect the results. Are you surprised that ibm.com wasnt the top site for blog posts about IBM Watson?

  • Hands On Lab Page 41

    __8. If desired, hover over each bar to see the URL host name and the number of blogs posted at that site.

    __9. Next, create a new chart of a different type to visualize the information in a different format. Select Add Chart > Categories > cloud > Bubble Cloud.

    __10. Provide appropriate values for the following fields:

    __a. Title: Top 10 Blog Sites for IBM Waton

    __b. Tags: URLHOST

    __c. Occurrence Order: Descending

    __d. Sort By: Count

    __e. Limit: 10

  • IBM Software

    Page 42 Explore Hadoop and BigInsights

    __11. Click the green check mark.

    __12. When prompted, Run the chart.

    __13. Inspect the results. If desired, hover over a bubble to see the number of blog postings for that site.

    4.4. Creating a Big SQL table based on your workbook

    BigSheets offers a wide range of built-in features, including the ability to create a Big SQL table from your workbook. This is quite handy if you have SQL-based tools or applications that you'd like to use with data you've customized in BigSheets.

  • Hands On Lab Page 43

    __1. If necessary, open your Watson Blogs Revised workbook.

    __2. Click Create Table button just above the columns of your workbook. When prompted, accept sheets as the target schema name and type mywatsonblogs as the target table name.

    __3. Click Confirm.

    __4. From the Files page of the Web console, click the Catalog Tables tab in the navigation window and expand the sheets folder.

    __5. Click the mywatsonblogs file. Note that a preview of the table appears in the pane at right.

    __6. Click the Welcome tab of the Web console. In the Quick Links section, click the Run Big SQL queries link.

  • IBM Software

    Page 44 Explore Hadoop and BigInsights

    __7. A new tab will appear in your Web browser.

    __8. In the box where you're prompted to enter your Big SQL query, type this statement:

    select urlhost, language, subjecthtml from sheets.mywatsonblogs

    fetch first 10 rows only;

    __9. Verify that the Big SQL radio button is checked (not the Big SQL V1 radio button).

    __10. If necessary, use the scroll bar at right to expose the Run button just below the radio buttons. Click Run.

    __11. Inspect the results.

  • Hands On Lab Page 45

    __12. Close the Big SQL browser tab.

    4.5. Optional: Exporting your workbook data

    In this optional exercise, you'll see how easy it is to export data in your workbook to one of several popular formats so that other applications can easily access the data.

    __1. If necessary, open your Watson Blogs Revised workbook.

    __2. Click Export data. From the drop-down menu, select TSV (tab separated value) as the format type.

    __3. Click the File radio button to export the data to a file in your distributed file system.

    Querying tables with Big SQL

    While the Web console's Big SQL query interface is handy for executing test queries that return a small amount of data, it's best to use other facilities provided by IBM or third parties to execute Big SQL queries that return larger volumes of data to avoid memory constraints imposed by your browser. In a subsequent lab, you'll learn how to execute Big SQL queries from Eclipse.

  • IBM Software

    Page 46 Explore Hadoop and BigInsights

    __4. Use the Browse button to navigate to the directory in HDFS where you would like to export this workbook. In this case, select /user/biadmin/sampleData/IBMWatson. In the box below the

    directory tree, enter myworkbook as the file name. Do not add a file extension such as .tsv. Click OK.

    __5. Click OK again to initiate the data export operation.

    __6. When a message appears indicating that the operation has finished, click OK.

    __7. On the Files page of the Web console, navigate to the directory you specified for the export (/user/biadmin/sampleData/IBMWatson) and locate your new myworkbook.tsv file.

  • Hands On Lab Page 47

    __8. Optionally, click the download icon to copy the file from HDFS to a directory of your choice in your local file system.

  • IBM Software

    Page 48 Explore Hadoop and BigInsights

    Lab 5 Querying data with Big SQL

    Now that you know how to work with HDFS and analyze your data with a spreadsheet-style tool, its a good time to explore how you can query your data with Big SQL. Big SQL provides broad SQL support based on the ISO SQL standard. You can issue queries using JDBC or ODBC drivers to access data that is stored in InfoSphere BigInsights in the same way that you access relational databases from your enterprise applications. The SQL query engine supports joins, unions, grouping, common table expressions, windowing functions, and other familiar SQL expressions.

    This tutorial uses sales data from a fictional company that sells and distributes outdoor products to third-party retailer stores as well as directly to consumers through its online store. It maintains its data in a series of FACT and DIMENSION tables, as is common in relational data warehouse environments. In this lab, you will explore how to create, populate, and query a subset of the star schema database to investigate the companys performance and offerings. Note that BigInsights provides scripts to create and populate the more than 60 tables that comprise the sample GOSALESDW database. You will use fewer than 10 of these tables in this lab.

    To execute the queries in this lab, you will use the open source Eclipse environment provided with the BigInsights Quick Start Edition VMware image. Of course, you can use other tools or interfaces to invoke Big SQL, such as the Java SQL Shell (JSqsh), a command-line facility provided with the BigInsights. However, Eclipse is a good choice for this lab, as it formats query results in a manner thats easy to read and encourages you to collect your SQL statements into scripts for editing and testing.

    After you complete the lessons in this module, you will understand how to:

    Connect to the Big SQL server from Eclipse

    Execute individual or multiple Big SQL statements

    Create Big SQL tables in Hadoop

    Populate Big SQL tables with data from local files

    Query Big SQL tables using projections, restrictions, joins, aggregations, and other popular expressions.

    Create and query a view based on multiple Big SQL tables.

    Create and run a JDBC client application for Big SQL using Eclipse.

    Allow 45 60 minutes to complete this lab.

    5.1. Creating a project and executing Big SQL statements

    To begin, create a BigInsights project and Big SQL script.

    __1. Launch Eclipse using the icon on your desktop. Accept the default workspace when prompted.

    __2. Create a BigInsights project for your work. From the Eclipse menu bar, click File > New > Other. Expand the BigInsights folder, and select BigInsights Project, and then click Next.

  • Hands On Lab Page 49

    __3. Type myBigSQL in the Project name field, and then click Finish.

    __4. If you are not already in the BigInsights perspective, a Switch to the BigInsights perspective window opens. Click Yes to switch to the BigInsights perspective.

    __5. Create a new SQL script file. From the Eclipse menu bar, click File > New > Other. Expand the BigInsights folder, and select SQL script, and then click Next.

    __6. In the New SQL File window, in the Enter or select the parent folder field, select myBigSQL. Your new SQL file is stored in this project folder.

    __7. In the File name field, type aFirstFile. The .sql extension is added automatically. Click Finish.

    In the Select Connection Profile window, locate the Big SQL JDBC connection, which is the

    pre-defined connection to Big SQL 3.0 provided with the VMware image. Inspect the properties displayed in the Properties field. Verify that the connection uses the JDBC driver and database name shown in the Properties pane here.

  • IBM Software

    Page 50 Explore Hadoop and BigInsights

    About the driver selection

    You may be wondering why you are using a connection that employs the com.ibm.com.db2.jcc.DB2 driver class. In 2014, IBM released a common SQL query engine as part of its DB2 and BigInsights offerings. Doing so provides for greater SQL commonality across its relational DBMS and Hadoop-based offerings. It also brings a greater breadth of SQL function to Hadoop (BigInsights) users. This common query engine is accessible through the DB2 driver. The Big SQL driver remains operational and offers connectivity to an earlier, BigInsights-specific SQL query engine. This lab focuses on using the common SQL query engine.

    __8. Click Edit to edit this connection's log in information.

  • Hands On Lab Page 51

    __9. Change the user name and password properties to match your user ID and password (e.g., biadmin / biadmin). Leave the remaining property values intact.

    __10. Click Test Connection to verify that you can successfully connect to the server.

    __11. Check the Save password box and click OK.

    __12. Click Finish to close the connection window. Your empty SQL script will be displayed.

    __13. Copy the following statement into your SQL script:

    create hadoop table test1 (col1 int, col2 varchar(5));

  • IBM Software

    Page 52 Explore Hadoop and BigInsights

    Because you didn't specify a schema name for the table, it will be created in your default schema, which is your user name (biadmin). Thus, the previous statement is equivalent to

    create hadoop table biadmin.test1 (col1 int, col2 varchar(5));

    In some cases, the Eclipse SQL editor may flag certain Big SQL statements as containing syntax errors. Ignore these false warnings and continue with your lab exercises.

    __14. Save your file (press Ctrl + S or click File > Save).

    __15. Right mouse click anywhere in the script to display a menu of options.

    __16. Select Run SQL or press F5. This causes all statements in your script to be executed.

    __17. Inspect the SQL Results pane that appears towards the bottom of your display. (If desired, double click on the SQL Results tab to enlarge this pane. Then double click on the tab again to return the pane to its normal size.) Verify that the statement executed successfully. Your Big SQL database now contains a new table named BIADMIN.TEST1. Note that your schema and

    table name were folded into upper case.

  • Hands On Lab Page 53

    For the remainder of this lab, you should execute each SQL statement individually. To do so, highlight the statement with your cursor and press F5.

    When youre developing a SQL script with multiple statements, its generally a good idea to test each statement one at a time to verify that each is working as expected.

    __18. From your Eclipse project, query the system for meta data about your test1 table:

    select tabschema, colname, colno, typename, length

    from syscat.columns where tabschema = USER and tabname= 'TEST1';

    In case you're wondering, syscat.columns is one of a number of views supplied over system

    catalog data automatically maintained for you by the Big SQL service.

    __19. Inspect the SQL Results to verify that the query executed successfully, and click on the Result1 tab to view its output.

    __20. Finally, clean up the object you created in the database.

    drop table test1;

    __21. Save your file. If desired, leave it open to execute statements for subsequent exercises.

    Now that youve set up your Eclipse environment and know how to create SQL scripts and execute queries, youre ready to develop more sophisticated scenarios using Big SQL. In the next lab, you will create a number of tables in your schema and use Eclipse to query them.

    5.2. Creating sample tables and loading sample data

    In this lesson, you will create several sample tables and load data into these tables from local files.

  • IBM Software

    Page 54 Explore Hadoop and BigInsights

    __1. Determine the location of the sample data in your local file system and make a note of it. You will need to use this path specification when issuing LOAD commands later in this lab.

    Subsequent examples in this section presume your sample data is in the /opt/ibm/biginsights/bigsql/samples/data directory. This is the location

    of the data on the BigInsights VMware image, and it is the default location in typical BigInsights installations.

    Furthermore, the /opt/ibm/biginsights/bigsql/samples/queries

    directory contains SQL scripts that include the CREATE TABLE, LOAD, and SELECT statements used in this lab, as well as other statements.

    __2. Create several tables to track information about sales. Issue each of the following CREATE TABLE statements one at a time, and verify that each completed successfully:

    -- dimension table for region info

    CREATE HADOOP TABLE IF NOT EXISTS go_region_dim

    ( country_key INT NOT NULL

    , country_code INT NOT NULL

    , flag_image VARCHAR(45)

    , iso_three_letter_code VARCHAR(9) NOT NULL

    , iso_two_letter_code VARCHAR(6) NOT NULL

    , iso_three_digit_code VARCHAR(9) NOT NULL

    , region_key INT NOT NULL

    , region_code INT NOT NULL

    , region_en VARCHAR(90) NOT NULL

    , country_en VARCHAR(90) NOT NULL

    , region_de VARCHAR(90), country_de VARCHAR(90), region_fr VARCHAR(90)

    , country_fr VARCHAR(90), region_ja VARCHAR(90), country_ja VARCHAR(90)

    , region_cs VARCHAR(90), country_cs VARCHAR(90), region_da VARCHAR(90)

    , country_da VARCHAR(90), region_el VARCHAR(90), country_el VARCHAR(90)

    , region_es VARCHAR(90), country_es VARCHAR(90), region_fi VARCHAR(90)

    , country_fi VARCHAR(90), region_hu VARCHAR(90), country_hu VARCHAR(90)

    , region_id VARCHAR(90), country_id VARCHAR(90), region_it VARCHAR(90)

    , country_it VARCHAR(90), region_ko VARCHAR(90), country_ko VARCHAR(90)

    , region_ms VARCHAR(90), country_ms VARCHAR(90), region_nl VARCHAR(90)

    , country_nl VARCHAR(90), region_no VARCHAR(90), country_no VARCHAR(90)

    , region_pl VARCHAR(90), country_pl VARCHAR(90), region_pt VARCHAR(90)

    , country_pt VARCHAR(90), region_ru VARCHAR(90), country_ru VARCHAR(90)

    , region_sc VARCHAR(90), country_sc VARCHAR(90), region_sv VARCHAR(90)

    , country_sv VARCHAR(90), region_tc VARCHAR(90), country_tc VARCHAR(90)

    , region_th VARCHAR(90), country_th VARCHAR(90)

    )

    ROW FORMAT DELIMITED

    FIELDS TERMINATED BY '\t'

    LINES TERMINATED BY '\n'

    STORED AS TEXTFILE

    ;

    -- dimension table tracking method of order for the sale (e.g., Web, fax)

    CREATE HADOOP TABLE IF NOT EXISTS sls_order_method_dim

  • Hands On Lab Page 55

    ( order_method_key INT NOT NULL

    , order_method_code INT NOT NULL

    , order_method_en VARCHAR(90) NOT NULL

    , order_method_de VARCHAR(90), order_method_fr VARCHAR(90)

    , order_method_ja VARCHAR(90), order_method_cs VARCHAR(90)

    , order_method_da VARCHAR(90), order_method_el VARCHAR(90)

    , order_method_es VARCHAR(90), order_method_fi VARCHAR(90)

    , order_method_hu VARCHAR(90), order_method_id VARCHAR(90)

    , order_method_it VARCHAR(90), order_method_ko VARCHAR(90)

    , order_method_ms VARCHAR(90), order_method_nl VARCHAR(90)

    , order_method_no VARCHAR(90), order_method_pl VARCHAR(90)

    , order_method_pt VARCHAR(90), order_method_ru VARCHAR(90)

    , order_method_sc VARCHAR(90), order_method_sv VARCHAR(90)

    , order_method_tc VARCHAR(90), order_method_th VARCHAR(90)

    )

    ROW FORMAT DELIMITED

    FIELDS TERMINATED BY '\t'

    LINES TERMINATED BY '\n'

    STORED AS TEXTFILE

    ;

    -- look up table with product brand info in various languages

    CREATE HADOOP TABLE IF NOT EXISTS sls_product_brand_lookup

    ( product_brand_code INT NOT NULL

    , product_brand_en VARCHAR(90) NOT NULL

    , product_brand_de VARCHAR(90), product_brand_fr VARCHAR(90)

    , product_brand_ja VARCHAR(90), product_brand_cs VARCHAR(90)

    , product_brand_da VARCHAR(90), product_brand_el VARCHAR(90)

    , product_brand_es VARCHAR(90), product_brand_fi VARCHAR(90)

    , product_brand_hu VARCHAR(90), product_brand_id VARCHAR(90)

    , product_brand_it VARCHAR(90), product_brand_ko VARCHAR(90)

    , product_brand_ms VARCHAR(90), product_brand_nl VARCHAR(90)

    , product_brand_no VARCHAR(90), product_brand_pl VARCHAR(90)

    , product_brand_pt VARCHAR(90), product_brand_ru VARCHAR(90)

    , product_brand_sc VARCHAR(90), product_brand_sv VARCHAR(90)

    , product_brand_tc VARCHAR(90), product_brand_th VARCHAR(90)

    )

    ROW FORMAT DELIMITED

    FIELDS TERMINATED BY '\t'

    LINES TERMINATED BY '\n'

    STORED AS TEXTFILE

    ;

    -- product dimension table

    CREATE HADOOP TABLE IF NOT EXISTS sls_product_dim

    ( product_key INT NOT NULL

    , product_line_code INT NOT NULL

    , product_type_key INT NOT NULL

    , product_type_code INT NOT NULL

    , product_number INT NOT NULL

    , base_product_key INT NOT NULL

    , base_product_number INT NOT NULL

    , product_color_code INT

  • IBM Software

    Page 56 Explore Hadoop and BigInsights

    , product_size_code INT

    , product_brand_key INT NOT NULL

    , product_brand_code INT NOT NULL

    , product_image VARCHAR(60)

    , introduction_date TIMESTAMP

    , discontinued_date TIMESTAMP

    )

    ROW FORMAT DELIMITED

    FIELDS TERMINATED BY '\t'

    LINES TERMINATED BY '\n'

    STORED AS TEXTFILE

    ;

    -- look up table with product line info in various languages

    CREATE HADOOP TABLE IF NOT EXISTS sls_product_line_lookup

    ( product_line_code INT NOT NULL

    , product_line_en VARCHAR(90) NOT NULL

    , product_line_de VARCHAR(90), product_line_fr VARCHAR(90)

    , product_line_ja VARCHAR(90), product_line_cs VARCHAR(90)

    , product_line_da VARCHAR(90), product_line_el VARCHAR(90)

    , product_line_es VARCHAR(90), product_line_fi VARCHAR(90)

    , product_line_hu VARCHAR(90), product_line_id VARCHAR(90)

    , product_line_it VARCHAR(90), product_line_ko VARCHAR(90)

    , product_line_ms VARCHAR(90), product_line_nl VARCHAR(90)

    , product_line_no VARCHAR(90), product_line_pl VARCHAR(90)

    , product_line_pt VARCHAR(90), product_line_ru VARCHAR(90)

    , product_line_sc VARCHAR(90), product_line_sv VARCHAR(90)

    , product_line_tc VARCHAR(90), product_line_th VARCHAR(90)

    )

    ROW FORMAT DELIMITED

    FIELDS TERMINATED BY '\t'

    LINES TERMINATED BY '\n'

    STORED AS TEXTFILE;

    -- look up table for products

    CREATE HADOOP TABLE IF NOT EXISTS sls_product_lookup

    ( product_number INT NOT NULL

    , product_language VARCHAR(30) NOT NULL

    , product_name VARCHAR(150) NOT NULL

    , product_description VARCHAR(765)

    )

    ROW FORMAT DELIMITED

    FIELDS TERMINATED BY '\t'

    LINES TERMINATED BY '\n'

    STORED AS TEXTFILE;

    -- fact table for sales

    CREATE HADOOP TABLE IF NOT EXISTS sls_sales_fact

    ( order_day_key INT NOT NULL

    , organization_key INT NOT NULL

    , employee_key INT NOT NULL

    , retailer_key INT NOT NULL

    , retailer_site_key INT NOT NULL

    , product_key INT NOT NULL

  • Hands On Lab Page 57

    , promotion_key INT NOT NULL

    , order_method_key INT NOT NULL

    , sales_order_key INT NOT NULL

    , ship_day_key INT NOT NULL

    , close_day_key INT NOT NULL

    , quantity INT

    , unit_cost DOUBLE

    , unit_price DOUBLE

    , unit_sale_price DOUBLE

    , gross_margin DOUBLE

    , sale_total DOUBLE

    , gross_profit DOUBLE

    )

    ROW FORMAT DELIMITED

    FIELDS TERMINATED BY '\t'

    LINES TERMINATED BY '\n'

    STORED AS TEXTFILE

    ;

    -- fact table for marketing promotions

    CREATE HADOOP TABLE IF NOT EXISTS mrk_promotion_fact

    ( organization_key INT NOT NULL

    , order_day_key INT NOT NULL

    , rtl_country_key INT NOT NULL

    , employee_key INT NOT NULL

    , retailer_key INT NOT NULL

    , product_key INT NOT NULL

    , promotion_key INT NOT NULL

    , sales_order_key INT NOT NULL

    , quantity SMALLINT

    , unit_cost DOUBLE

    , unit_price DOUBLE

    , unit_sale_price DOUBLE

    , gross_margin DOUBLE

    , sale_total DOUBLE

    , gross_profit DOUBLE

    )

    ROW FORMAT DELIMITED

    FIELDS TERMINATED BY '\t'

    LINES TERMINATED BY '\n'

    STORED AS TEXTFILE;

    Lets briefly explore some aspects of the CREATE TABLE statements shown here. If you have a SQL background, the majority of these statements should be familiar to you. However, after the column specification, there are some additional clauses unique to Big SQL clauses that enable it to exploit Hadoop storage mechanisms (in this case, Hive). The ROW FORMAT clause specifies that fields are to be terminated by tabs (\t) and lines are to be terminated by new line characters (\n). The table will be stored in a TEXTFILE format, making it easy for a wide range of applications to work with. For details on these clauses, refer to the Apache Hive documentation.

  • IBM Software

    Page 58 Explore Hadoop and BigInsights

    __3. Load data into each of these tables using sample data provided in files. One at a time, issue each of the following LOAD statements and verify that each completed successfully. Remember to change the file path shown (if needed) to the appropriate path for your environment. The statements will return a warning message providing details on the number of rows loaded, etc.

    load hadoop using file url

    'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt' with SOURCE

    PROPERTIES ('field.delimiter'='\t') INTO TABLE GO_REGION_DIM overwrite;

    load hadoop using file url

    'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_ORDER_METHOD_DIM.txt' with

    SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE SLS_ORDER_METHOD_DIM overwrite;

    load hadoop using file url

    'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_BRAND_LOOKUP.txt'

    with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE SLS_PRODUCT_BRAND_LOOKUP

    overwrite;

    load hadoop using file url

    'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_DIM.txt' with SOURCE

    PROPERTIES ('field.delimiter'='\t') INTO TABLE SLS_PRODUCT_DIM overwrite;

    load hadoop using file url

    'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_LINE_LOOKUP.txt' with

    SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE SLS_PRODUCT_LINE_LOOKUP overwrite;

    load hadoop using file url

    'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_LOOKUP.txt' with

    SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE SLS_PRODUCT_LOOKUP overwrite;

    load hadoop using file url

    'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_SALES_FACT.txt' with SOURCE

    PROPERTIES ('field.delimiter'='\t') INTO TABLE SLS_SALES_FACT overwrite;

    load hadoop using file url

    'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.MRK_PROMOTION_FACT.txt' with

    SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE MRK_PROMOTION_FACT overwrite;

  • Hands On Lab Page 59

    Lets explore the LOAD syntax shown in these examples briefly. The first line of each example loads data into your table using a file URL specification and then specifies the full path to the data source file on your local file system. Note that the path is local to the Big SQL server (not your Eclipse client). The WITH SOURCE PROPERTIES clause specifies that fields in the source data are delimited by tabs (\t). The INTO TABLE clause identifies the target table for the LOAD operation. The OVERWRITE keyword indicates that any existing data in the table will be replaced by data contained in the source file. (If you wanted to simply add rows to the tables content, you could specify APPEND instead.)

    Note that loading data from a local file is only one of several available options. You can also load data using FTP or SFTP. This is particularly handy for loading data from remote file systems, although you can practice using it against your local file system, too. For example, the following statement for loading data into the GOSALESDW.GO_REGION_DIM table using SFTP is equivalent to the syntax shown earlier for loading data into this table from a local file:

    load hadoop using file url

    'sftp://myID:[email protected]:22/opt/ibm/biginsights/bigsql/

    samples/data/GOSALESDW.GO_REGION_DIM.txt'

    with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE

    gosalesdw.GO_REGION_DIM overwrite;

    Big SQL supports other LOAD options, including loading data directly from a remote relational DBMS via a JDBC connection. See the product documentation for details.

    __4. Query the tables to verify that the expected number of rows was loaded into each table. Execute each query that follows individually and compare the results with the number of rows specified in the comment line preceding each query.

    -- total rows in GO_REGION_DIM = 21

    select count(*) from GO_REGION_DIM;

    -- total rows in sls_order_method_dim = 7

    select count(*) from sls_order_method_dim;

    -- total rows in SLS_PRODUCT_BRAND_LOOKUP = 28

    select count(*) from SLS_PRODUCT_BRAND_LOOKUP;

    -- total rows in SLS_PRODUCT_DIM = 274

    select count(*) from SLS_PRODUCT_DIM;

    -- total rows in SLS_PRODUCT_LINE_LOOKUP = 5

    select count(*) from SLS_PRODUCT_LINE_LOOKUP;

    -- total rows in SLS_PRODUCT_LOOKUP = 6302

    select count(*) from SLS_PRODUCT_LOOKUP;

    -- total rows in SLS_SALES_FACT = 446023

    select count(*) from SLS_SALES_FACT;

    -- total rows gosalesdw.MRK_PROMOTION_FACT = 11034

    select count(*) from MRK_PROMOTION_FACT;

  • IBM Software

    Page 60 Explore Hadoop and BigInsights

    5.3. Querying tables with joins, aggregations and more

    Now you're ready to query your tables. Based on earlier exercises, you've already seen that you can perform basic SQL operations, including projections (to extract specific columns from your tables) and restrictions (to extract specific rows meeting certain conditions you specified). Let's explore a few examples that are a bit more sophisticated.

    In this lesson, you will create and run Big SQL queries that join data from multiple tables as well as perform aggregations and other SQL operations. Note that the queries included in this section are based on queries shipped with BigInsights as samples. Some of these queries return hundreds of thousands of rows; however, the Eclipse SQL Results page limits output to only 500 rows. Although you can change that value in the Data Management preferences section, retain the default setting for this lab.

    __1. Join data from multiple tables to return the product name, quantity and order method of goods that have been sold. To do so, execute the following query. -- Fetch the product name, quantity, and order method -- of products sold.

    -- Query 1 SELECT pnumb.product_name, sales.quantity, meth.order_method_en

    FROM sls_sales_fact sales, sls_product_dim prod,

    sls_product_lookup pnumb, sls_order_method_dim meth WHERE

    pnumb.product_language='EN' AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number

    AND meth.order_method_key=sales.order_method_key;

    Lets review a few aspects of this query briefly:

    Data from four tables will be used to drive the results of this query (see the tables referenced in the FROM clause). Relationships between these tables are resolved through 3 join predicates specified as part of the WHERE clause. The query relies on 3 equi-joins to filter data from the referenced tables. (Predicates such as prod.product_number=pnumb.product_number help to

    narrow the results to product numbers that match in two tables.)

    For improved readability, this query uses aliases in the SELECT and FROM clauses when referencing tables. For example, pnumb.product_name refers to pnumb, which is the alias for

    the gosalesdw.sls_product_lookup table. Once defined in the FROM clause, an alias can be used in the WHERE clause so that you do not need to repeat the complete table name.

    The use of the predicate and pnumb.product_language=EN helps to further narrow the result

    to only English output. This database contains thousands of rows of data in various languages, so restricting the language provides some optimization.

  • Hands On Lab Page 61

    __2. Modify the query to restrict the order method to one type those involving a Sales visit. To

    do so, add the following query predicate just before the semi-colon:

    AND order_method_en='Sales visit'

    __3. Inspect the results, a subset of which is shown below:

  • IBM Software

    Page 62 Explore Hadoop and BigInsights

    __4. To find out which sales method of all the methods has the greatest quantity of orders, add a GROUP BY clause (group by pll.product_line_en, md.order_method_en). In addition,

    invoke the SUM aggregate function (sum(sf.quantity)) to total the orders by product and

    method. Finally, this query cleans up the output a bit by using aliases (e.g., as Product) to

    substitute a more readable column header.

    -- Query 3

    SELECT pll.product_line_en AS Product, md.order_method_en AS Order_method, sum(sf.QUANTITY) AS total

    FROM sls_order_method_dim AS md, sls_product_dim AS pd,

    sls_product_line_lookup AS pll, sls_product_brand_lookup AS pbl, sls_sales_fact AS sf

    WHERE pd.product_key = sf.product_key AND md.order_method_key = sf.order_method_key

    AND pll.product_line_code = pd.product_line_code AND pbl.product_brand_code = pd.product_brand_code GROUP BY pll.product_line_en, md.order_method_en;

  • Hands On Lab Page 63

    __5. Inspect the results, which should contain 35 rows. A portion is shown below.

    5.4. Optional: Using SerDes for non-traditional data

    While data structured in CSV and TSV columns are often stored in BigInsights and loaded into Big SQL tables, you may also need to work with other types of data data that might require the use of a serializer / deserializer (SerDe). SerDes are common in the Hadoop environment. Youll find a number of SerDes available in the public domain, or you can write your own following typical Hadoop practices.

    Using a SerDe with Big SQL is pretty straightforward. Once you develop or locate the SerDe you need, just add its JAR file to the appropriate BigInsights subdirectories. Then stop and restart the Big SQL service, and specify the SerDe class name when you create your table.

    In this lab exercise, you will use a SerDe to define a table for JSON-based blog data. The sample blog file for this exercise is the same blog file you used as input to BigSheets in a prior lab.

    __1. Download the hive-json-serde-0.2.jar into a directory of your choice on your local file system, such as /home/biadmin/sampleData. (As of this writing, the full URL for this SerDe is

    https://code.google.com/p/hive-json-serde/downloads/detail?name=hive-json-serde-0.2.jar)

    __2. Register the SerDe with BigInsights.

    __a. Stop the Big SQL server. From a terminal window, issue this command: $BIGINSIGHTS_HOME/bin/stop.sh bigsql

    __b. Copy the SerDe .jar file to the $BIGSQL_HOME/userlib and $HIVE_HOME/lib

    directories.

  • IBM Software

    Page 64 Explore Hadoop and BigInsights

    __c. Restart the Big SQL server. From a terminal window, issue this command: $BIGINSIGHTS_HOME/bin/start.sh bigsql

    Now that youve registered your SerDe, youre ready to use it. In this section, you will create a table that relies on the SerDe you just registered. For simplicity, this will be an externally managed table i.e., a table created over a user directory that resides outside of the Hive warehouse. This user directory will contain the table's data in files. As part of this exercise, you will upload the sample blogs-data.txt file into the target DFS directory.

    Creating a Big SQL table over an existing DFS directory has the effect of populating this table with all the data in the directory. To satisfy queries, Big SQL will look in the user directory specified when you created the table and consider all files in that directory to be the tables contents. This is consistent with the Hive concept of an externally managed table.

    Once the table is created, you'll query that table. In doing so, you'll note that the presence of a SerDe is transparent to your queries.

    __3. If necessary, download the .zip file containing the sample data from the bottom half of the article referenced in the introduction. Unzip the file into a directory on your local file system, such as /home/biadmin. You will be working with the blogs-data.txt file.

    From the Files tab of the Web console, navigate to the /user/biadmin/sampleData directory

    of your distributed file system. Use the create directory button to create a subdirectory named SerDe-Test.

    __4. Upload the blogs-data.txt file into /user/biadmin/sampleData/SerDe-Test.

  • Hands On Lab Page 65

    __5. Return to the Big SQL execution environment of your choice (JSqsh or Eclipse).

    __6. Execute the following statement, which creates a TESTBLOGS table that includes a LOCATION clause that specifies the DFS directory containing your sample blogs-data.txt file:

    create hadoop table if not exists testblogs ( Country String,

    Crawled String, FeedInfo String, Inserted String,

    IsAdult int, Language String, Postsize int,

    Published String, SubjectHtml String, Tags String,

    Type String, Url String) row format serde 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'

    location '/user/biadmin/sampleData/SerDe-Test';

    5.5. Optional: Developing a JDBC client application with Big SQL

    You can write a JDBC client application that uses Big SQL to open a database connection, execute queries, and process the results. In this optional exercise, you'll see how writing a client JDBC application for Big SQL is like writing a client application for any relational DBMS that supports JDBC access.

    __1. In the IBM InfoSphere BigInsights Eclipse environment, create a Java project by clicking File > New >Project. From the New Project window, select Java Project. Click Next.

  • IBM Software

    Page 66 Explore Hadoop and BigInsights

    __2. Type a name for the project in the Project Name field, such as MyJavaProject. Click Next.

    __3. Open the Libraries tab and click Add External Jars. Add the DB2 JDBC driver for BigInsights, located at /opt/ibm/biginsights/database/db2/java/db2jcc4.jar.

    __4. Click Finish. Click Yes when you are asked if you want to open the Java perspective.

    __5. Right-click the MyJavaProject project, and click New > Package. In the Name field, in the New Java Package window, type a name for the package, such as aJavaPackage4me. Click Finish.

  • Hands On Lab Page 67

    __6. Right-click the aJavaPackage4me package, and click New > Class.

    __7. In the New Java Class window, in the Name field, type SampApp. Select the public static void main(String[] args) check box. Click Finish.

    __8. Replace the default code for this class and copy or type the following code into the SampApp.java file (you'll find the file in /opt/ibm/biginsights/bigsql/samples/data/SampApp.java):

    package aJavaPackage4me; //a. Import required package(s) import java.sql.*; public class SampApp {

  • IBM Software

    Page 68 Explore Hadoop and BigInsights

    /** * @param args */ //b. set JDBC & database info //change these as needed for your environment static final String db = "jdbc:db2://YOUR_HOST_NAME:51000/bigsql"; static final String user = "YOUR_USER_ID"; static final String pwd = "YOUR_PASSWORD"; public static void main(String[] args) { Connection conn = null; Statement stmt = null; System.out.println("Started sample JDBC application."); try{ //c. Register JDBC driver -- not needed for DB2 JDBC type 4 connection // Class.forName("com.ibm.db2.jcc.DB2Driver"); //d. Get a connection conn = DriverManager.getConnection(db, user, pwd); System.out.println("Connected to the database."); //e. Execute a query stmt = conn.createStatement(); System.out.println("Created a statement."); String sql; sql = "select product_color_code, product_number from sls_product_dim " +

    "where product_key=30001"; ResultSet rs = stmt.executeQuery(sql); System.out.println("Executed a query."); //f. Obtain results System.out.println("Result set: "); while(rs.next()){ //Retrieve by column name int product_color = rs.getInt("PRODUCT_COLOR_CODE"); int product_number = rs.getInt("PRODUCT_NUMBER"); //Display values System.out.print("* Product Color: " + product_color + "\n"); System.out.print("* Product Number: " + product_number + "\n"); } //g. Close open resources rs.close(); stmt.close(); conn.close(); }catch(SQLException sqlE){ // Process SQL errors sqlE.printStackTrace(); }catch(Exception e){ // Process other errors e.printStackTrace(); } finally{

  • Hands On Lab Page 69

    // Ensure resources are closed before exiting try{ if(stmt!=null) stmt.close(); }catch(SQLException sqle2){ } // nothing we can do try{ if(conn!=null) conn.close(); } catch(SQLException sqlE){ sqlE.printStackTrace(); }// end finally block }// end try block System.out.println("Application complete"); }}

    __a. After the package declaration, ensure that you include the packages that contain the JDBC classes that are needed for database programming (import java.sql.*;).

    __b. Set up the database information so that you can refer to it. Be sure to change the user ID, password, and connection information as needed for your environment.

    __c. Optionally, register the JDBC driver. The class name is provided here for your reference. When using the DB2 Type 4.0 JDBC driver, its not necessary to specify the class name.

    __d. Open the connection.

    __e. Run a query by submitting an SQL statement to the database.

    __f. Extract data from result set.

    __g. Clean up the environment by closing all of the database resources.

    __9. Save the file and right-click the Java file and click Run > Run as > Java Application.

    __10. The results show in the Console view of Eclipse:

    Started sample JDBC application. Connected to the database. Created a statement. Executed a query. Result set: * Product Color: 908 * Product Number: 1110 Application complete

  • IBM Software

    Page 70 Explore Hadoop and BigInsights

    Lab 6 Summary

    In this lab, you gained hands-on experience using many popular capabilities of InfoSphere BigInsights, IBM's Hadoop-based platform for analyzing big data. You explored your BigInsights cluster using a Web-based console and manipulated social media data using a spreadsheet-style interface. You also created Big SQL tables for your data and executed several complex queries over this data.

    To expand your skills even further, visit the HadoopDev web site (https://developer.ibm.com/hadoop/) contains for links to free online courses, tutorials, and more.

    Now that youre ready to get started using BigInsights for your own projects. What will you do with big data?

  • NOTES

  • NOTES

  • Copyright IBM Corporation 2014.

    The information contained in these materials is provided for

    informational purposes only, and is provided AS IS without warranty

    of any kind, express or implied. IBM shall not be responsible for any

    damages arising out of the use of, or otherwise related to, these

    materials. Nothing contained in these materials is intended to, nor

    shall have the effect of, creating any warranties or representations

    from IBM or its suppliers or licensors, or altering the terms and

    conditions of the applicable license agreement governing the use of

    IBM software. References in these materials to IBM products,

    programs, or services do not imply that they will be available in all

    countries in which IBM operates. This information is based on

    current IBM product plans and strategy, which are subject to change

    by IBM without notice. Product release dates and/or capabilities

    referenced in these materials may change at any time at IBMs sole

    discretion based on market opportunities or other factors, and are not

    intended to be a commitment to future product or feature availability

    in any way.

    IBM, the IBM logo and ibm.com are trademarks of International

    Business Machines Corp., registered in many jurisdictions

    worldwide. Other product and service names might be trademarks of

    IBM or other companies. A current list of IBM trademarks is

    available on the Web at Copyright and trademark information at

    www.ibm.com/legal/copytrade.shtml.