Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the...

DATA524 - Information Visualization

Big Data Lab 2 Using Splunk Software

2016 John Hsu

DATA524 - Big Data Information Visualization

http://elearning.uona.edu/course/view.php?id=402

2 DATA524 - Big Data Information Visualization

Table of Contents Introduction ......................................................................................................... 3

About the UONA DATA524 Lab 2 - Accessing data in HDFS .......................... 3

Concepts: ........................................................................................................ 3

HDFS Introduction ............................................................................................ 4

Part 1: Upload data to Hadoop Distributed File System (HDFS) ..................... 8

Step 1: Login to UONA DATA524 Lab 2 Hadoop Web site ......................... 9

Step 2: Get into the root path of HDFS........................................................... 10

Step 3: Change file location to your subdirectory: .......................................... 11

Step 4: Upload the data into HDFS ................................................................ 15

Part 2: Showing HDFS data using Splunk software ....................................... 17

Step 1: Login to UONA DATA524 Lab 2 Splunk Web site ......................... 17

Step 2: Searching data in HDFS via Splunk: .................................................. 21

Step 3: Visualizing the data in HDFS by Using Splunk: ............................ 24

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 3


Introduction

About the UONA DATA524 Lab 2 - Accessing data in

HDFS The lab contained in this manual show you how to use Splunk and Apache

Hadoop file system. Add data to HDFS, then show you how to check your data

and run a simple search on the Hadoop directory. This lab is built for the user who

is new to Hadoop Distributed File System (HDFS), Splunk Enterprise and the

Splunk Search feature.

What's in this lab?

This manual guides the first user through searching the data and visualizing the

data. If you're new to Splunk Search, this is the place to start.

• Part 1: Upload data to Hadoop Distributed File System (HDFS) takes you through the steps to access DATA524 Lab’s HDFS web site.

• Part 2: Showing HDFS data using Splunk software describes the steps to retrieve and visualize the data in DATA524 Lab’s HDFS web site.

Concepts: Apache Hadoop: Apache Hadoop® is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data. Numerous Apache Software Foundation projects make up the services required by an enterprise to deploy, integrate and work with Hadoop Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. HDFS is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.

Ambari: The Apache Ambari project is aimed at making Hadoop management

simpler by developing software for provisioning, managing, and monitoring Apache

Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management

web UI backed by its RESTful APIs.


Ambari enables System Administrators to:

Provision a Hadoop Cluster Ambari provides a step-by-step wizard for installing Hadoop services across

any number of hosts.

Ambari handles configuration of Hadoop services for the cluster. Manage a Hadoop Cluster Ambari provides central management for starting, stopping, and reconfiguring

Hadoop services across the entire cluster. Monitor a Hadoop Cluster Ambari provides a dashboard for monitoring health and status of the Hadoop

cluster.

Ambari leverages Ambari Metrics System for metrics collection.

Ambari leverages Ambari Alert Framework for system alerting and will notify you when your attention is needed (e.g., a node goes down, remaining disk space is low, etc.).

Ambari enables Application Developers and System Integrators to:

Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own applications with the Ambari REST APIs.

HDFS Introduction A single physical machine gets saturated with its storage capacity as the data grows. Thereby comes impending need to partition your data across separate machines. This type of File system that manages storage of data across a network of machines is called Distributed File Systems. HDFS is a core component of Apache Hadoop and is designed to store large files with streaming data access patterns, running on clusters of commodity hardware.

Hadoop Distributed File System HDFS is a distributed file system that is designed for storing large data files. HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.



An HDFS cluster is comprised of a NameNode, which manages the cluster metadata, and DataNodes that store the data. Files and directories are represented on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, or namespace and disk space quotas. The file content is split into large blocks (typically 128 megabytes), and each block of the file is independently replicated at multiple DataNodes. The blocks are stored on the local file system on the DataNodes. The Namenode actively monitors the number of replicas of a block. When a replica of a block is lost due to a DataNode failure or disk failure, the NameNode creates another replica of the block. The NameNode maintains the namespace tree and the mapping of blocks to DataNodes, holding the entire namespace image in RAM. The NameNode does not directly send requests to DataNodes. It sends instructions to the DataNodes by replying to heartbeats sent by those DataNodes. The instructions include commands to:

replicate blocks to other nodes,

remove local block replicas,

re-register and send an immediate block report, or

shut down the node.


For more details on HDFS: http://hortonworks.com/hadoop/hdfs/

http://hortonworks.com/hadoop/hdfs/



UONA DATA524 Big Data Lab Environment • Lab data is stored at Splunk server.

• Search engine is at Splunk server.

• Users are accessing servers from internet.

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS


Part 1: Upload data to Hadoop Distributed File

System (HDFS)

Following steps same as lab 4 preparing data for part 2 search engine.

Prerequisite:

Find your username in “UONA LAB Account for DATA524”

Download lab data prices.csv to your local computer it.

We will use the prices.csv later.

You may preview the prices.csv file with text edit or spreadsheet as below:

http://elearning.uona.edu/mod/resource/view.php?id=45853

http://elearning.uona.edu/mod/resource/view.php?id=45853



Step 1: Login to UONA DATA524 Lab 2 Hadoop Web site

http://uona.dynu.net:8708

Filled in username and password:

username: bd524?? password: your password

http://uona.dynu.net:8708/



Step 2: Get into the root path of HDFS

Step 2-.1: Go to the Ambari Dashboard and open the HDFS User View by click on the User Views

icon and selecting the HDFS Files menu item

Step 2-.2: move mouse over the User Views icon then select “DHFS Files”:



OR click the User Views icon then select “DHFS Files”:

Step 3: Change file location to your subdirectory:

Your working subdirectory will be at /user/splunk/lab/bd524??

Starting from the top root of the HDFS file system, click “user” subdirectory:



Step 3-1: from the /user of the HDFS file system, click “splunk” subdirectory

Step 3-2: from the /user/splunk of the HDFS file system, click “lab” subdirectory



Step 3-3:

From the /user/splunk/lab of the HDFS file system,

find your subdirectory then click your subdirectory – bd524??:



You may switch the sorting sequence as below:

OR search your subdirectory using your account name bd524??



Step 4: Upload the data into HDFS Upload lab data geolocation.csv to HDFS file system: Step 4-1: Make sure you are in your subdirectory – /user/splunk/lab/bd524??.

Click “Upload”:

Click “Browse” to select the lab data, prices.csv, you downloaded to your local computer before:



Find and select prices.csv at your local computer (You had downloaded it before)

Step 4-2: Make sure you are in your subdirectory – bd524??. Click “Upload”:

prices.csv file will upload to HDFS as below:



Part 2: Showing HDFS data using Splunk software Following steps retrieve the HDFS data you uploaded at part 1, which is in the Hadoop server.

Step 1: Login to UONA DATA524 Lab 2 Splunk Web site Login to UONA DATA524 Lab 2 Splunk Web site:

https://uona.dynu.net:8803 Follow the message to authenticate with your credentials. Chrome:

https://uona.dynu.net:8803/



MS IE:



username: bd524??

password: your_password

The first page you see is Splunk Home.



Step 2: Searching data in HDFS via Splunk:

Step 2-1: From Splunk Home, click Search & Reporting under Apps.

Step 2-2: Type following search string in the Search bar and press Enter to search for the data in the Hadoop Distributed File System (HDFS), which is uploaded in the part 1 of this lab:

index=uona2_68_lab source=/user/splunk/lab/bd524??/prices.csv

Note: replace bd524?? with your account ID



The data you uploaded to HDFS will show up as below. You can see the csv format is automatically converted to JSON format.



Step 2-3

Type following search string in the Search bar and press Enter to search for the data in the Hadoop Distributed File System (HDFS), then compare the result with the original text file in prerequisite section of part 1.

index=uona2_68_lab source=/user/splunk/lab/bd524??/prices.csv |

table productId product_name price sale_price Code




Step 3: Visualizing the data in HDFS by Using Splunk: Step 3-1: Type following search string (same as step 2-3) in the Search bar and press Enter to search for the data in the Hadoop Distributed File System (HDFS), which is uploaded in the part 1:

index=uona2_68_lab source=/user/splunk/lab/bd524??/prices.csv |

table productId product_name price sale_price Code




Please click Visualization then select Column chart.

The column chart is one of the easy way to review the sale_price with

price.

After lab 1 and lab 2, you should be familiar with uploading data to HDFS and retrieving HDFS data using Splunk.

This is the end of UONA DATA524 Big Data lab 2

Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the...

Documents

Transcript of Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the...