Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the...

25
DATA524 - Information Visualization Big Data Lab 2 Using Splunk Software 2016 John Hsu DATA524 - Big Data Information Visualization

Transcript of Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the...

Page 1: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

DATA524 - Information Visualization

Big Data Lab 2 Using Splunk Software

2016 John Hsu

DATA524 - Big Data Information Visualization

Page 2: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

2 DATA524 - Big Data Information Visualization

Table of Contents Introduction ......................................................................................................... 3

About the UONA DATA524 Lab 2 - Accessing data in HDFS .......................... 3

Concepts: ........................................................................................................ 3

HDFS Introduction ............................................................................................ 4

Part 1: Upload data to Hadoop Distributed File System (HDFS) ..................... 8

Step 1: Login to UONA DATA524 Lab 2 Hadoop Web site ......................... 9

Step 2: Get into the root path of HDFS........................................................... 10

Step 3: Change file location to your subdirectory: .......................................... 11

Step 4: Upload the data into HDFS ................................................................ 15

Part 2: Showing HDFS data using Splunk software ....................................... 17

Step 1: Login to UONA DATA524 Lab 2 Splunk Web site ......................... 17

Step 2: Searching data in HDFS via Splunk: .................................................. 21

Step 3: Visualizing the data in HDFS by Using Splunk: ............................ 24

Page 3: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 3

3 DATA524 - Big Data Information Visualization

Introduction

About the UONA DATA524 Lab 2 - Accessing data in

HDFS The lab contained in this manual show you how to use Splunk and Apache

Hadoop file system. Add data to HDFS, then show you how to check your data

and run a simple search on the Hadoop directory. This lab is built for the user who

is new to Hadoop Distributed File System (HDFS), Splunk Enterprise and the

Splunk Search feature.

What's in this lab?

This manual guides the first user through searching the data and visualizing the

data. If you're new to Splunk Search, this is the place to start.

• Part 1: Upload data to Hadoop Distributed File System (HDFS) takes you through the steps to access DATA524 Lab’s HDFS web site.

• Part 2: Showing HDFS data using Splunk software describes the steps to retrieve and visualize the data in DATA524 Lab’s HDFS web site.

Concepts: Apache Hadoop: Apache Hadoop® is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data. Numerous Apache Software Foundation projects make up the services required by an enterprise to deploy, integrate and work with Hadoop Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. HDFS is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.

Ambari: The Apache Ambari project is aimed at making Hadoop management

simpler by developing software for provisioning, managing, and monitoring Apache

Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management

web UI backed by its RESTful APIs.

Page 4: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

4 DATA524 - Big Data Information Visualization

Ambari enables System Administrators to:

Provision a Hadoop Cluster Ambari provides a step-by-step wizard for installing Hadoop services across

any number of hosts.

Ambari handles configuration of Hadoop services for the cluster. Manage a Hadoop Cluster Ambari provides central management for starting, stopping, and reconfiguring

Hadoop services across the entire cluster. Monitor a Hadoop Cluster Ambari provides a dashboard for monitoring health and status of the Hadoop

cluster.

Ambari leverages Ambari Metrics System for metrics collection.

Ambari leverages Ambari Alert Framework for system alerting and will notify you when your attention is needed (e.g., a node goes down, remaining disk space is low, etc.).

Ambari enables Application Developers and System Integrators to:

Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own applications with the Ambari REST APIs.

HDFS Introduction A single physical machine gets saturated with its storage capacity as the data grows. Thereby comes impending need to partition your data across separate machines. This type of File system that manages storage of data across a network of machines is called Distributed File Systems. HDFS is a core component of Apache Hadoop and is designed to store large files with streaming data access patterns, running on clusters of commodity hardware.

Hadoop Distributed File System HDFS is a distributed file system that is designed for storing large data files. HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.

Page 5: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 5

5 DATA524 - Big Data Information Visualization

An HDFS cluster is comprised of a NameNode, which manages the cluster metadata, and DataNodes that store the data. Files and directories are represented on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, or namespace and disk space quotas. The file content is split into large blocks (typically 128 megabytes), and each block of the file is independently replicated at multiple DataNodes. The blocks are stored on the local file system on the DataNodes. The Namenode actively monitors the number of replicas of a block. When a replica of a block is lost due to a DataNode failure or disk failure, the NameNode creates another replica of the block. The NameNode maintains the namespace tree and the mapping of blocks to DataNodes, holding the entire namespace image in RAM. The NameNode does not directly send requests to DataNodes. It sends instructions to the DataNodes by replying to heartbeats sent by those DataNodes. The instructions include commands to:

replicate blocks to other nodes,

remove local block replicas,

re-register and send an immediate block report, or

shut down the node.

Page 6: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

6 DATA524 - Big Data Information Visualization

For more details on HDFS: http://hortonworks.com/hadoop/hdfs/

Page 7: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 7

7 DATA524 - Big Data Information Visualization

UONA DATA524 Big Data Lab Environment • Lab data is stored at Splunk server.

• Search engine is at Splunk server.

• Users are accessing servers from internet.

Page 8: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS

8 DATA524 - Big Data Information Visualization

Part 1: Upload data to Hadoop Distributed File

System (HDFS)

Following steps same as lab 4 preparing data for part 2 search engine.

Prerequisite:

Find your username in “UONA LAB Account for DATA524”

Download lab data prices.csv to your local computer it.

We will use the prices.csv later.

You may preview the prices.csv file with text edit or spreadsheet as below:

Page 9: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 9

9 DATA524 - Big Data Information Visualization

Step 1: Login to UONA DATA524 Lab 2 Hadoop Web site

http://uona.dynu.net:8708

Filled in username and password:

username: bd524?? password: your password

Page 10: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS

10 DATA524 - Big Data Information Visualization

Step 2: Get into the root path of HDFS

Step 2-.1: Go to the Ambari Dashboard and open the HDFS User View by click on the User Views

icon and selecting the HDFS Files menu item

Step 2-.2: move mouse over the User Views icon then select “DHFS Files”:

Page 11: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 11

11 DATA524 - Big Data Information Visualization

OR click the User Views icon then select “DHFS Files”:

Step 3: Change file location to your subdirectory:

Your working subdirectory will be at /user/splunk/lab/bd524??

Starting from the top root of the HDFS file system, click “user” subdirectory:

Page 12: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS

12 DATA524 - Big Data Information Visualization

Step 3-1: from the /user of the HDFS file system, click “splunk” subdirectory

Step 3-2: from the /user/splunk of the HDFS file system, click “lab” subdirectory

Page 13: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 13

13 DATA524 - Big Data Information Visualization

Step 3-3:

From the /user/splunk/lab of the HDFS file system,

find your subdirectory then click your subdirectory – bd524??:

Page 14: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS

14 DATA524 - Big Data Information Visualization

You may switch the sorting sequence as below:

OR search your subdirectory using your account name bd524??

Page 15: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 15

15 DATA524 - Big Data Information Visualization

Step 4: Upload the data into HDFS Upload lab data geolocation.csv to HDFS file system: Step 4-1: Make sure you are in your subdirectory – /user/splunk/lab/bd524??.

Click “Upload”:

Click “Browse” to select the lab data, prices.csv, you downloaded to your local computer before:

Page 16: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS

16 DATA524 - Big Data Information Visualization

Find and select prices.csv at your local computer (You had downloaded it before)

Step 4-2: Make sure you are in your subdirectory – bd524??. Click “Upload”:

prices.csv file will upload to HDFS as below:

Page 17: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 17

17 DATA524 - Big Data Information Visualization

Part 2: Showing HDFS data using Splunk software Following steps retrieve the HDFS data you uploaded at part 1, which is in the Hadoop server.

Step 1: Login to UONA DATA524 Lab 2 Splunk Web site Login to UONA DATA524 Lab 2 Splunk Web site:

https://uona.dynu.net:8803 Follow the message to authenticate with your credentials. Chrome:

Page 18: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS

18 DATA524 - Big Data Information Visualization

MS IE:

Page 19: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 19

19 DATA524 - Big Data Information Visualization

Page 20: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS

20 DATA524 - Big Data Information Visualization

username: bd524??

password: your_password

The first page you see is Splunk Home.

Page 21: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 21

21 DATA524 - Big Data Information Visualization

Step 2: Searching data in HDFS via Splunk:

Step 2-1: From Splunk Home, click Search & Reporting under Apps.

Step 2-2: Type following search string in the Search bar and press Enter to search for the data in the Hadoop Distributed File System (HDFS), which is uploaded in the part 1 of this lab:

index=uona2_68_lab source=/user/splunk/lab/bd524??/prices.csv

Note: replace bd524?? with your account ID

Page 22: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS

22 DATA524 - Big Data Information Visualization

The data you uploaded to HDFS will show up as below. You can see the csv format is automatically converted to JSON format.

Page 23: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 23

23 DATA524 - Big Data Information Visualization

Step 2-3

Type following search string in the Search bar and press Enter to search for the data in the Hadoop Distributed File System (HDFS), then compare the result with the original text file in prerequisite section of part 1.

index=uona2_68_lab source=/user/splunk/lab/bd524??/prices.csv |

table productId product_name price sale_price Code

Note: replace bd524?? with your account ID

Page 24: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS

24 DATA524 - Big Data Information Visualization

Step 3: Visualizing the data in HDFS by Using Splunk: Step 3-1: Type following search string (same as step 2-3) in the Search bar and press Enter to search for the data in the Hadoop Distributed File System (HDFS), which is uploaded in the part 1:

index=uona2_68_lab source=/user/splunk/lab/bd524??/prices.csv |

table productId product_name price sale_price Code

Note: replace bd524?? with your account ID

Page 25: Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the place to start. • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes

UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 25

25 DATA524 - Big Data Information Visualization

Please click Visualization then select Column chart.

The column chart is one of the easy way to review the sale_price with

price.

After lab 1 and lab 2, you should be familiar with uploading data to HDFS and retrieving HDFS data using Splunk.

This is the end of UONA DATA524 Big Data lab 2