Copyright © 2011, Splunk Inc.Listen to your data. Date Name Title Getting Started with Splunk.
Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the...
Transcript of Big Data Lab 2 Using Splunk Software - UoNA...data. If you're new to Splunk Search, this is the...
DATA524 - Information Visualization
Big Data Lab 2 Using Splunk Software
2016 John Hsu
DATA524 - Big Data Information Visualization
2 DATA524 - Big Data Information Visualization
Table of Contents Introduction ......................................................................................................... 3
About the UONA DATA524 Lab 2 - Accessing data in HDFS .......................... 3
Concepts: ........................................................................................................ 3
HDFS Introduction ............................................................................................ 4
Part 1: Upload data to Hadoop Distributed File System (HDFS) ..................... 8
Step 1: Login to UONA DATA524 Lab 2 Hadoop Web site ......................... 9
Step 2: Get into the root path of HDFS........................................................... 10
Step 3: Change file location to your subdirectory: .......................................... 11
Step 4: Upload the data into HDFS ................................................................ 15
Part 2: Showing HDFS data using Splunk software ....................................... 17
Step 1: Login to UONA DATA524 Lab 2 Splunk Web site ......................... 17
Step 2: Searching data in HDFS via Splunk: .................................................. 21
Step 3: Visualizing the data in HDFS by Using Splunk: ............................ 24
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 3
3 DATA524 - Big Data Information Visualization
Introduction
About the UONA DATA524 Lab 2 - Accessing data in
HDFS The lab contained in this manual show you how to use Splunk and Apache
Hadoop file system. Add data to HDFS, then show you how to check your data
and run a simple search on the Hadoop directory. This lab is built for the user who
is new to Hadoop Distributed File System (HDFS), Splunk Enterprise and the
Splunk Search feature.
What's in this lab?
This manual guides the first user through searching the data and visualizing the
data. If you're new to Splunk Search, this is the place to start.
• Part 1: Upload data to Hadoop Distributed File System (HDFS) takes you through the steps to access DATA524 Lab’s HDFS web site.
• Part 2: Showing HDFS data using Splunk software describes the steps to retrieve and visualize the data in DATA524 Lab’s HDFS web site.
Concepts: Apache Hadoop: Apache Hadoop® is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data. Numerous Apache Software Foundation projects make up the services required by an enterprise to deploy, integrate and work with Hadoop Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. HDFS is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.
Ambari: The Apache Ambari project is aimed at making Hadoop management
simpler by developing software for provisioning, managing, and monitoring Apache
Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management
web UI backed by its RESTful APIs.
4 DATA524 - Big Data Information Visualization
Ambari enables System Administrators to:
Provision a Hadoop Cluster Ambari provides a step-by-step wizard for installing Hadoop services across
any number of hosts.
Ambari handles configuration of Hadoop services for the cluster. Manage a Hadoop Cluster Ambari provides central management for starting, stopping, and reconfiguring
Hadoop services across the entire cluster. Monitor a Hadoop Cluster Ambari provides a dashboard for monitoring health and status of the Hadoop
cluster.
Ambari leverages Ambari Metrics System for metrics collection.
Ambari leverages Ambari Alert Framework for system alerting and will notify you when your attention is needed (e.g., a node goes down, remaining disk space is low, etc.).
Ambari enables Application Developers and System Integrators to:
Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own applications with the Ambari REST APIs.
HDFS Introduction A single physical machine gets saturated with its storage capacity as the data grows. Thereby comes impending need to partition your data across separate machines. This type of File system that manages storage of data across a network of machines is called Distributed File Systems. HDFS is a core component of Apache Hadoop and is designed to store large files with streaming data access patterns, running on clusters of commodity hardware.
Hadoop Distributed File System HDFS is a distributed file system that is designed for storing large data files. HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 5
5 DATA524 - Big Data Information Visualization
An HDFS cluster is comprised of a NameNode, which manages the cluster metadata, and DataNodes that store the data. Files and directories are represented on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, or namespace and disk space quotas. The file content is split into large blocks (typically 128 megabytes), and each block of the file is independently replicated at multiple DataNodes. The blocks are stored on the local file system on the DataNodes. The Namenode actively monitors the number of replicas of a block. When a replica of a block is lost due to a DataNode failure or disk failure, the NameNode creates another replica of the block. The NameNode maintains the namespace tree and the mapping of blocks to DataNodes, holding the entire namespace image in RAM. The NameNode does not directly send requests to DataNodes. It sends instructions to the DataNodes by replying to heartbeats sent by those DataNodes. The instructions include commands to:
replicate blocks to other nodes,
remove local block replicas,
re-register and send an immediate block report, or
shut down the node.
6 DATA524 - Big Data Information Visualization
For more details on HDFS: http://hortonworks.com/hadoop/hdfs/
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 7
7 DATA524 - Big Data Information Visualization
UONA DATA524 Big Data Lab Environment • Lab data is stored at Splunk server.
• Search engine is at Splunk server.
• Users are accessing servers from internet.
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS
8 DATA524 - Big Data Information Visualization
Part 1: Upload data to Hadoop Distributed File
System (HDFS)
Following steps same as lab 4 preparing data for part 2 search engine.
Prerequisite:
Find your username in “UONA LAB Account for DATA524”
Download lab data prices.csv to your local computer it.
We will use the prices.csv later.
You may preview the prices.csv file with text edit or spreadsheet as below:
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 9
9 DATA524 - Big Data Information Visualization
Step 1: Login to UONA DATA524 Lab 2 Hadoop Web site
http://uona.dynu.net:8708
Filled in username and password:
username: bd524?? password: your password
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS
10 DATA524 - Big Data Information Visualization
Step 2: Get into the root path of HDFS
Step 2-.1: Go to the Ambari Dashboard and open the HDFS User View by click on the User Views
icon and selecting the HDFS Files menu item
Step 2-.2: move mouse over the User Views icon then select “DHFS Files”:
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 11
11 DATA524 - Big Data Information Visualization
OR click the User Views icon then select “DHFS Files”:
Step 3: Change file location to your subdirectory:
Your working subdirectory will be at /user/splunk/lab/bd524??
Starting from the top root of the HDFS file system, click “user” subdirectory:
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS
12 DATA524 - Big Data Information Visualization
Step 3-1: from the /user of the HDFS file system, click “splunk” subdirectory
Step 3-2: from the /user/splunk of the HDFS file system, click “lab” subdirectory
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 13
13 DATA524 - Big Data Information Visualization
Step 3-3:
From the /user/splunk/lab of the HDFS file system,
find your subdirectory then click your subdirectory – bd524??:
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS
14 DATA524 - Big Data Information Visualization
You may switch the sorting sequence as below:
OR search your subdirectory using your account name bd524??
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 15
15 DATA524 - Big Data Information Visualization
Step 4: Upload the data into HDFS Upload lab data geolocation.csv to HDFS file system: Step 4-1: Make sure you are in your subdirectory – /user/splunk/lab/bd524??.
Click “Upload”:
Click “Browse” to select the lab data, prices.csv, you downloaded to your local computer before:
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS
16 DATA524 - Big Data Information Visualization
Find and select prices.csv at your local computer (You had downloaded it before)
Step 4-2: Make sure you are in your subdirectory – bd524??. Click “Upload”:
prices.csv file will upload to HDFS as below:
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 17
17 DATA524 - Big Data Information Visualization
Part 2: Showing HDFS data using Splunk software Following steps retrieve the HDFS data you uploaded at part 1, which is in the Hadoop server.
Step 1: Login to UONA DATA524 Lab 2 Splunk Web site Login to UONA DATA524 Lab 2 Splunk Web site:
https://uona.dynu.net:8803 Follow the message to authenticate with your credentials. Chrome:
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS
18 DATA524 - Big Data Information Visualization
MS IE:
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 19
19 DATA524 - Big Data Information Visualization
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS
20 DATA524 - Big Data Information Visualization
username: bd524??
password: your_password
The first page you see is Splunk Home.
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 21
21 DATA524 - Big Data Information Visualization
Step 2: Searching data in HDFS via Splunk:
Step 2-1: From Splunk Home, click Search & Reporting under Apps.
Step 2-2: Type following search string in the Search bar and press Enter to search for the data in the Hadoop Distributed File System (HDFS), which is uploaded in the part 1 of this lab:
index=uona2_68_lab source=/user/splunk/lab/bd524??/prices.csv
Note: replace bd524?? with your account ID
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS
22 DATA524 - Big Data Information Visualization
The data you uploaded to HDFS will show up as below. You can see the csv format is automatically converted to JSON format.
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 23
23 DATA524 - Big Data Information Visualization
Step 2-3
Type following search string in the Search bar and press Enter to search for the data in the Hadoop Distributed File System (HDFS), then compare the result with the original text file in prerequisite section of part 1.
index=uona2_68_lab source=/user/splunk/lab/bd524??/prices.csv |
table productId product_name price sale_price Code
Note: replace bd524?? with your account ID
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS
24 DATA524 - Big Data Information Visualization
Step 3: Visualizing the data in HDFS by Using Splunk: Step 3-1: Type following search string (same as step 2-3) in the Search bar and press Enter to search for the data in the Hadoop Distributed File System (HDFS), which is uploaded in the part 1:
index=uona2_68_lab source=/user/splunk/lab/bd524??/prices.csv |
table productId product_name price sale_price Code
Note: replace bd524?? with your account ID
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 25
25 DATA524 - Big Data Information Visualization
Please click Visualization then select Column chart.
The column chart is one of the easy way to review the sale_price with
price.
After lab 1 and lab 2, you should be familiar with uploading data to HDFS and retrieving HDFS data using Splunk.
This is the end of UONA DATA524 Big Data lab 2