Big Data Visualization: Turning Big Data Into Big Insights – White ...
Big Data Lab 4 Using Splunk Software · 2018. 9. 4. · UONA DATA523 - BIG DATA TECHNOLOGY...
Transcript of Big Data Lab 4 Using Splunk Software · 2018. 9. 4. · UONA DATA523 - BIG DATA TECHNOLOGY...
-
DATA524 - Information Visualization
Big Data Lab 4 Using Splunk Software
2016 John Hsu.
DATA524 - Big Data Information Visualization
-
2
DATA524 - Big Data Information Visualization
Table of Contents Introduction ......................................................................................................... 3
About the UONA DATA524 Lab 4 - Accessing data in HDFS .......................... 3
Concepts: ......................................................................................................... 3
HDFS Introduction ............................................................................................ 4
Part 1: Showing HDFS data using Splunk software ......................................... 8
Step 1: Login to UONA DATA524 Lab 4 Splunk Web site................................ 8
Step 2: Searching data in HDFS via Splunk: .................................................. 10
Step 3: Visualizing the data in HDFS by Using Splunk:.................................. 12
Part 2: Predict future Purchasing using Splunk software ............................. 16
-
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 3
3 DATA524 - Big Data Information Visualization
Introduction
About the UONA DATA524 Lab 4 - Accessing data in
HDFS The lab contained in this manual show you how to use Splunk and Apache
Hadoop file system. Add data to HDFS, then show you how to check your data
and run a simple search on the Hadoop directory. This lab is built for the user who
is new to Hadoop Distributed File System (HDFS), Splunk Enterprise and the
Splunk Search feature.
What's in this lab?
This manual guides the first user through searching the data and visualizing the
data. If you're new to Splunk Search, this is the place to start.
• Part 1: Upload data to Hadoop Distributed File System (HDFS) takes you through the steps to access DATA524 Lab’s HDFS web site.
• Part 2: Predict future Purchasing using Splunk software: Describes the steps to retrieve, predict and visualize the data in DATA524 Lab’s HDFS system.
Concepts: Apache Hadoop: Apache Hadoop® is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data. Numerous Apache Software Foundation projects make up the services required by an enterprise to deploy, integrate and work with Hadoop Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. HDFS is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.
Ambari: The Apache Ambari project is aimed at making Hadoop management
simpler by developing software for provisioning, managing, and monitoring Apache
Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management
web UI backed by its RESTful APIs.
-
4
DATA524 - Big Data Information Visualization
Ambari enables System Administrators to:
Provision a Hadoop Cluster Ambari provides a step-by-step wizard for installing Hadoop services across
any number of hosts.
Ambari handles configuration of Hadoop services for the cluster. Manage a Hadoop Cluster Ambari provides central management for starting, stopping, and reconfiguring
Hadoop services across the entire cluster. Monitor a Hadoop Cluster Ambari provides a dashboard for monitoring health and status of the Hadoop
cluster.
Ambari leverages Ambari Metrics System for metrics collection.
Ambari leverages Ambari Alert Framework for system alerting and will notify you when your attention is needed (e.g., a node goes down, remaining disk space is low, etc.).
Ambari enables Application Developers and System Integrators to:
Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own applications with the Ambari REST APIs.
HDFS Introduction A single physical machine gets saturated with its storage capacity as the data grows. Thereby comes impending need to partition your data across separate machines. This type of File system that manages storage of data across a network of machines is called Distributed File Systems. HDFS is a core component of Apache Hadoop and is designed to store large files with streaming data access patterns, running on clusters of commodity hardware.
Hadoop Distributed File System HDFS is a distributed file system that is designed for storing large data files. HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.
-
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 5
5 DATA524 - Big Data Information Visualization
An HDFS cluster is comprised of a NameNode, which manages the cluster metadata, and DataNodes that store the data. Files and directories are represented on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, or namespace and disk space quotas. The file content is split into large blocks (typically 128 megabytes), and each block of the file is independently replicated at multiple DataNodes. The blocks are stored on the local file system on the DataNodes. The Namenode actively monitors the number of replicas of a block. When a replica of a block is lost due to a DataNode failure or disk failure, the NameNode creates another replica of the block. The NameNode maintains the namespace tree and the mapping of blocks to DataNodes, holding the entire namespace image in RAM. The NameNode does not directly send requests to DataNodes. It sends instructions to the DataNodes by replying to heartbeats sent by those DataNodes. The instructions include commands to:
replicate blocks to other nodes,
remove local block replicas,
re-register and send an immediate block report, or
shut down the node.
-
6
DATA524 - Big Data Information Visualization
For more details on HDFS: http://hortonworks.com/hadoop/hdfs/
http://hortonworks.com/hadoop/hdfs/
-
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 7
7 DATA524 - Big Data Information Visualization
UONA DATA524 Big Data Lab Environment • Lab data is stored at Splunk server.
• Search engine is at Splunk server.
• Users are accessing servers from internet.
-
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS
8 DATA524 - Big Data Information Visualization
Part 1: Showing HDFS data using Splunk software Following steps retrieve the HDFS data you uploaded at lab 1, which is in the Hadoop server.
Step 1: Login to UONA DATA524 Lab 4 Splunk Web site Login to UONA DATA524 Lab 1 Splunk Web site:
https://uona.dynu.net:8803
username: bd524??
password: your_password
https://uona.dynu.net:8803/
-
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 9
9 DATA524 - Big Data Information Visualization
The first page you see is Splunk Home.
-
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS
10 DATA524 - Big Data Information Visualization
Step 2: Searching data in HDFS via Splunk:
Step 2-1: From Splunk Home, click Search & Reporting under Apps.
Step 2-2: Type following search string in the Search bar and press Enter to search for the data in the Hadoop Distributed File System (HDFS), which is uploaded in the part 1 of this lab:
index=uona2_68_lab source=/user/splunk/lab/bd524??/tutorialdata.gz
action=purchase
Note: replace bd524?? with your account ID
-
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 11
11 DATA524 - Big Data Information Visualization
The data you uploaded to HDFS will show up as below.
-
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS
12 DATA524 - Big Data Information Visualization
Step 3: Visualizing the data in HDFS by Using Splunk: Step 3-1: Type following search string in the Search bar and press Enter to search for the data in the Hadoop Distributed File System (HDFS), which is uploaded in the part 1:
index=uona2_68_lab source="/user/splunk/lab/bd524??/tutorialdata.gz"
action=purchase | timechart count by action
Note: replace bd524?? with your account ID
Daily purchase statistic showed as below:
-
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 13
13 DATA524 - Big Data Information Visualization
Click the Visualization.
The bar chart of the data will show up as below.
-
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS
14 DATA524 - Big Data Information Visualization
Select different chart, for example: line chart:
-
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 15
15 DATA524 - Big Data Information Visualization
Below is line chart of purchase trend.:
-
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS
16 DATA524 - Big Data Information Visualization
Part 2: Predict future Purchasing using Splunk
software
Predict future purchase based on the previous purchase numbers.
The predict command forecasts values for one or more sets of time-series data. The
command can also fill in missing data in a time-series and provide predictions for the next
several time steps.
The predict command provides confidence intervals for all of its estimates. The command
adds a predicted value and an upper and lower 95th percentile range to each event in the
time-series. See the Usage section in this topic.
Step 1: Type following search string in the Search bar and press Enter to search for the data in the Hadoop Distributed File System (HDFS), which is uploaded in the previous lab:
index=uona2_68_lab source="/user/splunk/lab/bd524??/tutorialdata.gz"
action=purchase | timechart span=1d count(action) as count | predict count
Note: replace bd524?? with your account ID
-
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 17
17 DATA524 - Big Data Information Visualization
-
UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS
18 DATA524 - Big Data Information Visualization
This is the end of UONA DATA524 Big Data lab 4