Big Data Lab 4 Using Splunk Software · 2018. 9. 4. · UONA DATA523 - BIG DATA TECHNOLOGY...

18
DATA524 - Information Visualization Big Data Lab 4 Using Splunk Software 2016 John Hsu. DATA524 - Big Data Information Visualization

Transcript of Big Data Lab 4 Using Splunk Software · 2018. 9. 4. · UONA DATA523 - BIG DATA TECHNOLOGY...

  • DATA524 - Information Visualization

    Big Data Lab 4 Using Splunk Software

    2016 John Hsu.

    DATA524 - Big Data Information Visualization

  • 2

    DATA524 - Big Data Information Visualization

    Table of Contents Introduction ......................................................................................................... 3

    About the UONA DATA524 Lab 4 - Accessing data in HDFS .......................... 3

    Concepts: ......................................................................................................... 3

    HDFS Introduction ............................................................................................ 4

    Part 1: Showing HDFS data using Splunk software ......................................... 8

    Step 1: Login to UONA DATA524 Lab 4 Splunk Web site................................ 8

    Step 2: Searching data in HDFS via Splunk: .................................................. 10

    Step 3: Visualizing the data in HDFS by Using Splunk:.................................. 12

    Part 2: Predict future Purchasing using Splunk software ............................. 16

  • UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 3

    3 DATA524 - Big Data Information Visualization

    Introduction

    About the UONA DATA524 Lab 4 - Accessing data in

    HDFS The lab contained in this manual show you how to use Splunk and Apache

    Hadoop file system. Add data to HDFS, then show you how to check your data

    and run a simple search on the Hadoop directory. This lab is built for the user who

    is new to Hadoop Distributed File System (HDFS), Splunk Enterprise and the

    Splunk Search feature.

    What's in this lab?

    This manual guides the first user through searching the data and visualizing the

    data. If you're new to Splunk Search, this is the place to start.

    • Part 1: Upload data to Hadoop Distributed File System (HDFS) takes you through the steps to access DATA524 Lab’s HDFS web site.

    • Part 2: Predict future Purchasing using Splunk software: Describes the steps to retrieve, predict and visualize the data in DATA524 Lab’s HDFS system.

    Concepts: Apache Hadoop: Apache Hadoop® is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data. Numerous Apache Software Foundation projects make up the services required by an enterprise to deploy, integrate and work with Hadoop Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. HDFS is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.

    Ambari: The Apache Ambari project is aimed at making Hadoop management

    simpler by developing software for provisioning, managing, and monitoring Apache

    Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management

    web UI backed by its RESTful APIs.

  • 4

    DATA524 - Big Data Information Visualization

    Ambari enables System Administrators to:

    Provision a Hadoop Cluster Ambari provides a step-by-step wizard for installing Hadoop services across

    any number of hosts.

    Ambari handles configuration of Hadoop services for the cluster. Manage a Hadoop Cluster Ambari provides central management for starting, stopping, and reconfiguring

    Hadoop services across the entire cluster. Monitor a Hadoop Cluster Ambari provides a dashboard for monitoring health and status of the Hadoop

    cluster.

    Ambari leverages Ambari Metrics System for metrics collection.

    Ambari leverages Ambari Alert Framework for system alerting and will notify you when your attention is needed (e.g., a node goes down, remaining disk space is low, etc.).

    Ambari enables Application Developers and System Integrators to:

    Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own applications with the Ambari REST APIs.

    HDFS Introduction A single physical machine gets saturated with its storage capacity as the data grows. Thereby comes impending need to partition your data across separate machines. This type of File system that manages storage of data across a network of machines is called Distributed File Systems. HDFS is a core component of Apache Hadoop and is designed to store large files with streaming data access patterns, running on clusters of commodity hardware.

    Hadoop Distributed File System HDFS is a distributed file system that is designed for storing large data files. HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.

  • UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 5

    5 DATA524 - Big Data Information Visualization

    An HDFS cluster is comprised of a NameNode, which manages the cluster metadata, and DataNodes that store the data. Files and directories are represented on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, or namespace and disk space quotas. The file content is split into large blocks (typically 128 megabytes), and each block of the file is independently replicated at multiple DataNodes. The blocks are stored on the local file system on the DataNodes. The Namenode actively monitors the number of replicas of a block. When a replica of a block is lost due to a DataNode failure or disk failure, the NameNode creates another replica of the block. The NameNode maintains the namespace tree and the mapping of blocks to DataNodes, holding the entire namespace image in RAM. The NameNode does not directly send requests to DataNodes. It sends instructions to the DataNodes by replying to heartbeats sent by those DataNodes. The instructions include commands to:

    replicate blocks to other nodes,

    remove local block replicas,

    re-register and send an immediate block report, or

    shut down the node.

  • 6

    DATA524 - Big Data Information Visualization

    For more details on HDFS: http://hortonworks.com/hadoop/hdfs/

    http://hortonworks.com/hadoop/hdfs/

  • UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 7

    7 DATA524 - Big Data Information Visualization

    UONA DATA524 Big Data Lab Environment • Lab data is stored at Splunk server.

    • Search engine is at Splunk server.

    • Users are accessing servers from internet.

  • UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS

    8 DATA524 - Big Data Information Visualization

    Part 1: Showing HDFS data using Splunk software Following steps retrieve the HDFS data you uploaded at lab 1, which is in the Hadoop server.

    Step 1: Login to UONA DATA524 Lab 4 Splunk Web site Login to UONA DATA524 Lab 1 Splunk Web site:

    https://uona.dynu.net:8803

    username: bd524??

    password: your_password

    https://uona.dynu.net:8803/

  • UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 9

    9 DATA524 - Big Data Information Visualization

    The first page you see is Splunk Home.

  • UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS

    10 DATA524 - Big Data Information Visualization

    Step 2: Searching data in HDFS via Splunk:

    Step 2-1: From Splunk Home, click Search & Reporting under Apps.

    Step 2-2: Type following search string in the Search bar and press Enter to search for the data in the Hadoop Distributed File System (HDFS), which is uploaded in the part 1 of this lab:

    index=uona2_68_lab source=/user/splunk/lab/bd524??/tutorialdata.gz

    action=purchase

    Note: replace bd524?? with your account ID

  • UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 11

    11 DATA524 - Big Data Information Visualization

    The data you uploaded to HDFS will show up as below.

  • UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS

    12 DATA524 - Big Data Information Visualization

    Step 3: Visualizing the data in HDFS by Using Splunk: Step 3-1: Type following search string in the Search bar and press Enter to search for the data in the Hadoop Distributed File System (HDFS), which is uploaded in the part 1:

    index=uona2_68_lab source="/user/splunk/lab/bd524??/tutorialdata.gz"

    action=purchase | timechart count by action

    Note: replace bd524?? with your account ID

    Daily purchase statistic showed as below:

  • UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 13

    13 DATA524 - Big Data Information Visualization

    Click the Visualization.

    The bar chart of the data will show up as below.

  • UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS

    14 DATA524 - Big Data Information Visualization

    Select different chart, for example: line chart:

  • UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 15

    15 DATA524 - Big Data Information Visualization

    Below is line chart of purchase trend.:

  • UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS

    16 DATA524 - Big Data Information Visualization

    Part 2: Predict future Purchasing using Splunk

    software

    Predict future purchase based on the previous purchase numbers.

    The predict command forecasts values for one or more sets of time-series data. The

    command can also fill in missing data in a time-series and provide predictions for the next

    several time steps.

    The predict command provides confidence intervals for all of its estimates. The command

    adds a predicted value and an upper and lower 95th percentile range to each event in the

    time-series. See the Usage section in this topic.

    Step 1: Type following search string in the Search bar and press Enter to search for the data in the Hadoop Distributed File System (HDFS), which is uploaded in the previous lab:

    index=uona2_68_lab source="/user/splunk/lab/bd524??/tutorialdata.gz"

    action=purchase | timechart span=1d count(action) as count | predict count

    Note: replace bd524?? with your account ID

  • UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS 17

    17 DATA524 - Big Data Information Visualization

  • UONA DATA523 - BIG DATA TECHNOLOGY FUNDAMENTALS

    18 DATA524 - Big Data Information Visualization

    This is the end of UONA DATA524 Big Data lab 4