CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3...

39
CS226 Big-Data Management Instructor: Ahmed Eldawy 01/09/2018

Transcript of CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3...

Page 1: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

CS226

Big-Data Management

Instructor: Ahmed Eldawy

01/09/2018

Page 2: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Welcome (back) to UCR!

01/09/2018

Page 3: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Class information

Classes: Tuesday and Thursday 9:40 AM –

11:00 AM at WCH 142

Instructor: Ahmed Eldawy

Office hours:

Tuesday and Thursday 11:00 AM – 12:00PM

@357 WCH. Conflicts?

Website:

http://www.cs.ucr.edu/~eldawy/18WCS226/

Email: [email protected]

Subject: “[CS226] …”01/09/2018

Page 4: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Course work

Active participation in the class (10%)

Class presentation (10%)

Reading and review tasks (5%)

Assignments (15%)

Project (35%)

Final exam (25%)

Date: Thursday, March 22, 2018

Time: 8:00 a.m. - 11:00 a.m.

Location: TBD

01/09/2018

Page 5: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Project

Groups of 2-3 students

Project proposal. Due on Thursday, 1/25

Literature survey. Due on Thursday, 2/8

Paper outline. Due on Thursday, 2/22

Final paper. Due on Thursday, 3/8

Project presentation. On 3/13 and 3/15

01/09/2018

Page 6: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Course goals

What are your goals?

Understand what big data means

Identify the internal components of big data

platforms

Recognize the differences between different

big data platforms

Explain how a distributed query runs on big

data

01/09/2018

Page 7: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Super Hero

01/09/2018

Page 8: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Big-data Expert

Understand how the big-data platforms really

work

Control those thousands of processors

efficiently to carry out your task

01/09/2018

Page 9: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Syllabus

Overview of big data

Big-data storage

Big-data processing

Big-data indexing

Big-SQL processing

Programming packages

01/09/2018

Page 10: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Introduction

01/09/2018

Page 11: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

01/09/2018

Page 12: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

01/09/2018

Page 13: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Jan 2012: World Economic Forum Report

01/09/2018

Page 14: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Interest in Big Data in the US

■March 2012: Obama administration

unveils BIG DATA initiative: $200 Million

in R&D investment

■June 2013:

Washington

Post is calling

Obama “The Big

Data President”

01/09/2018

Page 15: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Interest in Big Data in Europe

March 2014: David Cameron and Angela Merkel talking about

Big Data in a Computer Expo in Hannover, Germany

01/09/2018

Page 16: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

The Market of Big Data

01/09/2018

Page 17: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Four Three V’s of Big Data

01/09/2018

Page 18: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Big Data Vs Big Computation

Full scans (e.g., log processing)

Range scans

Point lookups

Iterations

Joins (self, binary, or multiway)

Proximity queries

Closures and graph traversals

01/09/2018

Page 19: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Big Data Applications

Web search

Marketing and advertising

Data cleaning

Knowledge base

Information retrieval

Internet of Things (IoT)

Visualization

Behavioral studies

01/09/2018

Page 20: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Publicly Available Datasets

Data.gov

Data.gov.uk

Twitter Streaming API

Yahoo! Webscope

[http://webscope.sandbox.yahoo.com/]

GDELT [http://www.gdeltproject.org/]

Instagram API

01/09/2018

Page 21: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Big Data Landscape 2012

01/09/2018 http://mattturck.com/2012/06/29/a-chart-of-the-big-data-ecosystem/

Page 22: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Big Data Landscape 2014

01/09/2018 http://mattturck.com/2014/05/11/the-state-of-big-data-in-2014-a-chart/

Page 23: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Big Data Landscape 2016

01/09/2018 http://mattturck.com/2016/02/01/big-data-landscape/

Page 24: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Components

of Big Data

01/09/2018

Page 25: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Storage of Big Data

Data is growing faster

than Moore’s Law

Too much data to fit

on a single machine

Partitioning

Replication

Fault-tolerance

01/09/2018

Page 26: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Hadoop Distributed File System(HDFS)

The most widely used distributed file system

Fixed-sized partitioning

3-way replication

Write-once read-many

01/09/2018

128MB 128MB 128MB 128MB 128MB 128MB …

Page 27: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Indexing

Data-aware organization

Global Index partitions the records into blocks

Local Indexes organize the records in a partition

Challenges:

Big volume

HDFS limitation

New programming

paradigms

Ad-hoc indexes

01/09/2018

Global index

Local indexes

Page 28: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Fault Tolerance

Replication

Redundancy

Multiple masters

01/09/2018

Page 29: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Streaming

Sub-second latency for queries

One scan over the data

(Partial) preprocessing

Continuous queries

Eviction strategies

In-memory indexes

01/09/2018

…1000100010101011101110101010110111010111011101110100…

Processing window

Page 30: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Task ExecutionMapReduce

Map-Shuffle- Reduce

Resiliency through

materialization

Resilient Distributed Datasets (RDD)

Directed-Acyclic-Graph (DAG)

In-memory processing

Resiliency through lineages

Hyracks

Stragglers

Load balance01/09/2018

M1 M2 … Mm

R1 R2 Rn

Page 31: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Query Optimization

Finding the most efficient query plan

e.g., grouped aggregation

Cost model (CPU – Disk – Network)

01/09/2018

Agg

Agg

Agg

Merge

Merge

Partition

Partition

Partition

Agg

Agg

Vs

Page 32: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Provenance

Debugging in distributed systems is painful

We need to keep track of transformations on

each record

01/09/2018

Page 33: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Big Graphs

Motivated by social networks

Billions of nodes and trillions of edges

Tens of thousands of insertions per second

Complex queries with graph traversals

01/09/2018

Page 34: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Hadoop Ecosystem

01/09/2018

Hadoop Distributed File System (HDFS)

Yet Another Resource Negotiator (YARN)

MapReduce Query Engine

Administration

Pig

Page 35: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Spark Ecosystem

01/09/2018

Hadoop Distributed File System (HDFS)

Yet Another Resource Negotiator (YARN)

Resilient Distributed Dataset (RDD) a.k.a Spark Core

Data Frames MLlib GraphX SparkRSpark

Streaming

Spark SQL

Page 36: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

01/09/2018

Hyracks Data-parallel Platform

Algebricks

Algebra Layer

Hadoop MapReduce

CompatibilityPregelix

HiveSterixAsteixDBOther

compilersHyracks

jobs

Pregel

Jobs

MapReduce

Jobs

PigLatinHiveQLAsterixQL

Page 37: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Impala

01/09/2018

Hadoop Distributed File System (HDFS)

Yet Another Resource Negotiator (YARN)

Query Executor

Query Planner

Query Parser

Page 38: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

SpatialHadoop

01/09/2018

Hadoop Distributed File System (HDFS) + Spatial Indexing

Yet Another Resource Negotiator (YARN)

MapReduce Processing + Spatial Query Processing

Spatial Visualization

Pig Latin + Pigeon

Page 39: CS226 Big-Data Managementeldawy/18WCS226/slides/CS226-01-Intro.pdf · Project Groups of 2-3 students Project proposal. Due on Thursday, 1/25 Literature survey. Due on Thursday, 2/8

Reading Material

“The Age of Analytics in a Data-driven World”

[Executive Summary]

by McKinsey & Company

01/09/2018