Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine Learning | AWS...

Post on 23-Jan-2018

5.455 views 0 download

Transcript of Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine Learning | AWS...

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Greg Khairallah, Business Development Manager, AWS

Malini Saxena, Senior Consultant, AWS

Raj Chary, VP of Technology / Architecture, WagglePractice

Lige Hensley, Chief Technology Officer, Ivy Tech

June 20, 2016

Easy Analytics with AWS

What to expect from this session

• AWS toolkit for analytics

• Understand stakeholders

• Demo

• Case Study – WagglePractice

• Case Study – Ivy Tech

• Q&A

AnalyzeStore

Amazon

Glacier

Amazon

S3

Amazon

DynamoDB

Amazon RDS,

Amazon Aurora

Big data portfolio—but what do I recommend?

AWS Data Pipeline

Amazon

CloudSearch

Amazon EMR

Amazon EC2

Amazon

Redshift

Amazon

Machine

Learning

Amazon

Elasticsearch

Service

AWS Database

Migration

Amazon

Kinesis

Analytics

Amazon Kinesis

Firehose

AWS Import/Export

AWS Direct

Connect

Collect

Amazon Kinesis

Streams Amazon

QuickSight

Match toolset to right persona

• Business intelligence (BI) analyst

• Primary tool is SQL

• Historical data resides in data warehouse such as

Amazon Redshift

• Data scientist—Uses programmatic languages such as R or

Python

• Application developer—Requires API to integrate with AWS

services

BI analyst

BI analyst with existing BI tools

BI Analyst

BI tools

Amazon EC2

Amazon Redshift

QuickSight API

• Primary tool is SQL

• Data is largely structured with well known data sources

• Primary concern is fast, consistent performance

• Need to extend SQL with custom functions

BI tools

Amazon EC2

Amazon QuickSight

Amazon QuickSight

Amazon Redshift system architecture

Leader node• SQL endpoint

• Stores metadata

• Coordinates query execution

Compute nodes• Local, columnar storage

• Execute queries in parallel

• Load, backup, restore via Amazon S3; load from Amazon DynamoDB, Amazon EMR, or SSH

Two hardware platforms• Optimized for data processing

• DS2: HDD; scale from 2 TB to 2 PB

• DC1: SSD; scale from 160 GB to 356 TB

10 GigE

(HPC)

JDBC/ODBC

New SQL functions

We add SQL functions regularly to expand Amazon Redshift’s query capabilities

Added 25+ window and aggregate functions since launch, including:

LISTAGG

[APPROXIMATE] COUNT

DROP IF EXISTS, CREATE IF NOT EXISTS

REGEXP_SUBSTR, _COUNT, _INSTR, _REPLACE

PERCENTILE_CONT, _DISC, MEDIAN

PERCENT_RANK, RATIO_TO_REPORT

We’ll continue iterating but also want to enable you to write your own

Window function examples: http://docs.aws.amazon.com/redshift/latest/dg/r_Window_function_examples.html

Scalar user defined functions

You can write UDFs using Python 2.7

• Syntax is largely identical to PostgreSQL UDF

• Python execution is performed in parallel

• System and network calls within UDFs are prohibited

Comes integrated with Pandas, NumPy, SciPy, DateUtil, and

Pytz analytic libraries

• Import your own libraries for even more flexibility

• Take advantage of thousands of functions available through Python

libraries to perform operations not easily expressed in SQL

A very fast, cloud-powered, business

intelligence service for 1/10 the cost of

traditional BI software

What is Amazon QuickSight?

Business

User

Business

User

QuickSight

APIQuickSight UI

Mobile Devices Web Browsers

Partner BI Products

MetadataData PrepConnectors SuggestionsSPICE

Amazon

S3

Amazon

Kinesis

Amazon

DynamoDB

Amazon EMRAmazon

RedshiftAmazon RDSFiles Third-party

Data scientist

Data scientist with existing toolsets

Data scientistToolkits like SAS or

R Studio installed

with Amazon EC2

Unstructured data

Amazon S3

Structured data

Amazon Redshift

• Work with unstructured datasets

• Use existing toolsets to connect to Amazon Redshift

Querying Amazon Redshift with R packages

• RJDBC—Supports SQL queries

• dplyr—Uses R code for data

analysis

• RPostgreSQL—R compliant

driver or database Interface (DBI)R UserR Studio

Amazon

EC2

Unstructured data

Amazon S3

User profile

Amazon RDS

Amazon Redshift

Connecting R with Amazon Redshift blog post: https://blogs.aws.amazon.com/bigdata/post/Tx1G8828SPGX3PK/Connecting-R-with-Amazon-Redshift

Querying Amazon Redshift with R packages example

Application developer

Application developers can build smart

applications using Amazon Machine Learning

Structured data/predictions

Amazon Redshift

Generate/query

predictions

Amazon QuickSight

Application

Amazon Machine

Learning

Visualize

• All skill levels

• Amazon Machine Learning technology is accessed through APIs and SDKs

• Embed visualizations in applications

Demo

Raj Chary, WagglePracticeVice President of Technology/Architecture

Smart, responsive practice

Math and ELA (Grades 2-8)

Provides students the right

challenge at the right time

What is Waggle?

Right Challenge, Right Time

Waggle looks for more than

correct answers. Waggle

continually analyzes each

student’s decisions and

progress. That way, students get

tougher material right when

they’re ready.

What is Waggle?

Productive Struggle

Waggle motivates students to

push themselves forward. How?

Through helpful hints,

supportive feedback, and

achievement badges that build

grit and confidence.

What is Waggle?

Constructive Grouping Waggle’s

insights means you can easily

group students together based

on learning needs. All without

sacrificing the quality of

individual instruction.

What is Waggle?

Waggle: Product Demo

• Data Creators Differentiated learning experience

Fun and engaging

• Data Visualizers Seamless integration with application

Analytics with a Story

Actionable Data

Redshift: Data Warehouse Layout

Write ClusterCompute – dw2.large

Redshift

Read ClusterCompute – dw2.large

Redshift

History ClusterDensity – dw1.xlarge

Redshift

Initial and Incremental {processed} data loads

Periodic Data Snapshots for historical analysis

Data sources

For serving Jaspersoft reports

APIs

OLTP

S3 COPY

S3 UnLoad and Load

S3 UnLoad and Load

Data mart(aggregations)

NodesNodes

Staging

Datamart(aggregations)

NodesS3 UnLoad and Load

S3 UnLoad and Load + UPSERTS

Results and Lessons Learned• Performance Metrics

– Millions of records are processed in <1 minute

• LOAD/UNLOAD commands | UPSERTS | S3 COPY Command – Report queries average < 1 to ~1.5 seconds

– {compression} – gained 20+% efficiencies in data retrieval

• Best Practices

– {sort keys} – lens-based data model: visualize data in variety of ways

– {commit stats} – Redshift is not a transactional system

– {nested loops} – no Cartesian products, ensure joins well managed

– {queries that queue} – tune the WLM configuration

– {query runtimes} – faster query means less queuing

– {stats missing} – analyze and vacuum when possible

– {alerts with tables} – monitor to ensure queries running optimally

Thank You

Ivy Tech & Amazon Redshift

May 25, 2016

• Transforming the culture of the College to be more data driven

• Moving from reporting silos to an Integrated Analytics system, we call

this a Data Democracy

• Collecting and analyzing a vast variety of data at a scale that no one

in Higher Ed is doing

• Using machine learning tools to identify students who may need

further assistance

• Starting this fall, we are implementing a one-on-one coaching

initiative for the students we identified with the machine learning tools

What We’re Doing

96% of organizations in the United States

use data in the same way.

…and it’s wrong.

But it’s not just education…

The “Standard” Approach

VIP

Relevant Data for Everyone

Data Regimes

Data Dictatorship: Data is controlled and its use is restricted. There is asymmetric distribution of information based on your position

Data Aristocracy: Data analysts, scientists and PhDs are needed to do anything meaningful. Power concentrates in the hands of these employees and their supervisors

Data Anarchy: Business users feel underserved and take matters into their own hands. They create “shadow IT” systems and work around the “unresponsive” IT group

Data Democracy: Everybody gets timely and equitable access to data. Line of business users are empowered and “own” the data. Executives and IT get out of the way

1 Shash Hegde, Mariner, “The Rise of Data Regimes”, 9/12/13, http://www.mariner-usa.com/rise-data-regimes/ (image substitution for Mao Zedong)

Every organization moves through increasingly complex stages of data accessibility.

Data Maturity Model

… very few complete the transition to Integrated Analytics

Stage 1: Report SilosRequest

Tracker

Banner Blackboard Luminis StarfishSCCM CAS

Authentication

This is what we have had for

decades at Ivy Tech…

Request

Tracker

Banner Blackboard Luminis StarfishSCCM CAS

Authentication

Stage 2: Data Warehousing

This is what

most

companies

do…

but we are

taking this a

step further…

Stage 3: Integrated AnalyticsRequest

Tracker

Banner Blackboard Luminis StarfishSCCM CAS

Authentication

Students by

Financial

Aid

Students

by

Award

Students

by

Term

Students

by

Class

Classes by

Class

Section

Students

These curated collections of

data are designed to enable

direct access to...

…the data you need, regardless of

where it came from. Quickly.

Easily.

GPA Graduation—Cumulative

Graduation Grade Point Average (Cumulative) is an indication of a student's academic progress for all

semester credit classes for all registered terms up to and including the selected term. Letter grades are

assigned points (A=4, B=3, C=2, D=1, F=0) and the GPA is calculated by taking the number of grade

points a student earned in a selected period of time divided by the total number of classes taken during

that same period.

GPA Graduation Cumulative = Sum of a student's total grade points earned in credit classes for all

classes for all registered terms up to and including the selected term / Sum of student's total classes

taken during that same period

NOTES ON USING THIS TERM: GPA Graduation - Cumulative does not include grades from remedial

classes.

Related Terms: [GPA Graduation - Term]

Questions?

Resources

Amazon Redshift Getting Started Guide:

http://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html

Scalar UDF Documentation: http://docs.aws.amazon.com/redshift/latest/dg/user-defined-

functions.html

Introduction to Python UDFs in Amazon Redshift:

https://blogs.aws.amazon.com/bigdata/post/Tx1IHV1G67CY53T/Introduction-to-Python-UDFs-in-

Amazon-Redshift

Connecting R with Amazon Redshift:

https://blogs.aws.amazon.com/bigdata/post/Tx1G8828SPGX3PK/Connecting-R-with-Amazon-

Redshift

Databricks Apache Spark–Amazon Redshift Tutorial: https://github.com/databricks/spark-

redshift/tree/master/tutorial

Amazon ML Getting Started Guide: https://aws.amazon.com/machine-learning/getting-started/

Amazon QuickSight (Preview Registration): https://aws.amazon.com/quicksight/

Thank you!