Aws Atlanta meetup Amazon Athena

AWS Athena

Querying Data in S3 with out the need for EMR

Sponsors

Find me on LinkedIn

AWS Certifications

Presented by Adam Book

What is Athena?

Big Data Offerings Announced at re:Invent 2016

AWS Glue Amazon Athena AWS Greengrass

What is Athena?

Athena allows you to Run interactive SQL Queries on S3 data

WITHOUT the need to CREATE, MANAGE, or worrying about Scaling infrastructure

About Athena?

Athena is currently only in 2 regions:• US-EAST-1 (Northern Virginia)• US-WEST-2 (Oregon)

Pricing: $5 per TB scanned

No charge for failed queries No charge for Data Definition Language Statements (DDL) • CREATE / ALTER / DROP Table• Managing partitions

What is Athena?

Athena allows you to Run interactive SQL Queries on S3 data • S3 data is never modified (data is loaded in read only memory) • Cross-region buckets are supported • You can access Athena via JDBC

WITHOUT the need to CREATE, MANAGE, or worrying about Scaling infrastructure

• You only pay for the queries you run• Queries execute in parallel – so results are FAST, even with large datasets and complex

queries

Athena Queries

Service based on Presto (which is available in Amazon EMR)

ANSI SQL operators and functions

Table Creation is using APACHE HIVE DDL • CREATE EXTERNAL_TABLE only • CREATE TABLE as SELECT not supported

Unsupported Operations:• User Defined Functions (UDFs or UDAFs)• Stored Procedures• Any Transaction found on Hive or Presto• LZO is not supported (use Snappy instead)

Running queries on Athena

Run queries straight from the AWS Console • Use the wizard for schema definition • It can save queries • It can run multiple queries in parallel

Running queries on Athena

Run queries from your favorite tool (SQL Workbench, Agility, Aqua Data Studio)

• Requires the JDBC driver

JDBC 4.1-compatible driver: https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.0.jar.

Or via AWS CLI

aws s3 cp s3://athena-downloads/drivers/AthenaJDBC41-1.0.0.jar [loc

Full documentation link HERE

https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.0.jar

http://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html

Available Athena File Formats

• Apache Web Logs• CSV• TSV • JSON • Parquet • ORC• TEXT File w/ Custom Delimiters• AVRO (COMING SOON)

Tips on File Formats

Use Compressed Formats: Snappy, Zlib, GZIP ( no LZO) • Less I/O = Better performance & more cost savings

Use Structured / Columnar Formats • Apache Parquet • Apache ORC (Optimized Row Columnar)

The Difference between Hive and Presto

disk

disk

map map

map map

reduce reduce

disk

reduce reduce

HIVE PRESTO

Wait between stages

task

task

task

task

task task

task

Write data to disk

Memory to memory data

transfer• No disk IO• Data chunk

must fit in memory

All stages are pipelined

• No wait time • No fault-tolerance

First Create A new Database

Choose the Data Format

Then lay out the columns

Optional add Partitions

Bypassing the wizard

CREATE EXTERNAL TABLE IF NOT EXISTS default.elb_logs ( `request_timestamp` string, èlb_name` string, `request_ip` string, `request_port` int, `backend_ip` string, `backend_port` int, `request_processing_time` double, `backend_processing_time` double, `client_response_time` double, èlb_response_code` string, `backend_response_code` string, `received_bytes` bigint, `sent_bytes` bigint, `request_verb` string, ùrl` string, `protocol` string, ùser_agent` string, `ssl_cipher` string, `ssl_protocol` string )ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe’WITH SERDEPROPERTIES ( 'serialization.format' = '1', 'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?') LOCATION 's3://athena-examples/elb/plaintext/';

Athena Catalog Management

Amazon Athena uses an internal data catalog to store information and schemas about the databases and tables that you create for your data stored in Amazon S3

For more information Read the documentation page for Catalog Management

http://docs.aws.amazon.com/athena/latest/ug/catalog.html

Athena Featuressaved queries

Athena Tips and Tricks

You will need to have some understanding of the structure of your data or have a DDL meta store before you can start querying with Athena

When looking through your S3 buckets and finding one of the allowed formats (CSV, TSV, Parquet, etc) you will have to have an understanding of what colums are there so you can create the Athena table before you can start querying

Avoid Surprises when working with Athena with the following tips and tricks


Table names that begin with an underscore

Use backticks if table names begin with an underscore. For example:

CREATE TABLE myUnderScoreTable ( `_id` string, `_index` string, ...



For the location clause, use a trailing slashIn the LOCATION clause, use a trailing slash for your folder, NOT filenames or glob characters


Don’t USEs3://path_to_buckets3://path_to_bucket/*s3://path_to_bucket/mySpecialFile.dat

USEs3://path_to_bucket/

Athena Tips and Tricks cont.

Athena table names are case insensitiveIf you are interacting with Apache Spark, then your table column names must be lowercase.Athena is case insensitive but Spark requires lowercase table names.

Athena table names only allow the underscore character Athena table names cannot contain any special characters beside the underscore _


Questions?

Image by http://www.gratisography.com/

Interested in SponsoringAWS Atlanta?

Image by http://www.gratisography.com/

Aws Atlanta meetup Amazon Athena

Technology

Transcript of Aws Atlanta meetup Amazon Athena