Aws Atlanta meetup Amazon Athena
-
Upload
adam-book -
Category
Technology
-
view
162 -
download
9
Transcript of Aws Atlanta meetup Amazon Athena
AWS Athena
Querying Data in S3 with out the need for EMR
Sponsors
Find me on LinkedIn
AWS Certifications
Presented by Adam Book
What is Athena?
Big Data Offerings Announced at re:Invent 2016
AWS Glue Amazon Athena AWS Greengrass
What is Athena?
Athena allows you to Run interactive SQL Queries on S3 data
WITHOUT the need to CREATE, MANAGE, or worrying about Scaling infrastructure
About Athena?
Athena is currently only in 2 regions:• US-EAST-1 (Northern Virginia)• US-WEST-2 (Oregon)
Pricing: $5 per TB scanned
No charge for failed queries No charge for Data Definition Language Statements (DDL) • CREATE / ALTER / DROP Table• Managing partitions
What is Athena?
Athena allows you to Run interactive SQL Queries on S3 data • S3 data is never modified (data is loaded in read only memory) • Cross-region buckets are supported • You can access Athena via JDBC
WITHOUT the need to CREATE, MANAGE, or worrying about Scaling infrastructure
• You only pay for the queries you run• Queries execute in parallel – so results are FAST, even with large datasets and complex
queries
Athena Queries
Service based on Presto (which is available in Amazon EMR)
ANSI SQL operators and functions
Table Creation is using APACHE HIVE DDL • CREATE EXTERNAL_TABLE only • CREATE TABLE as SELECT not supported
Unsupported Operations:• User Defined Functions (UDFs or UDAFs)• Stored Procedures• Any Transaction found on Hive or Presto• LZO is not supported (use Snappy instead)
Running queries on Athena
Run queries straight from the AWS Console • Use the wizard for schema definition • It can save queries • It can run multiple queries in parallel
Running queries on Athena
Run queries from your favorite tool (SQL Workbench, Agility, Aqua Data Studio)
• Requires the JDBC driver
JDBC 4.1-compatible driver: https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.0.jar.
Or via AWS CLI
aws s3 cp s3://athena-downloads/drivers/AthenaJDBC41-1.0.0.jar [loc
Full documentation link HERE
Available Athena File Formats
• Apache Web Logs• CSV• TSV • JSON • Parquet • ORC• TEXT File w/ Custom Delimiters• AVRO (COMING SOON)
Tips on File Formats
Use Compressed Formats: Snappy, Zlib, GZIP ( no LZO) • Less I/O = Better performance & more cost savings
Use Structured / Columnar Formats • Apache Parquet • Apache ORC (Optimized Row Columnar)
The Difference between Hive and Presto
disk
disk
map map
map map
reduce reduce
disk
reduce reduce
HIVE PRESTO
Wait between stages
task
task
task
task
task task
task
Write data to disk
Memory to memory data
transfer• No disk IO• Data chunk
must fit in memory
All stages are pipelined
• No wait time • No fault-tolerance
First Create A new Database
Choose the Data Format
Then lay out the columns
Optional add Partitions
Bypassing the wizard
CREATE EXTERNAL TABLE IF NOT EXISTS default.elb_logs ( `request_timestamp` string, `elb_name` string, `request_ip` string, `request_port` int, `backend_ip` string, `backend_port` int, `request_processing_time` double, `backend_processing_time` double, `client_response_time` double, `elb_response_code` string, `backend_response_code` string, `received_bytes` bigint, `sent_bytes` bigint, `request_verb` string, `url` string, `protocol` string, `user_agent` string, `ssl_cipher` string, `ssl_protocol` string )ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe’WITH SERDEPROPERTIES ( 'serialization.format' = '1', 'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?') LOCATION 's3://athena-examples/elb/plaintext/';
Athena Catalog Management
Amazon Athena uses an internal data catalog to store information and schemas about the databases and tables that you create for your data stored in Amazon S3
For more information Read the documentation page for Catalog Management
Athena Featuressaved queries
Athena Tips and Tricks
You will need to have some understanding of the structure of your data or have a DDL meta store before you can start querying with Athena
When looking through your S3 buckets and finding one of the allowed formats (CSV, TSV, Parquet, etc) you will have to have an understanding of what colums are there so you can create the Athena table before you can start querying
Avoid Surprises when working with Athena with the following tips and tricks
Athena Tips and Tricks
Table names that begin with an underscore
Use backticks if table names begin with an underscore. For example:
CREATE TABLE myUnderScoreTable ( `_id` string, `_index` string, ...
Avoid Surprises when working with Athena with the following tips and tricks
Athena Tips and Tricks
For the location clause, use a trailing slashIn the LOCATION clause, use a trailing slash for your folder, NOT filenames or glob characters
Avoid Surprises when working with Athena with the following tips and tricks
Don’t USEs3://path_to_buckets3://path_to_bucket/*s3://path_to_bucket/mySpecialFile.dat
USEs3://path_to_bucket/
Athena Tips and Tricks cont.
Athena table names are case insensitiveIf you are interacting with Apache Spark, then your table column names must be lowercase.Athena is case insensitive but Spark requires lowercase table names.
Athena table names only allow the underscore character Athena table names cannot contain any special characters beside the underscore _
Avoid Surprises when working with Athena with the following tips and tricks
Questions?
Image by http://www.gratisography.com/
Interested in SponsoringAWS Atlanta?
Image by http://www.gratisography.com/