Post on 16-Apr-2017
AWS: Redshift overviewPRESENTATION PREPARED BY VOLODYMYR ROVETSKIY
AgendaWhat is AWS Redshift
Amazon Redshift Pricing
AWS Redshift Architecture• Data Warehouse System Architecture• Internal Architecture and System Operation
Query Planning and Designing Tables• Query Planning And Execution Workflow• Columnar Storage• Zone Maps• Compression• Referential Integrity
Data locate in Redshift• The Sortkey• The Distribution Key
Workload Management (WLM)
Loading Data• What is Amazon S3• Data Loading from Amazon S3• COPY from Amazon S3
Redshift table maintenance operations• ANALYZE• VACUUM
Amazon Redshift Snapshots
Amazon Redshift Security
Monitoring Cluster Performance
Useful resources
Conclusion
What is Amazon Redshift
Cluster architecture
Columnar storage
Zone maps
Compression
Read optimized
No referential integrity by design
Redshift is the Amazon Cloud Data Warehousing server; it can interact with Amazon EC2 and S3 components but is managed separately using the Redshift tab of the AWS console. As a cloud based system it is rented by the hour from Amazon, and broadly the more storage you hire the more you pay.
Amazon Redshift features
Amazon Redshift PricingClients pay an hourly rate based on the type and
number of nodes in your cluster. There is discount up to 75% over On-Demand rates
by committing to use Amazon Redshift for a 1 or 3 year term.
Prices include two additional copies of your data, one on the cluster nodes and one in Amazon S3.
Amazon Redshift take care of backup, durability, availability, security, monitoring, and maintenance.
Price is depend on chosen Region.Dense Storage (DS) nodes allow you to create large
data warehouses using hard disk drives (HDDs) for a low price point.
Dense Compute (DC) nodes allow you to create high performance data warehouses using fast CPUs, large amounts of RAM and solid-state disks (SSDs).
Data Warehouse System Architecture
Leader node• Stores metadata• Manages communications with client programs and compute nodes• Manages distributing data to the slices on compute nodes• Develops and distributes execution plans for compute nodes
Compute nodes• Execute the query segments in parallel and send results back to the leader node for final
aggregation• Each compute node has its own dedicated CPU, memory, and attached disk storage• User data is stored on the compute nodes
Node slices• Each slice is allocated a portion of the node's memory and disk space• The slices work in parallel to complete the operation.• The number of slices per node is determined by the node size of the cluster.• The rows of table are distributed to the node slices according to the distribution key
Client applications• Amazon Redshift is based on industry-standard PostgreSQL
The following diagram shows a high level view of internal components and functionality of the Amazon Redshift data warehouse.
Internal Architecture and System Operation
Query Planning And Execution WorkflowThe query planning and execution workflow follows these steps:• 1. The leader node receives the query and parses the SQL.• 2. The parser produces an initial query tree that is a logical
representation of the original query. Amazon Redshift then inputs this query tree into the query optimizer.
• 3. The optimizer evaluates and if necessary rewrites the query to maximize its efficiency.
• 4. The optimizer generates a query plan for the execution with the best performance.
• 5. The execution engine translates the query plan into compiled C++ code
• 6. The compute node execute the compiled code segments in parallel
Columnar Storage
Pic.1 shows how records from database tables are typically stored into disk blocks by row.
Pic.2 shows how with columnar storage, the values for each column are stored sequentially into disk blocks.Columnar storage is optimizing analytic query performance because: reduces the overall disk I/O requirements reduces the amount of data you need to load from disk each block holds the same type of data block data can use a compression scheme selected specifically for the column data type
Zone MapsThe zone map is held separately from the block, like
an index
The zone map holds only two data points per block, the highest and lowest values in the block.
Redshift uses the zone map when executing queries, and excludes the blocks that the zone map indicates won’t be returned by the WHERE clause filter.
The zone maps will filter data blocks efficiently if columns are used as the sortkey
CompressionBenefits of Compression• Reduces the size of data when it is stored or read from storage• Conserves storage space• Reduces the amount of disk I/O• Improves query performance
Redshift recommendations and advices:• Use COPY command to apply automatic compression.(COMPUPDATE
ON)• Produce a report with the suggested column encoding schemes for
the tables analyzed. (ANALYZE COMPRESSION)• Compression type cannot be changed for a column after the table is
created• Highly compressed sort keys means many rows per block You’ll scan
more data blocks than you need
Referential integrity. Redshift unsupported features:Table partitioning TablespacesConstraints:
◦ Unique◦ Foreign key◦ Primary key◦ Check constraints◦ Exclusion constraints
Indexes
Important: Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift. Nonetheless, primary keys and foreign keys are used as planning hints and they should be declared if your ETL process or some other process in your application enforces their integrity.
CollationsStored proceduresTriggersTable functionsSequencesFull text searchExotic data types (arrays, JSON, Geospatial types, etc.)
Data locate in RedshiftThe Sort key• Each table can have a single Sort Key – a compound key, comprised of 1 to 400
columns from the table• Redshift is storing data on disk in Sort Key order• Sort keys should be selected based on how the table is used:
• Columns that are used to join to other tables should be included in the sort key;
• Date type columns that are used in filtering operations should be included;
• Redshift stores metadata about each data block, including the min and max of each column value – using sortkey, Redshift can skip entire blocks when answering a query;
Sort keys and Zone MapsCREATE SOME_TABLE ( SALESID INTEGER NOT NULL,
DATE DATETIME NOT NULL )
SELECT COUNT(*) FROM SOME_TABLEWHERE DATE = ’09-JUNE-2013’
CREATE SOME_TABLE ( SALESID INTEGER NOT NULL, DATE DATETIME NOT NULL ) SORTKEY (DATE)
SELECT COUNT(*) FROM SOME_TABLEWHERE DATE = ’09-JUNE-2013’
The Sort keys – Single Column
Table is sorted by 1 column [ SORTKEY ( date ) ]. Best for:
• Queries that use 1st column (i.e. date) as primary filter• Can speed up joins and group by• Quickest to VACUUM
Example:
create table sales(date datetime not
null,region datetime not
null,country varchar not
null)distkey(date)sortkey(date);
The Sort keys – Compound
Table is sorted by 1st column , then 2nd column etc. [ SORTKEY COMPOUND ( date, region, country) ].Best for:
Example:
create table sales(date datetime not null,region datetime not null,country varchar not null)
distkey(date)sortkey compound (date, region, country);
The Sort keys – Interleaved
Equal weight is given to each column. [ SORTKEY INTERLEAVED ( date, region, country) ]Best for:
• Queries that use different columns in filter• Queries get faster the more columns used in the filter• The Slowest to VACUUM
Example:
create table sales(date datetime not null,region datetime not null,country varchar not null)
distkey(date)sortkey interleaved(date, region, country);
Data locate in Redshift
The Distribution Key
• Redshift will distribute and replicate data between compute nodes;• By default, data will be spread evenly across all compute
nodes slices (EVEN distribution) • The even distribution of data across the nodes is vital to
ensuring consistent query performance • If data is denormalised and does not participate in joins, then
an EVEN distribution won’t be problematic • Alternatively a Distribution key can be provided (KEY
distribution) • The Distribution key helps distribute data across a node’s
slices • The Distribution key is defined on a per-table basis• The Distribution Key is comprised of only a single column
Distribution styles by example
• Large Fact tables • Large dimension tables
• Medium dimension tables (1K – 2M)
• Tables with no joins or group by • Small dimension tables (<1000) Data Distribution
Workload Management (WLM)WLM allows you to:
• Manage and adjust query concurrency• Increase query concurrency up to 15 in a queue• Define user groups and query groups• Segregate short and long running queries • Help improve performance of individual queries
Be aware:
• Query workload is distributed to every compute node• Increasing concurrency may not always help due to resource contention (CPU, Memory, I/O)• Total throughput may increase by letting one query complete first and allowing other queries to wait
WLM Options by default:
• 1 queue with a concurrency of 5• Define up to 8 queues with a total concurrency of 15• Redshift has a super user queue internally
Short Description of Amazon Simple Storage Service (S3) Cloud Storage for web applicationOrigin store for content distributionStaging area and persistent store for Big Data analyticsBackup and archive target databasesTo use Amazon S3, you need an AWS accountBefore you can store data in Amazon S3, you must
create a bucket.Add an object to the created bucket (a text file, a photo,
a video and so forth)When objects are added to the bucket you can view
and manage them
Data Loading from Amazon S3
Best Practice and recommendations:
• S3 bucket and your cluster must be created in the same region• Split your data on S3 into multiple files• Use a COPY Command to load data• Load your data in sort key order to avoid needing to vacuum• Organize your data as a sequence of time-series tables• Run the VACUUM command whenever you add, delete, or modify a large number of rows• Run the ANALYZE command whenever you’ve made a non-trivial number of changes to
update table statistics
COPY from Amazon S3 Syntax Parameters
FROM - the path to the Amazon S3 objects that contain the data
MANIFEST - The manifest is a text file in JSON format that lists the URL of each file that is to be loaded from Amazon S3. The URL includes the bucket name and full object path for the file. The files that are specified in the manifest can be in different buckets, but all the buckets must be in the same region as the Amazon Redshift cluster.
ENCRYPTED - Specifies that the input files on Amazon S3 are encrypted using client-side encryption.
REGION [AS] 'aws-region‘ - Specifies the AWS region where the source data is located. Examples
Redshift table maintenance operations
ANALYZE: The command used to capture statistical information about a table for use by the query planner.
• Run before running queries.• Run against the database after regular load or update cycle.• Run against any new tables that you create.• Consider running ANALYZE operations on different schedules for different types of tables and columns, depending on their use in
queries and their propensity to change.• Do not need to analyze all columns in all tables regularly or on the same schedule. Analyze the columns that are frequently used
in the following:• Sorting and grouping operations• Joins• Query predicates
This command can analyze the whole table or specified columns: ANALYZE <TABLE NAME>;ANALYZE <TABLE NAME> (<COLUMN1>,<COLUMN2>);
Redshift table maintenance operationsVACUUM: A process to physically reorganize tables after load activity.
• Can be run in 4 modes: • VACUUM FULL - reclaims space and re-sorts;• VACUUM DELETE ONLY - reclaims space but does not re-sort • VACUUM SORT ONLY - re-sorts but does not reclaim space • VACUUM REINDEX - used for INTERLEAVED sort keys, re-
analyzes sort keys and then runs FULL VACUUM
VACUUM is an I/O intensive operation and can take time to run. To minimize the impact of VACUUM:
• Run VACUUM on a regular schedule during time periods when you expect minimal activity on the cluster
• Use TRUNCATE instead of DELETE where possible • TRUNCATE or DROP test tables• Perform a Deep Copy instead of VACUUM • Load Data in sort order and remove need for VACUUM
TO threshold PERCENT - the threshold above which VACUUM skips the sort phase and the target threshold for reclaiming space in the delete phase. If you include the TO threshold PERCENT parameter, you must also specify a table name. This parameter can't be used with REINDEX.For example, if you specify 75 for threshold, VACUUM skips the sort phase if 75 percent or more of the table's rows are already in sort order. For the delete phase, VACUUMS sets a target of reclaiming disk space such that at least 75 percent of the table's rows are not marked for deletion following the vacuum. The threshold value must be an integer between 0 and 100. The default is 95.
Amazon Redshift SnapshotsAutomated Snapshots
• enabled by default when cluster is created• periodically takes from the cluster(every eight hours or every 5 GB of data changes)• deleted at the end of a retention period(1 day by default)• Can be disabled (set retention period to 0)
Manual Snapshots
• Can be taken whenever you want• will never automatically delete• manual snapshots accrue storage charges
Excluding Tables From Snapshot
• To create a no-backup table, include the BACKUP NO parameter when you create the table
Copying Snapshots to Another Region
• Copying snapshots across regions incurs data transfer charges
Restoring a Table from a Snapshot (feature was added 2016 10th of March)
• You can restore a table only to the current, active running cluster and from a snapshot that was taken of that cluster.• You can restore only one table at a time.• You cannot restore a table from a cluster snapshot that was taken prior to a cluster being resized.
Amazon Redshift SecurityCluster Security: Controlling access to Redshift cluster for management• Cluster runs within a Virtual Private Cloud (VPC) managed by the Amazon Redshift
service
Connection security: Controlling clients that can connect to Redshift cluster • . Users can only connect to the cluster using an ODBC or JDBC connections. You
may optionally only permit connections to the Amazon Redshift cluster from a VPC you control.
Database object security: Controlling which users have access to which database objects • At the database security level Amazon Redshift uses the Postgres security model,
with user name / password authentication. Database user accounts are configured separately from Redshift’s management security using SQL commands.
Data Security: encryption of data at rest (load data, table data, and backup data) • You can encrypt data that is loaded into Amazon Redshift, encrypt the data stored
in the Amazon Redshift tables, and encrypt the backups.
Monitoring Cluster PerformanceAmazon CloudWatch metrics help you monitor physical aspects of your cluster, such as CPU utilization, latency, and throughput.
Performance data helps you monitor database activity and performance. This data is aggregated in the Amazon Redshift console to help you easily correlate what you see in Amazon CloudWatch metrics
Query/Load Performance Data Amazon CloudWatch Metrics
Useful resources to learn more about Redshift
Redshift Documentation
• https://aws.amazon.com/redshift• http://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html
Open Source Scripts and Tools
• https://github.com/awslabs/amazon-redshift-utils• http://www.aginity.com/redshift
Conclusion
Amazon Redshift’s features
• Optimized for Data Warehousing- It uses columnar storage, data compression, and zone maps to reduce the amount of IO needed to perform queries. Redshift has a massively parallel processing (MPP) architecture, parallelizing and distributing SQL operations to take advantage of all available resources.
• Scalable- With a few clicks of the AWS Management Console or a simple API call, you can easily scale the number of nodes in your data warehouse up or down as your performance or capacity needs change.
• No Up-Front Costs- You pay only for the resources you provision. You can choose On-Demand pricing with no up-front costs or long-term commitments, or obtain significantly discounted rates with Reserved Instance pricing.
• Fault Tolerant- Amazon Redshift has multiple features that enhance the reliability of your data warehouse cluster. All data written to a node in your cluster is automatically replicated to other nodes within the cluster and all data is continuously backed up to Amazon S3.
• SQL - Amazon Redshift is a SQL data warehouse and uses industry standard ODBC and JDBC connections and Postgres drivers.
• Isolation - Amazon Redshift enables you to configure firewall rules to control network access to your data warehouse cluster.
• Encryption – With just a couple of parameter settings, you can set up Amazon Redshift to use SSL to secure data in transit and hardware-accelerated AES-256 encryption for data at rest.
Redshift
Optimized for Data
Warehousing
Scalable
No Up-Front Costs
Fault Tolerant
Secure
SQL Standards
Jeff Bezos reacted to my payment :-))