Teradata Overview

download Teradata Overview

of 64

  • date post

    11-Dec-2014
  • Category

    Documents

  • view

    110
  • download

    3

Embed Size (px)

description

teradata

Transcript of Teradata Overview

Teradata

An Overview

Access patterns are different, and hence

The access patterns of these two approaches are very different and hence they make very different demands on the underlying database engine The basic database architecture has to be different to be optimized for one type of processing Teradata leader in DSS and Data warehouse space

What is TeradataTeradata is a Relational Database Management System (RDBMS) composed of hardware and software Designed for worlds largest commercial databases. Used by Customer who are looking out for answers to their business questions from data of over 1 Terabyte6 of the top 10 Retailers 6 of the top 9 Communications companies Over 40% of the leading Manufacturers in the world 3 of the top 4 Blue Cross/Blue Shield insurance companies Many of the world's leading Banks

Teradata a brief history

1979 - Teradata Corp founded in Los Angeles, California. Development begins on a massively parallel database computer 1984 - Teradata sells first DBC/1012 1986 - Product of the Year 1990 - First Terabyte system installed and in production 1992 - Teradata is merged into NCR 1995 - Teradata Version 2 for UNIX operating systems released

Why TeradataCapacity:Scaling from Gigabytes to Terabytes of detailed data stored in billions of rows Scaling to thousands of millions of instructions per second (MIPS) to process data

Performance:

Shared Nothing Architecture - able to achieve parallelism in each and every stage of query execution Makes Teradata Database faster than other relational systems

Single Data Store:

Can be accessed by network-attached and channel-attached systems Supports the requirements of many diverse clients

Fault Tolerance & Availability: Data Integrity: Scalability:

High fault tolerance, no single point failure Automatically detects and recovers from hardware failures

Ensures that transactions either complete or rollback to a stable state if a fault occursLinearly expandable - as your database grows, additional nodes may be added Allows expansion without sacrificing performance

Teradata Architecture, the SMPCPU (Processors) PEs Vprocs AMPs Node Parsing AMPs: VirtualAccess Engine: Processor Module is a set Processors of software processes running on a Storing Checks the andSQL retrieving Syntaxrows to and node. Each Vproc is a separate, from the disks independent Resource Availability copy of the and processor Rights software Lock management isolated from the other Parses the SQL vprocs but sharing some of the Sorting rows and Aggregating physical Generates resources AMP Steps of the node such columns as memory and CPUs. Creates plan Join processing Dispatches to the AMPs over Output conversion and formatting BYNET Creating answer sets for clients EBCDIC-ASCII Conversion Disk space management and Handle up to 120 User Sessions Accounting Special utility protocols Recovery processing

Vdisks

This is called SMP Symmetric Multiprocessor - A multiprocessing node that contains a number of central processing units sharing a single memory pool "Shared Nothing Architecture" - each AMP has its own disk (data) and it shares this with no other AMP and solely responsible for any changes/access to that data

And then comes the MPPBYNET: Dual redundant, fault-tolerant bi-directional interconnect network that enables: Automatic load balancing of message traffic Automatic reconfiguration after fault detection

BYNET

Scalable bandwidth as nodes are added The BYNET is responsible for: Broadcast, multicast, and point-topoint communications between nodes and virtual processors Merging answer sets back to the PE Making Teradata parallelism possible

MPP (Massively Parallel Processing) consists of a number of nodes (SMPs) that work on a problem at the same time Each node (SMP) has one or more CPUs, own memory, I/O, network connections and disk arrays and doesn't share its resources with other nodes

Important componentsSMP Symmetric Multiprocessing is a single node that contains multiple CPUs sharing memory pool. MPP SMP combined with a communication network (BYNET) form a MPP. A MPP comprises of two or more loosely coupled SMP nodes connected by the BYNET with shared SCSI access to multiple disk arrays BYNET Hardware inter-processor network to link nodes on an MPP system. It implements point to point, multicast, broadcast communications depending upon situation. BYNET is usually used for merging and sorting of data from different nodes. The accumulated data is then sent back to the User. Disk Array Teradata employs RAID storage technology where drives are configured logically in one or more logical unit (LUN) which is further sliced into Pdisk that is assigned to each AMP. Group of Pdisk assigned to a AMP is called Vdisk.

More DefinitionsPDE - Parallel Database Extension is an interface layer on top of operating system. It enhances the processing by providing capability of parallel processing and priority scheduling. It executes Vprocs. It take advantage of BYNET and Shared Disk hardware to improve performance. It may visualized as a layer on top of Operating System File System - Teradata File System service calls allow Teradata RDBMS to store and retrieve data efficiently without being concerned about underlying operating system interfaces. It divides the disk in to logical blocks, MI, CI, CID, DB, DBD TPA - Teradata Parallel Application is responsible for distribution, coordination and balancing of processes/threads across nodes TDP - Teradata Director Program is responsible for session balancing across multiple PEs, failure notification, logging, verification, recovery, restart and security

Logical ProcessorsVPROCS - Virtual Processors. Vprocs are set of software processors that run on a node under Teradata PDE within the multitasking environment of the operating system. A single node (SMP) can have as high as 128 VprocsPE - Parsing Engine performs session control and dispatches tasks to fetch, return and merge data. It communicates with the client system on one side and with the AMPs on the other side (via BYNET) AMP - Access Modular Processor retrieve and update data on the virtual disks. It is accountable for doing locking, joining, sorting, aggregation, data conversion, disk space management, accounting, and journaling

A single PE can handle a request at a time. This request is parsed, optimized, steps are built and then dispatched to corresponding AMP(s) An AMP has 80 worker task which perform different kind of work related to the steps. If the request is a select, these worker tasks after finishing the work sends data to BYNET where it is merged and sorted PE dispatches the resultant data to the user

Query Lifecycle Application sends the request Application sends the SELECT *request FROM t1 WHERE id = 4; to the PE - PE sends back the to the PE - PE sends back the SELECT * FROM t1 WHERE id IN (2,8); acknowledge to application acknowledge to application The SQL is parsed by the PE CLI The SQL is parsed by the PE CLI PE uses the Hashmap to locate PE uses the Hashmap to locate the AMP the AMPs TDP (Teradata Director Program) PE sends the request to the PE sends the request to the particular AMP - AMP sends back individual AMPs - AMP sends Hashmap PE (1) PE (2) the acknowledge to PE back the acknowledge to PE AMP retrieves the data from its own Vdisk AMP sends data to BYNET AMP (1) AMP (2) BYNET merges the data BYNET sends merged data to PE V Disk (1) V Disk (2) Result is sent to application from PE - Application sends back acknowledge to PE ID (PI) Desc ID (PI) Desc 3 C 1 A 5 E 4 D BYNET Merge AMP retrieves the data from its own Vdisk AMP sends the data to PE AMP (3) AMP (4) Result is sent to application from PE - Application sends back acknowledge to PE V Disk (3) ID (PI) Desc 2 B 6 F V Disk (4) ID (PI) Desc 7 G 8 H

Client Server

Data is distributed across all AMPs based on row-hash of PI

Data Distribution and Access MethodsHashing: Teradata uses hashing for data distribution & access Data row is hashed based on primary index value. Hash maps direct the data row to a particular AMP based on its hash value.PI

Row Hash

HashMap

Hashing and IndexingIndexing:

A data value (or values, if the index is compound) from a row acts as an index key to that row Associates the index key with a relative row address that reports the location of the row on disk Stored in order of their index key values and are said to be value-ordered

Hashing:

Index key data value is transformed by a mathematical function to produce an abstract value not related to the original data value in an obvious way Hashed data is assigned to hash buckets that correspond in a 1:1 manner to the relationship a particular hash code with an AMP location There is no obvious correspondence between a hash code and the location of the row it refers to

Teradata does not use indexing. What we refer to as indexes are either row hash values or data tables (join index) Tradeoffs Between Hashing and Indexing:Hashing is far better suited for the parallel database architecture Hashing provides consistently better performance because rows are always distributed evenly across the AMPs Primary indexes are not stored in an index subtable - directly as part of the row data Primary index columns on frequently used join constraints can be co-located on the same AMP Range queries Retrievals having selection criteria that involve only part of a multicolumn hash key

HashingTeradata Database hashing algorithms are proprietary mathematical functions that transform an input data value of any length into a 32-bit value A 32-bit row hash value provides 4.2 billion possible values 16-bit Destination Selection Word Row Hash Row ID First 16 b