Lecture 12 Scalable Computing

41
Lecture 12 Scalable Computing Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National Taiwan University

description

Lecture 12 Scalable Computing. Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National Taiwan University. Scalable Internet Services. Lessions from Giant-Scale Services http://www.computer.org/internet/ic2001/w4046abs.htm - PowerPoint PPT Presentation

Transcript of Lecture 12 Scalable Computing

Page 1: Lecture 12 Scalable Computing

Lecture 12

Scalable Computing

Graduate Computer Architecture

Fall 2005

Shih-Hao Hung

Dept. of Computer Science and Information Engineering

National Taiwan University

Page 2: Lecture 12 Scalable Computing

Scalable Internet Services

• Lessions from Giant-Scale Serviceshttp://www.computer.org/internet/ic2001/w4046abs.htm

– Access anywhere, anytime.– Availability via multiple devices.– Groupware support.– Lower overall cost.– Simplified service updates.

Page 3: Lecture 12 Scalable Computing

Giant-Scale Services: Components

Page 4: Lecture 12 Scalable Computing

Network Interface

• A simple network connecting two machines

• Message

Page 5: Lecture 12 Scalable Computing

Network Bandwidth vs Message Size

Page 6: Lecture 12 Scalable Computing

Switch: Conencting More than 2 Machines

Page 7: Lecture 12 Scalable Computing

Switch

Page 8: Lecture 12 Scalable Computing

Network Topologies

• Relative performance for 64 nodes

Page 9: Lecture 12 Scalable Computing

Packets

Page 10: Lecture 12 Scalable Computing

Load Management

• Balancing loads (load balancer)– Round-robin DNS– Layer-4 (Transport layer, e.g. TCP) switches– Layer-7 (Application layer) switches

Page 11: Lecture 12 Scalable Computing

The 7 OSI (Open System Interconnection) Layers

Page 12: Lecture 12 Scalable Computing

The 7 OSI (Open System Interconnection) Layers

• Application (Layer 7) This layer supports application and end-user processes. Communication partners are identified, quality of service is identified, user authentication and privacy are considered, and any constraints on data syntax are identified. Everything at this layer is application-specific. file transfers, e-mail, and other network software services. Telnet and FTP.

• Presentation (Layer 6) This layer provides independence from differences in data representation (e.g., encryption) by translating from application to network format, and vice versa. The presentation layer works to transform data into the form that the application layer can accept. This layer formats and encrypts data to be sent across a network, providing freedom from compatibility problems. It is sometimes called the syntax layer.

• Session (Layer 5) This layer establishes, manages and terminates connections between applications. The session layer sets up, coordinates, and terminates conversations, exchanges, and dialogues between the applications at each end. It deals with session and connection coordination.

Page 13: Lecture 12 Scalable Computing

The 7 OSI (Open System Interconnection) Layers• Transport (Layer 4) This layer provides transparent transfer of data between

end systems, or hosts, and is responsible for end-to-end error recovery and flow control. It ensures complete data transfer. TCP/IP.

• Network (Layer 3) This layer provides switching and routing technologies, creating logical paths, known as virtual circuits, for transmitting data from node to node. Routing and forwarding are functions of this layer, as well as addressing, internetworking, error handling, congestion control and packet sequencing.

• Data Link (Layer 2) At this layer, data packets are encoded and decoded into bits. It furnishes transmission protocol knowledge and management and handles errors in the physical layer, flow control and frame synchronization. The data link layer is divided into two sublayers: The Media Access Control (MAC) layer and the Logical Link Control (LLC) layer. The MAC sublayer controls how a computer on the network gains access to the data and permission to transmit it. The LLC layer controls frame synchronization, flow control and error checking.

• Physical (Layer 1) This layer conveys the bit stream - electrical impulse, light or radio signal -- through the network at the electrical and mechanical level. It provides the hardware means of sending and receiving data on a carrier, including defining cables, cards and physical aspects. Fast Ethernet, RS232, and ATM are protocols with physical layer components.

Page 14: Lecture 12 Scalable Computing

OSI

Page 15: Lecture 12 Scalable Computing

Simple Web Farm

Page 16: Lecture 12 Scalable Computing

Search Engine Cluster

Page 17: Lecture 12 Scalable Computing

High Availability• High availability is a major driving requirement

behind giant-scale system design.– Uptime: typically measured in nines, and traditional

infrastructure systems such as the phone system aim for four or five nines (“four nines” implies 0.9999 uptime, or less than 60 seconds of downtime per week).

– Meantime-between-failure (MTBF)– Mean-time-to-repair (MTTR)– uptime = (MTBF – MTTR)/MTBF– yield = queries completed/queries offered– harvest = data available/complete data– DQ Principle:

Data per query × queries per second →constant– Graceful Degradation

Page 18: Lecture 12 Scalable Computing

Clusters in Giant-Scale Services

– Scalability– Cost/performance– Independent components

Page 19: Lecture 12 Scalable Computing

Cluster Example

Page 20: Lecture 12 Scalable Computing

Lesson Learned• Get the basics right. Start with a professional data center and layer-7

switches, and use symmetry to simplify analysis and management.• Decide on your availability metrics. Everyone should agree on the goals and

how to measure them daily. Remember that harvest and yield are more useful than just uptime.

• Focus on MTTR at least as much as MTBF. Repair time is easier to affect for an evolving system and has just as much impact.

• Understand load redirection during faults. Data replication is insufficient for preserving uptime under faults; you also need excess DQ.

• Graceful degradation is a critical part of a high-availability strategy. Intelligent admission control and dynamic database reduction are the key tools for implementing the strategy.

• Use DQ analysis on all upgrades. Evaluate all proposed upgrades ahead of time, and do capacity planning.

• Automate upgrades as much as possible. Develop a mostly automatic upgrade method, such as rolling upgrades. Using a staging area will reduce downtime, but be sure to have a fast, simple way to revert to the old version.

Page 21: Lecture 12 Scalable Computing

Deep Scientific ComputingKramer et. al., IBM J. R&D March 2004

• High-performance computing (HPC)– Resolution of a simulation– Complexity of an analysis– Computational power– Data storage

• New paradigms of computing– Grid computing– Network

Page 22: Lecture 12 Scalable Computing

Themes (1/2)• Deep science applications must now integrate simulation with

data analysis. In many ways this integration is inhibited by limitations in storing, transferring, and manipulating the data required.

• Very large, scalable, high-performance archives, combining both disk and tape storage, are required to support this deep science. These systems must respond to large amounts of data—both many files and some very large files.

• High-performance shared file systems are critical to large systems. The approach here separates the project into three levels—storage systems, interconnect fabric, and global file systems. All three levels must perform well, as well as scale, in order to provide applications with the performance they need.

• New network protocols are necessary as the data flows are beginning to exceed the capability of yesterdays protocols. A number of elements can be tuned and improved in the interim, but long-term growth requires major adjustments.

Page 23: Lecture 12 Scalable Computing

Themes (2/2)• Data management methods are key to being able to organize

and find the relevant information in an acceptable time.• Security approaches are needed that allow openness and ser

vice while providing protection for systems. The security methods must understand not just the application levels but also the underlying functions of storage and transfer systems.

• Monitoring and control capabilities are necessary to keep pace with the system improvements. This is key, as the application developers for deep computing must be able to drill through virtualization layers in order to understand how to achieve the needed performance.

Page 24: Lecture 12 Scalable Computing

Simulation: Time and Space

Page 25: Lecture 12 Scalable Computing

More Space

Page 26: Lecture 12 Scalable Computing

NERSC System

Page 27: Lecture 12 Scalable Computing

High-Performance Storage System (HPSS)

Page 28: Lecture 12 Scalable Computing

Networking for HPC Systems

• End-to-end network performance is a product of– Application behavior– Machine capabilities– Network path– Network protocol– Competing traffic

• Difficult to ascertain the limiting factor without monitoring/diagnostic capabilities– End host issues– Routers and gateways– Deep Security

Page 29: Lecture 12 Scalable Computing

End Host Issues

• Throughput limit– Time to copy data from user memory to kernel

across memory bus (2 memory cycles)– Time to copy from kernel to NIC (1 I/O cycle)– If limited by memory BW:

Page 30: Lecture 12 Scalable Computing

Memory & I/O Bandwidth

• Memory BW– DDR: 650-2500 MB/s

• I/O BW– 32-bit/33Mhz PCI: 132 MB/s– 64-bit/33Mhz PCI: 264 MB/s– 64-bit/66Mhz PCI: 528 MB/s– 64-bit/133Mhz PCI-X: 1056 MB/s– PCI-E x1: ~1 Gbit/s– PCI-E x16: ~16 Gbit/s

Page 31: Lecture 12 Scalable Computing

Network Bandwidth• VT600, 32-bit/33mhz PCI, DDR400, AMD2700+, 850 MB/s me

mory BW– 485 Mbit/s

• 64-bit/133Mhz PCI-X, 1100-2500 MB/s memory BW– Limited to 5000 Mbit/s– Also limited by DMA overhead– Only reach half of 10Gb NIC

• I/O architecture– On-chip NIC?

• OS architecture– Reduce number of memory copy? Zero copy?– TCP/IP overhead– TCP/IP offload– Maximum Transfer Unit (MTU)

Page 32: Lecture 12 Scalable Computing

Conclusion

• High performance storage and network• End host performance• Data management• Security• Monitoring and control

Page 33: Lecture 12 Scalable Computing

Petaflop Computing

Page 34: Lecture 12 Scalable Computing

Science-driven System Architecture

• Leadership Computing Systems– Processor performance– Interconnect performance– Software: scalability & optimized lib

• Blue Planet– Redesigned Power5-based HPC system

• Single core node• High-memory bandwidth per processor

– ViVA (Virtual Vector Architecture) allows the eight processors in a node to be treated as a single processor with peak performance of 60+ Gigaflop/s.

Page 35: Lecture 12 Scalable Computing

ViVA-2: Application Accelerator

• Accelerates particular application-specific or domain-specific features.– Irregular access patterns– High load/store issue rates– Low cache line utilization

• ISA enhancement– Inst to support prefetch irregular data access– Inst to upport sparse, non-cache-resident loads– More registers for SW pipelining– Inst to initiate many dense/indexed/sparce loads

• Proper compiler support will be a critical component

Page 36: Lecture 12 Scalable Computing

Leadership Computing Applications

• Major computational advances– Nanoscience– Combustion– Fusion– Climate– Life Sciences– Astrophysics

• Teamwork– Project team– Facilities– Computational scientist

Page 37: Lecture 12 Scalable Computing

Supercomputers 1993-2000

• Clusters vs MPPs

Page 38: Lecture 12 Scalable Computing

Clusters

• Cost-performance

Page 39: Lecture 12 Scalable Computing

Total Cost of Ownership (TCO)

Page 40: Lecture 12 Scalable Computing

Google

• Built with lots of PC’s• 80 PC’s in one rack

Page 41: Lecture 12 Scalable Computing

Google• Performance

– Latency: <0.5s– Bandwidth, scaled with # of users

• Cost– Cost of PC keeps shrinking– Switches, Rack, etc.– Power

• Reliability– Software failure– Hardware failure (1/10 of SW failure)

• DRAM (1/3)• Disks (2/3)

– Switch failure– Power outage– Network outage