Lecture 12 Scalable Computing

Lecture 12

Scalable Computing

Graduate Computer Architecture

Fall 2005

Shih-Hao Hung

Dept. of Computer Science and Information Engineering

National Taiwan University

Scalable Internet Services

• Lessions from Giant-Scale Serviceshttp://www.computer.org/internet/ic2001/w4046abs.htm

– Access anywhere, anytime.– Availability via multiple devices.– Groupware support.– Lower overall cost.– Simplified service updates.

http://www.computer.org/internet/ic2001/w4046abs.htm

Giant-Scale Services: Components

Network Interface

• A simple network connecting two machines

• Message

Network Bandwidth vs Message Size

Switch: Conencting More than 2 Machines

Switch

Network Topologies

• Relative performance for 64 nodes

Packets

Load Management

• Balancing loads (load balancer)– Round-robin DNS– Layer-4 (Transport layer, e.g. TCP) switches– Layer-7 (Application layer) switches

The 7 OSI (Open System Interconnection) Layers

The 7 OSI (Open System Interconnection) Layers

• Application (Layer 7) This layer supports application and end-user processes. Communication partners are identified, quality of service is identified, user authentication and privacy are considered, and any constraints on data syntax are identified. Everything at this layer is application-specific. file transfers, e-mail, and other network software services. Telnet and FTP.

• Presentation (Layer 6) This layer provides independence from differences in data representation (e.g., encryption) by translating from application to network format, and vice versa. The presentation layer works to transform data into the form that the application layer can accept. This layer formats and encrypts data to be sent across a network, providing freedom from compatibility problems. It is sometimes called the syntax layer.

• Session (Layer 5) This layer establishes, manages and terminates connections between applications. The session layer sets up, coordinates, and terminates conversations, exchanges, and dialogues between the applications at each end. It deals with session and connection coordination.

http://www.webopedia.com/quick_ref/application.html

http://www.webopedia.com/quick_ref/authentication.html

http://www.webopedia.com/quick_ref/syntax.html

http://www.webopedia.com/quick_ref/e_mail.html

http://www.webopedia.com/quick_ref/network.html

http://www.webopedia.com/quick_ref/software.html

http://www.webopedia.com/quick_ref/Telnet.html

http://www.webopedia.com/quick_ref/FTP.html

http://www.webopedia.com/quick_ref/encryption.html

The 7 OSI (Open System Interconnection) Layers• Transport (Layer 4) This layer provides transparent transfer of data between

end systems, or hosts, and is responsible for end-to-end error recovery and flow control. It ensures complete data transfer. TCP/IP.

• Network (Layer 3) This layer provides switching and routing technologies, creating logical paths, known as virtual circuits, for transmitting data from node to node. Routing and forwarding are functions of this layer, as well as addressing, internetworking, error handling, congestion control and packet sequencing.

• Data Link (Layer 2) At this layer, data packets are encoded and decoded into bits. It furnishes transmission protocol knowledge and management and handles errors in the physical layer, flow control and frame synchronization. The data link layer is divided into two sublayers: The Media Access Control (MAC) layer and the Logical Link Control (LLC) layer. The MAC sublayer controls how a computer on the network gains access to the data and permission to transmit it. The LLC layer controls frame synchronization, flow control and error checking.

• Physical (Layer 1) This layer conveys the bit stream - electrical impulse, light or radio signal -- through the network at the electrical and mechanical level. It provides the hardware means of sending and receiving data on a carrier, including defining cables, cards and physical aspects. Fast Ethernet, RS232, and ATM are protocols with physical layer components.

http://www.webopedia.com/quick_ref/transparent.html

http://www.webopedia.com/quick_ref/flow_control.html

http://www.webopedia.com/quick_ref/switch.html

http://www.webopedia.com/quick_ref/routing.html

http://www.webopedia.com/quick_ref/virtual_circuit.html

Simple Web Farm

Search Engine Cluster

High Availability• High availability is a major driving requirement

behind giant-scale system design.– Uptime: typically measured in nines, and traditional

infrastructure systems such as the phone system aim for four or five nines (“four nines” implies 0.9999 uptime, or less than 60 seconds of downtime per week).

– Meantime-between-failure (MTBF)– Mean-time-to-repair (MTTR)– uptime = (MTBF – MTTR)/MTBF– yield = queries completed/queries offered– harvest = data available/complete data– DQ Principle:

Data per query × queries per second →constant– Graceful Degradation

Clusters in Giant-Scale Services

– Scalability– Cost/performance– Independent components

Cluster Example

Lesson Learned• Get the basics right. Start with a professional data center and layer-7

switches, and use symmetry to simplify analysis and management.• Decide on your availability metrics. Everyone should agree on the goals and

how to measure them daily. Remember that harvest and yield are more useful than just uptime.

• Focus on MTTR at least as much as MTBF. Repair time is easier to affect for an evolving system and has just as much impact.

• Understand load redirection during faults. Data replication is insufficient for preserving uptime under faults; you also need excess DQ.

• Graceful degradation is a critical part of a high-availability strategy. Intelligent admission control and dynamic database reduction are the key tools for implementing the strategy.

• Use DQ analysis on all upgrades. Evaluate all proposed upgrades ahead of time, and do capacity planning.

• Automate upgrades as much as possible. Develop a mostly automatic upgrade method, such as rolling upgrades. Using a staging area will reduce downtime, but be sure to have a fast, simple way to revert to the old version.

Deep Scientific ComputingKramer et. al., IBM J. R&D March 2004

• High-performance computing (HPC)– Resolution of a simulation– Complexity of an analysis– Computational power– Data storage

• New paradigms of computing– Grid computing– Network

Themes (1/2)• Deep science applications must now integrate simulation with

data analysis. In many ways this integration is inhibited by limitations in storing, transferring, and manipulating the data required.

• Very large, scalable, high-performance archives, combining both disk and tape storage, are required to support this deep science. These systems must respond to large amounts of data—both many files and some very large files.

• High-performance shared file systems are critical to large systems. The approach here separates the project into three levels—storage systems, interconnect fabric, and global file systems. All three levels must perform well, as well as scale, in order to provide applications with the performance they need.

• New network protocols are necessary as the data flows are beginning to exceed the capability of yesterdays protocols. A number of elements can be tuned and improved in the interim, but long-term growth requires major adjustments.

Themes (2/2)• Data management methods are key to being able to organize

and find the relevant information in an acceptable time.• Security approaches are needed that allow openness and ser

vice while providing protection for systems. The security methods must understand not just the application levels but also the underlying functions of storage and transfer systems.

• Monitoring and control capabilities are necessary to keep pace with the system improvements. This is key, as the application developers for deep computing must be able to drill through virtualization layers in order to understand how to achieve the needed performance.

Simulation: Time and Space

More Space

NERSC System

High-Performance Storage System (HPSS)

Networking for HPC Systems

• End-to-end network performance is a product of– Application behavior– Machine capabilities– Network path– Network protocol– Competing traffic

• Difficult to ascertain the limiting factor without monitoring/diagnostic capabilities– End host issues– Routers and gateways– Deep Security

End Host Issues

• Throughput limit– Time to copy data from user memory to kernel

across memory bus (2 memory cycles)– Time to copy from kernel to NIC (1 I/O cycle)– If limited by memory BW:

Memory & I/O Bandwidth

• Memory BW– DDR: 650-2500 MB/s

• I/O BW– 32-bit/33Mhz PCI: 132 MB/s– 64-bit/33Mhz PCI: 264 MB/s– 64-bit/66Mhz PCI: 528 MB/s– 64-bit/133Mhz PCI-X: 1056 MB/s– PCI-E x1: ~1 Gbit/s– PCI-E x16: ~16 Gbit/s

Network Bandwidth• VT600, 32-bit/33mhz PCI, DDR400, AMD2700+, 850 MB/s me

mory BW– 485 Mbit/s

• 64-bit/133Mhz PCI-X, 1100-2500 MB/s memory BW– Limited to 5000 Mbit/s– Also limited by DMA overhead– Only reach half of 10Gb NIC

• I/O architecture– On-chip NIC?

• OS architecture– Reduce number of memory copy? Zero copy?– TCP/IP overhead– TCP/IP offload– Maximum Transfer Unit (MTU)

Conclusion

• High performance storage and network• End host performance• Data management• Security• Monitoring and control

Petaflop Computing

Science-driven System Architecture

• Leadership Computing Systems– Processor performance– Interconnect performance– Software: scalability & optimized lib

• Blue Planet– Redesigned Power5-based HPC system

• Single core node• High-memory bandwidth per processor

– ViVA (Virtual Vector Architecture) allows the eight processors in a node to be treated as a single processor with peak performance of 60+ Gigaflop/s.

ViVA-2: Application Accelerator

• Accelerates particular application-specific or domain-specific features.– Irregular access patterns– High load/store issue rates– Low cache line utilization

• ISA enhancement– Inst to support prefetch irregular data access– Inst to upport sparse, non-cache-resident loads– More registers for SW pipelining– Inst to initiate many dense/indexed/sparce loads

• Proper compiler support will be a critical component

Leadership Computing Applications

• Major computational advances– Nanoscience– Combustion– Fusion– Climate– Life Sciences– Astrophysics

• Teamwork– Project team– Facilities– Computational scientist

Supercomputers 1993-2000

• Clusters vs MPPs

Clusters

• Cost-performance

Total Cost of Ownership (TCO)

Google

• Built with lots of PC’s• 80 PC’s in one rack

Google• Performance

– Latency: <0.5s– Bandwidth, scaled with # of users

• Cost– Cost of PC keeps shrinking– Switches, Rack, etc.– Power

• Reliability– Software failure– Hardware failure (1/10 of SW failure)

• DRAM (1/3)• Disks (2/3)

– Switch failure– Power outage– Network outage

Lecture 12 Scalable Computing

Documents

Transcript of Lecture 12 Scalable Computing