Programming and porting in a Microsoft 64 bit world Shanthal Vasanth Hewlett Packard.

38
Programming and porting in a Microsoft 64 bit world Shanthal Vasanth Hewlett Packard

Transcript of Programming and porting in a Microsoft 64 bit world Shanthal Vasanth Hewlett Packard.

Programming and porting in a Microsoft 64 bit world

Shanthal Vasanth

Hewlett Packard

Agenda

• Why 64 bit CPU?

• Introducing Itanium (EPIC)

• About Opteron and EM64T

• HP Itanium server platforms

• 64 bit porting

Why 64 bit CPU?

Why 64 bit CPU?

• Ability to use massive amounts of memory

• Data in memory is faster than getting it from the disk

• 32 bit CPU can address at the most 4 GB, while a 64 bit CPU can address upto 18 exabytes

• Ability to handle larger floating point numbers

• 32 bit CPUs bank on software emulation for values > 2^32

• 64 bit CPUs can handle upto 2^64

• Process data/instructions in chunks of 64 bits in a single clock cycle

Introducing Itanium (EPIC)

CISC

RISC

Superscalar

EPICEPIC

Per

form

ance

Time

• Maximize instructions executed in parallel

• Improve floating point• Speculation & predication to

overcome memory latency & branch misses (common to databases)

• Large and fast on-die cache • Expanded Number of registers• (general, FP, branch)• Efficient management engine

– Register stack engine– Register Windowing– 4 GB page size

• Systems scalable to 512 cpus and beyond

• 64b for large memory addressability

• High internal & external bus & memory bandwidth

• MCA for data detection, correction, & logging

Massive On-chipMassive On-chipResourcingResourcing

Performance Performance EnhancementEnhancement Scalable & ReliableScalable & Reliable

Age : 20+Age : 20+

Age : 10-15+Age : 10-15+

IntelIntel®® ItaniumItanium®®

ProcessorProcessor Age : 2+Age : 2+

Age : 9+Age : 9+

* Source: * Source: Computer Organization and Architecture, 1999 W. StallingsComputer Organization and Architecture, 1999 W. Stallings

Architectural EvolutionEPIC (Explicitly Parallel Instruction Computing)

•Largest, most demanding workloads requires new approach•Benefits from the experience of past architectures•Goal to move beyond RISC performance bounds with explicit parallel instruction streams•Developed by the best CPU and server architecture minds in the industry

RISC (Reduced Instruction Set Computing)

•Goal to optimize performance with simpler instructions (this effort coined the term CISC)

Traditional Architecture Limiters

Today’s Processors often 60% Today’s Processors often 60% IdleIdleToday’s Processors often 60% Today’s Processors often 60% IdleIdle

parallelizedparallelizedcodecode parallelizedparallelized

codecode

parallelizedparallelizedcodecode

HardwareHardwareCompilerCompiler

multiplemultiple functional unitsfunctional units

Original SourceOriginal SourceCodeCode

Sequential MachineSequential MachineCodeCode

......

......

Execution Units Available Execution Units Available Used InefficientlyUsed Inefficiently

Technology case for a new Architecture

• Superscalar Complexity Growth• Functional unit area

grows linearly with number of units

• Scheduler area grows as the square of the number of units

Cost-performance Cost-performance reaches areaches a

point of diminishing point of diminishing returnsreturns

Cost-performance Cost-performance reaches areaches a

point of diminishing point of diminishing returnsreturns

1 2 3 4 5

1 2 3 4 5

1 2 3 4 5

1 2 3 4 5

1 2 3 4 5

1 2 3 4 5

1 2 3 4 5

1 2 3 4 5

Itanium Hardware/Software Synergy

compilercompiler implicitly parallel

implicitly parallel

hardware

......

......

sequential machine code

original sourcecode

Itanium-based

compiler

......

......multiple

execution units resources used more efficiently

parallel machine code

Traditional Architecture

massive resource

s

original sourcecode

multiple execution units

Itanium architecture

• Developed by HP and Intel• Next pervasive computer architecture• Explicitly Parallel Instruction Computing (EPIC)• Supports multiple operating systems: Windows, HP-UX, and Linux

Itanium architecture features

•explicit parallelism•predication•enhanced speculation•floating-point architecture•multi-processor scalability•large number of registers

• Built-in instruction-level parallelism• Massive on-chip resources• Up to 2x instructions/clock cycle• CPU clock and compiler maturity curve

• Fewer memory loads/stores on complex workloads• Huge memory address spaces• 60% shorter memory pipeline• Latency avoidance• Instruction predication • Data and control speculation

Itanium’s unique advantages

Customer benefits• Higher performance in

FP-intensive and complex technical workloads.• 2X performance of x86, at any clock speed, for faster:

– Image manipulation– Voice encoding/recognition– Encryption

Industry leading performance and scalability for demanding and unpredictable commercial applications:

- OLTP

- database query (TPC-H)

- sorting

Intel Itanium architecture: ultimate technology for technical computing

About Opteron &

EM64T

Opteron

• Integrated DDR memory controllers (5.3 GB/s)

• Enable direct access to memory reducing latency

• HyperTransport links

• a chip-to-chip interconnect technology that operates at memory speeds (6.4 GB/s)

• AMD 64 bit extensions on top of IA-32 instructions

• 32 bit binaries run natively

• Does not support Streaming SIMD Extensions 3

Xeon/EM64T - Extended Memory 64 Technology

• 64 bit Instruction set extensions to IA-32

• Compatible with AMD’s 64-bit instructions

• Traditional North Bridge Architecture

• Go thru north bridge to access memory

• Advantage: No need to spin CPU to access next-gen memory

• Does not support 3DNow

How next-generation x86 extensions address needs for improved performance

AMD Opteron™ utilizes 3 key innovations• Memory controller on-board with the processor,

allowing the controller to run at full speed of the processor core.

• The HyperTransport link between the processors, and between the processors and I/O, is also extremely low-latency (6.4GB/sec of data throughput).

• Capability for maximum 32-bit performance, and excellent 64-bit performance.

• Increased frequency headroom; 3.6 GHz/1MB cache

• 800 MHz FSB - 1.5x system bus speed vs 533MHz platform

• DDR2-400 - Faster memory technology

• PCI Express 4x - 8x - Faster I/O

L2 Cache

L1Instruction

Cache

L1Data

Cache

AMD64Core

DDR Memory Controller

HyperTransport™

Multitude of features enhance Xeon performance

Additional Registers8- SSE & 8-Gen Purpose

Additional Registers8- SSE & 8-Gen Purpose

Double Precision (64-bit) Integer Support

Double Precision (64-bit) Integer Support

Extended Memory Addressability64-Bit Pointers, Registers

Extended Memory Addressability64-Bit Pointers, Registers

Xeon extensions add

Opteron Compared to Itanium 2

1 2 3 4 5 6

Opteron* Processor

6.4 GB/s16x16 HTT

1 TB

~2.0 GHz

Itanium® 2 Processor

6.4 GB/s

1024 TB

8

Memory Addressing

1 2 3 4 5 6 7 8 9 1011

System Bus Bandwidth

On-die Cache

On-die Registers

Execution Units

Core Frequency

Issue Ports

Itanium Architecture

264 Application Registers + 64 Predicate Registers*

6 Instructions / Cycle

40 Registers

12

3 Instructions / Cycle

6 MB

Instructions / Clk

6 Integer, 3 Branch

2 FP (FMAC)

1 SIMD2 Load and

2 Store

x86 with extra memory bits

2 Loador

2 Store

Fmul,Fadd 1 for SIMD

3Integer

1MB

1.5 GHz

Pipeline Stages

* Intel’s EPIC technology includes 64 single-bit predicate registers to accelerate loop unrolling and branch intensive code execution.

Opteron and Itanium: A comparison

Opteron MadisonClock (for this comparison) 1.8 GHz 1.5 GHzPhysical address Space 40 bit 50 bitVirtual address space 48 bit 56 bitInt (=GRs) Registers 16 128Float Registers 8 128supported page sizes 4 KB, 2 MB 4 KB … 4 GB Instructions/clock 3 6On Die Cache 1 MB 9 MBIA32 applications Native EmulatedNative 64 bit OS Win/Linux Win/Linux/UXMemory Access Onchip MCU NorthBridge

HP Itanium OS & Server

Choices

• Large scale applications and databases

• Complex workloads – technical and commercial

• Primarily back end DB & application tier

• Enterprise scale up and scale out

• Server consolidation

HP Integrity Servers

1 to 128-way*Itanium processor

architecture

* Future

HP ProLiant Servers

1 to 8-way

ProLiant and

Integrity serversx86

processor architecture

• Small to medium scale application and databases

• Well-defined, less-complex workloads

• Primarily front-end/network edge & application tier

• Scale out and small to mid-sizescale up

Customer choice

Customer-specific needs driven

HP delivering choice

• Price/performance leadership with 32/64-bit co-existence

• Highest clock speed, peak performance

• Extensive 32-bit, and emerging 64-bit ecosystems

• Scale-out for simple, highly parallel workloads (2p nodes)

• Linux & Windows

• Price/performance leadership with 32/64-bit co-existence

• 32-bit throughput performance leadership

• Highest bandwidth for sustained performance

• Extensive 32-bit, and emerging 64-bit ecosystems

• Scale-out for moderate workloads (2p/4p nodes)

• Linux & Windows

• Highest performance 64-bit processor core for sustained performance

• Highest SMP scalability (to 128p)

• HP-UX for mission-critical technical computing

• Extensive 64-bit ecosystem(and 32/64-bit on HP-UX)

• Scale-up and scale-out for complex workloads

• HP-UX, Linux & Windows

ProLiant Serverswith 64-bit extensions

IntegrityServers

Complete choice to meet diverse needs across the data center

Workgroup

File, print

MailMessaging

Directory, DNS, firewall, security

Services, caching, proxy Web

Infra-structure

Parallel computing, clustering

HPC

OLTP mid size

Apptier

ERP, biz logic, app server

Biz intelligence/ SCM planning

Biz intelligenceVery large data sets

Back-end for CRM,SCM, ERP, large data sets

Large SMP, large memoryHPC

ERPlarge

OLTPlarge

BI

Front-end

Application & data-tier

Large scale data tier

1 - 4 processors 4 - 8 processors 8 - 64+ processors

OLTPmed

ERPmedium

BI

Integrity & NonStop

ProLiant & Integrity

OLTP large size DBHigh transaction volumes

Back-end for CRM, SCM, ERP

Integrity & NonStop servers

ProLiant & Integrity systems

Mix of ProLiant, Integrity & NonStop

ProLiant

Integrity

Unique options on Itanium

W2003-64 (native boot)

MS Win64 Applications compiled for IPF (64bit)

Linux64 (native boot)

MS W2Kapps for IA32

HP-UX 11i (native boot)

HP-UX ABILinux ABI

Linux 64 Applications (also Linux 32)

Linux 64 Applications compiled for IPF (64bit)

HP-UX Applications compiled for IPF

PA-Apps

ARIES

Itanium®-based hp servers:The most scalable roadmap from any

vendor

2001 2002 2003 2004

scala

ble

(ce

ll-base

d)

serv

ers

Itanium2 2-way 2U

SuperDomeMadison

rp8400 Madison

rp7410 Madison

Itanium24-way 7U

Itanium 4-ways

DL 590rx4610

rx9610

Futu

re Ita

niu

m

Proce

ssor

Madison

entr

y

serv

ers

rp5400

Itanium16-way

rx5670

rx2600

roadmap subject to change

Itanium2Madison4way 4U

HP Itanium Processor Family– HP's entire family of Itanium-

based servers--- including the midrange 8-way and 16-way, and the high-end Superdome 64-way will support the 64-bit version of Microsoft Windows Server 2003

– HP is successfully running Windows on HP's Superdome server configured with 64 Itanium 2 processors and 512GB memory

– HP is successfully running Windows, HP-UX, and Linux in separate partitions on a 64-processor Superdome

64+way64-

socket

16+way16-socket

8+way8-socket

2-way

4-way

64 bit porting

Porting to 64 bit : When?

• Applications that manipulate very large data sets and need to exceed the 4 GB (32 bit address space ) limit.

• Are I/O bound and can use memory to perform disk I/O.• Your platform does not have a 32 bit option.• You already have a 64 bit application.

Categories of applications good for 64 bit

• Database:– Benefit greatly from larger physical address space– Run entire database out of memory rather than from disk

• Email:– Larger address space : larger number of users per

server

• Terminal Server:– Avoiding kernel address space limitations when hosting

multiple applications– Example: Microsoft Office hosting on Terminal Server in

64-bit environment supports 50% more users than in 32-bit environment

• Business Applications:– Apps that have high memory requirements– Apps that have high computational requirements

• Technical / Scientific computing:– Need for a large virtual and physical address space– Complex computations

Choose an appropriate porting model

• 64 bit full port• Small address space with 64 bit pointers• Small address space with 32 bit pointers• Win32 application

General porting process

• Find all your source code – check under the couch.

• Acquire 64-bit versions of all 3rd party libraries.

• Port all of your own libraries to 64-bit.

• Rewrite assembly code to be 64-bit.

• Fix migration warnings

Porting steps:

• Compile the code in Win32 and eliminate warnings• Remove /FD compiler flag (exporting makefiles from VisualStudio)• Change linker option for machine type from IX86 to IA64• Remove /Gm compiler flag (minimal rebuild- ignored by IA64)• Add compiler options –Wp64 and –W4 • For 32 bit pointer variables , add –Ap32 • For 4 GB (small) address space use –As32• Clean all 32 bit objects, rebuild for 64 bit • Fix all warnings (use conditional compilation)• Test

32 / 64 bit issues – Common pitfalls

– Storing pointers in ints– Truncating function return values– Casting pointers to ints or ints to pointers– Using unnamed or unqualified bit fields– Using literals and masks that assume data sizes– Hard coding size of data types– Hard coding bit shift values– Inline assembly

Example 1:

What is wrong with the following program ?

#include <stdlib.h>

void main() {

int *p = malloc(100);

int i = (int)p;

}

Example 2:

void func(int p);

char *ptr;

func(ptr);

Example 3:

Why would the following C program dump core in 64 bit mode?

void main() {

char *p = malloc(2000);

*p = ‘A’;

}

• HP DSPP Site– www.hp.com/dspp

• For information on HP products in The Netherlands– www.hp.nl– Contact addresses

• Wouter Smit: [email protected]

• Laurent Van Veen: [email protected]

• Route64 training– www.route64.net

For more information