Programming and porting in a Microsoft 64 bit world Shanthal Vasanth Hewlett Packard.
-
Upload
antony-green -
Category
Documents
-
view
219 -
download
0
Transcript of Programming and porting in a Microsoft 64 bit world Shanthal Vasanth Hewlett Packard.
Agenda
• Why 64 bit CPU?
• Introducing Itanium (EPIC)
• About Opteron and EM64T
• HP Itanium server platforms
• 64 bit porting
Why 64 bit CPU?
• Ability to use massive amounts of memory
• Data in memory is faster than getting it from the disk
• 32 bit CPU can address at the most 4 GB, while a 64 bit CPU can address upto 18 exabytes
• Ability to handle larger floating point numbers
• 32 bit CPUs bank on software emulation for values > 2^32
• 64 bit CPUs can handle upto 2^64
• Process data/instructions in chunks of 64 bits in a single clock cycle
CISC
RISC
Superscalar
EPICEPIC
Per
form
ance
Time
• Maximize instructions executed in parallel
• Improve floating point• Speculation & predication to
overcome memory latency & branch misses (common to databases)
• Large and fast on-die cache • Expanded Number of registers• (general, FP, branch)• Efficient management engine
– Register stack engine– Register Windowing– 4 GB page size
• Systems scalable to 512 cpus and beyond
• 64b for large memory addressability
• High internal & external bus & memory bandwidth
• MCA for data detection, correction, & logging
Massive On-chipMassive On-chipResourcingResourcing
Performance Performance EnhancementEnhancement Scalable & ReliableScalable & Reliable
Age : 20+Age : 20+
Age : 10-15+Age : 10-15+
IntelIntel®® ItaniumItanium®®
ProcessorProcessor Age : 2+Age : 2+
Age : 9+Age : 9+
* Source: * Source: Computer Organization and Architecture, 1999 W. StallingsComputer Organization and Architecture, 1999 W. Stallings
Architectural EvolutionEPIC (Explicitly Parallel Instruction Computing)
•Largest, most demanding workloads requires new approach•Benefits from the experience of past architectures•Goal to move beyond RISC performance bounds with explicit parallel instruction streams•Developed by the best CPU and server architecture minds in the industry
RISC (Reduced Instruction Set Computing)
•Goal to optimize performance with simpler instructions (this effort coined the term CISC)
Traditional Architecture Limiters
Today’s Processors often 60% Today’s Processors often 60% IdleIdleToday’s Processors often 60% Today’s Processors often 60% IdleIdle
parallelizedparallelizedcodecode parallelizedparallelized
codecode
parallelizedparallelizedcodecode
HardwareHardwareCompilerCompiler
multiplemultiple functional unitsfunctional units
Original SourceOriginal SourceCodeCode
Sequential MachineSequential MachineCodeCode
......
......
Execution Units Available Execution Units Available Used InefficientlyUsed Inefficiently
Technology case for a new Architecture
• Superscalar Complexity Growth• Functional unit area
grows linearly with number of units
• Scheduler area grows as the square of the number of units
Cost-performance Cost-performance reaches areaches a
point of diminishing point of diminishing returnsreturns
Cost-performance Cost-performance reaches areaches a
point of diminishing point of diminishing returnsreturns
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
Itanium Hardware/Software Synergy
compilercompiler implicitly parallel
implicitly parallel
hardware
......
......
sequential machine code
original sourcecode
Itanium-based
compiler
......
......multiple
execution units resources used more efficiently
parallel machine code
Traditional Architecture
massive resource
s
original sourcecode
multiple execution units
Itanium architecture
• Developed by HP and Intel• Next pervasive computer architecture• Explicitly Parallel Instruction Computing (EPIC)• Supports multiple operating systems: Windows, HP-UX, and Linux
Itanium architecture features
•explicit parallelism•predication•enhanced speculation•floating-point architecture•multi-processor scalability•large number of registers
• Built-in instruction-level parallelism• Massive on-chip resources• Up to 2x instructions/clock cycle• CPU clock and compiler maturity curve
• Fewer memory loads/stores on complex workloads• Huge memory address spaces• 60% shorter memory pipeline• Latency avoidance• Instruction predication • Data and control speculation
Itanium’s unique advantages
Customer benefits• Higher performance in
FP-intensive and complex technical workloads.• 2X performance of x86, at any clock speed, for faster:
– Image manipulation– Voice encoding/recognition– Encryption
Industry leading performance and scalability for demanding and unpredictable commercial applications:
- OLTP
- database query (TPC-H)
- sorting
Intel Itanium architecture: ultimate technology for technical computing
Opteron
• Integrated DDR memory controllers (5.3 GB/s)
• Enable direct access to memory reducing latency
• HyperTransport links
• a chip-to-chip interconnect technology that operates at memory speeds (6.4 GB/s)
• AMD 64 bit extensions on top of IA-32 instructions
• 32 bit binaries run natively
• Does not support Streaming SIMD Extensions 3
Xeon/EM64T - Extended Memory 64 Technology
• 64 bit Instruction set extensions to IA-32
• Compatible with AMD’s 64-bit instructions
• Traditional North Bridge Architecture
• Go thru north bridge to access memory
• Advantage: No need to spin CPU to access next-gen memory
• Does not support 3DNow
How next-generation x86 extensions address needs for improved performance
AMD Opteron™ utilizes 3 key innovations• Memory controller on-board with the processor,
allowing the controller to run at full speed of the processor core.
• The HyperTransport link between the processors, and between the processors and I/O, is also extremely low-latency (6.4GB/sec of data throughput).
• Capability for maximum 32-bit performance, and excellent 64-bit performance.
• Increased frequency headroom; 3.6 GHz/1MB cache
• 800 MHz FSB - 1.5x system bus speed vs 533MHz platform
• DDR2-400 - Faster memory technology
• PCI Express 4x - 8x - Faster I/O
L2 Cache
L1Instruction
Cache
L1Data
Cache
AMD64Core
DDR Memory Controller
HyperTransport™
Multitude of features enhance Xeon performance
Additional Registers8- SSE & 8-Gen Purpose
Additional Registers8- SSE & 8-Gen Purpose
Double Precision (64-bit) Integer Support
Double Precision (64-bit) Integer Support
Extended Memory Addressability64-Bit Pointers, Registers
Extended Memory Addressability64-Bit Pointers, Registers
Xeon extensions add
Opteron Compared to Itanium 2
1 2 3 4 5 6
Opteron* Processor
6.4 GB/s16x16 HTT
1 TB
~2.0 GHz
Itanium® 2 Processor
6.4 GB/s
1024 TB
8
Memory Addressing
1 2 3 4 5 6 7 8 9 1011
System Bus Bandwidth
On-die Cache
On-die Registers
Execution Units
Core Frequency
Issue Ports
Itanium Architecture
264 Application Registers + 64 Predicate Registers*
6 Instructions / Cycle
40 Registers
12
3 Instructions / Cycle
6 MB
Instructions / Clk
6 Integer, 3 Branch
2 FP (FMAC)
1 SIMD2 Load and
2 Store
x86 with extra memory bits
2 Loador
2 Store
Fmul,Fadd 1 for SIMD
3Integer
1MB
1.5 GHz
Pipeline Stages
* Intel’s EPIC technology includes 64 single-bit predicate registers to accelerate loop unrolling and branch intensive code execution.
Opteron and Itanium: A comparison
Opteron MadisonClock (for this comparison) 1.8 GHz 1.5 GHzPhysical address Space 40 bit 50 bitVirtual address space 48 bit 56 bitInt (=GRs) Registers 16 128Float Registers 8 128supported page sizes 4 KB, 2 MB 4 KB … 4 GB Instructions/clock 3 6On Die Cache 1 MB 9 MBIA32 applications Native EmulatedNative 64 bit OS Win/Linux Win/Linux/UXMemory Access Onchip MCU NorthBridge
• Large scale applications and databases
• Complex workloads – technical and commercial
• Primarily back end DB & application tier
• Enterprise scale up and scale out
• Server consolidation
HP Integrity Servers
1 to 128-way*Itanium processor
architecture
* Future
HP ProLiant Servers
1 to 8-way
ProLiant and
Integrity serversx86
processor architecture
• Small to medium scale application and databases
• Well-defined, less-complex workloads
• Primarily front-end/network edge & application tier
• Scale out and small to mid-sizescale up
Customer choice
Customer-specific needs driven
HP delivering choice
• Price/performance leadership with 32/64-bit co-existence
• Highest clock speed, peak performance
• Extensive 32-bit, and emerging 64-bit ecosystems
• Scale-out for simple, highly parallel workloads (2p nodes)
• Linux & Windows
• Price/performance leadership with 32/64-bit co-existence
• 32-bit throughput performance leadership
• Highest bandwidth for sustained performance
• Extensive 32-bit, and emerging 64-bit ecosystems
• Scale-out for moderate workloads (2p/4p nodes)
• Linux & Windows
• Highest performance 64-bit processor core for sustained performance
• Highest SMP scalability (to 128p)
• HP-UX for mission-critical technical computing
• Extensive 64-bit ecosystem(and 32/64-bit on HP-UX)
• Scale-up and scale-out for complex workloads
• HP-UX, Linux & Windows
ProLiant Serverswith 64-bit extensions
IntegrityServers
Complete choice to meet diverse needs across the data center
Workgroup
File, print
MailMessaging
Directory, DNS, firewall, security
Services, caching, proxy Web
Infra-structure
Parallel computing, clustering
HPC
OLTP mid size
Apptier
ERP, biz logic, app server
Biz intelligence/ SCM planning
Biz intelligenceVery large data sets
Back-end for CRM,SCM, ERP, large data sets
Large SMP, large memoryHPC
ERPlarge
OLTPlarge
BI
Front-end
Application & data-tier
Large scale data tier
1 - 4 processors 4 - 8 processors 8 - 64+ processors
OLTPmed
ERPmedium
BI
Integrity & NonStop
ProLiant & Integrity
OLTP large size DBHigh transaction volumes
Back-end for CRM, SCM, ERP
Integrity & NonStop servers
ProLiant & Integrity systems
Mix of ProLiant, Integrity & NonStop
ProLiant
Integrity
Unique options on Itanium
W2003-64 (native boot)
MS Win64 Applications compiled for IPF (64bit)
Linux64 (native boot)
MS W2Kapps for IA32
HP-UX 11i (native boot)
HP-UX ABILinux ABI
Linux 64 Applications (also Linux 32)
Linux 64 Applications compiled for IPF (64bit)
HP-UX Applications compiled for IPF
PA-Apps
ARIES
Itanium®-based hp servers:The most scalable roadmap from any
vendor
2001 2002 2003 2004
scala
ble
(ce
ll-base
d)
serv
ers
Itanium2 2-way 2U
SuperDomeMadison
rp8400 Madison
rp7410 Madison
Itanium24-way 7U
Itanium 4-ways
DL 590rx4610
rx9610
Futu
re Ita
niu
m
Proce
ssor
Madison
entr
y
serv
ers
rp5400
Itanium16-way
rx5670
rx2600
roadmap subject to change
Itanium2Madison4way 4U
HP Itanium Processor Family– HP's entire family of Itanium-
based servers--- including the midrange 8-way and 16-way, and the high-end Superdome 64-way will support the 64-bit version of Microsoft Windows Server 2003
– HP is successfully running Windows on HP's Superdome server configured with 64 Itanium 2 processors and 512GB memory
– HP is successfully running Windows, HP-UX, and Linux in separate partitions on a 64-processor Superdome
64+way64-
socket
16+way16-socket
8+way8-socket
2-way
4-way
Porting to 64 bit : When?
• Applications that manipulate very large data sets and need to exceed the 4 GB (32 bit address space ) limit.
• Are I/O bound and can use memory to perform disk I/O.• Your platform does not have a 32 bit option.• You already have a 64 bit application.
Categories of applications good for 64 bit
• Database:– Benefit greatly from larger physical address space– Run entire database out of memory rather than from disk
• Email:– Larger address space : larger number of users per
server
• Terminal Server:– Avoiding kernel address space limitations when hosting
multiple applications– Example: Microsoft Office hosting on Terminal Server in
64-bit environment supports 50% more users than in 32-bit environment
• Business Applications:– Apps that have high memory requirements– Apps that have high computational requirements
• Technical / Scientific computing:– Need for a large virtual and physical address space– Complex computations
Choose an appropriate porting model
• 64 bit full port• Small address space with 64 bit pointers• Small address space with 32 bit pointers• Win32 application
General porting process
• Find all your source code – check under the couch.
• Acquire 64-bit versions of all 3rd party libraries.
• Port all of your own libraries to 64-bit.
• Rewrite assembly code to be 64-bit.
• Fix migration warnings
Porting steps:
• Compile the code in Win32 and eliminate warnings• Remove /FD compiler flag (exporting makefiles from VisualStudio)• Change linker option for machine type from IX86 to IA64• Remove /Gm compiler flag (minimal rebuild- ignored by IA64)• Add compiler options –Wp64 and –W4 • For 32 bit pointer variables , add –Ap32 • For 4 GB (small) address space use –As32• Clean all 32 bit objects, rebuild for 64 bit • Fix all warnings (use conditional compilation)• Test
32 / 64 bit issues – Common pitfalls
– Storing pointers in ints– Truncating function return values– Casting pointers to ints or ints to pointers– Using unnamed or unqualified bit fields– Using literals and masks that assume data sizes– Hard coding size of data types– Hard coding bit shift values– Inline assembly
Example 1:
What is wrong with the following program ?
#include <stdlib.h>
void main() {
int *p = malloc(100);
int i = (int)p;
}
Example 3:
Why would the following C program dump core in 64 bit mode?
void main() {
char *p = malloc(2000);
*p = ‘A’;
}
• HP DSPP Site– www.hp.com/dspp
• For information on HP products in The Netherlands– www.hp.nl– Contact addresses
• Wouter Smit: [email protected]
• Laurent Van Veen: [email protected]
• Route64 training– www.route64.net
For more information