Alexey Pakhunov /XCG, Microsoft Research/ [email protected] March 30 th, 2011.

15
SCC Development Experiences Alexey Pakhunov /XCG, Microsoft Research [email protected] March 30 th , 201

Transcript of Alexey Pakhunov /XCG, Microsoft Research/ [email protected] March 30 th, 2011.

Page 1: Alexey Pakhunov /XCG, Microsoft Research/ alexeypa@microsoft.com March 30 th, 2011.

SCC Development Experiences

Alexey Pakhunov /XCG, Microsoft Research/[email protected]

March 30th, 2011

Page 2: Alexey Pakhunov /XCG, Microsoft Research/ alexeypa@microsoft.com March 30 th, 2011.

2

Overview

Black Cloud OS: A fork of Singularity OS Our playground for experimenting with

message passing in non-cache coherent environment

This presentation covers only our development experiences on the SCC Submission of the paper is on its way

Page 3: Alexey Pakhunov /XCG, Microsoft Research/ alexeypa@microsoft.com March 30 th, 2011.

3

What is Singularity?

A quote from Singularity home page:“A research operating system prototype, extending programming languages, and developing new techniques and tools for specifying and verifying program behavior”

Written in managed code Some Assembler and C++ in the boot

loader and kernel IPC and inter-component

communications are based on passing messages

Page 4: Alexey Pakhunov /XCG, Microsoft Research/ alexeypa@microsoft.com March 30 th, 2011.

4

Our setup

Tile

Tile

Tile

Tile

Tile

Tile

Tile

Tile

Tile

Tile

Tile

Tile

Tile

Tile

Tile

Tile

Tile

Tile

Tile

Tile

Tile

Tile

Tile

Tile

R R

R R

R R

R R

R R

R R

R R

R R

R R

R R

R R

R R

DDR3 MC

DDR3 MC

DDR3 MC

DDR3 MC

VRC System Interface

PCI-E

Management Console (Linux)

sccTcpServer/mceGui

Desktop PC (Windows)

RcLoader.Net, KdProxy, WinDbg, etc.

TCP/IP

Page 5: Alexey Pakhunov /XCG, Microsoft Research/ alexeypa@microsoft.com March 30 th, 2011.

5

RcLoader.Net

Configuration Generates the system memory map Configures the SCC registers Uploads the boot loader and OS images Supports manual editing of the SCC

configuration

Debugging Allows inspecting the memory and

configuration registers

Page 6: Alexey Pakhunov /XCG, Microsoft Research/ alexeypa@microsoft.com March 30 th, 2011.

6

The memory map

Shared memory (OS image, the initial jmp)

Unused

Shared memory buffers (256KB per core)

Configuration space

MPB (16KB per tile)

Unused

Private Memory (336 MB - 1360 MB)0x00000000 - up to 0x54FFFFFF

0x80000000 – 0x97FFFFFF

0xA0000000 – 0xB7FFFFFF

0xC0000000 – 0xC3FFFFFF

0xFC000000 – 0xFFFFFFFF

Page 7: Alexey Pakhunov /XCG, Microsoft Research/ alexeypa@microsoft.com March 30 th, 2011.

7

Debugging challenges

No serial port or console Memory at 0xb8000 is the console buffer

I/O redirection doesn’t work as expected Execution of IN or OUT instruction effectively

halts the core and sccTcpServer

Serial KD transport is emulated A couple of ring buffers on the SCC side KdProxy.exe exposes a named pipe interface for

the debugger

Page 8: Alexey Pakhunov /XCG, Microsoft Research/ alexeypa@microsoft.com March 30 th, 2011.

8

Porting challenges

No BIOS The system memory map is patched directly in the boot

loader

No standard devices Local APIC is used instead of i8254 timer and PIC No RTC clock

No modern instruction supported Context handling code was updated due to lack of MMX▪ 32bit flavor of Singularity uses only x87 for floating point

calculations Bartok compiler was patched due to lack of CMOV

instructions

Page 9: Alexey Pakhunov /XCG, Microsoft Research/ alexeypa@microsoft.com March 30 th, 2011.

9

Experimental hardware

Turning on MPB bypass bit causes a race causing memory corruptions Minus three days of debugging :-) We couldn’t take advantage of fast MPB

access

Large pages cannot be used together with MPB Singularity uses large pages to create

the identity mapping spanning 4GB

Page 10: Alexey Pakhunov /XCG, Microsoft Research/ alexeypa@microsoft.com March 30 th, 2011.

10

Interface

A telnet connection to each core The same serial transport emulation via

KdProxy.exe was used

Page 11: Alexey Pakhunov /XCG, Microsoft Research/ alexeypa@microsoft.com March 30 th, 2011.

11

Cache coherency matters A read-only OS image is shared among all

cores

Message passing code uses MPB-mapped buffers and CL1FLUSH-aware memcpy()

Large shared memory storage is accessible via dynamically remapped LUTs R/W access is possible with proper cache

flushing and/or caching settings in PTEs

Page 12: Alexey Pakhunov /XCG, Microsoft Research/ alexeypa@microsoft.com March 30 th, 2011.

12

Performance

Core’s memory interface bandwidth is limited One outstanding memory operationFrequency (MHz) A cache line

writing latency (cycles)

Maximum bandwidth

(MB/s)

533 39 (cached) 417

533 131 (uncached) 124.2

Page 13: Alexey Pakhunov /XCG, Microsoft Research/ alexeypa@microsoft.com March 30 th, 2011.

13

Performance

Memory controller bandwidth is limited

1 2 3 4 5 6 7 8 9 10 11 120

100000000

200000000

300000000

400000000

500000000

600000000

700000000

Write Bandwidth (uncached)Read Bandwidth (uncached)Write Bandwidth (cached)Read Bandwidth (cached)

Number of Cores

Bandw

idth

in M

B/s

Page 14: Alexey Pakhunov /XCG, Microsoft Research/ alexeypa@microsoft.com March 30 th, 2011.

14

Conclusions

The SCC is an experimental platform tailored for message passing Lack of cache coherency makes us think hard

how about message passing The chip has enough cores to play with

scalability

Compare apples to apples The cache and memory subsystems are

significantly different The SCC is super parallel, not super fast

Page 15: Alexey Pakhunov /XCG, Microsoft Research/ alexeypa@microsoft.com March 30 th, 2011.

15

Q&A