Evolution of the Windows Kernel Architecturematerias.fi.uba.ar/7508/BuenosAires-Oct20… · PPT...

30
EVOLUTION OF THE WINDOWS KERNEL ARCHITECTURE Dave Probert, Ph.D. - Windows Kernel Architect Microsoft Windows Division 08.10.2009 Buenos Aires Copyright Microsoft Corporation

Transcript of Evolution of the Windows Kernel Architecturematerias.fi.uba.ar/7508/BuenosAires-Oct20… · PPT...

EVOLUTION OF THE WINDOWS KERNEL ARCHITECTURE

Dave Probert, Ph.D. - Windows Kernel ArchitectMicrosoft Windows Division

08.10.2009Buenos Aires

Copyright Microsoft Corporation

About Me Ph.D. in Computer Engineering (Operating Systems w/o

Kernels) Kernel Architect at Microsoft for over 13 years

Managed platform-independent kernel development in Win2K/XP Working on multi-core & heterogeneous parallel computing

support Architect for UMS in Windows 7 / Windows Server 2008 R2

Co-instigator of the Windows Academic Program Providing kernel source and curriculum materials to universities http://microsoft.com/WindowsAcademic or

[email protected] Wrote the Windows material for leading OS textbooks

Tanenbaum, Silberschatz, Stallings Consulted on others, including a successful OS textbook in China

UNIX vs NT Design Environments

Environment which influenced fundamental design decisions

UNIX [1969] Windows (NT) [1989]16-bit program address spaceKbytes of physical memorySwapping system with memory mappingKbytes of disk, fixed disksUniprocessorState-machine based I/O devicesStandalone interactive systems Small number of friendly users

32-bit program address spaceMbytes of physical memoryVirtual memoryMbytes of disk, removable disksMultiprocessor (4-way)Micro-controller based I/O devicesClient/Server distributed computing Large, diverse user populations

Copyright Microsoft Corporation

Effect on OS Design

NT vs UNIXAlthough both Windows and Linux have adapted to changes in the environment, the original design environments (i.e. in 1989 and 1969) heavily influenced the design choices:

Unit of concurrency:Process creation:I/O:Namespace root:Security:

Threads vs processesCreateProcess() vs fork()Async vs syncVirtual vs FilesystemACLs vs uid/gid

Addr space, uniprocAddr space, swappingSwapping, I/O devicesRemovable storageUser populations

Copyright Microsoft Corporation

Today’s Environment [2009]

64-bit addressesGBytes of physical memoryTBytes of rotational diskNew Storage hierarchies (SSDs)Hypervisors, virtual processorsMulti-core/Many-coreHeterogeneous CPU architectures, Fixed function hardwareHigh-speed internet/intranet, Web ServicesMedia-rich applicationsSingle user, but vulnerable to hackers worldwide

Convergence: Smartphone / Netbook / Laptop / Desktop / TV / Web / Cloud

Copyright Microsoft Corporation

Windows Architecture

hardware interfaces (buses, I/O devices, interrupts, interval timers, DMA, memory cache control, etc., etc.)

System Service Dispatcher

Task ManagerExplorer

SvcHost.ExeWinMgt.Exe

SpoolSv.Exe

ServiceControl Mgr.

LSASS

ObjectM

gr.

WindowsUSER,GDI

File System Cache

I/O Mgr

Environment Subsystems

UserApplication

Subsystem DLLs

System Processes Services Applications

SystemThreads

UserMode

KernelMode

NTDLL.DLL

Device &File Sys.Drivers

WinLogon

Session Manager

Services.Exe POSIX

Windows DLLs

Plug andPlay M

gr.

Power

Mgr.

SecurityReferenceM

onitor

VirtualM

emory

Processes&

Threads

LocalProcedure

Call GraphicsDrivers

Kernel

Hardware Abstraction Layer (HAL)

(kernel mode callable interfaces)

Configura-tion M

gr(registry)

OS/2

Windows

Copyright Microsoft Corporation

Kernel-mode Architecture of Windows

Copyright Microsoft Corporation

NT API stubs (wrap sysenter) -- system library (ntdll.dll)user mode

kernel

mode

NTOS executive layer

Trap/Exception/Interrupt Dispatch

CPU mgmt: scheduling, synchr, ISRs/DPCs/APCs

DriversDevices, Filters, Volumes, Networking, Graphics

Hardware Abstraction Layer (HAL): BIOS/chipset details

firmware/

hardware

CPU, MMU, APIC, BIOS/ACPI, memory, devices

NTOS kernel layer

Caching Mgr

Security

Procs/Threads

Virtual Memory

IPC

glue

I/O

Object Mgr

Registry

Copyright Microsoft Corporation

Kernel/Executive layers

Kernel layer – ntos/ke – ~ 5% of NTOS source) Abstracts the CPU

Threads, Asynchronous Procedure Calls (APCs) Interrupt Service Routines (ISRs) Deferred Procedure Calls (DPCs – aka Software

Interrupts) Providers low-level synchronization

Executive layer OS Services running in a multithreaded

environment Full virtual memory, heap, handles Extensions to NTOS: drivers, file systems,

network, …Copyright Microsoft Corporation

NT (Native) API examples

NtCreateProcess (&ProcHandle, Access, SectionHandle, DebugPort, ExceptionPort, …)

NtCreateThread (&ThreadHandle, ProcHandle, Access, ThreadContext, bCreateSuspended, …)

NtAllocateVirtualMemory (ProcHandle, Addr, Size, Type, Protection, …)

NtMapViewOfSection (SectionHandle, ProcHandle, Addr, Size, Protection, …)

NtReadVirtualMemory (ProcHandle, Addr, Size, …)NtDuplicateObject (srcProcHandle, srcObjHandle,

dstProcHandle, dstHandle, Access, Attributes, Options)

Copyright Microsoft Corporation

Windows Vista Kernel Changes Kernel changes mostly minor improvements

Algorithms, scalability, code maintainability CPU timing: Uses Time Stamp Counter (TSC)

Interrupts not charged to threads Timing and quanta are more accurate

Communication ALPC: Advanced Lightweight Procedure Calls Kernel-mode RPC New TCP/IP stack (integrated IPv4 and IPv6)

I/O Remove a context switch from I/O Completion Ports I/O cancellation improvements

Memory management Address space randomization (DLLs, stacks) Kernel address space dynamically configured

Security: BitLocker, DRM, UAC, Integrity LevelsCopyright Microsoft Corporation

Windows 7 Kernel Changes Miscellaneous kernel changes

MinWin Change how Windows is built Lots of DLL refactoring API Sets (virtual DLLs)

Working-set management Runaway processes quickly start reusing own pages Break up kernel working-set into multiple working-sets

System cache, paged pool, pageable system code Security

Better UAC, new account types, less BitLocker blockers Energy efficiency

Trigger-started background services Core Parking Timer-coalescing, tick skipping

Major scalability improvements for large server apps Broke apart last two major kernel locks, >64p

Kernel support for ConcRT User-Mode Scheduling (UMS)

Copyright Microsoft Corporation

MinWin MinWin is first step at creating architectural

partitions Can be built, booted and tested separately from the rest of

the system Higher layers can evolve independently An engineering process improvement, not a microkernel NT!

MinWin was defined as set of components required to boot and access network Kernel, file system driver, TCP/IP stack, device drivers,

services No servicing, WMI, graphics, audio or shell, etc, etc, etc

MinWin footprint: 150 binaries, 25MB on disk, 40MB in-memory

MinWin Layering

Shell,Graphics,Multimedia,Layered Services,Applets, Etc.

Kernel, HAL,TCP/IP,File Systems,Drivers,Core System Services

MinWin

Timer Coalescing Secret of energy efficiency: Go idle and Stay idle Staying idle requires minimizing timer interrupts Before, periodic timers had independent cycles even

when period was the same New timer APIs permit timer coalescing

Application or driver specifies tolerable delay Timer system shifts timer firing

Timer tick15.6 ms

Periodic Timer Events

Windows 7

Vista

MarkRuss

Broke apart the Dispatcher Lock Scheduler Dispatcher lock hottest on server

workloads Lock protects all thread state changes (wait,

unwait) Very lock at >64x

Dispatcher lock broken up in Windows 7 / Server 2008 R2 Each object protected by its own lock Many operations are lock-free

hot

Copyright Microsoft Corporation

Removed PFN Lock Windows tracks the state of pages in physical

memory In use: in working sets: Not assigned: on paging lists: freemodified,

standby, … Before, all page state changes protected by global

PFN (Physical Frame Number) lock As of Windows 7 the PFN lock is gone

Pages are now locked individually Improves scalability for large memory

applications

Copyright Microsoft Corporation

The Silicon Power WallThe situation: Power2 ∝ Clock frequency Voltage ∝ Power2

⇨ Clock frequency and Voltage offset each other Clock frequency inversely proportional to logic path lengthBad News: Power is about as low as it can go Logic paths between clocked elements are pretty shortGood News: Moore’s Law continues (# transistors doubles ~22 months) All that parallel computational theory is going into practice

Transistors going into more cores, not faster cores!Software subject to Amdahl’s Law, not Moore’s Law

(or Gustafson’s Law – if my wife can find large enough datasets she cares about) 17

Approaches to HW parallelismHomogeneous

More big superscalar cores Extend with private (or shared) SIMD engines (SSE on steroids) (Maybe) not very energy efficient A few more big, cores and lots of smaller, slower, cooler cores Use SIMD for performance Shutoff idle small cores for energy efficiency (but leakage?)Lots of little fully programmable cores, all the same Nobody has ever gotten this to work – more on this later

HeterogeneousProgrammable Accelerators (e.g. GPUs) Attach loosely-coupled, specialized (non-x86), energy-efficient coresFixed-function Accelerators Very energy-efficient, device-like computational units for very-specific tasks

18

User Mode Scheduling (UMS) Improve support for efficient cooperative multithreaded

scheduling of small tasks (over-decomposition)Þ Want to schedule tasks in user-modeÞ Use NT threads to simulate CPUs, multiplex tasks onto these

threads When a task calls into the kernel and blocks, the CPU

may get scheduled to a different appÞ If a single NT thread per CPU, when it blocks it blocks.Þ Could have extra threads, but then kernel and user-mode are

competing to schedule the CPU Tasks run arbitrary Win32 code (but only x64/IA64)

Þ Assumes running on an NT thread (TEB, kernel thread) Used by ConcRT (Visual Studio 2010’s Concurrency Run-

Time)

Copyright Microsoft Corporation

Windows 7 User-Mode Scheduling UMS breaks NT thread into two parts:

UT: user-mode portion (TEB, ustack, registers) KT: kernel-mode portion (ETHREAD, kstack, registers)

Three key properties: User-mode scheduler switches UTs w/o ring crossing KT switch is lazy: at kernel entry (e.g. syscall, pagefault) CPU returned to user-mode scheduler when KT blocks

KT “returns” to user-mode by queuing completion User-mode scheduler schedules corresponding UT (similar to scheduler activations, etc)

Copyright Microsoft Corporation

Normal NT Threading

kerneluser

KT0 KT1 KT2

UT2UT1UT0

Kernel-modeScheduler NTOS executive

trap code

NT Thread is Kernel Thread (KT) and User Thread (UT)UT/KT form a single logical thread representing NT thread in user or kernel

KT: ETHREAD, KSTACK, link to EPROCESSUT: TEB, USTACK

x86 core

Copyright Microsoft Corporation

User-Mode Scheduling (UMS)

kerneluser

Thread Parking

KT0 KT1 KT2

UT Completion list

PrimaryThread

UT0UT1

UT0User-modeScheduler

trap code

NTOS executiveKT0 blocks

Only primary thread runs in user-modeTrap code switches to parked KTKT blocks Þ primary returns to user-modeKT unblocks & parks Þ queue UT completion

Copyright Microsoft Corporation

UMS Based on NT threads

Þ Each NT thread has user & kernel parts (UT & KT)Þ When a thread becomes UMS, KT never returns to UT

Þ (Well, sort of)Þ Instead, the primary thread calls the USched

USchedÞ Switches between UTs, all in user-modeÞ When a UT enters kernel and blocks, the primary thread

will hand CPU back to the USched declaring UT blockedÞ When UT unblocks, kernel queues notificationÞ USched consumes notifications, marks UT runnable

Primary ThreadÞ Self-identified by entering kernel with wrong TEBÞ So UTs can migrate between threadsÞ Affinities of primaries and KTs are orthogonal issues

Copyright Microsoft Corporation

UMS Thread Roles

Primary threads: represent CPUs, normal app threads enter the USched world and become primaries, primaries also can be created by UScheds to allow parallel execution

Primaries represent concurrent execution

UMS threads (UT/KTs): allow blocking in the kernel without losing the CPU

UMS thread represent concurrent blocking in kernel

Copyright Microsoft Corporation

Thread Scheduling vs UMS

Core 2

Thread3

Non-running threads

Core 1

Thread4

Thread5

Thread1

Thread2

Thread6

Core 2Core 1

UserThrea

d2

KernelThrea

d2

UserThrea

d1

KernelThrea

d1

UserThrea

d3

KernelThrea

d3

UserThrea

d4

KernelThrea

d4

UserThrea

d5

KernelThrea

d5

UserThrea

d6

KernelThrea

d6

Thread SchedulingCooperative Scheduling

MarkRuss

Win32 compat considerations

Why not Win32 fibers? TEB issues

Þ Contains TLS and Win32-specific fields (incl LastError)Þ Fibers run on multiple threads, so TEB state doesn’t

track Kernel thread issues

Þ Visibility to TEBÞ I/O is queued to threadÞ Mutexes record thread ownerÞ ImpersonationÞ Cross-thread operations expect to find threads and IDsÞ Win32 code has thread and affinity awareness

Copyright Microsoft Corporation

Futures: Master/Slave UMS?

remote kernel

Remote x86

Thread Parking

KT0 KT1 KT2

UT2UT1

RemoteScheduler

trap code

NTOS executiveKernel-modeScheduler

Syscall Completion QueueSyscall Request Queue

UT0

x86 core

UTs (can) run on accelerators or x86sKTs run on x86s, syscalls remoted/batchedPagefaults are just like syscallsAccelerator never “loses the CPU” (implicit primary)

Copyright Microsoft Corporation

Operating Systems Futures Many-core challenge

New driving force in software innovation:Amdahl’s Law overtakes Moore’s Law as high-

order bit Heterogeneous cores?

OS Scalability Loosely –coupled OS: mem + cpu + services? Energy efficiency

Shrink-wrap and Freeze-dry applications? Hypervisor/Kernel/Runtime relationships

Move kernel scheduling (cpu/memory) into run-times?

Move kernel resource management into Hypervisor?

Copyright Microsoft Corporation

Windows Academic Program Windows Kernel Internals

Windows kernel in source (Windows Research Kernel – WRK) Windows kernel in PowerPoint (Curriculum Resource Kit –

CRK) Based on Windows Server 2008 Service Pack 1

Latest kernel at time of release First kernel release with AMD64 support

Joint program between Windows Product Group and MS Academic Groups Program directed by Arkady Retik (Need a DVD? Have

questions?)Information available at http://microsoft.com/WindowsAcademic OR [email protected]

Microsoft Academic Contacts in Buenos AiresMiguel Saez ([email protected]) or Ezequiel Glinsky ([email protected])

Copyright Microsoft Corporation

30

muchas gracias