1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.

Lecture 5: Scheduling and Reliability

• Topics: scheduling policies, handling DRAM errors

PAR-BS Mutlu and Moscibroda, ISCA’08

• A batch of requests (per bank) is formed: each thread can only contribute R requests to this batch; batch requests have priority over non-batch requests

• Within a batch, priority is first given to row buffer hits, then to threads with a higher “rank”, then to older requests

• Rank is computed based on the thread’s memory intensity; low-intensity threads are given higher priority; this policy improves batch completion time and overall throughput

• By using rank, requests from a thread are serviced in parallel; hence, parallelism-aware batch scheduling

TCM Kim et al., MICRO 2010

• Organize threads into latency-sensitive and bw-sensitive clusters based on memory intensity; former gets higher priority

• Within bw-sensitive cluster, priority is based on rank

• Rank is determined based on “niceness” of a thread and the rank is periodically shuffled with insertion shuffling or random shuffling (the former is used if there is a big gap in niceness)

• Threads with low row buffer hit rates and high bank level parallelism are considered “nice” to others

Minimalist Open-Page Kaseridis et al., MICRO 2011

• Place 4 consecutive cache lines in one bank, then the next 4 in a different bank and so on – provides the best balance between row buffer locality and bank-level parallelism

• Don’t have to worry as much about fairness

• Scheduling first takes priority into account, where priority is determined by wait-time, prefetch distance, and MLP in thread

• A row is precharged after 50 ns, or immediately following a prefetch-dictated large burst

Other Scheduling Ideas

• Using reinforcement learning Ipek et al., ISCA 2008

• Co-ordinating across multiple MCs Kim et al., HPCA 2010

• Co-ordinating requests from GPU and CPU Ausavarungnirun et al., ISCA 2012

• Several schedulers in the Memory Scheduling Championship at ISCA 2012

• Predicting the number of row buffer hits Awasthi et al., PACT 2011

Basic Reliability

• Every 64-bit data transfer is accompanied by an 8-bit (Hamming) code – typically stored in a x8 DRAM chip

• Guaranteed to detect any 2 errors and recover from any single-bit error (SECDED)

• Such DIMMs are commodities and are sufficient for most applications that are inherently error-tolerant (search)

• 12.5% overhead in storage and energy

• For a BCH code, to correct t errors in k-bit data, need an r-bit code, r = t * ceil (log2 k) + 1

Terminology

• Hard errors: caused by permanent device-level faults

• Soft errors: caused by particle strikes, noise, etc.

• SDC: silent data corruption (error was never detected)

• DUE: detected unrecoverable error

• DUE in memory caused by a hard error will typically lead to DIMM replacement

• Scrubbing: a background scan of memory (1GB every 45 mins) to detect and correct 1-bit errors

Field Studies Schroeder et al., SIGMETRICS 2009

• Memory errors are the top causes for hw failures in servers and DIMMs are the top component replacements in servers

• Study examined Google servers in 2006-2008, using DDR1, DDR2, and FBDIMM

Field Studies Schroeder et al., SIGMETRICS 2009

• A machine with past errors is more likely to have future errors• 20% of DIMMs account for 94% of errors• DIMMs in platforms C and D see higher UE rates because they do not have chipkill• 65-80% of uncorrectable errors are preceded by a correctable error in the same month – but predicting a UE is very difficult• Chip/DIMM capacity does not strongly influence error rates• Higher temperature by itself does not cause more errors, but higher system utilization does• Error rates do increase with age; the increase is steep in the 10-18 month range and then flattens out

Chipkill

• Chipkill correct systems can withstand failure of an entire DRAM chip

• For chipkill correctness the 72-bit word must be spread across 72 DRAM chips or, a 13-bit word (8-bit data and 5-bit ECC) must be spread across 13 DRAM chips

RAID-like DRAM Designs

• DRAM chips do not have built-in error detection

• Can employ a 9-chip rank with ECC to detect and recover from a single error; in case of a multi-bit error, rely on a second tier of error correction

• Can do parity across DIMMs (needs an extra DIMM); use ECC within a DIMM to recover from 1-bit errors; use parity across DIMMs to recover from multi-bit errors in 1 DIMM

• Reads are cheap (must only access 1 DIMM); writes are expensive (must read and write 2 DIMMs) Used in some HP servers

RAID-like DRAM Udipi et al., ISCA’10

• Add a checksum to every row in DRAM; verified at the memory controller

• Adds area overhead, but provides self-contained error detection

• When a chip fails, can re-construct data by examining another parity DRAM chip

• Can control overheads by having checksum for a large row or one parity chip for many data chips

• Writes are again problematic

Virtualized ECC Yoon and Erez, ASPLOS’10

• Also builds a two-tier error protection scheme, but does the second tier in software

• The second-tier codes are stored in the regular physical address space (not specialized DRAM chips); software has flexibility in terms of the types of codes to use and the types of pages that are protected

• Reads are cheap; writes are expensive as usual; but, the second-tier codes can now be cached; greatly helps reduce the number of DRAM writes

• Requires a 144-bit datapath (increases overfetch)

LoT-ECC Udipi et al., ISCA 2012

• Use checksums to detect errors and parity codes to fix• Requires access of only 9 DRAM chips per read, but the storage overhead grows to 26%

• Bullet

1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.

Documents

Transcript of 1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.

Adaptive-Latency DRAM: Optimizing DRAM Timing …users.ece.cmu.edu/~omutlu/pub/adaptive-latency-dram_hpca15.pdf · Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common ...

Gather-Scatter DRAM: In-DRAM Address ranslationT …people.inf.ethz.ch/omutlu/pub/GSDRAM-gather-scatter-dram...Gather-Scatter DRAM: In-DRAM Address ranslationT to Improve the Spatial

DRAM capacity

DDR3 Synchronous DRAM Memory - … · DDR3 Synchronous DRAM Memory ... Read/Write leveling . DDR3 Synchronous DRAM 3 8-bit prefetch/burst length . DDR3 Synchronous DRAM 4 Commands

Scheduling Jobs with Estimation Errors for Multi-Server ...

DRAM Errors in the Wild: A Large-Scale Field Study · DRAM Errors in the Wild: A Large-Scale Field Study Schroeder et al. In Proc. of SIGMETRICS 2009 Presented by: John Otto Irene

Co-Architecting Controllers and DRAM to Enhance DRAM ...camelab.org/uploads/Main/Co-Architecting Controllers and DRAM to... · Co-Architecting Controllers and DRAM to Enhance DRAM

Cosmic Rays Don’t Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design

A Read-Write Aware DRAM Scheduling for Power … · A Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core System DRAM Power Management Demand of low power systems DRAM

HY425 Lecture 15: DRAM Technology - University of Cretehy425/2012f/lectures/lecture15...DRAM basics Advanced DRAM technology Virtual memory HY425 Lecture 15: DRAM Technology Dimitrios

High Speed DRAM Interface - Hanyangiclab.hanyang.ac.kr/research/data/High-speed DRAM interface.pdf · High Speed DRAM Interface Changsik Yoo DRAM Design 3 Samsung Electronics. csyoo@ieee.orgcsyoo@ieee.org

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for

DRAM Memory Controllerscs7810/pres/dram-cs7810-mem-ctlr-x2.pdf · DRAM Memory Controllers Reference: “Memory Systems: Cache, DRAM, Disk Bruce Jacob, Spencer Ng, & David Wang Today’s

DRAM - Mouser

DRAM Lecture2

Lecture05 Dram

DRAM Technology

Scalable Many-Core Memory Systems Lecture 1, Topic 1: DRAM ...users.ece.cmu.edu/.../onur-ACACES2013-Lecture1-dram... · Solution 1: Tolerate DRAM ! Overcome DRAM shortcomings with

Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors

Using Scheduling Templates to Your Advantage · cut back on confusion and scheduling errors Created appointment called “Day Clinic” and only schedulers are allowed to use this