mentorpaper_88190

8/18/2019 mentorpaper_88190

1/18

S Y S T E M D E S I G N W

H

I

T

E

P

A

P

E

R

w w w . m e n t o r . c o m

DDR4 BOARD DESIGN AND SIGNAL

INTEGRITY VERIFICATION CHALLENGES

NITIN BHAGWATH, MENTOR GRAPHICS

CHUCK FERRY, MENTOR GRAPHICSARPAD MURANYI, MENTOR GRAPHICS

ATSUSHI SATO, FUJITSU SEMICONDUCTOR LIMITEDMOTOAKI MATSUMURA, FUJITSU SEMICONDUCTOR LIMITEDAKIHIRO MIKI, FUJITSU VLSI LIMITED

RANDY WOLFF, MICRON TECHNOLOGY

This paper was presented at DesignCon, January 29, 2015.

It is reproduced by permission.

ABSTRACT

Besides faster data rates, the new DDR4 standard incorporates additional changes from

prior DDR technologies which impact the board design engineer. New factors in DDR4

such as an asymmetric termination scheme, data bus inversion and signal validation

using eye masks require new methods of validating designs through simulation. This

paper investigates the effects of DDR4’s Pseudo Open Drain (POD) driver on data bus

signaling and describes methodologies for dynamically calculating the DRAM’s internal

VrefDQ level required for data eye analysis, methodologies for generating and verifying

the data eye as well as ways of incorporating write leveling and calibration into the

simulation. Additionally, evaluation of Simultaneous Switching Noise (SSN) by

incorporation of power integrity effects into the signal integrity analysis is also critical

to board design and timing closure and will be elaborated with examples. A system

design example using IBIS 5.0 power aware models will be described including a

simulation accuracy study comparing the IBIS results with transistor-level models.


2/18

www.mentor.com2

DDR4 Board Design and Signal Integrity Verification Challenges

INTRODUCTION

DDR4 is the next step in JEDEC’s family of DRAM parts. It has been developed to serve the market needs of higher

speeds and lower power consumption. These factors have contributed to new features in DDR4, as well as new

requirements which need to be accounted for while designing a DDR4 system.

The f irst sections of this paper investigate DDR4’s Pseudo Open Drain driver and what its use means for power

consumption and Vref levels for the receivers. Subsequent sections of the paper look at a DDR4 system design

example and the need for simulating with IBIS power aware models versus transistor level models for Simultaneous

Switching Noise characterization.

ADVANTAGE OF POD OVER SSTL

One of the major market forces acting on the DRAM industry is the demand for lower power consumption of the

memory devices. To this end, DDR4 uses a new drive standard, known as Pseudo Open Drain, or POD. In POD, the

receiver terminates the signal to a high level, rather than to half the rail voltage.

To see the dif ference that the termination scheme makes in the total power consumption, the current draws in the

low and high states can be compared.

When in the low state, there is a current draw in both SSTL and in POD. In fact, POD might draw slightly higher

current since the termination is to the voltage rail whereas the termination is to only half the voltage rail in SSTL.

This is somewhat offset by a slightly lower voltage rail in DDR4.

Figure 1 - Termination of DDR4 (POD) and DDR3 (SSTL)

Figure 2 - Current Comparison when driving Low


3/18

www.mentor.com3


However, the main difference between the two drive options is highlighted when a high is driven. Whereas SSTL

continues to draw current at a rate approximately equal to when driving low, POD draws no power when driving a

high.

So, the way to decrease system power with DDR4 is to maximize the number of highs being driven. This is wherethe DBI feature comes in handy. If there are at least 5 DQ signals in an 8-bit lane which are driven low, then all bits

are toggled, and the Data Bus Inversion (DBI) signal is asserted low to indicate that the inversion has taken place.

This way, out of the total of 9 signals (8 DQ signals and one DBI), at least f ive are driven high. If the original data

contains four or more DQ signals being driven high, then the DBI signal is de-asserted high, once again ensuring at

least 5 of the total 9 bits being driven high. This way, on each transaction, it is guaranteed that at least 5 of the 9

bits are driven in the power reduced high state.

CALCULATION OF VREF

In DDR3, an external reference voltage is used to compare the input signal to determine a high vs. a low. This

external voltage is often a generated either by a voltage divider which is then filtered, or by an external precision

voltage regulator. DDR4 however, requires that the Vref be generated within the DRAM, and be adjustable. The

Vref will be set to a value on each powerup.

NEED FOR DYNAMICALLY CALCULATING VREF

To highlight why this variable Vref is needed in DDR4, consider a simple setup of a DDR3 and DDR4 dr iver driv ing

into a termination resistor which has been strapped to the appropriate voltage. By calculating the voltage at the

receiver when a high is driven, and when a low is driven, the average will be the ideal voltage to be used as the

reference threshold voltage since this level will be equidistant to the high and low signals.

F gur e 3 - Curr ent com ar ison w hen r iv ing Hi gh

Figure 4 - With DBI, if 5 or more lows are driven, toggle the entire byte


4/18

www.mentor.com4


To calculate this center voltage a simple setup with driver and termination is analyzed. To simplify calculations, the

transmission line is taken to be very short, and the driver strength when driven high and low is assumed to be

equal.

We can first consider the DDR3 case.

When driving high, the voltage at the receiver will be the superposition of the effect of the two voltage sources,

Vdd/2 and Vdd.

When driving low, the voltage at the receiver will be a simple voltage divider

The center voltage for this DDR3 setup can then be obtained by taking the average of the two results.

This value is always hal f the rail voltage. It is constant with respect to all other aspects of the setup, including

termination values or drive strengths.

Figure 5 - DDR3 driving high (left) and low (right)


5/18

www.mentor.com5


We can then consider the DDR4 case, and apply the same sequence as above. When DDR4 is driven high, the

voltage at the receiver is simply Vdd since both the termination and the driver are strapped to Vdd. Similar to DDR3

when driving low, the receiver voltage is the result of the voltage divider.

Again, the center of the receiver eye will be the average of the two values.

Note that in this situation, the center voltage value is dependent on not only the supply voltage, but also on the

characteristics of the transmitter and receiver. This implies that the ideal voltage to be used at the receiver will

depend on the setup, on the silicon batch, read vs. write and other system variables.

To see the ef fects, we can consider a simple driver, transmission line, receiver setup. The termination resistance of

the receiver is varied to see the effect of the eye for DDR3 and DDR4.

First, with DDR3, as the receiver termination is weakened from 40 Ohms to 60 Ohms to 120 Ohms, the signal isallowed to freely go towards greater extremes – both highs and lows. However, the center of the eye for all three

settings is always fixed at Vdd/2.

Figure 7 - setup to observe eye center vs. receiver termination

Figure 6 - DDR4 driving high (left) and low (right)


6/18

www.mentor.com6


For the DDR4 setup, the receiver ODT is varied from 40 to 60 to 80 Ohms. With the weaker termination, the lows

are allowed to go lower, but the high value stays more or less fixed. This causes the center value of the eye to

increase with stronger (lower

value) termination.

COMPARISON OF ALGORITHMS FORCALCULATING VREF

Now, each pin of a given device

might have different requirements

for reference voltage due to slight

variations between the pins and

layout. However, the device cost,

both in terms of silicon and power,

would be too prohibitive to set a

separate reference for each pin

separately. So, a common

reference voltage needs to be

calculated which optimizes the

response of all the pins using this

reference. We will compare two

options for calculating this

common reference voltage below.

The f irst option (“Option 1”) to

generate the reference voltage

takes the average of all the signals.

The second option (“Option 2”)

uses only the extreme signals and

use the average of the ex treme

signals.

As an example, if an 8-pin device

receives optimal reference

voltages on the 8 pins as 800mV,

750mV, 730mV, 725mV, 720mV,

710mV, 705mV and 700mV, then

the device Vref will be as follows:

To analyze the ef fects of using the two options, let us first consider the average margin loss for the device when

using each of these options.

Consider a pin ‘x’ in a device ‘d’. The receiver eye at the pin might have a voltage center, Vx, which is optimal for

that pin. The device however has a different reference voltage, Vd, which it uses for several pins in that device.

Corresponding to each of these reference voltages will be a set of high and low threshold levels.

Figure 8 - DDR3 eyes with ODT sweep

Figure 9 - DDR4 eyes with ODT sweep


7/18

www.mentor.com7


Now, for a given signal (shown in red in the diagram

below), the optimal margin from the threshold level will

be Mp. However, since the actual voltage being used

for the threshold might not be optimal, the actual

margin is given by Md.

The Margin lost on account of using the device

reference instead of the optimal reference of the pin is

given by:

Note that the margin lost can be negative, which implies a margin gain. For a margin lost on the high side, an

equivalent margin will be lost on the low side, and vice versa.

Next, we can compare the average margin loss using two different algorithms to determine Vref.

Option 1 uses the average of all the signals as the reference voltage.

Therefore,

This is somewhat intuitively expected. When the reference voltage is obtained by using all the signals, the average

margin loss is zero because for each pin, the margin lost will be of fset by the margin gained by another set of pinsin the group.

Now, considering option 2, the reference voltage for the device is taken as the average of the highest and lowest

voltages.

Resulting in,

Figure 10 - Margin for pin


8/18

www.mentor.com8


The average margin loss is not zero. Either on the high or the low side, the average margin loss will be greater than

zero. However, although it may appear that option 1 might be the better course, option 2 actually works out better

when we consider the extreme signals.

Let’s take as an example an eyemask requirement of DDR4. The eyemask height requirement is 136mV, or threshold

±68mV.

So, if option 1 were used, then the requirement for the signal would be 730±68, or 662mV on the low side and

798mV on the high side. Similarly, with option 2, the requirement for the signal would be 750±68, or 682mV on the

low side and 818mV on the high side.

Next, we can take a look at the pin which has an optimal center of 800mV, and compare the results with a reference

voltage using option 1 vs. option 2. Let us assume that the signal arriving at this pin has a peak-to-peak swing of260mV, or 800±130mV (730mV on high side, and 670mV on low side). This eye should be able to pass all signals if

an optimal threshold is selected.

Figure 11 - DDR4 eyemask example

Figure 12 - Eye behavior with different Vref calculations


9/18

www.mentor.com9


As can be seen from the diagram below, this implies that the signal with option 1 (730mV) as Vref will have a large

margin on the high side, but will fail on the low side. The signal using option 2 (750mV) as Vref, however, will have

a smaller margin on the high side, but will pass on the low side as well.

In general, if the extreme high and low requirements across all pins of the device are given by Vh and Vl, the

incoming eye needs to have an eye-height of at least Vh-Vl if a common reference voltage is to be used. In thiscase, assuming that the high and low requirements from the threshold are equal, the threshold needs to be at

(Vh+Vl)/2, which only considers the two extreme signals, and not the other signals. A threshold set to any other

level might cause issues with some signals even if the eye opening is at least Vh-Vl.

By taking only the extreme signals into account when calculating the reference voltage to be used, the margin of

the remaining signals may be reduced. However, by ensuring that the extreme high and low signals pass, it will be

ensured that all the other signals pass as well.

GENERATION OF THE DATA EYE

DDR4 has borrowed the concept of an eye-mask from SerDes technology to validate receiver signals, as seen in

Figure 11. However, unlike SerDes signals, the DDR4 signals don’t have a clock embedded within the data stream.

The data is clocked by an external signal – DQS for data and CLK for the address/command.

So, the generation of the eye either when simulating or when measuring on an oscilloscope must not be

generated by wrapping the DQ waveform around itself at a fixed bit period. Since the DQ signal is sampled using

the explicit DQS, the eye must be formed by sampling around the DQS. This will account for the irregularities of

the DQS signal.

One method to do so would be to sample the data signal for a predetermined time window around each strobe

crossing as in Figure 13. If the strobe is early, then some parts of the data signal might be shifted. If the strobe is

delayed, there might be parts of the data which are not visible in the window. This is how the actual device would

react, since any shift in the DQS will af fect the sampling of the signal.

Figure 13 - Eye created by windowing data around strobe


10/18

www.mentor.com10


To illustrate this , the following is a simulation of a DQS and a DQ. The two par ts of the DQS are intentionally

mismatched so as to create a non-ideal strobe at the receiver. The signal is run at 2400Mbps.

If the receiver signal is simply wrapped around at 416.67ps (one UI at 2400Mbps), then the eye has a jitter of about

12ps.

However, if the eye is created by sampling the signal around the strobe, then – even discounting the runt signal

caused by the initial strobe transient, the jitter as seen by the data signal increases to 20ps.

POWER AWARE IBIS FOR DDR4 SIMULATION

IO MODELS SPICE NETLIST OR IBIS 4.2 VS IBIS 5.0

Wide parallel memory busses can present significant design challenges when it comes to designing a robust power

delivery network (PDN). One critical focus of PDN design is delivery of power to the memory chip output drivers.

The on-chip data (DQ) drivers can require signif icant amounts of current delivered through sometimes highly

inductive package connections. These simultaneous switching outputs (SSO) can cause significant noise issues that

translate into timing jitter and signal integrity (SI) problems.

Figure 14 - Setup to highlight effect of imperfect strobe

Figure 15 - Data eye with no strobe variation effect (left) vs. Data eye including strobe imperfections (right)


11/18

www.mentor.com11


Mitigating SSO issues in a system requires optimizing the design of the PDN of the printed circuit board (PCB),

package and on-die. Detailed circuit models are needed for each piece. Historically, these circuit models are

combined and simulated in SPICE based simulators to analyze SSO effects. These simulations are computationally

intensive and lead to lengthy simulation times of hours to days. For solution-space and what-if analysis, simulation

times are simply too long.

SPICE-based transistor level models of the on-die drivers are often the most complex part of the system model.

This is especially true for the most accurate models that include layout-based RC parasitic circuit elements. One

effective way to reduce simulation time is to use behavioral buffer models. Behavioral models use simpler

algorithms than SPICE models, enabling faster simulation with often similar levels of accuracy.

The I/O Buf fer Information Specif ication (IBIS) is a behavioral modeling format used industry-wide for SI simulation.

Commonly used versions of the IBIS specification include IBIS 4.2 and IBIS 5.0. Figure 16 shows common

implementations of an IO circuit model using a SPICE netlist, IBIS 4.2 and IBIS 5.0.

IBIS 4.2 and IBIS 5.0 have tables of data that describe circuit characteristics of the f inal IO buffer.

IBIS 4.2 assumes ideal power connected to the buffer. Thus, the SSO noise cannot be taken into account in the

simulation. IBIS 5.0 extended the usefulness for Power Integrity (PI) simulation specifically enabling the simulation

of SSO noise. New keywords in IBIS 5.0 specific to PI include [Composite Current], [ISSO PU] and [ISSO PD].

Figure 16 - Setup comparison between SPICE, IBIS 4.2 and IBIS 5.0


12/18

www.mentor.com12


[Composite Current] data are I-T tables that describe the shape of the rising and falling edge current waveforms

from the power reference terminal of the buffer (VDE). This switching current includes contributions from the

on-die decoupling circuit, crow-bar current, any termination current, signal driver current and pre-driver current.

Final driver current could be derived accurately by simulating IBIS 4.2 models, but this can significantly

underestimate the total driver current without details of the pre-driver contribution.

[ISSO PU] and [ISSO PD] data are tables describing the effective current of the pullup and pulldown driver

transistors as a function of the voltage on the pullup and pulldown supply reference nodes. The PI problem being

modeled is known as gate modulation and is caused by drooping power supply voltages on-die as the die PDN

attempts to pull current instantaneously through the inductive package PDN.

In addition to the [Composite Current], [ISSO PU] and [ISSO PD] data tables in the IBIS file, it is necessary to include

the characteristics of the on-die power supply decoupling structure. Due to limitations in the IBIS specification, a

model of the decoupling’s electrical behavior must be included in SSO simulations external to the IBIS buf fer

model, connected across the power and ground reference terminals.

TRADE OFFS BET WEEN SPICE NETLIST, IBIS 4.2 AND IB IS 5.0

Table 1 shows a comparison of simulation time between the models. Simulation with IBIS models (both 4.2 and 5.0)is about ten times faster than with the SPICE netlist. There is a tendency of simulation time of SPICE netlists to

increase with faster data rates.

SPICE NETLIST IBIS 4.2 IBIS 5.0

Simulation Time Longer Shorter Shorter

SI simulation accuracy High High Hight

PI simulation accuracy High Low High

For improved SI, DDR3 used ZQ (Zero Quotient) calibration and ODT (On Die Termination). In addition to those,

DDR4 has Vref training functionality in the IO circuit, which can make SPICE Netlists much larger. For example, in

DDR4 the number of elements (that includes both MOSFET and parasitic RC) per SPICE netlist balloons to several

tens of thousands. In order to simulate SSO noise, it is necessary to model the full data channel, so the number of

elements can reach several hundreds of thousands. With this many elements, simulation time can take days. DDR4

simulation with SPICE netlists of the IO is not realistic. Since IBIS models have only data tables modeling the output

circuits, simulation time is significantly shortened.

IBIS 4.2/5.0 both provide accurate simulation results with ideal power conditions. When SSO noise is imposed, IBIS

4.2 has accuracy issues, but IBIS 5.0 gives a good match to SPICE netlist simulation. As seen in Figure 17, there is a

trade-off between simulation time and accuracy when SPICE netlists and IBIS 4.2 models are the choices. However,

IBIS 5.0 balances both performance and accuracy well.

Table 1 - Modeling trade-offs


13/18

www.mentor.com13


OLD BUT NEW ISSUE OF IB IS MODEL OVERCLOCKING

IBIS models have assumption of a

maximum working frequency that

provides better accuracy. If a buffer is

switched faster than that frequency, the

accuracy will be sacrificed. This

phenomenon is called “Over Clocking.”

The maximum working frequenc y

depends on the waveform described in

the V-T tables. These waveforms can be

broken down into three sections: initial

delay area, active area, and inactive area as

seen in Figure 18. To get an accurate result,

the following formula needs to be

satisfied:

It is common in simulation software to

remove the initial delay area in an IBIS 4.2

model, which helps avoid the Over Clocking

problem. However, this technique cannot be

used for IBIS 5.0 models. Since [Composite

Current] includes the current of the pre-driver,

the initial delay area cannot be truncated

since current is seen in this area (see Figure 19

on the following page).

To avoid Over Clocking problems, simulators

are required to deal with waveforms that have

long initial delay areas for IBIS 5.0 simulation.

The simulator must support “Length of a half

cycle” = “Length of Active Area.” Figure 20, on

the following page, illustrates the concept of

how each area of the V-T and I-T waveforms

needs to be handled to generate the correct

voltage and current waveforms that do not

show artifacts of overclocking problems.

Figure 17 - Voltage noise comparison between SPICE netlist, IBIS 4.2 and

IBIS 5.0

Figure 18 - Overclocking waveform regions


14/18

www.mentor.com14


Figure 19 - Clipping output waveforms with composite current

Figure 20 - Composite Current calculation


15/18

www.mentor.com15


Figure 21 shows a comparison between t wo simulation engines. The blue line in the figure shows the waveform

generated by a traditional simulator. The correct waveform result in red is generated by a simulator that employed

the improved simulation technique.

CHALLENGES TO DDR4 SSO NOISE SIMULATION

Figure 22 shows the simulation setup for a DDR4 memory interface design (also see simulation schematic in Figure

23). The controller is in an FCBGA (Flip Chip Ball Grid Array) package and two 2400Mbps speed-grade DDR4

SDRAMs are mounted on a 6-layer PCB. Various simulations were run using either SPICE netlists or IBIS 5.0 models

for the IO circuits. For the SPICE netlist simulation, the controller IO signal and power circuits are modeled in the

SPICE netlist, the packages and PCB are modeled with S-parameters, and the SDRAMs are modeled with IBIS 5.0.

For the IBIS 5.0 simulation, the controller IO signal circuit is modeled with an IBIS 5.0 buffer model , the IO power

circuit is modeled as an RC equivalent circuit, and the package, PCB, and SDRAMs are modeled the same as in the

SPICE netlist simulation.

Figure 21 - Waveform incorporating Overclocking

Figure 22 - Simulation Setup


16/18

www.mentor.com16


First, a comparison was done between the SPICE netlist model and IBIS 5.0 model simulations that do not have SSO

noise or crosstalk noise. The DQS signal and one DQ bit were stimulated at 2400Mbps in Write mode. The

measurement was done at the die pad of the SDRAM with the DQS as the trigger. Both simulations matched well as

seen in Figure 24. The eye widths, referred to as VdiVW, were within 10ps difference. For this simulation, the IBIS 5.0

model provides enough accuracy.

Figure 23 - Simulation Schematic

Figure 24 - SPICE and IBIS 5.0 comparison without SSO effects


17/18

www.mentor.com17


Next, SSO noise was examined. The DQS signals and 32 DQ bits were operated at 2400Mbps in Write mode, and

the SDRAM die pad and VDE voltage at the controller were measured. The upper waveform in Figure 25 shows the

VDE waveform, and the lower waveform is a DQ signal’s waveform at the SDRAM die pad. Due to the 32 bits of DQ

signals switching, VDE voltage at the controller is fluctuating, which is the SSO noise. The SPICE netlist model (blue

line) and IBIS 5.0 model (red line) meet almost perfectly. It is confirmed that SSO noise was accurately simulated

using the IBIS 5.0 model.

Next, a comparison was done where SSO noise and crosstalk noise were imposed. The DQS signals and one DQ

signal (victim) were operated at 2400Mbps in Write mode with the other 31 bits (aggressors) operated both in

phase and out-of-phase with the DQ victim. The measurement was done at the SDRAM die pad with the DQS as

the trigger. Results are shown in Figure 26

Figure 25 - Power affected by DQ signal

Figure 26 - Eyes incorporating SSO Noise


18/18

TECH12690-wMF 2-15

©2015 Mentor Graphics Corporation, all rights reserved. This document contains information that is proprietary to Mentor Graphics Corporation and maybe duplicated in whole or in part by the original recipient for internal business purposes only, provided that this entire notice appears in all copies. In

accepting this document, the recipient agrees to make every reasonable effort to prevent unauthorized use of this information. On March 1st 2015,system LSI businesses of Fujitsu Limited and Panasonic Corporation have been consolidated and transferred to Socionext Inc. The contents of this whitepaper, which contain the company name “Fujitsu Semiconductor”, are still valid by replacing the name with “Socionext”. All trademarks mentioned in thisdocument are the trademarks of their respective owners.

F o r t h e l a t e s t p r o d u c t i n f o r m a t i o n , c a l l u s o r v i s i t : w w w . m e n t o r . c o m


The eye widths in Figure 26 became generally smaller than the widths in Figure 24 due to the SSO noise.

Comparing results between the SPICE netlist model simulation and IBIS 5.0 model simulation shows that IBIS 5.0

eye width is larger (300ps versus 278ps). The IBIS 5.0 model simulation underestimated the SSO noise influence by

22ps (8%). This underestimation was caused by ignoring the delay fluctuations in the pre-dr iver circuitry. IBIS 5.0

models ignore the effects of voltage changes on pre-driver circuitry. Increasing voltage on-die will make transistors

in the pre-driver circuits switch faster; the opposite effect is seen with decreasing voltage. These voltage changescan lead to mismatches in timing between pre-driver pullup and pulldown signal paths as well as overall increased

or decreased delay of the driver switching.

Finally, simulation times were compared. One cycle of PRBS7 stimulus for DDR4-2400Mbps is 60ns. It took 221

hours (9.2 days) to simulate the schematic shown in Figure 23 with the SPICE netlist model. The simulation of the

IBIS 5.0 model was completed in 3 hours, which is a 98.6% reduction from the SPICE netlist model. IBIS 5.0 is useful

for large scale simulation, which is required for chip-package-PCB level co-design.

Note: The performance results are based on simulations in which no attempts were made to ensure set up of equivalent simulation

conditions such as time step, hardware, etc.

CONCLUSION

A successful DDR4 board design can be accomplished using the analysis techniques described in this paper. EDA

software updated to support DDR4 simulation can help the designer properly use DBI, calculate the proper Vref

level for analysis, apply the DDR4 receiver mask for timing verification and generate data eyes with correct jitter

contributions. Using IBIS 5.0 power aware models can significantly speed up simulation time while allowing for

reasonably accurate simulation of SSO jitter ef fects.

Figure 27 - Execution time comparison

mentorpaper_88190

Documents

Transcript of mentorpaper_88190