Download - TECHMAN XC100 NVMe SSDtm-ssd.com/whitepaper/TM-XC100-NVMe-SSD-Technical... · A PCIe SSD with the lowest latency second to DDR is the best option to avoid such IO performance bottleneck.

TECHMAN Electronics

1

Confidential

TECHMAN XC100 NVMe SSD

Technical White Paper

v1.0

April, 2016

Techman reserves the right to change products, information and specifications

without notice.

Information in this document is provided in connection with Techman products.

No license, express or implied, by estoppel or otherwise, to any intellectual property

rights is granted by this document. Except as provided in Techman's terms and

conditions of sale for such products, Techman assumes no liability whatsoever and

Techman disclaims any express or implied warranty, relating to sale and/or use of

Techman products including liability or warranties relating to fitness for a particular

purpose, merchantability, or infringement of any patent, copyright or other

intellectual property right. Unless otherwise agreed in writing by Techman, the

Techman products are not designed nor intended for any application in which the

failure of the Techman product could create a situation where personal injury or

death may occur.

All brand names, trademarks and registered trademarks belong to their

respective owners.

Revision History

Version 1.0

Date Apr, 2016

Author Ilong.Hsiao, Ted.Hsieh

Approver Ilong.Hsiao

Amendment Robert.Hsiao

TECHMAN Electronics

2

Confidential

Contents

Overview

Part 1: High Performance Hardware

1-1: Multi-core Computing

1-2: Multi-channel Flash Controller

1-3: Multi-queue Engines

1-4: Embedded XOR & Randomizer

1-5: Strong BCH ECC

Part 2: Advanced NAND Flash Management

2-1: Bad Block Management

2-2: Read Disturb Policy

2-3: Data Retention Policy

2-4: Smart Read Retry Policy

Part 3: Data Integrity Guarantee

3-1: End-to-end Data Protection

3-2: Adaptive RAID Data Protection

3-3: Thermal Throttling Protection

3-4: Power Loss Protection

3-5: Firmware & Metadata Protection

Part 4: Intelligent Firmware Management

4-1: High Performance FTL

4-2: Global Wear Leveling

4-3: Efficient Garbage Collection

4-4: Fast Power-on Rebuild

4-5: TRIM support

4-6: Intelligent Write Flow Control

4-7: Intelligent Read Sequence Control

Part 5: Dual Port for High Availability

TECHMAN Electronics

3

Confidential

OVERVIEW

The digital universe is exploding. The data in the whole world is expected to

reach 17 zettabytes in 2017 and 44 zettabytes in 2020, due to the emerging of IoT.

Currently 90% of data on Earth is generated within the last 2 years. According to IDC,

every 60 seconds there will be: 72 hours of video uploading to Youtube, 350 GBs of

data generated on Facebook, 571 new websites created, 277,000 tweets on Twitter,

100 million emails sent, and over 2 million Google search queries happening.

Either a Youtube broadcaster showcasing a live game streaming or a seismic

professor analyzing the earthquakes’ data requires fast and stable process, e.g.

consistent and low latency on IO requests. To fulfill such requirement, the server

systems in charge of these processes must enhance the capability of both computing

cores and storage devices. The traditional HDD is becoming a bottleneck of

performance due to its extreme high latency. A PCIe SSD with the lowest latency

second to DDR is the best option to avoid such IO performance bottleneck.

To be ready for the era of High Speed Computing, such as Cloud service, Big Data

Analysis, Online Transaction Process, High Frequency Financial Trading, the storage

devices must be evolving simultaneously with the computing processors, to avoid

becoming an obstacle of the whole performance. As a result, Techman SSD has

decided to focus on Enterprise grade storage design and development. Based on PCI

Express Gen3X4, Techman XC100 further extends on NVM Express (NVMe) protocol

supporting higher volume of command sequences. With our optimized designs,

Techman XC100 will guarantee high speed processing with very stable response time

within a long operation period.

Below chapters describe what technologies Techman XC100 has designed and

delivered to achieve the highest speed performance with great consistency. Please

enjoy it.

TECHMAN Electronics

4

Confidential

PART 1: HIGH PERFORMANCE HARDWARE

Multi-core Computing The evolution of SSD controller is very much similar to that of the whole

computer industry. The more processors a SSD supports, the higher performance it

delivers. To keep up with the system’s increasing performance, SSD must numerically

increase the processors/cores accordingly.

The controller adopted by XC100 supports 16-cores architecture. With these 16

cores, the commands/threads from host system can be processed in parallel with

high speed. Among all these 16 cores, certain cores will have own dedicated

managers, e.g., Boot Processor Core with ROM manager function, to handle

functions specifically. Also all the threads will be processed via Inter-Process

Communication (IPC) which transfers with high speed and allows information sharing.

Furthermore, the SRAM and DRAM inside XC100 are also shared with all these 16

cores & threads. This will release more resources of the cache memory and CPU.

Together with the 16-cores architecture, IPC, dedicated manager functions, and

the shared cache design, multiple requests and commands from host system will be

handled fast and efficiently to achieve high performance requirement.

Multi-channel Flash Controller For a SSD, the more flash channels it controls, the higher performance it delivers.

XC100 supports up to 16 channels of NAND Flash control.

XC100 controller will utilize all these 16 channels simultaneously during

operations. All the Read/Program/Erase commands and data from host system will

TECHMAN Electronics

5

Confidential

be coordinated and distributed evenly, through XC100’s Flash Interface, to all these

16 channels. With Multi-channel Flash coordination & distribution, the Quality of

Service (QoS) will be guaranteed.

Multi-queue Engines Today, deploying multiple high performance processors is a basic requirement

not only to a server system but also to a personal computer. Also, thanks to the

development of NVM Express (NVMe), the interface protocol now also can afford

much, much more command queues sending from host to storage device than before.

A storage device seemed to become a bottleneck of the overall performance.

To fulfill the rapid & tons of IOs from the host, a storage device must be capable

of handling these IOs with maximum speed. XC100 has already adopted a 16-cores

controller and also supported NVMe protocol. In between controller and protocol,

XC100 further designed the Multi-queue Engines to process these high speed &

frequent IOs.

The Multi-Queue Engines of XC100 include 1 Admin Queue, 128 Submission

Queues, and 128 Completion Queues. And each Queue supports up to 1024 queue

entries (1024 queue depth).

IOs from multiple cores of host system will be first latched in Submission Queue

Engine and then distributed to XC100’s multi-core controller for corresponding

process. Once completed, the controller will submit a completion queue entry to

Completion Queue Engine and a notification interrupt (doorbell) to host. Finally, the

host will process the completion queue entry in Completion Queue Engine and

release its resource. Simply put, these frequent IOs issued by multiple cores of host,

TECHMAN Electronics

6

Confidential

going through NVMe protocol, will be handled rapidly and efficiently with the

Multi-queue engine of XC100.

Embedded XOR & Randomizer One of the interesting characteristics of NAND is that continuously storing some

identical data pattern into the NAND flash will impact the data integrity and accuracy.

To avoid such symptom, the Randomizer schemes with encryption can help.

Well-randomized data pattern by randomizer stored in NAND Flash will reduce the

data error during reading back. In addition, XC100 supports XOR Calculator, and XOR

Engine to have flash-aware RAID functionality to provide extra protection to increase

the data integrity. The XOR Calculator will calculate the parity information for Flash

RAID stripe, and the XOR Engine will deliver high performance Flash RAID rebuild

operations.

TECHMAN Electronics

7

Confidential

Strong BCH ECC Regarding the error correction capability on BIT/BYTE level, XC100 has adopted

the Bose-Chaudhuri-Hocquenghem (BCH) ECC scheme. This function supports the

error correction capability up to 100 bits within 4320 Bytes of data.

With such capability, XC100 can easily fulfill: (1) the 40bits/1000Bytes ECC

requirement of TOSHIBA 15nm MLC adopted in XC100; (2) the UBER ≤ 10−16

requirement in JEDEC Enterprise SSD specification.

TECHMAN Electronics

8

Confidential

PART 2: ADVANCED NAND FLASH MANAGEMENT

Bad Block Management

There are always some unhealthy cells in NAND flash memory, in nature or in

nurture. These unhealthy cells usually are called “Bad Blocks”. Bad Blocks are no

longer suitable to store any data. To avoid this, SSD must always monitor and record

the healthiness of all Blocks, from the beginning until its life-end.

There are 2 types of Bad Blocks: Original Bad Blocks (OBB) and Growth Bad Blocks

(GBB). OBB are those who existed after SSD manufacturing process while GBB usually

refer to those who are generated during SSD runtime operations.

Process and functions are built in XC100 to well manage Bad Blocks. During SSD

manufacturing, the BURN-IN process of XC100 will locate the Bad Blocks by scanning

all cells in NAND. Along with those generated during Wafer & Package process of

NAND vendor, all these OBB, before SSD shipping out, will be marked to avoid further

usage by customers.

Once XC100 started runtime operating on customers’ side, it will activate the

real-time monitor function to mark and record those who encounter: (1) Block Erase

Failures; (2) Page Program Failures. These 2 types of blocks will be categorized as

GBB.

Via such Bad Block Management, XC100 will mark and record all the possible Bad

Blocks to assure the healthiness of the whole SSD during its life span.

Read Disturb Policy One of the most interesting characteristics of NAND flash memory is the Read

Disturb Phenomenon. The cell’s electrons adjacent to the cell BEING READ will be

TECHMAN Electronics

9

Confidential

influenced, resulting into data loss in the adjacent cell. This is the so called “Read

Disturb”. For example, when Reading cell B, NAND circuits will also apply 5V to its

adjacent cell A & cell C. After reading cell B over 10,000 times or more, the data in

cell A or cell C might not be read out any more.

To avoid such phenomenon, XC100 will: (1) Monitor and record Read counts

information of each block; (2) Detect error bits of the Read block; (3) Refresh block

with Garbage Collection function (GC) based on information of (1) & (2). With these

operations, XC100 will no longer encounter Read Disturb phenomenon.

Data Retention Policy

Retention phenomenon is

another interesting characteristic

of NAND flash. Under some

conditions and circumstances, e.g.,

high temperature environment and

power-off, the data inside the

NAND might disappear after a

period of time.

The root cause is that there will

be charge leakage from floating gate every time after page program. And the

Program/Erase cycles (P/E cycles) also will influence the retention time. When the

P/E cycles of certain cells reached its limitation, e.g., life end, usually specified in

NAND spec., the data retention time of SSD must fulfill the requirement defined in

JEDEC: For client grade MLC, 30℃ for 1 year; for enterprise grade MLC, 40℃ for 3

months.

TECHMAN Electronics

10

Confidential

So how to make sure SSD data retention time fulfill JEDEC requirements? In

XC100, we will: (1) Monitor retention period for each block; (2) Detect bit error rate

of the Read block; (3) Refresh block with GC function based on information of (1) &

(2). Via these operations, XC100 assures that the data retention will meet JEDEC’s

specification.

Smart Read Retry Policy The NAND flash cell quality will become worse due to all operations, e.g. P/E

cycling, Read/Write disturbance, Retention, and Temperature. Also the cell voltage

distribution will shift, which means the Read threshold voltage (Vth) might require

adjustment to determine if it’s 0 or 1. Under such circumstances, the Read process

might require some retries to complete.

However, as long as Read Retry occurs, it will impact SSD’s performance. So, it’s

an important task for SSD designers to minimize the retry impact. XC100 supports

Smart Read Retry scheme to protect data integrity. Our scheme includes:

(a) Apply fast and adjustable Vth setting to read back data even with error bits.

(b) Apply previous Vth as an Optimal value to reduce the retry time and latency.

(c) Refresh data block once error bits exceeding preset limits.

(d) Flash-level RAID function is the last resort of data recovery.

Via this scheme, the Read Retry will not influence the overall performance of

XC100.

TECHMAN Electronics

11

Confidential

PART 3: DATA INTEGRITY GUARANTEE

End-to-End Data Protection Data integrity is extremely important for both service providers and service users.

To protect the data in storage device, XC100 supports End-to-End Data Protection to

maintain data accuracy and integrity. The End-to-End Data Protection function

adopted by XC100 includes:

(a) Support protections similar to T10-DIF/DIX specifications

(b) Support XTS-AES-256 data encryption

(c) XOR data protection on DRAM. The DRAM bus contains 64 bits of data and 8 bits

of data ECC

(d) Support BCH ECC with 4176 bytes of data and 200 bytes of parity data

(e) Support Flash-based RAID protection

With such End-to-End Data Protection, the data on every path within SSD will be

integrity-guaranteed.

Adaptive RAID Protection End-to-End Data Protection support BIT/BYTE level data protection. For

Page/Block protection, XC100 adopted the Adaptive RAID Protection.

This protection is similar to RAID-5 protection on device level, except that it is

operating over all flash channels by XC100. The concept is to store parity information

in 1 randomly-selected page among all n+1 pages. Since the parity page is distributed,

the protection of RAID-5 scheme is similarly activated.

Furthermore, the RAID stripe size is not fixed. XC100 will dynamically adjust the

stripe size once Bad Block symptom occurred. However if the quantities of Bad Block

TECHMAN Electronics

12

Confidential

are over 8, XC100 will mark the corresponding stripe as a bad one and activate the

refresh function accordingly.

Thermal Throttle Protection All electronic devices generate heat, so does a SSD. With high performance

request, the temperature of a PCIe/NVMe SSD operating at full speed performance

will certainly ramp up very rapidly. Thanks to the air-flow design, such phenomenon

seldom occurs in a well-ventilated system. However a good designer must hope for

the best and prepare for the worst. Therefore, XC100 adopted the Thermal Throttle

Protection to prevent any possible thermal damages to the SSD device.

There are 3 preset temperature thresholds in XC100 design. When the embedded

thermal sensor exceeds each of these thresholds, XC100 will throttle the data

transfer rate at its corresponding level to decrease the heat generation. Once the

internal mercury goes below the threshold, the limitation will be dismissed.

TECHMAN Electronics

13

Confidential

Power Loss Protection The data integrity in an Enterprise system is critically important even when

encountering the unexpected power shutdown. A system or a storage device must

guarantee the data integrity by all means.

XC100 has designed a function, Power Loss Protection (PLP), to avoid data loss

when ungraceful power shutdown occurs. With PLP, XC100 can operate normally

with a limited period of time without original power source. The concepts are

depicted in figure below.

(1) In normal mode (green path), XC100 operates on normal power source while the

PLP capacitors will be fully charged as a backup power source.

(2) In abnormal mode (red dotted path), XC100 will open the switch (SW), activating

the previously fully charged PLP capacitors as the backup power source, to keep

XC100 operating normally for a short while.

The backup power must supply long enough for a SSD to flush all important data

into NAND flash. Thus, the PLP function must be well-designed and optimized with all

other SSD functions, such as FTL, WL, & GC, etc., to prevent data loss.

Metadata & Firmware Protection Metadata mainly includes (I) FTL table info (II) Wear Leveling info (III)

Write/Read/Erase counts info of every block (IV) Bad and Free blocks info (V)

Firmware info. This means, other than user data, metadata contains lots of extremely

important information. To protect metadata and firmware, XC100 adopts 2 schemes:

(1) Pseudo SLC mode (2) Multi-copy backup.

(1) Pseudo SLC (pSLC) is a transformation from MLC to SLC. As we all know, the SLC

NAND has much better endurance (P/E cycle ≈ 60,000) and faster processing time

TECHMAN Electronics

14

Confidential

than MLC NAND (P/E cycle ≈ 3,000). By configuring some MLC blocks into pSLC

mode, XC100 will extend the P/E cycles of these pSLC blocks to 30,000. Thus, the

Metadata and Firmware in pSLC blocks will be much more intact, and also much

faster to process.

(2) Multi-copy concept is depicted as below figure. By distributing Metadata into

different LUNs & different Blocks, XC100 can still operate normally even if errors

occurred in some copies of Metadata.

As for Firmware Protection, since the NVMe 1.1 protocol has defined the SLOTs

concept for storing firmware image, XC100 has accordingly adopted this multi-slots

scheme to store the firmware images. There can be 3 versions of firmware images

stored in XC100 and each version will have another 3 backup copies (below figure).

Furthermore, all these firmware images are distributed into LUNs which is similar to

the concept of Metadata Protection. Via such protection, the firmware is intact even

encountering ungraceful power-off.

TECHMAN Electronics

15

Confidential

PART 4: INTELLIGENT FIRMWARE MANAGEMENT

High Performance FTL Under some operations, such as Garbage Collection, SSD will move the valid user

data from the going-to-be-erased block to other locations without notifying users.

This means that the Physical Block Address (PBA) of these valid data have been

changed while its Logical Block Address (LBA) remained the same. Such operations

require the Flash Translation Layer (FTL) function which monitors and records the

mapping relations between LBA & PBA. Moreover, due to the frequent IO commands

from the host to SSD, the FTL will be updated rapidly & frequently. This means that

the performance of the FTL will heavily influence the overall SSD performance.

XC100 has designed and support an optimized and high speed Flash Translation

Layer (FTL) scheme. The scheme includes features:

(1) High speed direct mapping between LBA & PBA;

(2) 4K data based mapping, which is the most utilized file size of various OS;

(3) Put FTL in S/DRAM for fast & frequent update operations;

(4) Optimized with WL & GC for better endurance & lower latency;

(5) Periodical-saving Snapshot algorithm for balancing system performance and

faster rebuild time;

Via these intelligent designs and

detailed verifications, XC100’s data

mapping function, FTL, will be

operating not only with high speed

but also with consistency.

Global Wear-Leveling Regarding the unique behaviors of NAND flash: Programming is Page-based;

Erasing is Block-based; A block consists of many Pages; Erase must be prior to

Program; P/E cycle has limitations; Read will induce Disturbance; Data has Retention

limits, etc. Almost every process applied to NAND cell will impact its life span. To

avoid unevenly usage of NAND cells, the “Wear Leveling (WL)” must be adopted and

carefully-designed.

Wear Leveling is the function that will calculate the P/E counts of all blocks and

move user data from block to block to assure that all blocks are evenly used, e.g.

TECHMAN Electronics

16

Confidential

with similar P/E cycles. Needless to say, such actions will involve FTL & GC mentioned

previously. Again, WL, GC, & FTL functions must all be well-designed and

together-optimized to avoid impacting SSD performance.

Data from host can be roughly separated into 2 categories: Hot Data & Cold Data.

Hot Data are very frequently updated while Cold Data might not be updated for a

very long time. Based on these 2 categories, there are as well 2 corresponding types

of WL in XC100:

(1) Dynamic WL: Mainly applied on Hot Data. XC100 will dynamically prioritize those

blocks with minimum P/E counts to store Hot Data. Via Global FTL, the original

PBA of the Hot Data will be marked as invalid, waiting for Garbage Collection (GC)

function to collect, erase, and release it.

(2) Static WL: Mainly applied on Cold Data. As previously mentioned, XC100 will

monitor the P/E counts of all blocks. Once the P/E counts of Cold Data block are

the minimum, XC100 will activate Static WL, forcing this Cold Data to move to

other area and release it.

XC100’s outstanding endurance specifications (3 and 7 DWPD) simply indicate the

well-design of WL scheme and its optimization via GC and FTL.

Efficient Garbage Collection There are many versions of Data stored in NAND flash. However there will be

ONLY ONE up-to-date version while others are out-of-dated. These out-of-date data

are usually referred to invalid data. Garbage Collection (GC) is to retrieve these

invalid data and release free space for further usage. However, too frequent GC

operations will increase the overhead and P/E cycles, impacting the overall

performance and endurance. Also, GC is always operating in the background. If GC is

not well-designed, along with FTL & WL, the host commands & device responses will

be severely influenced.

GC operations of XC100 include:

(1) Select GC target block;

(2) Acquire all information of valid/invalid pages of GC blocks;

(3) Select free blocks for GC destination;

(4) Copy valid data to destination block, leaving only invalid data on GC target block;

(5) Erase GC target block to release it free.

With unique and smart selection algorithm of GC blocks, GC scheme of XC100 has

been optimized efficiently with FTL & WL to reach the highest performance.

TECHMAN Electronics

17

Confidential

Fast Power-On Rebuild Comparing to HDD, SSD is much faster not only on operating but also on

rebooting. It’s because the native process speed of NAND is much higher than HDD.

However, there are still some topics a good SSD design must cover, e.g. the FTL

rebuild speed during power-on.

When power-on (rebooting), SSD must first acquire and rebuild all information of

FTL then upload it onto DRAM for the upcoming process. When shutting down power

gracefully, the system will wait for SSD until it flushes the complete FTL on DRAM

back to NAND. With complete FTL in NAND, power-on process will be very fast. But if

encountering abnormal power-off, the FTL on DRAM usually will be lost immediately.

Thus, SSD need to scan every page in every block to rebuild the complete FTL during

power-on. Such “scan-everything” process will take much longer time than normal

case.

In XC100, the Snapshot scheme with some special designs was adopted to avoid

long power-on duration.

When normal operations:

(1) Snapshot function will periodically save and update the FTL data back to NAND;

(2) Optimize update frequency of Snapshot to avoid performance impact;

(3) Store Snapshot data in pSLC area for better endurance & faster access;

When power-on:

(4) Retrieve information from the latest Snapshot data first;

(5) Scan the non-updated data, if any, to retrieve the rest information;

(6) Rebuild the whole FTL with (4) & (5).

With such Snapshot scheme, XC100 assures that the power-on rebuild is fast for

both normal and abnormal power-off.

TRIM Command Support Unlike HDD which allows overwritten process, SSD must first Erase the flash cell

before Programming data into it. In this case, SSD must activate WL to determine

which block contains the most invalid data and use it as the data target block.

Furthermore, the host system usually will not reveal to SSD which data (LBA) is no

TECHMAN Electronics

18

Confidential

longer valid. As a result, the more invalid data, the less free NAND space. To release

free NAND space, GC will be activated. Once activating GC, the performance will start

decreasing gradually.

The TRIM command is to get rid of such inconveniences. The host can issue TRIM

to SSD, indicating which data (LBA) is no longer valid. Then SSD can activate GC,

operating in the background, to collect and erase these invalid data to release more

space. Thus the SSD performance will be sustaining at a certain level instead of

continuous decreasing.

Intelligent Write Data Flow Control

XC100 has designed an intelligent scheme each for Read & Write data flow

management. For Write flow management, by treating GC data and host data both as

input, XC100 will adaptively control and adjust the balance of these 2 data to keep

XC100 performance consistent while maintaining the sufficient free blocks.

Intelligent Read Sequence Control

For Read flow management, XC100 adopted the Re-scheduler function to

rearrange the command sequences in order to simultaneously utilize as many

channels as possible. The advantages of such function are: (1) Read commands will

not be jammed on flash channels; (2) Read latency will be much lower with the

Pending Queue mechanism of Re-scheduler function.

TECHMAN Electronics

19

Confidential

PART 5: DUAL PORT SUPPORT

The PCI express bus can support 2 lanes, 4 lanes, 8 lanes, and 16 lanes. The more

lanes it support, the more data it allows. Why would a PCIe storage device with 4

lanes decide to downgrade itself to 2 lanes? The answer is for High Availability.

High Availability is to ensure a certain degree of operational continuity during a

given measurement period. HA will avoid Single Point of Failure (SPOF). A system

with HA will provide: (1) A certain amount of uptime; (2) Access of critical functions

of the system; (3) Redundancy.

For example, there is only one server system with one XC100 providing service

to customers. If failure occurs with this server, its service must be shut down for

further repair. Although the data itself is intact, customers still have to wait until

finished repairing. This is known as SPOF. If connecting 2 server systems to one single

XC100, when one of these two paths has failure, the other one will take over the

failed one’s jobs with the same data storage, e.g. XC100, to continue working for a

certain period of time. As a result, service users will not encounter any downtime

phenomenon while the system maintainer could repair the failed one in the

meantime.

Currently Techman design and validation teams are working together with our

server partners on such Dual Port feature evaluations. Estimated in early June,

Techman SSD will introduce our Dual-port-featured series, XC200, to the markets.