ECC/DSP System Architecture for Enabling Reliability Scaling in Sub-20nm NAND

Eran Sharon, Idan Alrod, Avi Klein, Alon Eyal, Ofer Shapira
*Intelligent Memory Systems, Memory Division, SanDisk Corp.*

August 2013
Forward-Looking Statements

During our presentation today we may make forward-looking statements.

Any statement that refers to expectations, projections or other characterizations of future events or circumstances is a forward-looking statement, including those relating to market growth, industry trends, future memory technology, technology transitions and future products. This presentation contains information from third parties which reflect their projections as of the date of issuance.

Actual results may differ materially from those expressed in these forward looking statements due to factors detailed under the caption “Risk Factors” and elsewhere in the documents we file from time to time with the SEC, including our annual and quarterly reports.

We undertake no obligation to update these forward-looking statements, which speak only as of the date hereof.

Disclaimer: This tutorial provides an overview of various techniques and concepts, some or all of which may not necessarily reflect what SanDisk is actually using in their products.
Outline

- Gap Between Product Requirements and Technology Capability
  - Applications Requirements: Endurance, Performance, Power
  - Reliability Challenges with Scaling

- ECC/DSP solutions
  - Tier 0: Adaptive NAND Parameters Optimization
  - Tier 1: Noise Reduction
  - Tier 2: Advanced Error Correction Coding (ECC)
  - Tier 3: Second level Error Correction (RAID)
  - Tier 4: Flash Management Algorithms
  - Tier 5: Host Data Manipulation

- Summary

Disclaimer: This tutorial provides an overview of various techniques and concepts, some or all of which may not necessarily reflect what SanDisk is actually using in their products.
Increasing Product Requirements

- Faster Application Execution
- Faster Web Browsing
- Smoother Multi-tasking
- Computational Photography
- Longer Battery Life, Power Savings
- Sharing, Connectivity
- Higher Resolution Video

SanDisk
Gap Between Raw Memory Capability and Applications Requirements

- INAND™
- EFD “Classic”
- AFM 1.0 Performance Leadership
- AFM 2.0 Application Awareness
- AFM 3.0 Usage based Adaptive Flash Management™ Technology

- Email, Music, & Video
- Multitasking Apps

Cost Reduction

RAW_TEXT_END
Gap Between Raw Memory Capability and Applications Requirements

AFM 4.0
Multi-dimensional Adaptive Flash Management™ Technology

AFM 1.0 Performance Leadership
AFM 2.0 Application Awareness

Email, Music, & Video
Multitasking Apps
Computing

Raw Flash Compatibility

56nm 43nm 32nm 24nm 19nm 1Ynm 12nm

Reliability Degrades
Performance Deteriorates

Lower Power Consumption
Better Performance
Higher Endurance

Cost Reduction
Optimized **Endurance** for enhanced video download & application caching
Optimized Performance for superior gaming experience

Better Performance

Lower Power Consumption

Higher Endurance
Optimized Power Consumption for longer web browsing

Lower Power Consumption
Reliability Challenges with Scaling

As an example we will describe the phenomena of Read Disturb
1. BL Pre-Charge
Read Operation

Threshold Voltage numbers are nominal

1. BL Pre-Charge
   - Opens “select gate” transistors

2. Gate Voltages
   - Opens unselected cells – “Victims”
   - Senses state of selected cell – “Target”
Read Operation

Target Cell is Erased – Read “1”

1. BL Pre-Charge
2. Gate Voltages
3. Sensing
Read Operation

Target Cell is Erased – Read “1”

1. BL Pre-Charge
2. Gate Voltages
3. Sensing
Read Operation

*Target Cell is Erased – Read “1”*

1. BL Pre-Charge
2. Gate Voltages
3. Sensing
Read Operation
Erased Cell – Read “1”

1. BL Pre-Charge
2. Gate Voltages
3. Sensing

Current is Sensed
➢ Cell is Erased!
Read “1”
Read Operation

Target cell is Programmed – Read “0”

1. BL Pre-Charge
2. Gate Voltages
3. Sensing

Current is NOT Sensed

➢ Cell is Programmed!
Read “0”
Read Disturb

Er → A
Read Disturb

- P/E cycles leads to Tunnel Oxide (Tox) degradation that creates traps
Read Disturb

- P/E cycles lead to Tunnel Oxide (Tox) degradation that creates traps
- “Weak Programming” in unselected cells due to unintentional tunneling of electrons to the FG
Read Disturb

- P/E cycles lead to Tunnel Oxide (Tox) degradation that creates traps
- "Weak Programming" in unselected cells due to unintentional tunneling of electrons to the FG
ECC/DSP Methods: from NAND to System

- Adaptive NAND Parameters Optimization (Tier 0)
- Noise Reduction (Tier 1)
- Advanced Error Correction Coding (ECC) (Tier 2)
- Second level Error Correction (RAID) (Tier 3)
- Flash Management Algorithms (Tier 4)
- Host Data Manipulation (Tier 5)

NAND (Raw) to System
Error Handling System Solutions

Early Technologies

ECC
Error Correction Coding (BCH)

1e-21
Few Errors

Errors

1e-1
Many Errors

Basic ECC sufficient to meet application requirements
Error Handling System Solutions
Advanced (Sub-20nm)

Sophisticated ECC and DSP techniques applied to mitigate the natural drift in reliability, and to meet the more demanding requirements of embedded application.
Adaptive NAND parameters optimization along the memory lifetime.

Parameter setting ("trimming") of the Program, Erase and Read parameters.

System level feedback adapts the parameters to:
- Memory wearing and error rates along the lifetime
- Die to die, block to block, WL to WL variations within the memory
- Host data patterns

Once NAND level optimization has been exhausted, the residual noises and errors need to be handled at system level.
Adaptive Read Thresholds – Example

Problem:
- Cell Voltage Distribution is not fixed:
  - Changes along the memory lifetime with W/E cycling and time (DR)
  - Variations within a die - changes from Block to Block, WL to WL,…
- Using a fixed set default thresholds result in high BER and decoding failure

Solution:
- Adaptive read thresholds
Tier 1: Noise Reduction

- System level residual NAND “noise” reduction via DSP and coding techniques, aimed at reducing error rates to a bare minimum level
  - Tier 1 countermeasures may reduce raw NAND error rates from a ~1E-1 error level to ~1E-2 error level
  - Tier 1 countermeasures are aimed at:
    - Ensuring that the next Tier 2 Error Correction Coding (ECC) is cost effective (i.e. less redundancy)
    - Maximizing performance and reducing power consumption
  - Tier 1 countermeasures deal with non intrinsic “noises”, which can be cancelled out, mitigated or compensated for:
    - Data dependent noises such as cross-coupling induced widening, back pattern effects, Program and Read Disturbs, Over programming errors, etc.
NAND Scaling Challenges – Interferences

Source: Semiconductor Insights
Cross Coupling Widening effect
Cell-to-Cell Coupling (CCC) Trend

- With technology scaling, CCC increases dramatically
- Air Gap technology make the 19nm (AG) CCC equivalent to 24nm (no AG) ~ 27% reduction
Mitigating Data Dependent Noises

Digitally mitigating cross coupling and other data depended noises during read by taking into account the neighboring cell’s read state

Example: Read LSB page

Without utilizing neighbor cell information:
LSBit = “0” with low reliability

With utilizing neighbor cell information:
LSBit = “1” with high reliability

Assume we know that the neighbor was programmed to highest state
Tier 2: Advanced Error Correction Coding (ECC)

- Advanced Error Correction Coding (ECC) is required in order to handle the residual errors of tier 1
  - Tier 2 ECC can reduce the $\sim 1E^{-2}$ residual error levels of tier 1 to $\sim 1E^{-16}$ error level
  - State of the art iterative coding techniques, such as LDPC, are replacing algebraic coding techniques, such as BCH codes
  - Advanced ECC techniques are essential for achieving an optimal cost, endurance and performance tradeoff, as they allow operation near the theoretic limits (Shannon limit), providing maximal correction capability for a given amount of overprovisioning ("ECC redundancy")
Flash Information Theory...

How can we compute the Flash capacity?: *Information Theory* (Shannon 1948)

Based on knowing probability to read a voltage level $Y$ given that a voltage level $X$ was programmed

$$C = \max_{P(X)} I(X;Y) = \max_{P(X)} \sum_{X,Y} P(X)P(Y|X) \log_2 \left( \frac{P(Y|X)}{\sum_X P(X)P(Y|X)} \right)$$

Actual computations are more complicated. Depend on:

- Verify and read voltage levels
- Data retention
- P/E cycles
- Temperature
- Tuning voltage ambiguity
- Cross coupling
- Back pattern
- Program/Read disturb

...
Approaching the Shannon Limit

Source: Forward Insights
Tier 3: Second level Error Correction (RAID)

- For enhanced reliability, especially required for SSD applications, a second level error correction, aimed to deal with complete NAND failures resulting in colossal errors, is required. RAID like techniques are used for that purpose.
  - Tier 3 level protection is used for both:
    - Reducing tier 2 error rates from $\sim 1E^{-16}$ to $\sim 1E^{-24}$ or lower
    - Reducing dPPM levels due to gross NAND failure, such as WL breaks, WL shorts, etc.
  - Tier 3 protection may require extra overprovisioning, or may only maintain the overprovisioning temporarily in the controller until verifying data integrity.
RAID Example
Tier 4: Flash Management Algorithms

- Back End Flash Management algorithms which manage how logical data is stored on the physical NAND level, in a way that will provide the best performance (both sequential, random or any other combined use case) and the best endurance.

- Examples of Flash management functions are:
  - Logical to physical address
  - Wear leveling
  - Garbage collection
Host data manipulation, leveraging the inherent “redundancy” in the host data for improving endurance, performance and power

- Examination of host data produced by users or arising from various operating and file system shows that a significant fraction of the data is of low entropy, having many repetitive data patterns
- Low entropy data from the host can be manipulated by the controller in various ways:
  - Compression, Endurance coding, Deduplication
Summary

**Tier 0:**
Adaptive Parameters

**Tier 1:**
Noise Reduction

**Tier 2:**
Advanced ECC

**Tier 3:**
Second Level Error Correction

**Tier 4:**
Flash Management Algorithm

**Tier 5:**
Host Data Manipulation

Optimization:
Optimization for different tradeoffs
Thank you!