Overview and Motivation

- Big Data
- HPC as defined and requested by President Obama for the ExaScale Challenge (1 ExaFLOPS at 20 MW)
  - Unstructured data = random distribution of data across all addresses in the address space.
  - Random accesses to random addresses decrease efficiency of CPU caching strategies which rely on spatial and some degree on temporal locality.
  - Worse than lack of locality is the need to swap to disk – even to and from an SSD or PCIe-attached Flash.
  - Plenty of evidence that Big Data and HPC fare better on computers with very large main memories, even if they are slower than DRAM.
Motivation for Big Main Memory

- Why is there a need for a new type of memory?
- The problem size (Big Data, HPC) keeps growing
- Economic considerations rule out SRAM and DRAM
- What we really need is
  - Big
  - Fast
  - Cheap
  - Energy-efficient
Very large capacity Main Memory

- Total Main Memory size must grow to accommodate in-situ processing
- SRAM and DRAM are not dense enough and consume too much power
- SRAM and DRAM are too expensive
- SSDs and PCIe-attached Flash are too slow
- Very big main memory often can avoid swapping
- Really, Big Data means never having to go to Disk
Practical Solutions

- Direct attachment to the CPU is preferred over SAS, SATA or PCIe for latency reasons
- DDR3, DDR4 and HBM rely on outdated buses
- A faster infrastructure is needed, such as Hybrid Memory Cube and its High Speed Serial Links
- The memory controller(s) should reside with the memory, not on the CPU
- 3D XPoint is in its infancy
- It is a material property change in intersecting wires
Current CPU & Memory

Processor with DDR-3/4 DRAM Controller

DDR-3/4 DRAM DIMM

DDR-3/4 DRAM DIMM

Shared SSTL-2 Interface
Bandwidth: 17 GB/s

DDR-3/4 DRAM DIMM

DDR-3/4 Flash DIMM
Multi-Core CPU with Memory

Multi-Core CPU with in-order DRAM Controllers

- L2 Cache SRAM
- L2 Cache Controller
- CPU Core(s)

Mux/Demux

- DRAM Ctrl0
- DRAM Ctrl1
- DRAM Ctrl2
- DRAM Ctrl3

Multi-Drop DRAM Arrays with SSTL-2 Bus

- DIMM 0, 1, 2
- DIMM 0, 1, 2
- DIMM 0, 1, 2
- DIMM 0, 1, 2
Host CPU to Disk I/O

- Host CPU
- NorthBridge With PCIe Root Complex
- PCIe SSD Controller
- Flash Array
- PCIe SATA/SAS Controller
- Flash Controller
- Flash Array
Single-Port HMC-based Memory

Processor with HMC Host Adapter

HMC Host Adapter

HMC Flash Module

HMC DRAM Module

HMC Flash Module

HMC Flash Module

FDX Interface Bandwidth: 60 GB/s

FDX Interface Bandwidth: 60 GB/s

FDX Interface Bandwidth: 60 GB/s

FDX Interface Bandwidth: 60 GB/s
SSRLabs Unified HMC Memory

Processor with HMC Host Adapter

HMC Host Adapter

HMC DRAM Controller & TSV DRAM Interface

HMC DRAM

Unified HMC Memory

HMC Base Logic (Parser, Switch, Command xlat)

HMC Flash Controller & TSV Flash Interface

HMC Flash

TSV-attached DRAM

TSV-attached Flash

TSV-attached Flash

FDX Interface Bandwidth: 60 GB/s
# Cost Comparison

- **Assumption:** 512 GB Memory Array
- **Source:** DRAMXChange

<table>
<thead>
<tr>
<th>Type</th>
<th>Per-Unit Cost</th>
<th>Number needed</th>
<th>Total Cost</th>
</tr>
</thead>
<tbody>
<tr>
<td>DDR4 DRAM Chip</td>
<td>$3.35</td>
<td>1024</td>
<td>$3,430.40</td>
</tr>
<tr>
<td>4 GB Registered DIMM (DDR4)</td>
<td>$66.99</td>
<td>128</td>
<td>$8,574.72</td>
</tr>
<tr>
<td>32GB DDR4 PC4-17000 Load Reduced ECC 1.2V 4096Meg x 72</td>
<td>$469.99</td>
<td>16</td>
<td>$7,519.84</td>
</tr>
</tbody>
</table>
Benefits of a Unified HMC Mem

- 3D and TSV manufacturing is maturing
- All components are readily available
- Internal and port bandwidth exceed all legacy memory architectures
- Better than DDR3/4 DRAM Performance at better price, density and power