Computational RAM

Computational RAM

Computational RAM or C-RAM is random access memory with processing elements integrated into the design. This enables C-RAM to be used as a SIMD computer. It also can be used to more efficiently use memory bandwidth within a memory chip.

Perhaps the most influential implementations of computational RAM came from The Berkeley IRAM Project.

Some embarrassingly parallel computational problems are already limited by the von Neumann bottleneck between the CPU and the DRAM. Some researchers expect that, for the same total cost, a machine built from computational RAM will run orders of magnitude faster than a traditional general-purpose computer on these kinds of problems.[1]

As of 2011, the "DRAM process" (few layers; optimized for high capacitance) and the "CPU process" (many layers; optimized for high frequency; relatively expensive per square millimeter) are distinct enough that there three approaches to computational RAM:

  • starting with a CPU-optimized process and a device that uses lots of embedded SRAM, add an additional process step (making it even more expensive per square millimeter) to allow replacing the embedded SRAM with embedded DRAM (eDRAM), giving ~3x area savings on the SRAM areas (and so lowering net cost per chip).
  • starting with a system with a separate CPU chip and DRAM chip(s), add small amounts of "coprocessor" computational ability to the DRAM, working within the limits of the DRAM process and adding only small amounts of area to the DRAM, to do things that would otherwise be slowed down by the narrow bottleneck between CPU and DRAM: zero-fill selected areas of memory, copy large blocks of data from one location to another, find where (if anywhere) a given byte occurs in some block of data, etc. The resulting system—the unchanged CPU chip, and "smart DRAM" chip(s) -- is at least as fast as the original system, and potentially slightly lower in cost. The cost of the small amount of extra area is expected to be more than paid back in savings in expensive test time, since there is now enough computational capability on a "smart DRAM" for a wafer full of DRAM to do most testing internally in parallel, rather than the traditional approach of fully testing one DRAM chip at a time with an expensive external automatic test equipment.
  • starting with a DRAM-optimized process, tweak the process to make it slightly more like the "CPU process", and build a (relatively low-frequency, but low-power and very high bandwidth) general-purpose CPU within the limits of that process. The Berkeley IRAM Project, TOMI Technology[2]

References

  • Duncan Elliott, Michael Stumm, W. Martin Snelgrove, Christian Cojocaru, Robert McKenzie, "Computational RAM: Implementing Processors in Memory," IEEE Design and Test of Computers, vol. 16, no. 1, pp. 32–41, Jan-Mar, 1999. [1]