Scratchpad RAM

Scratchpad RAM

Scratchpad memory (SPM), also known as scratchpad, scatchpad RAM or local store in computer terminology, is a high-speed internal memory used for temporary storage of calculations, data, and other work in progress. In reference to a microprocessor ("CPU"), scratchpad refers to a special high-speed memory circuit used to hold small items of data for rapid retrieval.

It can be considered as similar to an L1 cache in that it is the memory next closest to the ALU's after the internal registers, with explicit instructions to move data from and to main memory, often using
DMA-based data transfer. In contrast with a system that uses caches, a system with scratchpads is a system with Non-Uniform Memory Access latencies, because the memory access latencies to the different scratchpads and the main memory vary. Another contrast with a system that employs caches is that a scratchpad does commonly not contain a copy of data that is also stored in the main memory.

Scratchpads are employed for simplification of caching logic, and to guarantee a unit can work without main memory contention in a system employing multiple processors, especially in multiprocessor system-on-chip for embedded systems. They are most suited to storing temporary results (such as would be found in the CPU stack for example) that typically wouldn't always need committing to main memory; however when fed by DMA, they can also be used in place of a cache for mirroring the state of slower main memory. The same issues of locality of reference apply relating to efficiency of use; although some systems allow strided DMA to access rectangular data sets. Another difference is that scratchpads are explicitly manipulated by applications.

Scratchpads are not used in mainstream desktop processors where generality is required for legacy software to run from generation to generation, in which the available on-chip memory size may change. They are suited to embedded systems, special-purpose processors and games consoles, where chips are often manufactured as MPSoC, and where software is often tuned to one hardware configuration.

Examples of use

* SH2, SH4 used in Sega's consoles could lock cachelines to an address outside of main memory, for use as a Scratchpad.

* The Sony PS1's R3000 had a Scratchpad instead of an L1 cache. It was possible to place the CPU stack here, an example of the temporary workspace usage.

* Sony's PS2's customized R5000 employed a 16KiB Scratchpad, to and from which DMA transfers could be issued to its GS, and main memory.

* The Cell's SPEs are restricted purely to working in their "local-store", relying on DMA for transfers from/to main memory and between local stores, much like a Scratchpad. In this instance, additional benefit is derived from the lack of hardware to check and update coherence between multiple caches: the design takes advantage of the assumption that each processor's workspace is separate and private. It is expected this benefit will become more noticeable as the number of processors scales into the "many-core" future.

* Many other processors allow L1 cache lines to be locked.

* Most DSPs use a Scratchpad. Many past 3D accelerators and games machines (including the PS2) have used DSPs for vertex transformations. This contrasts with the stream based approach of modern GPUs which has more in common with a CPU cache in function.

* NVIDIA's 8800 GPU running under CUDA provides 16KiB of Scratchpad per thread-bundle when being used for gpgpu tasks.

* Ageia's PhysX chip utilizes Scratchpad RAM in a manner similar to the Cell; the theory being a cache hierarchy is of less use than software management for physics and collision calculations. These memories are also banked and a switch fabric manages transfers between them.

Alternatives

Cache control vs Scratchpads

Many architectures such as PowerPC attempt to avoid the need for cacheline locking or scratchpads through the use of cache control instructions. Marking an area of memory with "Data Cache Block Zero" (allocating a line but setting its contents to zero instead of loading from main memory) and discarding it after use ('Data Cache Block Invalidate', signaling that main memory needn't receive any updated data) the cache is made to behave as a scratchpad. Generality is maintained in that these are hints and the underlying hardware will function correctly regardless of actual cache size.

Shared L2 vs Cell local stores

Regarding interprocessor communication in a multicore setup, there are similarities between the Cell's inter-localstore DMA and a Shared L2 cache setup as in the Core2 Duo or the Xbox 360's custom powerPC: the L2 cache allows processors to share results without those results having to be committed to main memory. This can be an advantage where the working set for an algorithm encompasses the entirety of the L2. However when a program can be written to take advantage of inter-localstore DMA, the Cell has the benefit of each other Local Store serving the purpose of BOTH the private workspace for a single processor AND the point of sharing between processors i.e. the other Local Stores are on a similar footing viewed from one processor as the shared L2 in a conventional chip. The tradeoff is memory wasted in buffering and programming complexity for synchronization, though this would be similar to precached pages in a conventional chip. Domains where using this capability is effective include:

*pipeline processing (where one achieves the same effect as increased L1 size by splitting one job into smaller chunks).

*Extending the working set, e.g. a sweet spot for a merge sort where the data fits within 8x256KiB

*sharing code upload, e.g. load a piece of code to one SPU, then copy it from there to the others to avoid hitting main memory again.

It would be possible for a conventional processor to gain similar advantages with cache-control instructions, e.g. allowing prefetching to L1 bypassing L2, or an eviction hint that signaled a transfer from L1 to L2 but not committing to main memory; however, at present no systems offer this capability in a usable form and such instructions in effect should mirror explicit transfer of data among cache areas used by each core.


Wikimedia Foundation. 2010.

Игры ⚽ Поможем решить контрольную работу

Look at other dictionaries:

  • Axiom (computer algebra system) — Scratchpad redirects here. For scratchpad memory, see Scratchpad RAM. Axiom Developer(s) independent group of people Stable release September 2011 Operating system cross platform …   Wikipedia

  • Texas Instruments TI-99/4A — Infobox computer Name = Texas Instruments TI 99/4A Photo = Caption = 1979 TI 99/4 with RF modulator, optional Speech Synthesizer, keyboard overlays, and a cartridge. Note orange shift keys. Type = Home computer Released = June, 1981 (99/4 in June …   Wikipedia

  • Emotion Engine — The Emotion Engine is a CPU developed and manufactured by Sony and Toshiba for use in the Sony PlayStation 2 video game console. Mass production of the Emotion Engine began in 1999. Description The Emotion Engine consists of eight separate units …   Wikipedia

  • Sega Saturn — Sega Saturn …   Wikipedia

  • PlayStation Portable hardware — The PlayStation Portable s hardware consists of the physical components of the PlayStation Portable and its accessories.OverviewThe PSP was designed by nihongo|Shin ichi Ogasawara|小笠原伸一 for the Sony Computer Entertainment subsidiary of Sony… …   Wikipedia

  • Zilog Z800 — The Zilog Z800 was a 16 bit microprocessor designed by Zilog to be released in 1985. It was instruction compatible with their existing Z80, and differed primarily in having on chip cache and MMU for a 16 MB address range, and also a huge number… …   Wikipedia

  • CPU cache — Cache memory redirects here. For the general use, see cache. A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the… …   Wikipedia

  • Physics processing unit — A physics processing unit (PPU) is a dedicated microprocessor designed to handle the calculations of physics, especially in the physics engine of video games. Examples of calculations involving a PPU might include rigid body dynamics, soft body… …   Wikipedia

  • SIMD — In computing, SIMD (Single Instruction, Multiple Data) is a technique employed to achieve data level parallelism, as in a vector processor. First made popular in large scale supercomputers (contrary to MIMD parallelization), smaller scale SIMD… …   Wikipedia

  • LEON — is a computer CPU core, specifically, a 32 bit microprocessor based on RISC design. It is based on the SPARC V8 architecture, i.e., it is SPARC V8 (1987) instruction compatible, and originally designed by the European Space Research and… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”