Chipkill

Chipkill

Chipkill is IBM's trademark for a form of advanced error checking and correcting (ECC) computer memory technology that protects computer memory systems from any single memory chip failure as well as multi-bit errors from any portion of a single memory chip. One simple scheme to perform this function scatters the bits of a Hamming code ECC word across multiple memory chips, such that the failure of any one memory chip will affect only one ECC bit per word. This allows memory contents to be reconstructed despite the complete failure of one chip. Typical implementations use more advanced codes, such as a BCH code, that can correct multiple bits with less overhead. The equivalent system from Sun Microsystems is called Extended ECC. The equivalent system from HP is called Chipspare. A similar system from Intel is called SDDC.

Chipkill is frequently combined with dynamic bit-steering, so that if a chip fails (or has exceeded a threshold of bit errors), another, spare, memory chip is used to replace the failed chip. The concept is similar to that of RAID, which protects against disk failure, except that now the concept is applied to individual memory chips. The technology was developed by the IBM Corporation in the early and middle 1990s. An important RAS feature, Chipkill technology is deployed primarily on SSDs, mainframes and midrange servers.

In a 2009 paper using data from Google's datacentres [1], provided evidence that demonstrated that in the Google systems DRAM errors were recurrent at the same location, and that 8% of DIMMs were affected each year. Specifically, "In more than 85% of the cases a correctable error is followed by at least one more correctable error in the same month". DIMMs with chip-kill error correction showed a lower fraction of DIMMs reporting uncorrectable errors compared to DIMMs with error correcting codes that can only correct single-bit errors.

See also

References

  1. ^ Schroeder, Bianca; Pinheiro, Eduardo and Weber, Wolf-Dietrich (2009). "DRAM errors in the wild: a large-scale field study". Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems. SIGMETRICS '09 (ACM): 193–204. doi:http://doi.acm.org/10.1145/1555349.1555372. http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf. Retrieved 7 September 2011. 

External links


Wikimedia Foundation. 2010.

Игры ⚽ Нужно решить контрольную?

Look at other dictionaries:

  • Chipkill — est une marque commerciale de IBM. Il s agit d une technique de correction d erreur pour la mémoire vive qui protège un système de mémoire informatique de tout fonctionnement incorrect d une seule puce de mémoire. De plus, il est possible pour la …   Wikipédia en Français

  • IBM BladeCenter — IBM BladeCenter  это архитектура блэйд серверов IBM. BladeCenter E front side: 8 blade servers (HS20) followed by 6 empty slots …   Википедия

  • Dynamic random access memory — (DRAM) is a type of random access memory that stores each bit of data in a separate capacitor within an integrated circuit. Since real capacitors leak charge, the information eventually fades unless the capacitor charge is refreshed periodically …   Wikipedia

  • Soft error — In electronics and computing, an error is a signal or datum which is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a signal or datum which …   Wikipedia

  • IBM BladeCenter — The IBM BladeCenter is IBM s blade server architecture.HistoryOriginally introduced in 2002, based on engineering work started in 1999, the IBM BladeCenter was a relative late comer to the blade market. But, it differed from prior offerings in… …   Wikipedia

  • DDR SDRAM — This article is about DDR SDRAM. For graphics DDR, see GDDR. Generic DDR 266 Memory in the 184pin DIMM form …   Wikipedia

  • Error detection and correction — In mathematics, computer science, telecommunication, and information theory, error detection and correction has great practical importance in maintaining data (information) integrity across noisy channels and less than reliable storage… …   Wikipedia

  • List of algebraic coding theory topics — This is a list of algebraic coding theory topics. ARQ[disambiguation needed  ] Adler 32 BCH code BCJR algorithm Berger code Berlekamp Massey algo …   Wikipedia

  • Memory ProteXion — Memory ProteXion, found in IBM xSeries servers, is a form of redundant bit steering . This technology uses redundant bits in a data packet to recover from a DIMM failure. Memory ProteXion is different from normal ECC error correction in that it… …   Wikipedia

  • Dataram — Corporation Type Public (NASDAQ: DRAM) Industry Technology …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”