- Cell (microprocessor)
Cell is a
microprocessor architecture jointly developed bySony Computer Entertainment ,Toshiba , andIBM , an alliance known as "STI". The architectural design and first implementation were carried out at the STI Design Center inAustin, Texas over a four-year period beginning March 2001 on a budget reported by IBM as approaching US$400 million. [cite web|url=http://ps3.qj.net/Cell-Designer-talks-about-PS3-and-IBM-Cell-Processors/pg/49/aid/14805|title=Cell Designer talks about PS3 and IBM Cell Processors|accessdate=2007-03-22] Cell is shorthand for Cell Broadband Engine Architecture, commonly abbreviated "CBEA" in full or "Cell BE" in part. Cell combines a general-purposePower Architecture core of modest performance with streamlined coprocessing elementscite web|url=http://www.research.ibm.com/people/m/mikeg/papers/2006_ieeemicro.pdf|title=Synergistic Processing in Cell' s Multicore Architecture|publisher=IEEE|accessdate=2007-03-22|format=PDF] which greatly acceleratemultimedia andvector processing applications, as well as many other forms of dedicated computation.The first major commercial application of Cell was in Sony's
PlayStation 3 game console.Mercury Computer Systems has a dual Cell server, a dual Cell blade configuration, a rugged computer, and a PCI Express accelerator board available in different stages of production. Toshiba has announced plans to incorporate Cell in high definition television sets. Exotic features such as the XDR memory subsystem and coherent Element Interconnect Bus (EIB) interconnect [cite web|url=http://www.hotchips.org/archives/hc17/2_Mon/HC17.S1/HC17.S1T2.pdf|title=Cell Broadband Engine Interconnect and Memory Interface|publisher=IBM|accessdate=2007-03-22|format=PDF] appear to position Cell for future applications in thesupercomputing space to exploit the Cell processor's prowess infloating point kernels. IBM has announced plans to incorporate Cell processors as add-on cards intoIBM System z9 mainframes, to enable them to be used as servers forMMORPG s [cite web|url=http://www-03.ibm.com/press/us/en/pressrelease/21433.wss|title=Cell Broadband Engine Project Aims to Supercharge IBM Mainframe for Virtual Worlds|publisher=IBM|date=2007-04-26 ] .The Cell architecture includes a novel
memory coherence architecture for which IBM received manypatent s. The architecture emphasizes efficiency/watt, prioritizes bandwidth over latency, and favors peak computationalthroughput over simplicity of program code. For these reasons, Cell is widely regarded as a challenging environment for software development. [cite news|url=http://news.com.com/Octopiler+seeks+to+arm+Cell+programmers/2100-1007_3-6042132.html|title=Octopiler seeks to arm Cell programmers|first=Stephen|last=Shankland|publisher=CNET|date=2006-02-22 |accessdate=2007-03-22] IBM provides a comprehensiveLinux -based Cell development platform to assist developers in confronting these challenges. [cite news|url=http://lwn.net/Articles/159564/|title=Cell Broadband Engine Software Development Kit Version 1.0|publisher=LWN|date=2005-11-10 |accessdate=2007-03-22] Software adoption remains a key issue in whether Cell ultimately delivers on its performance potential. Despite those challenges, research has indicated that Cell excels at several types of scientific computation. [cite web|url=http://www.cs.berkeley.edu/~samw/research/papers/cf06.pdf|title=The Potential of the Cell Processor for Scientific Computing|coauthors=Samuel Williams, John Shalf, Leonid OlikerShoaib Kamil, Parry Husbands, Katherine Yelick|publisher=Computational Research Division, Lawrence Berkeley National Laboratory|accessdate=2007-03-18|format=PDF]In November 2006,
David A. Bader at Georgia Tech was selected by Sony, Toshiba, and IBM from more than a dozen universities to direct the first STI Center of Competence for the Cell Processor.cite news|first=Gary|last=Goettling|url=http://gtalumni.org/Publications/magazine/win07/research.html|title=Power Cell|work=Georgia Tech Alumni Magazine Online|publisher=Georgia Tech Alumni Association|date=Winter 2007|accessdate=2007-03-22] cite pressrelease|url=http://www-03.ibm.com/industries/media/doc/content/news/pressrelease/1875614111.html|title=College of computing at Georgia tech selected as the first Sony-Toshiba-IBM center of competence focused on the cell processor|publisher=IBM|date=2006-11-05|accessdate=2007-03-22] [cite news|first=Bob|last=Keefe|url=http://www.cc.gatech.edu/~bader/news/AustinAmericanStatesman-061114.pdf|title=Georgia, not Austin, gets chip center|work=Austin American Statesman|date=2006-11-14 |accessdate=2007-03-22|format=PDF] This partnership is designed to build a community of programmers and broaden industry support for the Cell processor. There is a Cell Programming tutorial video available. [cite web|url=http://www.cc.gatech.edu/~bader/CellProgramming.html|title=One-Day IBM Cell Programming Workshop at Georgia Tech: Streaming Presentation of the full-day workshop|publisher=Georgia Tech College of Computing|accessdate=2007-03-22]Cell microprocessor segmentsPower ArchitectureHistory
In 2000,
Sony Computer Entertainment ,Toshiba Corporation , andIBM formed an alliance ("STI") to design and manufacture the processor.The STI Design Center opened in March 2001.cite news | title=Introduction to the Cell multiprocessor | publisher=IBM Journal of Research and Development
url=http://researchweb.watson.ibm.com/journal/rd/494/kahle.html|date=2005-08-07 |accessdate=2007-03-22] The Cell was designed over a period of four years, using enhanced versions of the design tools for thePOWER4 processor. Over 400 engineers from the three companies worked together in Austin, with critical support from eleven of IBM's design centers.During this period, IBM filed many
patents pertaining to the Cell architecture, manufacturing process, and software environment. An early patent version of the Broadband Engine was shown to be a chip package comprising four "Processing Elements," which was the patent's description for what is now known as the "Power Processing Element". Each Processing Element contained 8 "APUs", which are now referred to as SPEs on the current Broadband Engine chip. Said chip package was widely regarded to run at a clock speed of 4 GHz and with 32 APUs providing 32 GFLOPS each, the Broadband Engine was shown to have 1 teraflop of raw computing power. This design was fabricated using a 90 nm SOI process.In March 2007 IBM announced that the 65 nm version of Cell BE is in production at its plant in
East Fishkill, New York .cite web | title=IBM Produces Cell Processor Using New Fabrication Technology. | url=http://www.xbitlabs.com/news/cpu/display/20070312121941.html | publisher=X-bit labs | accessmonthday=March 12 | accessyear=2007] [cite news|url=http://www.psu.com/node/7409|title=65nm CELL processor production started|publisher=PlayStation Universe|date=2007-01-30 |accessdate=2007-05-18]Again in February 2008, IBM debuted that it will begin to fabricate Cell processors with the 45 nm process [ [http://arstechnica.com/news.ars/post/20080207-ibm-shrinks-cell-to-45nm-cheaper-ps3s-will-follow.html IBM shrinks Cell to 45nm. Cheaper PS3s will follow ] ]
In May 2008, IBM introduced the high-performance double-precision floating-point version of the Cell processor, the PowerXCell 8icite web | title=IBM Offers Higher Performance Computing Outside the Lab | url=http://www-03.ibm.com/press/us/en/pressrelease/24180.wss | publisher=IBM | accessmonthday=May 15 | accessyear=2008] , at the 65 nm feature size.
In May 2008, an
Opteron - and Cell-BE-based supercomputer, theIBM Roadrunner system, became the world's first and thus far only system to achieve one petaFLOPS. The Cell BE-based Roadrunner system is currently the worlds fastest supercomputer as represented by the Top500 list. The world's three most energy efficient supercomputers, as represented by the Green500 list, are similarly based on the PowerXCell 8i.Commercialization
On
May 17 ,2005 , Sony Computer Entertainment confirmed some specifications of the Cell processor that would be shipping in the forthcomingPlayStation 3 console. [cite news|first=David|last=Becker|url=http://news.com.com/PlayStation+3+chip+has+split+personality/2100-1043_3-5566340.html?tag=nl|title=PlayStation 3 chip has split personality|work=CNET |date=2005-02-07 |accessdate=2007-05-18] cite news|url=http://www.windowsitpro.com/Articles/ArticleID/46431/46431.html?Ad=1|title=Sony Ups the Ante with PlayStation 3|first=Paul|last=Thurrott|publisher=WindowsITPro|date=2005-05-17 |accessdate=2007-03-22] cite news|url=http://gear.ign.com/articles/615/615521p1.html|title=E3 2005: Cell Processor Technology Demos|first=Chris|last=Roper|publisher=IGN|date=2005-05-17 |accessdate=2007-03-22] This Cell configuration will have one Power processing element (PPE) on the core, with eight physical SPEs in silicon. In the PlayStation 3, one SPE is locked-out during the test process, a practice which helps to improve manufacturing yields, and another one is reserved for the OS, leaving 6 free SPEs to be used by games' code. The target clock-frequency at introduction is 3.2GHz . The introductory design is fabricated using a 90-nanometre SOI process, with initial volume production slated for IBM's facility inEast Fishkill, New York .Note that the relationship between cores and threads is a common source of confusion. The PPE core is dual threaded and manifests in software as two independent threads of execution while each active SPE manifests as a single thread. In the PlayStation 3 configuration as described by Sony, the Cell processor provides nine independent threads of execution.
On
June 28 2005 , IBM andMercury Computer Systems announced a partnership agreement to build Cell-based computer systems for embedded applications such asmedical imaging ,industrial inspection ,aerospace and defense, seismic processing, andtelecommunications .cite news|url=http://www.supercomputingonline.com/article.php?sid=13477|title=Mercury Wins IBM PartnerWorld Beacon Award|publisher=Supercomputing Online|date=2007-04-12 |accessdate=2007-05-18] Mercury has since then released blades, conventional rack servers andPCI Express accelerator boards with Cell processors.In the fall of 2006, IBM released the QS20 blade module using double Cell BE processors for tremendous performance in certain applications, reaching a peak of 410 gigaFLOPS per module. The
QS22 based on the PowerXCell 8i processor is used for theIBM Roadrunner supercomputer. Mercury and IBM uses the fully utilized Cell processor with 8 active SPEs. OnApril 8 2008 , Fixstars Corporation released aPCI Express accelerator board based on the PowerXCell 8i processor.cite web | title=Fixstars Releases Accelerator Board Featuring the PowerXCell 8i | publisher=Fixstars Corporation | url=http://www.fixstars.com/en/company/press/20080403.html|date=2008-04-08 ]Overview
The Cell Broadband Engine—or "Cell" as it is more commonly known—is a microprocessor designed to bridge the gap between conventional desktop processors (such as the
Athlon 64 , andCore 2 families) and more specialized high-performance processors, such as theNVIDIA andATI graphics-processors (GPUs). The longer name indicates its intended use, namely as a component in current and futuredigital distribution systems; as such it may be utilized in high-definition displays and recording equipment, as well as computer entertainment systems for the HDTV era. Additionally the processor may be suited todigital imaging systems (medical, scientific, "etc.") as well asphysical simulation ("e.g.", scientific andstructural engineering modeling).In a simple analysis, the Cell processor can be split into four components: external input and output structures, the main processor called the "Power Processing Element" (PPE) (a two-way simultaneous multithreaded Power ISA v.2.03 compliant core), eight fully-functional co-processors called the "Synergistic Processing Elements", or SPEs, and a specialized high-bandwidth
circular data bus connecting the PPE, input/output elements and the SPEs, called the "Element Interconnect Bus" or EIB.To achieve the high performance needed for mathematically intensive tasks, such as decoding/encoding
MPEG streams, generating or transforming three-dimensional data, or undertakingFourier analysis of data, the Cell processor marries the SPEs and the PPE via EIB to give access, via fully cache coherent DMA (direct memory access), to both main memory and to other external data storage. To make the best of EIB, and to overlap computation and data transfer, each of the nine processing elements (PPE and SPEs) is equipped with a DMA engine. Since the SPE's load/store instructions can only access its own local memory, each SPE entirely depends on DMAs to transfer data to and from the main memory and other SPEs' local memories. A DMA operation can transfer either a single block area of size up to 16KB, or a list of 2 to 2048 such blocks. One of the major design decisions in the architecture of Cell is the use of DMAs as a central means of intra-chip data transfer, with a view to enabling maximal asynchrony and concurrency in data processing inside a chip.cite web | title= Chip multiprocessing and the cell broadband engine | first=Michael |last=Gschwind | publisher= ACM | url=http://portal.acm.org/citation.cfm?id=1128023 | year= 2006 | accessdate= 29 June | accessyear= 2008]The PPE, which is capable of running a conventional operating system, has control over the SPEs and can start, stop, interrupt, and schedule processes running on the SPEs. To this end the PPE has additional instructions relating to control of the SPEs. Unlike SPEs, the PPE can read and write the main memory and the local memories of SPEs through the standard load/store instructions. Despite having
Turing complete architectures, the SPEs are not fully autonomous and require the PPE to prime them before they can do any useful work. Though most of the "horsepower" of the system comes from the synergistic processing elements, the use ofDMA as a method of data transfer and the limited local memory footprint of each SPE pose a major challenge to software developers who wish to make the most of this horsepower, demanding careful hand-tuning of programs to extract maximal performance from this CPU.The PPE and bus architecture includes various modes of operation giving different levels of
memory protection , allowing areas of memory to be protected from access by specific processes running on the SPEs or the PPE.Both the PPE and SPE are
RISC architectures with a fixed-width 32-bit instruction format. The PPE contains a 64-bitgeneral purpose register set (GPR), a 64-bit floating point register set (FPR), and a 128-bitAltivec register set. The SPE contains 128-bit registers only. These can be used for scalar data types ranging from 8-bits to 128-bits in size or forSIMD computations on a variety of integer and floating point formats. System memory addresses for both the PPE and SPE are expressed as 64-bit values for a theoretic address range of 264 bytes (16,777,216 terabytes). In practice, not all of these bits are implemented in hardware. Local store addresses internal to the SPU processor are expressed as a 32-bit word. In documentation relating to Cell a word is always taken to mean 32 bits, a doubleword means 64 bits, and a quadword means 128 bits.PowerXCell 8i Variant
In 2008, IBM announced a revised variant of the Cell called the PowerXCell 8i, which is available in QS22 Blade Servers from IBM. The PowerXCell is manufactured on a
65 nm process, and adds support for up to 32GB of slotted DDR2 memory, as well as dramatically improving double-precision floating-point performance on the SPEs from a peak of about 14GFLOPS to 102 GFLOPS total for 8 SPEs. TheIBM Roadrunner supercomputer , currently the world's fastest, consists of nearly 13,000 PowerXCell 8i processors, along with almost 7,000AMD Opteron processors.cite web | title= IBM announces PowerXCell 8i, QS22 blade server | publisher= Beyond3D | url= http://www.beyond3d.com/content/news/640 |month= May | year= 2008 | accessdate= 10 June | accessyear= 2008]Influence and contrast
In some ways the Cell system resembles early
Seymour Cray designs in reverse. The famedCDC 6600 used a single very fast processor to handle the mathematical calculations, while a series of ten slower systems were given smaller programs to keep themain memory fed with data. In the Cell the problem has been reversed: reading the data is no longer the difficult problem due to the complex encodings used in industry; today the problem is efficiently decoding that data into an ever-less-compressed version as quickly as possible.A more recent equivalent is the
Parallax Propeller , which has eight "cog" processors controlled by a single "hub"; however this is designed more for flexibility and low chipcount in embedded situations than for high performance.Modern
graphics card s andGPU s (including recent GPUs developed forGPGPU ) have multiple processing elements similar to the SPEs, known as shader units, with an attached high speedscratchpad RAM and shared cache. Programs, known as "shaders ", are loaded onto the units to process the input data streams fed from the previous stages (possibly the CPU), according to the required operations.The main differences between Cell and traditional GPUs are:
# the Cell's SPEs are much more general purpose than traditional shader units, both ininstruction set and in methods to transfer data among processing units, enabling programs to flexibly and dynamically chain the SPEs under program control, through high-bandwidth DMAs;
# that a shader unit in GPU usually allows for more than one hardware threads with shared local memory whereas a SPE uses (cooperative) software threads within a single hardware thread;
# that a GPU as a whole assumes the use of shared data (texture data) stored in a relatively large shared cache (one for each group of shader units) mapped from global memory, following the framework of the standard memory hierarchy with partitioned large L2 cache for multiple cores, which is a less flexible way of sharing data among processing units than DMA-based data sharing in Cell;
# that, as a result, Cell can efficiently handle more complex programming tasks including graphics, sound, or essentially any other workload which consists of multiple compute-intensive parallel tasks and their interactions, in comparison with GPUs. At the same time, GPUs are much more economical in its transistor usage for specific tasks and task patterns for which they are tailored, such as those found in the standard rendering pipeline.While these comparisons are generally relevant for recent GPUs developed for general-purpose GPU computing such as AMD's FireStream and Intel's coming Larrabee, these new GPUs enjoy a general-purpose instruction set and a more flexible cache sharing scheme than traditional GPUs (for example through pre-fetch and eviction hint instructions in the case of Larrabee), giving a middle ground between Cell and traditional GPUs as general-purpose parallel high-performance computing platforms.
Architecture
While the Cell chip can have a number of different configurations, the basic configuration is a multi-core chip composed of one "Power Processor Element" ("PPE") (sometimes called "Processing Element", or "PE"), and multiple "Synergistic Processing Elements" ("SPE").cite news | title=Cell Microprocessor Briefing | publisher=IBM, Sony Computer Entertainment Inc., Toshiba Corp. | url=http://pc.watch.impress.co.jp/docs/2005/0208/kaigai153.htm |date=
7 February 2005 ] The PPE and SPEs are linked together by an internal high speed bus dubbed "Element Interconnect Bus" ("EIB"). Due to the nature of its applications, Cell is optimized towardssingle precision floating point computation. The SPEs are capable of performingdouble precision calculations, albeit with an order of magnitude performance penalty. New chips expected mid-2008 are rumored to boost SPE double precision performance as high as 5x over pre-2008 designs. In the meantime, there are ways to circumvent this in software using iterative refinement, which means values are calculated in double precision only when necessary.Jack Dongarra and his team [http://www.netlib.org/lapack/lawnspdf/lawn175.pdf demonstrated] a 3.2 GHz Cell with 8 SPEs delivering a performance equal to 100 GFLOPS on an average double precisionLinpack 4096x4096 matrix.Power Processor Element
The "PPE" is the
Power Architecture based, two-way multithreaded core acting as the controller for the eight SPEs, which handle most of the computational workload. The PPE will work with conventional operating systems due to its similarity to other 64-bit PowerPC processors, while the SPEs are designed for vectorized floating point code execution. The PPE contains a 32KiB instruction and a 32KiB data Level 1cache and a 512 KiB Level 2 cache. The size of a cache line is 128 bytes. Additionally, IBM has included anAltiVec unitcite news | title=Power Efficient Processor Design and the Cell Processor | publisher=IBM | url=http://www.cerc.utexas.edu/vlsi-seminar/spring05/slides/2005.02.16.hph.pdf |date=16 February 2005 |format=PDF] which is fully pipelined forsingle precision floating point. (Altivec does not supportdouble precision floating-point vectors.) Each PPU can complete two double precision operations per clock cycle using a scalar-fused multiply-add instruction, which translates to 6.4GFLOPS at 3.2 GHz; or eight single precision operations per clock cycle with a vector fused-multiply-add instruction, which translates to 25.6GFLOPS at 3.2 GHz.cite web | title= Cell Broadband Engine Architecture and its first implementation | publisher=IBM developerWorks | url= http://www-128.ibm.com/developerworks/power/library/pa-cellperf/|date=November 29 2005 | accessdaymonth=6 April | accessyear= 2006]ynergistic Processing Elements (SPE)
Each SPE is composed of a "Synergistic Processing Unit", SPU, and a "Memory Flow Controller", MFC (DMA, MMU, and bus interface).cite web | title=IBM Research - Cell | work=IBM | url=http://www.research.ibm.com/cell/ | accessdaymonth=11 June | accessyear=2005] An SPE is a
RISC processor with128-bit SIMD organizationcite web | title= Synergistic Processing in Cell's Multicore Architecture | publisher= IEEE Micro | url= http://citeseer.ist.psu.edu/cache/papers/cs2/624/http:zSzzSzwww.research.ibm.comzSzpeoplezSzmzSzmikegzSzpaperszSz2006_ieeemicro.pdf/gschwind06synergistic.pdf |month= March | year= 2006 | accessdate=1 November | accessyear= 2006|format=PDF] cite web | title= A novel SIMD architecture for the Cell heterogeneous chip-multiprocessor | publisher=Hot Chips 17 | url= http://www.hotchips.org/archives/hc17/2_Mon/HC17.S1/HC17.S1T1.pdf |date=August 15 2005 | accessdate=1 January | accessyear= 2006|format=PDF] for single and double precision instructions. With the current generation of the Cell, each SPE contains a 256KiB embedded SRAM for instruction and data, called "Local Storage" (not to be mistaken for "Local Memory" in Sony's documents that refer to the VRAM) which is visible to the PPE and can be addressed directly by software. Each SPE can support up to 4GiB of local store memory. The local store does not operate like a conventional CPUcache since it is neither transparent to software nor does it contain hardware structures that predict which data to load. The SPEs contain a 128-bit, 128 entryregister file and measures 14.5 mm² on a 90 nm process. An SPE can operate on 16 8-bit integers, 8 16-bit integers, 4 32-bit integers, or 4 single precision floating-point numbers in a single clock cycle, as well as a memory operation. Note that the SPU cannot directly access system memory; the 64-bit virtual memory addresses formed by the SPU must be passed from the SPU to the SPE memory flow controller (MFC) to set up a DMA operation within the system address space.In one typical usage scenario, the system will load the SPEs with small programs (similar to threads), chaining the SPEs together to handle each step in a complex operation. For instance, a
set-top box might load programs for reading a DVD, video and audio decoding, and display, and the data would be passed off from SPE to SPE until finally ending up on the TV. Another possibility is to partition the input data set and have several SPEs performing the same kind of operation in parallel. At 3.2 GHz, each SPE gives a theoretical 25.6GFLOPS of single precision performance.Compared to a modern
personal computer , the relatively high overall floating point performance of a Cell processor seemingly dwarfs the abilities of the SIMD unit in desktop CPUs like thePentium 4 and theAthlon 64 . However, comparing only floating point abilities of a system is a one-dimensional and application-specific metric. Unlike a Cell processor, such desktop CPUs are more suited to the general purpose software usually run on personal computers. In addition to executing multiple instructions per clock, processors from Intel and AMD featurebranch predictor s. The Cell is designed to compensate for this with compiler assistance, in which prepare-to-branch instructions are created. For double-precision floating point operations, as sometimes used in personal computers and often used in scientific computing, Cell performance drops by an order of magnitude, but still reaches 14 GFLOPS (the PowerXCell 8i variant, which was specifically designed for double-precision, reaches 102.4 GFLOPS in double-precision calculations cite web | title= Cell successor with turbo mode - PowerXCell 8i | publisher= PPCNux | url= http://www.ppcnux.com/?q=node/7144 |month= November | year= 2007 | accessdate= 10 June | accessyear= 2008] ).Recent tests by IBM show that the SPEs can reach 98% of their theoretical peak performance using optimized parallel Matrix Multiplication.
Toshiba has developed aco-processor powered by four SPEs, but no PPE, called theSpursEngine designed to accelerate 3D and movie effects in consumer electronics.Element Interconnect Bus (EIB)
The EIB is a communication bus internal to the Cell processor which connects the various on-chip system elements: the PPE processor, the memory controller (MIC), the eight SPE coprocessors, and two off-chip I/O interfaces, for a total of 12 participants. The EIB also includes an arbitration unit which functions as a set of traffic lights. In some documents IBM refers to EIB bus participants as 'units'.
The EIB is presently implemented as a circular ring comprised of four 16B-wide unidirectional channels which counter-rotate in pairs. When traffic patterns permit, each channel can convey up to three transactions concurrently. As the EIB runs at half the system clock rate the effective channel rate is 16 bytes every two system clocks. At maximum
concurrency , with three active transactions on each of the four rings, the peak "instantaneous" EIB bandwidth is 96B per clock (12 concurrent transactions * 16 bytes wide / 2 system clocks per transfer). While this figure is often quoted in IBM literature it is unrealistic to simply scale this number by processor clock speed. The arbitration unit imposes additional constraints which are discussed in the Bandwidth Assessment section below.IBM Senior Engineer
David Krolak , EIB lead designer, explains the concurrency model::"A ring can start a new op every three cycles. Each transfer always takes eight beats. That was one of the simplifications we made, it's optimized for streaming a lot of data. If you do small ops, it does not work quite as well. If you think of eight-car trains running around this track, as long as the trains aren't running into each other, they can coexist on the track."cite web|url=http://www-128.ibm.com/developerworks/power/library/pa-expert9/|title=Meet the experts: David Krolak on the Cell Broadband Engine EIB bus|publisher=IBM|date=2005-12-06|accessdate=2007-03-18]
Each participant on the EIB has one 16B read port and one 16B write port. The limit for a single participant is to read and write at a rate of 16B per EIB clock (for simplicity often regarded 8B per system clock). Note that each SPU processor contains a dedicated DMA management queue capable of scheduling long sequences of transactions to various endpoints without interfering with the SPU's ongoing computations; these DMA queues can be managed locally or remotely as well, providing additional flexibility in the control model.
Data flows on an EIB channel stepwise around the ring. Since there are twelve participants, the total number of steps around the channel back to the point of origin is twelve. Six steps is the longest distance between any pair of participants. An EIB channel is not permitted to convey data requiring more than six steps; such data must take the shorter route around the circle in the other direction. The number of steps involved in sending the packet has very little impact on transfer latency: the clock speed driving the steps is very fast relative to other considerations. However, longer communication distances "are" detrimental to the overall performance of the EIB as they reduce available concurrency.
Despite IBM's original desire to implement the EIB as a more powerful cross-bar, the circular configuration they adopted to spare resources rarely represents a limiting factor on the performance of the Cell chip as a whole. In the worst case, the programmer must take extra care to schedule communication patterns where the EIB is able to function at high concurrency levels.
David Krolak explains::"Well, in the beginning, early in the development process, several people were pushing for a crossbar switch, and the way the bus is designed, you could actually pull out the EIB and put in a crossbar switch if you were willing to devote more silicon space on the chip to wiring. We had to find a balance between connectivity and area, and there just was not enough room to put a full crossbar switch in. So we came up with this ring structure which we think is very interesting. It fits within the area constraints and still has very impressive bandwidth."
Bandwidth assessment
For the sake of quoting performance numbers, we will assume a Cell processor running at 3.2 GHz, the clock speed most often cited.
At this clock frequency each channel flows at a rate of 25.6 GB/s. Viewing the EIB in isolation from the system elements it connects, achieving twelve concurrent transactions at this flow rate works out to an abstract EIB bandwidth of 307.2 GB/s. Based on this view many IBM publications depict available EIB bandwidth as "greater than 300 GB/s". This number reflects the peak "instantaneous" EIB bandwidth scaled by processor frequency. [cite web|url=http://hpc.pnl.gov/people/fabrizio/papers/ieeemicro-cell.pdf|title=Cell Multiprocessor Communication Network: Built for Speed|publisher=IEEE|accessdate=2007-03-22|format=PDF]
However, other technical restrictions are involved in the arbitration mechanism for packets accepted onto the bus. The IBM Systems Performance group explains:
:"Each unit on the EIB can simultaneously send and receive 16B of data every bus cycle. The maximum data bandwidth of the entire EIB is limited by the maximum rate at which addresses are snooped across all units in the system, which is one per bus cycle. Since each snooped address request can potentially transfer up to 128B, the theoretical peak data bandwidth on the EIB at 3.2 GHz is 128Bx1.6 GHz = 204.8 GB/s."
This quote apparently represents the full extent of IBM's public disclosure of this mechanism and its impact. The EIB arbitration unit, the snooping mechanism, and interrupt generation on segment or page translation faults are not well described in the documentation set as yet made public by IBM.
In practice effective EIB bandwidth can also be limited by the ring participants involved. While each of the nine processing cores can sustain 25.6 GB/s read and write concurrently, the memory interface controller (MIC) is tied to a pair of XDR memory channels permitting a maximum flow of 25.6 GB/s for reads and writes combined and the two IO controllers are documented as supporting a peak combined input speed of 25.6 GB/s and a peak combined output speed of 35 GB/s.
To add further to the confusion, some older publications cite EIB bandwidth assuming a 4 GHz system clock. This reference frame results in an instantaneous EIB bandwidth figure of 384 GB/s and an arbitration-limited bandwidth figure of 256 GB/s.
All things considered the theoretic 204.8 GB/s number most often cited is the best one to bear in mind. The "IBM Systems Performance" group has demonstrated SPU-centric data flows achieving 197 GB/s on a Cell processor running at 3.2 GHz so this number is a fair reflection on practice as well [http://www-128.ibm.com/developerworks/power/library/pa-cellperf/] .
Optical interconnect
Sony is currently working on the development of an optical interconnection technology for use in the device-to-device or internal interface of various types of cell-based digital consumer electronics and game systems.
Memory controller and I/O
Cell contains a dual channel next-generation
Rambus XIO macro which interfaces to Rambus XDR memory. The memory interface controller (MIC) is separate from the XIO macro and is designed by IBM. The XIO-XDR link runs at 3.2 Gbit/s per pin. Two 32-bit channels can provide a theoretical maximum of 25.6 GB/s.The system interface used in Cell, also a Rambus design, is known as FlexIO. The FlexIO interface is organized into 12 lanes, each lane being a unidirectional 8-bit wide point-to-point path. Five 8-bit wide point-to-point paths are inbound lanes to Cell, while the remaining seven are outbound. This provides a theoretical peak bandwidth of 62.4 GB/s (36.4 GB/s outbound, 26 GB/s inbound) at 2.6 GHz. The FlexIO interface can be clocked independently, typ. at 3.2 GHz. 4 inbound + 4 outbound lanes are supporting memory coherency.
Possible applications
Video Processing Card
Some companies, such as
Leadtek , have plans to release aPCI-E card based upon the Cell to allow for "faster than real time" transcoding ofH.264 ,MPEG-2 andMPEG-4 video. cite web | title=Leadtek demos Cell chip on a card | url=http://www.custompc.co.uk/news/604962/leadtek-demos-cell-chip-on-a-pci-e-card.htmlBlade Server
On the
29 August 2007 , IBM announced theBladeCenter QS21. Generating a measured 1.05 Giga Floating Point Operations Per Second (GigaFLOPS) per watt, with peak performance of approximately 460 GFLOPS it is one of the most power efficient computing platforms to date. A single BladeCenter chassis can achieve 6.4 Tera Floating Point Operations Per Second (TeraFLOPS) and over 25.8 TeraFLOPS in a standard 42U rack.[http://www-03.ibm.com/press/us/en/pressrelease/22258.wss IBM Press Release]
On
13 May 2008 , IBM announced theBladeCenter QS22. The QS22 introduces the PowerXCell™ 8i processor with five times the double-precision Floating Point performance of the QS21, and the capacity for up to 32GB of DDR2 memory on-blade.[http://www-03.ibm.com/press/us/en/pressrelease/24180.wss IBM Press Release]
Console video games
Sony 'sPlayStation 3 video game console contains the first production application of the Cell processor, clocked at 3.2GHz and containing seven out of eight operational SPEs, to allow Sony to increase the yield on the processor manufacture. Only six of the seven SPEs are accessible to developers as one is reserved by the OS.cite news |title=Optimizing Cell Core | author=Martin Linklater | work=Game Developer Magazine, April 2007 | pages=15-18 | quote=To increase fabrication yelds, Sony ships PlayStation 3 Cell processors with only seven working SPEs. And from those seven, one SPE will be used by the operating system for various tasks, This leaves six SPEs for game programmer to use.]Home cinema
Reportedly, Toshiba is considering producing HDTVs using Cell. They have already presented a system to decode 48 standard definition
MPEG-2 streams simultaneously on a 1920×1080 screen.cite news | title=Toshiba Demonstrates Cell Microprocessor Simultaneously Decoding 48 MPEG-2 Streams | publisher=Tech-On! | url=http://techon.nikkeibp.co.jp/english/NEWS_EN/20050425/104149/?ST=english |date=April 25 2005 ] cite news | title= Winner: Multimedia Monster | publisher = IEEE Spectrum | url=http://www.spectrum.ieee.org/jan06/2609| date =1 January 2006 ] This can enable a viewer to choose a channel based on dozens of thumbnail videos displayed simultaneously on the screen.upercomputing
IBM's latest supercomputer,
IBM Roadrunner , is a hybrid of General Purpose CISC Opteron as well as Cell processors. This system assumed the #1 spot on the June 2008 Top 500 list as the first supercomputer to run at petaFLOPS speeds, having gained a sustained 1.026 petaFLOPS speed using the standard linpack benchmark. IBM Roadrunner uses the PowerXCell 8i version of the Cell processor, manufactured using 65 nm technology and enhanced SPUs that can handle double precision calculations in the 128-bit registers, reaching double precision 102 GFLOPs per chip.cite web | title=Beyond a Single Cell | url=http://www.cs.utk.edu/~dongarra/cell2006/cell-slides/04-Ken-Koch.pdf | publisher=Los Alamos National Laboratory |accessmonthday=October 25 | accessyear=2006|format=PDF] cite web | title=The Potential of the Cell Processor for Scientific Computing | url=http://repositories.cdlib.org/cgi/viewcontent.cgi?article=4262&context=lbnl | publisher= ACM Computing Frontiers |accessdate=October | accessyear=2006]Cluster computing
Clusters of
PlayStation 3 consoles are an attractive alternative to high-end systems based on Cell blades. Innovative Computing Laboratory, a group led byJack Dongarra , in the Computer Science Department at the University of Tennessee, investigated such an application in depth.cite web | title=SCOP3: A Rough Guide to Scientific Computing On the PlayStation 3 | url=http://www.netlib.org/netlib/utk/people/JackDongarra/PAPERS/scop3.pdf | publisher=Computer Science Department, University of Tennessee | accessmonthday=May 8 | accessyear=2007|format=PDF] Terrasoft Solutions is selling 8-node and 32-node PS3 clusters withYellow Dog Linux pre-installed, an implementation of Dongarra's research. As reported by Wired Magazine on October, 17, 2007, an interesting application of using PlayStation 3 in a cluster configuration was implemented by Astrophysicist Dr. Gaurav Khanna who replaced time used on supercomputers with a cluster of eight PlayStation 3s. [cite web |title=Astrophysicist Replaces Supercomputer with Eight PlayStation 3s|url=http://www.wired.com/techbiz/it/news/2007/10/ps3_supercomputer/ |publisher=Wired Magazine |accessdate=2007-10-17] The computational Biochemistry and Biophysics lab at theUniversitat Pompeu Fabra , inBarcelona , deployed in 2007 aBOINC system calledPS3GRID [cite web | title=PS3GRID.net | url=http://www.ps3grid.net] for collaborative computing based on the CellMD software, the first one designed specifically for the Cell processor.Distributed Computing
With the help of the computing power of over half a million PlayStation 3 consoles, the distributed computing project
Folding@Home has been recognized byGuinness World Records as the most powerful distributed network in the world. The first record was achieved onSeptember 16 ,2007 , as the project surpassed one petaFLOPS, which had never been reached before by a distributed computing network. Additionally, the collective efforts enabled PS3 alone to reach the petaFLOPS mark onSeptember 23 ,2007 . In comparison, the world's second most powerful supercomputer at the time, IBM'sBlueGene/L , performed at around 478.2 teraFLOPS. This means Folding@Home's computing power is approximately twice BlueGene/L's (although the CPU interconnect in BlueGene/L is more than one million times faster than the mean network speed in Folding@Home.)Mainframes
IBM announced
April 25 ,2007 that it will begin integrating its Cell Broadband Engine Architecture microprocessors into the company's line of mainframes. [cite web|url=http://www.eweek.com/article2/0,1895,2122352,00.asp?kc=EWEWKEMLP042807BOE1|title=IBM Mainframes Go 3-D|publisher=eWeek |date=2007-04-26 |accessdate=2007-05-18]oftware engineering
Due to the flexible nature of the Cell, there are several possibilities for the utilization of its resources, not limited to just different computing paradigms:cite news | title=CELL: A New Platform for Digital Entertainment | publisher=Sony Computer Entertainment Inc. | url=http://www.research.scea.com/research/html/CellGDC05/ |date=
March 9 2005 ]Job queue
The PPE maintains a job queue, schedules jobs in SPEs, and monitors progress. Each SPE runs a "mini kernel" whose role is to fetch a job, execute it, and synchronize with the PPE.
elf-multitasking of SPEs
The kernel and scheduling is distributed across the SPEs. Tasks are synchronized using mutexes or semaphores as in a conventional
operating system . Ready-to-run tasks wait in a queue for an SPE to execute them. The SPEs use shared memory for all tasks in this configuration.tream processing
Each SPE runs a distinct program. Data comes from an input stream, and is sent to SPEs. When an SPE has terminated the processing, the output data is sent to an output stream.
This provides a flexible and powerful architecture for
stream processing , and allows explicit scheduling for each SPE separately. Other processors are also able to perform streaming tasks, but are limited by the kernel loaded.Open source software development
An open source software-based strategy was adopted to accelerate the development of a Cell BE ecosystem and to provide an environment to develop Cell applications. [cite web|url=http://www.research.ibm.com/people/m/mikeg/papers/2007_ieeecomputer.pdf|title=An Open Source Environment for Cell Broadband Engine System Software|date=
2007-06 |format=PDF] In 2005, patches enabling Cell support in the Linux kernel were submitted for inclusion by IBM developers. [cite web|first=Arnd|last=Bergmann|url=http://lkml.org/lkml/2005/6/21/390|title=ppc64: Introduce Cell/BPA platform, v3|date=2005-06-21 |accessdate=2007-03-22] Arnd Bergmann (one of the developers of the aforementioned patches) also described the Linux-based Cell architecture atLinuxTag 2005.cite web | title=The Cell Processor Programming Model | work=LinuxTag 2005
url=http://www.linuxtag.org/typo3site/freecongress-details.html?talkid=156 | accessdaymonth=11 June | accessyear=2005]Both PPE and SPEs are programmable in C/C++ using a common API provided by libraries.
Terra Soft Solutions providesYellow Dog Linux for IBM, and Mercury Cell-based systems, as well as for the Playstation 3. [cite web|url=http://www.terrasoftsolutions.com/news/2006/2006-10-17.shtml|title= Terra Soft to Provide Linux for PLAYSTATION3] Terra Soft strategically partnered with Mercury to provide a Linux Board Support Package for Cell, and support and development of software applications on various other Cell platforms, including the IBM BladeCenter JS21 and Cell QS20, and Mercury Cell-based solutions. [ [http://www.terrasoftsolutions.com/products/mercury/intro.shtml Terra Soft - Linux for Cell, PlayStation PS3, QS20, QS21, QS22, IBM System p, Mercury Cell, and Apple PowerPC ] ] Terra Soft also maintains the Y-HPC(High Performance Computing) Cluster Construction and Management Suite and Y-Bio gene sequencing tools. Y-Bio is built upon the RPM Linux standard for package management, and offers tools which help bioinformatics researchers conduct their work with greater efficiency. [cite web | title= Y-Bio | url=http://www.terrasoftsolutions.com/products/y-bio/programs.shtml|date=2007-08-31 ] IBM has developed a pseudo-filesystem for Linux coined "Spufs" that simplifies access to and use of the SPE resources. IBM is currently maintaining a Linux kernel andGDB ports, while Sony maintains theGNU toolchain (GCC, binutils). [cite news | title=Arnd Bergmann on Cell | publisher=IBM developerWorks | url=http://www-128.ibm.com/developerworks/power/library/pa-expert4/|date=2005-06-25 ]In November 2005, IBM released a "Cell Broadband Engine (CBE) Software Development Kit Version 1.0", consisting of a simulator and assorted tools, to its web site. Development versions of the latest kernel and tools for Fedora Core 4 are maintained at the
Barcelona Supercomputing Center website. [cite web|url=http://www.bsc.es/projects/deepcomputing/linuxoncell/|title=Linux on Cell BE-based Systems|publisher=Barcelona Supercomputing Center|accessdate=2007-03-22]In August 2007, Mercury Computer Systems released a Software Development Kit for PLAYSTATION(R)3 for High-Performance Computing. [cite web | title=Mercury Computer Systems Releases Software Development Kit for PLAYSTATION(R)3 for High-Performance Computing | publisher=PRNewswire-FirstCall | url=http://www.mc.com/mediacenter/pressrelease.aspx?id=10454|date=
2007-08-03 ]With the release of kernel version 2.6.16 on
March 20 2006 , the Linux kernel officially supports the Cell processor. [cite news|url=http://news.com.com/2100-7344_3-6052314.html|title=Linux gets built-in Cell processor support|first=Stephen|last=Shankland|publisher=CNET|date=2006-03-21 |accessdate=2007-03-22]References
External links
* [http://www-128.ibm.com/developerworks/power/cell/ Cell Broadband Engine resource center]
* [http://www-128.ibm.com/developerworks/power/cell/docs_documentation.html Standards and documentation]
* [http://cell.scei.co.jp/ Sony Computer Entertainment Incorporated's CELL resource page]
* [http://www.cmpware.com/Docs/ProductBrief_3.0_CellBE.pdf Cmpware Configurable Multiprocessor Development Kit for Cell BE]
* [http://www.gamezero.com/team-0/articles/interviews/dr_h_peter_hofstee/ The Soul of Cell: An Interview with Dr. H. Peter Hofstee, Chief Architect of the Cell Synergistic Processor]
* [http://www.realworldtech.com/page.cfm?ArticleID=RWT021005084318 ISSCC 2005: The CELL Microprocessor, a comprehensive overview of the CELL microarchitecture]
* [http://members.forbes.com/forbes/2006/0130/076.html Holy Chip!]
* [http://www.ibm.com/developerworks/power/library/pa-tacklecell5/index.html The little broadband engine that could]
* [http://arstechnica.com/articles/paedia/cpu/cell-1.ars Introducing the IBM/Sony/Toshiba Cell Processor — Part I: the SIMD processing units]
* [http://arstechnica.com/articles/paedia/cpu/cell-2.ars Introducing the IBM/Sony/Toshiba Cell Processor -- Part II: The Cell Architecture] .
Wikimedia Foundation. 2010.