- Stanford Smart Memories Project
Advances in
VLSI technology now permit multiple processors to reside on a single integrated circuit chip, orIC . Such a processing system is known as a chip multiprocessor, or multi-core CPU system. Building on this technology, the Stanford Smart Memories Project places several processors on an IC, along with several independent memory blocks. In addition, the processors can be connected to the memory blocks in various ways, with the ability to form and change connections even while the processors are running. This ability is a kind ofreconfigurable computing .Depending on how processors talk to memory and to one another, they form a system that can be tailored more or less to a given style of computation. A fixed (non-reconfigurable) compute system might do well at supporting one style of computation, but consequently perform poorly on a different style. A reconfigurable computer, however, can adapt to many different styles of computing, and thus provide reasonably good performance across a wide range. Smart Memories has been shown to be effective for diverse compute styles including MESI-style shared-memory cache coherence, thread-level speculation (TLS), streaming and [http://tcc.stanford.edu/ transaction coherence (TCC).] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally, M. Horowitz,
[http://mos.stanford.edu/papers/km_isca_00.pdf Smart Memories: A Modular Reconfigurable Architecture,] "International Symposium on Computer Architecture ", June 2000.]The Stanford Smart Memories Project is an effort to develop acomputing infrastructure for the next generation of applications. It is a
multicore systemwith coarse grain reconfiguration capabilities for supportingdiverse computing models, like speculative multithreading and
streaming architectures. These features allow the system to run a broad range of applications efficiently. Research in this area involves VLSI circuits,computer architecture ,compilers ,operating systems ,computer graphics andcomputer networking .Smart Memories is a project in the [http://csl.stanford.edu/ Computer Systems Laboratory,] a joint laboratory of the [http://www-ee.stanford.edu/ Electrical Engineering] and [http://www-cs.stanford.edu/ Computer Science] departments at [http://www.stanford.edu/ Stanford University] .
Project overview
The Stanford Smart Memories Project aims to design a single-chip computing element that provides configurable hardware support for diverse computing models, and that maps efficiently to future wire-limited VLSI technologies.
The Smart Memory chip architecture exploits the fact that wire-delay limitations in future VLSI chips will impose a fine-grained partitioning of processors, memories, and interconnects. Adding programmable wires and logic to this inherently modular organization allows on-chip memories and communication paths to be customized to the particular computing problem at hand. This allows performancecompetitive with application-specific architectures, but with lower cost and increased flexibility. This fine-grained partitioning of processing and memory resources also enables substantial hardware parallelism. Effectively exploiting this parallelism in the face of global wiring delays requires aggressive methods for reducing on-chip communication overhead between the various processing and memory structures.
To develop a configurable micro-architecture, the Smart Memories groupis studying diverse classes of computing problems, (such as ray tracing, multimedia and DSP, speech and voice recognition, probabilistic reasoning), and the specialized architectures that have been optimized for these problems. This will provide insight into the hardware primitives and configurable mechanisms required to implement a universal computing substrate. The group is mainly interested in the requirements that such classes of applications place on the memory system of a multiprocessor environment, and are investigating strategies for building a reconfigurable memory system.
Architecture overview
Smart Memories is a multiprocessor system with coarse grain reconfiguration capabilities. Processing units in this system are in the form of "tiles" which, when put together in groups of four, form "quads". These elements connect in a hierarchical manner: a set of inter-quad connections provide communication facilities for tiles inside a quad, while a mesh interconnection network connects quads together. Tiles inside a quad share a network interface to connect to the outside world ( [http://mos.stanford.edu/smart_memories/gif/overvi1.gifFigure 1] ).
Each tile in the Smart Memories system consists of four major parts ( [http://mos.stanford.edu/smart_memories/gif/archit1.gifFigure 2] ): two processor cores, a set of configurable memory mats, a cross bar interconnect and a load/store unit (LSU). Either or both of the processors inside the tile can be easily turned off, allowing a tile to be just a memory resource, and saving power, in the case that excess processing power is not required.
Tile: Processors, memory mats, crossbar and LSU
Processors
Smart Memories leverages [http://www.tensilica.com/html/xtensa_lx.html Xtensa LX] commercial configurable processing cores from [http://www.tensilica.com/ Tensilica.] Cores are 32 bit RISC machines with a flexible 16/24 bit instruction length.The cores have been configured to be 3-way issue VLIW with flexible instructionformats. The Xtensa LX has a seven stage pipeline, with two stages for memoryaccess. It has 64 general purpose registers, a 32-bit floating point unit and 32floating point registers.
Processors are configured and extended using the TIE "(Tensilica Instruction Extension)"Language. The Smart Memories group has defined new interfaces to thememory, plus state registers and custom instructions for supporting different programming models.
Memory mats
[http://mos.stanford.edu/smart_memories/gif/archit2.gifFigure 3] shows the block diagram of a reconfigurable memory mat.Each memory mat has 1024 32-bit words in its main data array. Each word isassociated with six control bits in a separate control array. A programmablePLA performs a read-modify-write operation on the control bits after each accessto the memory word. The mat can perform read, write and compare operations oneach 32-bit data word.
Each memory mat also has two pairs of pointer/stride registers,which can be used to implement two separate hardware FIFOs inside. Mats areconnected via a two bit inter-mat communication network, which allows them toexchange control information. They can be configured to be used as cache, FIFO's or scratchpads.
Crossbar
A crossbar inside the tile connects the memory mats to the twoprocessor cores inside the tile, and to the tile's quad interface. The crossbar has four ports at the processor (LSU) side, two ports to the quad interface and 16 ports to the memory mats.
Load/Store Unit (LSU)
A Load/Store unit interfaces the two Tensilica cores to the rest of thememory system. It provides basic interfacing and support for the custom memory operations that were defined using the TIE language. The LSU also communicates with the quad's cache controller to request cache refills, access off-tile memory and report other special events, such as synchronization misses.
Quad: Four tiles, cache controller, network interface
Each group of four tiles forms one "quad". Each quad has a sharedcache or protocol controller, which provides support for the processors inside. It also has a network interface, which sends/receives/routes packets on the mesh-like network, and provides an interface to the outside world.
Cache (protocol) controller
The protocol controller is considered to be the heart of the quad. Itcan perform a variety of actions to support the processors'memory access needs under different programming models. Briefly, the protocol controller services cache evictions/refills, provides access to memory mats in one tile for a processor in another tile (off-tile accesses), enforces cache coherence invariance (MESI protocol), acts as a DMA engine to move data in and out of the quad, and provides support for transactions.
Network interface
The network interface is a simple router that connects each quad to its neighbors via a set of wires. It receives packets from the protocolcontroller or other neighbors and routes them to appropriate destinations.
Programming models / software
Smart Memories is designed to efficiently support different programming models, allowing an application to be programmed and run in the model that gives the best performance and/or programming ease. Smart Memories can reconfigure its memory system to provide the unique memory access requirements for each of three major models: shared memory, streaming, and transactional consistency.
Shared memory / multi-thread mode
This programming model gives the programmer a cache coherent shared memoryenvironment. Multi-thread programs are supported usingdifferent APIs, such as pthreads or ANL macros. There are on-going efforts tomap different application classes to the Smart Memories architecture usingthis programming model, including probabilistic reasoning applications,global illumination and data structure pre-fetching.
Probabilistic reasoning applications
Probabilistic reasoning is an influential approach in
artificial intelligence , where it has been shownto successfully tackle difficult problems in growing fields such as datamining, image analysis, robotics, and genetics. Given the increasingly complexmodels and large data sets used in these emerging applications, theperformance of reasoning algorithms is likely to become important for future computing systems. These algorithms tend to beinherently parallel, but are demanding in compute, memory and bandwidthresources. By mapping these algorithms onto the Smart Memories architecture,we can evaluate the effectiveness of various reconfigurable components in our design.Global illumination on parallel architectures
Monte-Carlo ray tracing to generate scenes with global illumination is anapplication that demands a lot from a memory system. The [http://www.tacc.utexas.edu/~cslugg/lightray/lightray.php application] has been coded using pthreads and simulated on the Smart Memories simulator.Although real-time performance on a single Smart Memories chip is achieved,higher performance over current processors is possible.Related publications: C. Burns, " [http://www.tacc.utexas.edu/~cslugg/thesis.pdf Global Illumination on Parallel Architectures,] "Senior Thesis, University of Texas Department of Computer Sciences, Dec. 2004
Data structure pre-fetching
Hardware-based or compiler-assisted pre-fetching techniques work well forarray-based programs but are less effective in hiding memory latency forpointer-intensive programs. By using a data structure centric approach topre-fetching (as opposed to control-flow centric approaches), the Smart Memories project exploits libraries of data structures to help with pre-fetching data stored in the data structures. Taking advantage of the recent success of chip multiprocessors, an idle or under-utilized processor can pre-fetch data using a pre-fetch thread.
A library is modified by adding code for the pre-fetch thread as well as a few lines to communicate information from the library code to the pre-fetch thread. The pre-fetch thread uses the knowledge about data structures in the library to identify the memory traversal patterns and issues pre-fetches accordingly. This is contrary to issuing pre-fetches for individual load instructions independently. This approach can obtain performance improvements without the assistance of any profiling-compiler or costly hardware even while restricted to the paradigm of sequential programming languages. Furthermore, this approach makes pre-fetching transparent to the programmer (using the library) as one need not modify the application code at all.
Streaming
Streaming is the second programming model supported in the Smart Memories system. For data parallel applications as in the multimedia and DSP domain, the stream programming model gives high performance. By separating a program's computation and communication into kernels and streams of data, a compiler can make a lot of static optimizations. A high level compiler such as [http://www.reservoir.com/r-stream.php Reservoir Labs R-Stream] maps compute kernels to stream co-processors and manages the transfer of data to software managed local memories. It generates SVM (Stream
Virtual Machine ) code, C with SVM API calls, which is then compiled by a Tensilica XCC compiler. The SVM runtime implements the SVM API calls to allow a stream program to run on Smart Memories.Smart Memories is an active participant in the [http://www.morphware.org Morphware Forum] , which develops standards such as the [http://www.morphware.org/standards.html Stream Virtual Machine.]
Related publications: F.Labonte, P. Mattson, I. Buck, C. Kozyrakis and M. Horowitz, " [http://mos.stanford.edu/papers/svm_pact04.pdf The Stream Virtual Machine,] " PACT, September 2004.
Transactional Coherence and Consistency (TCC)
The last major programming model in the Smart Memories system is
transactions. By executing all codes as transactions on the memory system, TCC offers a simpler way to parallelize applications than by using different threads. For more details about TCC please refer to [http://tcc.stanford.edu Stanford TCC website.]Smart Memories test chips
Memory test chip
In February 2003, the Smart Memories group taped out a reconfigurable memory test chip on the
TSMC 0.18um process. The test chip consisted of four memory blocks, a low swing
crossbar, and testing infrastructure circuits. The chips were successfully tested in the lab, operating at 1.1GHz clock frequency at nominal voltage of 1.8 volts (Figure 1).Results were published in the 2004 ISSCC conference (K. Mai, R. Ho, E. Alon, D. Liu, Y. Kim, D. Patil, and M. Horowitz. [http://mos.stanford.edu/papers/km_isscc_04.pdf Architecture and Circuit Techniques for a Reconfigurable Memory Block] ."ISSCC," February 2004)."INSERT FIGURE HERE:" [http://mos.stanford.edu/smart_memories/gif/ken_chip.jpgFigure 1 - Smart Memories test chip, memory blocks and low swing crossbar]
Interconnect test chip
In April 2002, Smart Memories taped out a low swing interconnect test chip on the
TSMC 0.18um process (Figure 2). The test chip consisted of multiple low-swing bus topologies as well as some full-swing buses for comparison. The test chip also contained a sense amplifier offset measurement block (later re-spawn on aNational Semiconductor 0.25um process). The chips have been tested and a paper is presented at the 2003 VLSI Circuits Symposium (R. Ho, K. Mai, M. Horowitz. [http://mos.stanford.edu/papers/rh_vlsi03.pdf Efficient On-Chip Global Interconnects.] "IEEE Symposium on VLSI Circuits," June 2003)."INSERT FIGURE HERE:" [http://mos.stanford.edu/smart_memories/gif/chip.jpgFigure 2 - Low swing interconnect test chip]
References
Papers and presentations
Architecture papers
J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, C. Kozyrakis,
"Comparative Evaluation of Memory Models for Chip Multiprocessors,""ACM Transactions on Architecture and Code Optimization ", (to appear in 2008)A. Solomatnikov, A. Firoozshahian, W. Qadeer, O. Shacham, K. Kelley, Z. Asgar, M. Wachs, R. Hameed, M. Horowitz,
"Chip Multi-Processor Generator,"Wild and Crazy Ideas session at "Design Automation Conference ", June 2007J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, M. Horowitz, C. Kozyrakis,
"Comparing Memory Systems for Chip Multiprocessors,""International Symposium on Computer Architecture ", June 2007F.Labonte, P. Mattson, I. Buck, C. Kozyrakis and M. Horowitz,
" [http://mos.stanford.edu/papers/svm_pact04.pdf The Stream Virtual Machine,] ""International Conference on Parallel Architectures and Compiler Techniques ", September 2004.K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally, M. Horowitz,
[http://mos.stanford.edu/papers/km_isca_00.pdf Smart Memories: A Modular Reconfigurable Architecture,] "International Symposium on Computer Architecture ", June 2000.VLSI papers
O. Shacham, M. Wachs, A. Solomatnikov, A. Firoozshahian, S. Richardson and M. Horowitz.
[http://mos.stanford.edu/papers/scoreboard.pdf Verification of Chip Multiprocessor Memory Systems Using A Relaxed Scoreboard] .MICRO-41, November 2008.
"( [http://www-vlsi.stanford.edu/smart_memories/RSB/index.html Addendum] )"K. Mai, R. Ho, E. Alon, D. Liu, Y. Kim, D. Patil, and M. Horowitz.
[http://mos.stanford.edu/papers/km_isscc_04.pdf Architecture and Circuit Techniques for a Reconfigurable Memory Block.] "International Solid-State Circuits Conference ", February 2004.R. Ho, K. Mai, M. Horowitz.
[http://mos.stanford.edu/papers/rh_vlsi03.pdf Efficient On-Chip Global Interconnects.] "IEEE Symposium on VLSI Circuits ", June 2003.Ho, K. Mai, and M. Horowitz,
[http://mos.stanford.edu/papers/rh_ieeeproc_01.pdf The Future of Wires] .Proceedings of the IEEE, April 2001, pp. 490-504.R. Ho, K. Mai, H. Kapadia and M Horowitz,
[http://mos.stanford.edu/papers/rh_iccad_99.pdf Interconnect Scaling Implications For CAD] ,IEEE/ACM International Conference on Computer-Aided Design,1999, San Jose,CA.W. J. Dally and S. Lacy,
[http://cva.stanford.edu/publications/1999/arvlsi99.pdf VLSI Architecture: Past, Present, and Future] ,Proceedings of the Advanced Research in VLSI conference, 1999, Atlanta, GA.W. J. Dally and A. Chang,
[http://cva.stanford.edu/publications/2000/dac00.pdf The Role of Custom Design in ASIC Chips] ,Proceedings of the 37th Design Automation Conference, June 2000, Los Angeles, CA.Presentations
* Nuwan Jayasena, [http://mos.stanford.edu/smart_memories/stuff/presentations/DetailedArchitecture.pdf Detailed Hardware Architecture]
* DARPA site visit, 14 October 2002: Amin Firoozshahian, [http://mos.stanford.edu/smart_memories/stuff/presentations/Smart%20Memories%20Hardware%20Presentation%20-%20DARPA%20site%20visit%2010-17-02%20-%20No%20notes.pdf Smart Memories Hardware] : Kenneth Mai, Ron Ho, [http://mos.stanford.edu/smart_memories/stuff/presentations/darpa_review_10_14_02.pdf Smart Memories Test Chips]Posters
People involved
Faculty/staff: [http://www-flash.stanford.edu/~horowitz/ Mark Horowitz] ; [http://csl.stanford.edu/~billd/ Bill Dally] ; [http://ogun.stanford.edu/~kunle/ Kunle Olukotun] ; [http://csl.stanford.edu/~christos/ Christos Kozyrakis] ; [http://www.stanford.edu/~engler/ Dawson Engler] ; [http://www.stanford.edu/~steveri/ Stephen Richardson] .
Graduate students: [http://www.stanford.edu/~sols/ Alex Solomatnikov] ; [http://www.stanford.edu/~aminf13/ Amin Firoozshahian] ; [http://www.stanford.edu/~vicwong/ Vicky Wong] ;Varun Sagar Malhotra;Wajahat Qadeer;Zain Asgar;Rehan Hameed; [http://www.stanford.edu/~shacham/ Ofer Shacham] ;Kyle Kelley;Megan Wachs.
Alumni:Mike Chen; [http://www-lance.stanford.edu/ Lance Hammond] ; [http://www-vlsi.stanford.edu/~flabonte/ Francois Labonte] ;Ken Mai; [http://ogun.stanford.edu/~mkprabhu/ Manohar Prabhu] ;Ayodele Thomas.
Project resources
: [http://mos.stanford.edu/smart_memories/protected/applications.html Application Studies ] : [http://mos.stanford.edu/smart_memories/protected/documents.html Documents ] : [http://mos.stanford.edu/smart_memories/protected/bugzilla.htm Internal Bug Database (bugzilla) ] : [http://mos.stanford.edu/smart_memories/sm/html/sm/index.html E-mail Archive ] : [http://mos.stanford.edu/smart_memories/protected/seminar_slides.html Seminar Slides ] : [http://mos.stanford.edu/smart_memories/protected/meetings.html Meeting Schedules ] : [http://mos.stanford.edu/smart_memories/protected/docs/Smart_Memories_project.html Project Schedule ] : [http://mos.stanford.edu/smart_memories/protected/hardware_meetings.htm Hardware Meetings ]
Similar projects and links
Architecture projects
: [http://iram.cs.berkeley.edu/ Berkeley IRAM] - Processor plus DRAM on the same chip.: [http://www-hydra.stanford.edu/fast/fast.shtml Stanford FAST] - FPGA prototype board for e.g. Hydra, below (2003-2006).: [http://www-flash.stanford.edu/ Stanford FLASH Multiprocessor] - Multiple processor boards on a backplane. Each board includes a processor, DRAM, L2 cache, and configurable interconnect (1992-1997).: [http://www-hydra.stanford.edu/ Stanford Hydra] - Four-processor CMP with support for TLS (1994-2005).: [http://cva.stanford.edu/imagine/index.html Stanford Imagine] : [http://www.cs.wisc.edu/~mscalar/ Wisconsin MultiScalar] - Origins of speculative multithreading, similar to TLS. Uses multiple functional units to attack different segments of the out-of-order window in parallel. 1995-2001.
Reconfigurable/polymorphic architectures
: [http://www.ece.cmu.edu/research/piperench/ CMU PipeRench] - Pipelineable FPGA, sort of (1997-2000).: [http://www.cag.lcs.mit.edu/raw/ MIT RAW] - multiple processor tiles connected by a reconfigurable fabric (1997-2004).: [http://www.cs.utexas.edu/users/cart/trips/ Texas TRIPS] - Multiple tiles connected via a network. Each tile has two processors, sixteen ALU's, L1 and L2 caches (2001-2006).
Related projects
: [http://graphics.stanford.edu/sss/ Stanford Streaming Supercomputer (SSS)] : [http://graphics.stanford.edu/streamlang/ Streaming Languages] : [http://suif.stanford.edu/ SUIF Compiler] : [http://simos.stanford.edu/ SimOS] : [http://tiny-tera.stanford.edu/tiny-tera/ Tiny-Tera]
Research groups at Stanford
: [http://mos.stanford.edu/ VLSI Research Group] : [http://cva.stanford.edu/ Concurrent VLSI Architecture Group (CVA)] : [http://www-graphics.stanford.edu/ Computer Graphics Laboratory] : [http://getafix.stanford.edu/cad/ High-Level Design Group] : [http://klamath.stanford.edu/networking/ Networking Research Groups] : [http://chronos.stanford.edu/users/nanni/research/ CAD Synthesis Group]
General links
: [http://www.cs.wisc.edu/~arch/www/ Wisconsin Computer Architecture Home Page] : [http://www.acm.org/sigarch/ SIGARCH] : [http://www.morphware.org/ Morphware Forum]
Wikimedia Foundation. 2010.