- Flexible Architecture for Simulation and Testing
The FAST Project is a new hybrid hardware prototyping platform enabled by integrating a variety of hardware components on a printed circuit board (PCB) to implement Chip Multiprocessor (CMP) or
Multiprocessor (MP) systems. The Flexible Architecture for Simulation and Testing (FAST) combines the flexibility of software simulation with the accuracy and speed of a hardware implementation enabling computer architects to implement new multithreaded, multiprocessor, or CMP architectures for in-depth evaluation and software development. This is accomplished by combining dedicatedmicroprocessor chips and SRAMs with Field Programmable Gate Arrays (FPGAs), enabling FAST to operate at hardware speeds, yet still have the ability to emulate the latency and bandwidth characteristics of a modern CMP architecture. FAST provides the foundation for future TLP-focused computer architecture research and software development that is currently not possible or practical using software-only or hardware-only solutions. Thus, FAST can be used to implement or emulate a variety of TLP, CMP, or MP computer architectures using a common hardware platform. FAST is a project in the [http://csl.stanford.edu/ Computer Systems Laboratory,] a joint laboratory of the [http://www-ee.stanford.edu/ Electrical Engineering] and [http://www-cs.stanford.edu/ Computer Science] departments at [http://www.stanford.edu/ Stanford University.]Project Overview
The Stanford FAST project is a hybrid heterogeneous platform for hardware simulation, emulation, or implementation. The FAST project started in 2002 with the goal of building both hardware and software infrastructure capable of emulating or implementing a variety of CMP and
multiprocessor computer systems. By combining FPGAs, SRAMs , and dedicated MIPS R3000 microprocessor chips with a software toolbox, FAST is able to provide the benefits of software simulation and hardware prototyping, bridging the research gap that inhibits software development and in depth research. FAST covers the implementation of the printed circuit board and initial software infrastructure contained in the FAST Software Toolbox. FAST enables full system implementation at the hardware and software level including a variety of memory levels and structures and operating system and application development.FAST enables hardware implementation or emulation by providing base functionality with both hardware and software. The hardware leverages the functionality of the base components. FPGAs provide the flexibility, expandability, and interconnectivity to map a variety of designs to a single hardware platform. SRAMs provide high speed and high density memories for a variety of storage structures and multiple memory levels. The dedicated
microprocessors provide optimized 32-bit datapaths for both integer and floating-point operations. Combined on to a printed circuit board (PCB), these components provide the foundation for reusable, modular, and scalable framework for multiprocessor systems research.Layered on top of the hardware is the FAST Software Toolbox. This toolbox is composed of base morphware modules to describe the connectivity and functionality of the PCB. Additional morphware modules define functionality that can be modified for specific computer systems, for example, level 1 and level 2 caches or other memory structures, predictors, or performance counters. FAST also has a predefined “morphware” tool chain for mapping designs to the FPGAs. The application tool chain is also provided for application development. Finally, the base components and in particular the microprocessors can run a ported, fully functional operating system, like Linux.
Figure 1 illustrates the complete FAST hardware and software stack. The FAST PCB is at the bottom of the stack with the base functionality and connectivity described in Verilog for the base FAST morphware. There is also a collection of FAST morphware modules that can be modified by FAST users to model different systems. These three components are specific to the FAST hardware and enable morphing FAST to the target architecture. The next two components are enabled by the hardware.
FAST is an initial proof of concept that demonstrates the feasibility of a hybrid hardware prototyping platform that integrates hardware back into the computer architecture research cycle, enabling both in-depth software development and hardware research.
Architecture Overview
The FAST system is a collection of hardware and software components that manage and configure on-board resources to enable full-system prototyping and software development. The on-board resources exist as functional layers that together can prototype multiprocessor hardware systems similar to those shown in Figure 2, at high speeds. The layers include: (1) the hardware: fixed-function SRAM memories, microprocessors, FPGA devices that can be "morphed" to provide different system-level functionality using Verilog models (morphware), and (2) the software, what we call the FAST Software Toolbox: the morphware, application benchmarks to be evaluated, low-level software and operating system functionality to manage functions such as program loading and I/O.
The FAST architecture facilitates the combination of these components to prototype a variety of architectures. FPGAs are the key components that have moved beyond simple glue logic implementations to more complex system integration implementations due to their high density, enabling FAST and other FPGA prototyping systems. Using a PCB substrate, FAST tightly couples four processors and provides shared memory (SM) and I/O capabilities as shown in Figure 2.
We have defined processor tiles and not just processors because we want to add functionality to the processor, which requires more chips, like the additional functionality of the speculative thread processor (CP2) in Hydra. There is also a PCB interconnect that enables interprocessor and memory system communication. This interconnect would include the read and write buses, as well as some of the control signals for memory arbitration. The shared memory (SM) provides the L2 cache infrastructure required by CMPs. Finally, FAST requires I/O capabilities for moving program data and other information on and off the PCB.
Figure 3. provides more detail for the general FAST architecture. There are four tightly coupled processor tiles composed of the L1 Memory, CPU, and Processor FPGA connected to the Hub FPGA. There is a very wide bus between the Hub FPGA and the processor tiles. There are both point-to-point and shared buses connecting all of the FPGAs for a variety of data movement and control signal configurations. There are centralized auxiliary components to provide a base system clock, PCB management, and transparency. There are also several features that enable off-PCB I/O and scalability to create a FAST compute fabric and extensibility. The Hub FPGA is also connected to a second level of memory that can be shared or partitioned into private blocks. The expandable interface can also be used for a third level of memory, if desired. Because FPGAs are the flexible building blocks of the FAST architecture, we are able to morph the hardware into a variety systems, all running at hardware speeds.
Implementation Overview
The FAST PCB uses 35 unique components, but there are 4260 total components required for each PCB. Testpoints and decoupling capacitors make up the majority of parts on the PCB. Figure 4 shows the FAST PCB with the major components labeled. Figure 5 shows the 8 JTAG zones that can be used for device programming and I/O pin observability. The PCB retains the overall structure of the FAST architecture, but the implementation differs with respect to device count. The Processor FPGA is implemented with two FPGAs and not one FPGA. Likewise, the HUB FPGA is implemented with two FPGAs and not one FPGA. Table 1 provides a list of the main FAST components, their quantity, operating voltage, and maximum operating frequency.
Processor Tile
The selection of the MIPS R3000 impacted the rest of the FAST implementation. The R3000 and R3010 use 5 V power. At FAST design time, no contemporary FPGAs had 5 V tolerant I/O's. This presented an implementation challenge with respect to the Processor FPGA. There were two options: (1) use current generation FPGAs and place level shifters between the FPGA and CPU and FPU or (2) use previous generation FPGAs that are 5 V tolerant to interface to the R3000's and R3010's. Adding level shifters would have increased the complexity of the FAST design by increasing the parts count and potentially impacting performance because of the delay introduced by the level shifting transitions. Using a previous generation FPGA removed the need for level shifters, but these FPGAs are more resource limited than the current generation FPGAs. Therefore, FAST uses two previous generation FPGAs, instead of one current generation FPGA with level shifters, in each processor tile. The Processor FPGA has two distinct operations: local memory controller, e.g. cache controller, and coprocessor interface. These distinct roles map well to the two FPGAs used in the processor tile. The additional integration and partitioning effort is minimized without compromising performance or correctness, as would have happened with a delay element (the level shifters) in the data bus and control signals.
The FAST PCB processor tile also provides the capability to create a variety of memory subsystems by inserting an FPGA between the MIPS components and the L1 SRAMs, as shown in Figure 6. In reality, the processor FPGA is split into two FPGAs, one that interfaces to the SRAMs and provides a memory controller and another FPGA that supports the MIPS processor and implements additional coprocessor features. First, the L1 controller (L1C) FPGA can define different interfaces between the MIPS processor and the SRAMs, and also can create the translation mechanisms to service memory or cache requests. Thus, the L1C can fake the normal L1 cache interface for the MIPS components, while simultaneously using a completely different memory structure resident in the L1 SRAMs. The other FPGA, the coprocessor 2 (CP2), can provide additional functionality. The main XCV1000 FPGA is the CP2 FPGA that can be used to add instructions to the MIPS ISA, add new compute engines, maintain statistics counters, or facilitate interprocessor communication.
HUB FPGA
Like the Processor FPGA, the Hub FPGA had two distinct functions, shared memory controller and interprocessor communication. Thus, we split the Hub FPGA into two XC2V6000 FPGAs to provide similar functionality and increase the amount of logic, reducing the resource constraints. The Hub FPGA became the Read/Write Controller FPGA used for interprocessor communication and coordination and the Shared Memory Controller FPGA used to interface to the 64 MB of L2 SRAM chips. In both cases, for the Hub and Processor FPGAs, the implementation required two reconfigurable components, even though the initial architecture specified one. Furthermore, parts availability and lead time was also a concern with some of the newest parts.
Read/Write Controller
The Read/Write Controller (RWC) handles memory hierarchy events that propagate past the primary caches, such as write-throughs and cache misses. A wide bus permits the observation of memory traffic to and from all processor tiles on a cycle-by-cycle basis, making it possible to implement cache coherence protocols using snooping on primary cache contents in other processor tiles. This controller could also be used for interprocessor tile messaging in systems that do not use traditional memory coherence.
Shared Memory Controller and L2 SRAMs
The Shared Memory Controller (SMC) manages the 16M x 36 bit (including 4 parity bits) of secondary memory. These synchronous SRAMs can be configured as a secondary cache or general-purpose, off-chip memory. The entire memory can be shared by all processor tiles or segmented, e.g., into 4M x 36 bit private partitions assigned to each processor tile. Furthermore, time-division multiplexing can be utilized to implement various set associative cache configurations. The SMC controls an 80-pin expansion connector. This multi-purpose header can be used to connect multiple FAST prototyping substrates together to create a larger FAST emulation fabric, to add daughter cards, or to attach additional memory, such as a DRAM main memory bank or Compact Flash daughter cards. This 80-pin connector can have two 40-pin IDE interfaces connected simultaneously, used for either a Compact Flash or a hard drive.
FAST PCB
FAST PCB details:
*16" X 16" PCB
* 20 Layer board
** 13 Signal layers
** 7 Power/ground planes
* Mixed voltage design
** 5 V, 3.3 V, 2.5 V, 1.5 V
* 4200 nets, 17000 routes, 32000 vias
** 4300 total parts (2500 testpoints for visibility and fault injection)
** 43 BGA packaged parts
* 8 JTAG chains to guarantee signal integrity
FAST PCB Stack-up: (TOP, BOTTOM, and INNER* are signal routing layers)
* TOP
* GND1
* PWR (5.0V, 2.5V, and 1.5V combined plane)
* GND2
* INNER2 (clock and reset)
* INNER3 (Clock and reset)
* GND3
* INNER4
* INNER5
* INNER6
* INNER7
* INNER8
* INNER9
* INNER10
* INNER11
* INNER12
* GND4
* 3V
* GND5
* BOTTOMFAST Software Toolbox Overview
The FAST software architecture encompasses many levels of software and tools, from low-level software that makes the PCB functional (morphware), to applications running on the prototype system (software). We present a bottom-up view of the software components and tools required to make FAST functional. These tools and software components apply to all similar flexible prototyping systems.
FAST VAL
The FAST VAL is the software that describes and implements the base hardware functionality. These base Verilog modules provide the MIPS interface for all designs. Similarly, additional Verilog modules are built on top of the FAST VAL and provide the prototype or thread level parallel (TLP) architecture definition. There are also hardware definitions that provide hooks for profiling and performance counter definition. FAST also has the ability to run fully functional operating systems (OS). However, the OS must be ported and auxiliary software must be provided to start up FAST and support the OS operations. For simplicity, we group the low-level boot software and OS together. There are also additional drivers and (TLP) API's that provide prototype specific functionality. Finally, building software using common widely available tools, applications are compiled for the FAST or TLP prototype target.
Verilog Wrappers
The FAST PCB has 11 configurable devices with almost 6500 I/O pins that are mapped to various components on the PCB. The fixed function components like the SRAMs and Flash chips have no configuration to manage. For each FPGA and PLD on the FAST PCB, a generic user constraint file (UCF) maps the device pin name to the Verilog port name. Also, a Verilog wrapper for each FPGA provides the top-level port list that corresponds to the names used in the UCF file. This is the base infrastructure required to use the FAST PCB. By bundling the UCF file with a supplied Verilog FPGA wrapper, the end user can use all or a portion of the UCF file and Verilog wrapper to implement a particular architecture. Furthermore, the Verilog wrappers in combination with the UCF files provide some initial guidelines on mapping new prototype architectures to FAST by guiding prototype design partitioning across the FPGAs and defining the inter-FPGA buses.
MIPS Interface Verilog
The CP2 FPGA generates the four MIPS double frequency clocks (Clk2XSys, Clk2XSmp, Clk2XRd, and Clk2XPhi) and the R3000 uses a DLL to lock onto the supplied clocks and generates the system clock that is distributed back to all of the components in the processor tile.
At power-on or system reset, the MIPS R3000 must be initialized. First, the R3000 must lock the clocks generated by the CP2 FPGA and output the MIPS system clock. This requires about 4000 cycles in the FAST initialization phase. Five additional cycles are used at the end of this period to initialize the R3000. The initialization uses the interrupt pins to set the data block refill size, extended cache size, byte order, output tri-state, cacheless operation, data/tag bus drive control, additional phase delay, instruction streaming, CPU mode (R3000 or R2000), partial stores, and multi-processor support.
Once the processor is initialized, it jumps to the reset vector address at hexadecimal physical address 0xbfc0_0000. This physical address is in the uncacheable kernel space, kseg1. The MIPS memory space is divided into four segments: kuseg, kseg0, kseg1, and kseg2. Both kuseg and kseg2 enable the translation lookaside buffer (TLB), an address translation mechanism that requires handlers to manage the TLB entries. The processor automatically starts in kseg1, the kernel uncached memory space. In this mode, the processor uses its system bus to fill all memory requests. Each memory request from this space takes at least four cycles to fill, dramatically degrading system performance. The kseg0 memory segment can use the caches without requiring a TLB. This address space is the most attractive for the initial working prototypes because of the reduced software handler development. Finally, the MIPS R3000 supports protected and unprotected memory accesses by segregating the memory space into kernel and user space. This allows the R3000 and the FAST system to run modern, multi-tasking operating systems, an option still unavailable, circa 2006, to current software-defined processor cores from Altera and Xilinx.
The MIPS R3000 and R3010 use a shared data and tag bus that serves as the cache and system bus interface. This shared bus also services two different transactions per clock cycle. The instruction address is supplied during phase two of the clock and the instruction comes back on the bus during the subsequent phase one. The R3000 overlaps the data request by providing the data address during phase one and then reading or writing on the bus during the next phase two. If a cache miss occurs, the R3000 holds the address constant and enters at least one stall cycle while it waits for the memory request to be filled. Once the data is presented to the R3000, it transitions to a fix up cycle before continuing to the normal run cycle. Thus, at least four cycles are required to service a cache miss.
The split transaction, dual-purpose tag and data buses provide the external and internal interface to the R3000 and all the coprocessors. It is crucial for this interface to work, but its operation and complexity should be abstracted from the system. For all cache transactions, the L1C FPGA services the requests. When a cache miss occurs, the CP2 handles the memory request by forwarding the address to the higher levels of memory.The L1C FPGA latches the instruction and data addresses in order to provide a full clock cycle to fulfill the cache request in the correct phase. The CP2 FPGA services all non-cache memory requests for the processor tile, while the R3000 stalls, waiting for the data. Thus, given these predefined interfaces, the FAST user only needs to provide the higher-level memory implementations and interface to a data and address bus with a few control signals for transaction handshaking.
Stanford Small Benchmarks
The Stanford Small Benchmark Suite is a small benchmark suite that was assembled by John Hennessy and Peter Nye around the same time period of the MIPS R3000 processors. The benchmark suite contains ten applications, eight integer benchmarks and two floating-point benchmarks. The original suite measured the execution time in milliseconds for each benchmark in the suite. The Stanford Small Benchmark Suite includes the following programs:
* Permute A tightly recursive permutation program.
* Towers The canonical Towers of Hanoi problem.
* Queens The eight Queens chess problem solved 50 times.
* Integer MM Two 2-D integer matrices multiplied together.
* FP MM Two 2-D floating-point matrices multiplied together.
* Puzzle A compute bound program.
* Quicksort An array sorted using the quicksort algorithm.
* Bubblesort An array sorted using the bubblesort algorithm.
* Treesort An array sorted using the treesort algorithm.
* FFT A floating-point Fast Fourier Transform program.For simplicity, FAST focused on using integer programs that did not require libc or floating-point support because that would require more FAST infrastructure development time. The simple prototype described in the next section ran the following benchmarks: Permute, Towers, Queens, Integer MM, Puzzle, Quicksort, and Bubblesort.
Simple Prototype
In order to demonstrate FAST's abilities, a simple 4-way decoupled CMP (FAST CMP 4W-NC) was developed as part of the FAST Software Toolbox. This base functionality demonstrates the potential of FAST as an architecture prototyping platform. Before the rest of the FAST Software Toolbox is described, it is useful to present this motivating example that the rest of the software components build upon.
By using the SRAM chips, the L1 cache capacity for the data and tag arrays increases by two orders of magnitude, to 256 KB plus parity bits compared to a BRAM only L1 cache implemented in the L1C FPGA. Again, a single SRAM chip Verilog interface was implemented and used for both the data and instruction caches. As would be expected, the timing and control signals are the only Verilog interface definitions required for the SRAM data and instruction caches.
The next component of the FAST CMP 4W-NC is the private per-tile L2 memory. The L2 memory is implemented in the RWC using four BRAMs blocks, one block for each processor tile. Each BRAM block is generated using Xilinx Core Generator TM. This tool specifies all the configuration details for BRAMs including: BRAM width, BRAM number of entries, handshaking, BRAM primitive, and initialization file, to name a few. Each L2 memory is 40 KB or 8096 x 36 bits. The BRAM blocks use the 512 X 36 bit primitive, which requires multiplexers to select the BRAM primitive for reading or writing. This BRAM configuration is not optimized for performance, but it allows the memory block to use the parity bits in the primitive. If it were an issue, bit slicing could be used for better performance.
This simple system integrates the L1 SRAMs, CP2 FPGA, L1C FPGA, and RWC FPGA. Even though this system is not cache coherent, communication between the CP2 and the RWC demonstrates that the additional coherence functionality is possible to implement. Thus, time and software implementation effort are the only limiting factors to mapping new architectures onto the FAST system.
Papers and Presentations
# [http://portal.acm.org/ft_gateway.cfm?id=1105740&type=pdf&coll=GUIDE&dl=GUIDE&CFID=11016435&CFTOKEN=85293490 A chip prototyping substrate: the flexible architecture for simulation and testing (FAST).] Davis, J.D., et al., SIGARCH Comput. Archit. News, 2005. 33(4): p. 34-43.
# [http://www.cse.ucsd.edu/~rakumar/dasCMP05/talk04.pdf.gz A chip prototyping substrate: the flexible architecture for simulation and testing (FAST).] Davis, J.D., et al., dasCMP presentation, 2005. 33(4): p. 34-43.
# [http://www.cag.csail.mit.edu/warfp2005/submissions/33-davis.pdf A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems.] Davis, J.D., et al., WARFP 2005, abstract
# [http://www.cag.csail.mit.edu/warfp2005/slides/davis-warfp2005.pdf A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems. ] Davis, J.D., et al., WARFP 2005,presentationPeople Involved
Faculty: [http://ogun.stanford.edu/~kunle/ Kunle Olukotun] , Stephen Richardson
Post Doc:Lance Hammond
Staff:Alan Swithenbank
Graduate Student: [http://research.microsoft.com/users/john.d/ John D. Davis]
Masters Students:Charis Charitsis, Wade Gupta, Lark-Hoon Leem, & Amarachi Okorie
Undergraduate Students:Joshua Baylor, Bob Bell, Daxia Ge, Bryan Golden, Brian Mariner, Garrett Smith, Bert Shen, John Vinyard, & Michael Yu
Project Home Page
: [http://www-hydra.stanford.edu/fast/fast.shtml Stanford's Flexible Architecture for Simulation and Testing (FAST) ]
Similar Projects and Links
Architecture Projects
: [http://csdl2.computer.org/persagen/DLAbsToc.jsp?resourcePath=/dl/mags/co/&toc=comp/mags/co/1995/02/r2toc.xml&DOI=10.1109/2.347997 RPM] : [http://ieeexplore.ieee.org/iel4/54/15307/00706042.pdf RPM2] : [http://www.cag.lcs.mit.edu/raw/ MIT RAW] : [http://bwrc.eecs.berkeley.edu/Research/BEE/ BEE] : [http://bee2.eecs.berkeley.edu/ BEE2] : [http://www-hydra.stanford.edu/ Stanford Hydra] : [http://tcc.stanford.edu/prototypes/#atlas ATLAS] : [http://ramp.eecs.berkeley.edu/ RAMP]
References
: "FAST: A FLEXIBLE ARCHITECTURE FOR SIMULATION AND TESTING OF MULTIPROCESSOR AND CMP SYSTEMS", Davis, J.D., thesis in Electrical Engineering. 2006, Stanford University: Stanford, CA: [http://csdl2.computer.org/persagen/DLAbsToc.jsp?resourcePath=/dl/mags/co/&toc=comp/mags/co/1995/02/r2toc.xml&DOI=10.1109/2.347997 RPM: A Rapid Prototyping Engine for Multiprocessor Systems.] Barroso, L., et al., Computer, 1995. 28(2): p. 26-34.: [http://ramp.eecs.berkeley.edu/publications.php RAMP: Research Accelerator for Multiple Processors - A Community Vision for a Shared Experimental Parallel HW/SW Platform. ] Arvind, et al., 2005: BEE2: A High-End Reconfigurable Computing System. Chang, C., J. Wawrzynek, and R.W. Brodersen, IEEE Des. Test, 2005. 22(2): p. 114-125. : The Stanford Hydra CMP. Hammond, L., et al., IEEE Micro, 2000. 20(2): p. 71-84. : Stanford Small Benchmark Suite. Hennessy, J. and P. Nye, 1989.
Wikimedia Foundation. 2010.