Register renaming

Register renaming

In computer engineering, register renaming refers to a technique usedto avoid unnecessary serialization of program operations imposed by the reuseof registers by those operations.

Problem definition

Programs are composed of instructions which operate on values. Theinstructions must name these values in order to distinguish them from oneanother. A typical instruction might say, add X and Y and put the resultin Z. In this instruction, X, Y, and Z are the names of storage locations.

In order to have a compact instruction encoding, most processor instructionsets have a small set of special locations which can be directly named.For example, the x86 instruction set architecture has 8 integer registers,x86-64 has 16, many RISCs have 32, and IA-64 has 128. In smaller processors,the names of these locations correspond directly to elements of a
register file.

Different instructions may take different amounts of time (e.g, CISC architecture). For instance, a processor may be able to execute hundreds of instructions while asingle load from main memory is in progress. Shorter instructions executed while the load is outstanding will finish first, thus the instructions are finishing out of the original program order. Out-of-order execution has been used in most recent high-performance CPUs to achieve some of their speed gains.

Consider this piece of code running on an out-of-order CPU:

Instructions 4, 5, and 6 are independent of instructions 1, 2, and 3, butthe processor cannot finish 4 until 3 is done, because 3 would then writethe wrong value.

We can eliminate this restriction by changing the names of some of theregisters:

Now instructions 4, 5, and 6 can be executed in parallel with instructions1, 2, and 3, so that the program can be executed faster.

When possible, the compiler performs this renaming. The compiler is constrained in many ways, primarily by the finite number of register namesin the instruction set. Many high performance CPUs have more physical registers than may be named directly in the instruction set, so rename registers in hardware to achieve additional parallelism.

Hazards and renaming

When more than one instruction references a particular location for an operand, either reading it (as an input) or writing it (as an output), executing those instructions in an order different from the original program order can lead to three kinds of problems, also known as hazards:

*Read-after-write (RAW):A read from a register or memory location must return the value placed there by the last write in program order, not some other write. This is referred to as a true dependency or flow dependency, and requires the instructions to execute in program order.

*Write-after-write (WAW):Successive writes to a particular register or memory location must leave that location containing the result of the second write. This can be resolved by squashing (synonyms: cancelling, annulling, mooting) the first write if necessary. WAW dependencies are also known as output dependencies.

*Write-after-read (WAR):A read from a register or memory location must return the last prior value written to that location, and not one written programmatically after the read. This is the sort of false dependency that can be resolved by renaming. WAR dependencies are also known as anti-dependencies.

:Instead of delaying the write until all reads are completed, two copies of the location can be maintained, the old value and the new value. Reads that precede, in program order, the write of the new value can be provided with the old value, even while other reads that follow the write are provided with the new value. The false dependency is broken and additional opportunities for out-of-order execution are created. When all reads needing the old value have been satisfied, it can be discarded. This is the essential concept behind register renaming.

Anything that is read and written can be renamed. While the general-purpose and floating-point registers are discussed the most, flag and status registers or even individual status bits are commonly renamed as well.

Memory locations can also be renamed, although it is not commonly done to the extent practiced in register renaming. The Transmeta Crusoe processor's gated store buffer is a form of memory renaming.

If programs refrained from reusing registers immediately, therewould be no need for register renaming. Some instruction sets (e.g. IA-64)specify very large numbers of registers for specifically this reason.There are limitations to this approach:

* It is very difficult for the compiler to avoid reusing registers without large code size increases. In loops, for instance, successive iterations would have to use different registers, which requires replicating the code (but see register rotation)
* Large numbers of registers require lots of bits to specify those registers, making the code size increase.
* Many instruction sets historically specified smaller numbers of registers and cannot be changed now.

Code size increases are important because when the program code islarger, the instruction cache misses more often and the processorstalls waiting for new instructions.

Architectural vs physical registers

Machine language programs specify reads and writes to a limited set of registers specified by the instruction set architecture (ISA).For instance, the DEC Alpha ISA specifies 32 integer registers, each 64 bits wide, and 32 floating-point registers, each 64 bits wide. These are the architectural registers. Programs written for processors running the Alpha instruction set will specify operations reading and writing those 64 registers. If a programmer stops the program in a debugger, she or he can observe the contents of these 64 registers (and a few status registers) to determine the progress of the machine.

One particular processor which implements this ISA, the Alpha 21264, has 80 integer and 72 floating-point physical registers. There are, on an Alpha 21264 chip, 80 physically separate locations which can store the results of integer operations, and 72 locations which can store the results of floating point operations. (In fact, there are even more locations than that, but those extra locations are not germane to the register renaming operation.)

Below are described two styles of register renaming, distinguished by the circuit which holds data ready for an execution unit.

In all renaming schemes, the machine converts the architecturalregisters referenced in the instruction stream into tags. Where thearchitectural registers might be specified by 3 to 5 bits, the tagsare usually a 6 to 8 bit number. The rename file must havea read port for every input of every instruction renamed every cycle,and a write port for every output of every instruction renamed everycycle. Because the size of a register file generally grows as thesquare of the number of ports, the rename file is usually physicallylarge and consumes significant power.

In the tag-indexed register file style, there is one large register file for data values, containing one register for every tag. For example, if the machine has 80 physical registers, then it would use 7 bit tags. 48 of the possible tag values in this case are unused.

In this style, when an instruction is issued to an execution unit, the tags of the source registers are sent to the physical register file, where the values corresponding to those tags are read and sent to the execution unit.

In the reservation station style, there are many small associative register files, usually one at the inputs to each execution unit. Each operand of each instruction in an issue queue has a place for a value in one of these register files.

In this style, when an instruction is issued to an execution unit, the register file entries corresponding to the issue queue entry are read and forwarded to the execution unit.

Architectural Register Fileor Retirement Register File (RRF)The committed register state of the machine.RAM indexed by logical register number.Typically written into as results are retiredor committed out of a reorder buffer.
Future FileThe most speculative register state of the machine.RAM indexed by logical register number.
Active Register FileThe Intel P6 group's term for Future File.
History BufferTypically used in combination with a future file.Contains the "old" values of registers that havebeen overwritten. If the producer is still in flightit may be RAM indexed by history buffer number.After a branch misprediction must use resultsfrom the history buffer - either they are copied,or the future file lookup is disabled and thehistory buffer is CAM indexed by logical registernumber.
Reorder Buffer (ROB)

Pretty much any structure that is sequentially (circularly)indexed on a per operation basis, for instructions in flight.Except... differs from a history buffer, in that thereorder buffer typically comes after the future file (if it exists)and before the architectural register file.

Reorder buffers come in data-less and data-ful versions.

In Willamette's ROB, the ROB entries point to registersin the physical register file (PRF), and also contain otherbookkeeping. This was also the first OOO design done by Andy Glew,at Illinois with HaRRM.

In P6's ROB, the ROB entries contain data; there is noseparate PRF. Data values from the ROB are copiedfrom the ROB to the RRF at retirement.

Small detail: if there is temporallocality in ROB entries, i.e. if instructions close togetherin the Von Neuman instruction sequence write backclose together in time, it may be possible to perform writecombining on ROB entries and so have fewer ports thana separate ROB/PRF would. It's not clear if it makes a difference,since a PRF should be banked.

ROBs usually don't have associative logic,and certainly none of the ROBs designed by Andy Glew have CAMs.Keith Dieffendorf insisted that ROBs havecomplex associative logic for many years. The firstROB proposal may have had CAMs.

Details: tag-indexed register file

This is the renaming style used in the MIPS R10000, the 21264, and inthe FP section of the AMD Athlon.

In the renaming stage, every architectural register referenced (forread or write) is looked up in an architecturally-indexed remap file.This file returns a tag and a ready bit. The tag is unreadyif there is a queued instruction which will write to it that has notyet executed. For read operands, this tag takes the place of thearchitectural register in the instruction. For every register write,a new tag is pulled from a free tag FIFO,and a new mapping is written into the remap file, so thatfuture instructions reading the architectural register will refer tothis new tag. The tag is marked as unready, because the instructionhas not yet executed. The previous physical register allocated forthat architectural register is saved with the instruction in thereorder buffer, which is a FIFO that holds the instructions inprogram order between the decode and graduation stages.

The instructions are then placed in various issue queues.

As instructions are executed, the tags for their results arebroadcast, and the issue queues match these tags against the tagsof their unready source operands. A match means that the operandis ready. The remap file also matches these tags, so that it can markthe corresponding physical registers as ready.

When all the operands to an instruction in an issue queue are ready,that instruction is ready to issue. The issue queues pick readyinstructions to send to the various functional units each cycle.Unready instructions stay in the issue queues. This unordered removalof instructions from the issue queues is one of the things that makesthem large and use lots of power.

Issued instructions read from a tag-indexed physical register file(bypassing just-broadcast operands), then execute.

Execution results are written to tag-indexed physical register file,as well as broadcast to the bypass network preceding each functionalunit.

Graduation puts the previous tag for the written architecturalregister into the free queue so that it can be reused for a newlydecoded instruction.

An exception or branch misprediction causes the remap file to back upto the remap state at last valid instruction via combination of statesnapshots and cycling through the previous tags in the in-orderpre-graduation queue. Since this mechanism is required, and since itcan recover any remap state (not just the state before the instructioncurrently being graduated), branch mispredictions can be handledbefore the branch reaches graduation, potentially hiding the branchmisprediction latency.

Details: reservation stations

This is the style used in the integer section of the AMD K7 and K8designs.

In the renaming stage, every architectural register referenced forreads is looked up in both the architecturally-indexed future fileand the rename file.The future file read gives the value of that register, if there is nooutstanding instruction yet to write to it (i.e. it's "ready").When the instruction is placed in an issue queue, the values read fromthe future file are written into the corresponding entries in thereservation stations. Register writes in the instruction cause a new,unready tag to be written into the rename file. The tag number isusually serially allocated in instruction order -- no free tag FIFO isnecessary.

Just as with the tag-indexed scheme, the issue queues wait for unreadyoperands to see matching tag broadcasts. Unlike the tag-indexedscheme, matching tags cause the corresponding broadcast value to bewritten into the issue queue entry's reservation station.

Issued instructions read their arguments from the reservation station,bypass just-broadcast operands, and then execute. As mentionedearlier, the reservation station register files are usually small,with perhaps eight entries.

Execution results are written to the reorder buffer, to the reservationstations (if the issue queue entry has a matching tag), and to thefuture file if this is the last instruction to target thatarchitectural register (in which case register is marked ready).

Graduation copies the value from the reorder buffer into thearchitectural register file. The sole use of the architecturalregister file is to recover from exceptions and branch mispredictions.

Exceptions and branch mispredictions, recognised at graduation, causethe architectural file to be copied to the future file, and allregisters marked as ready in the rename file. There is usually no way toreconstruct the state of the future file for some instruction intermediatebetween decode and graduation, so there is usually no way to do earlyrecovery from branch mispredictions.

Comparison between the schemes

In both schemes, instructions are inserted in-order into the issuequeues, but are removed out-of-order. If the queues do not collapse"holes", then they will either have many unused entries, or requiresome sort of variable priority encoding for when multiple instructionsare simultaneously ready to go. Queues that collapse holes havesimpler priority encoding, but require simple but large circuitry toadvance instructions through the queue.

Reservation stations have better latency from rename to execute,because the rename stage finds the register values directly, ratherthan finding the physical register number, and then using that to findthe value. This latency shows up as a component of the branchmispredict latency.

Reservation stations also have better latency from instruction issueto execution, because each local register file is smaller than thelarge central file of the tag-indexed scheme. Tag generation andexception processing are also simpler in the reservation stationscheme, as discussed below.

The physical register files used by reservation stations usuallycollapse unused entries in parallel with the issue queue they serve,which makes these register files larger in aggregate, and burn more power,and more complicated than the simpler register files used in a tag-indexedscheme. Worse yet, every entry in each reservation station can bewritten by every result bus, so that a reservation-station machinewith e.g. 8 issue queue entries per functional unit will typicallyhave 9 times as many bypass networks as an equivalent tag-indexedmachine. Result forwarding thus takes much more power and area thanin a tag-indexed design.

Furthermore, the reservation station scheme has four places (FutureFile, Reservation Station, Reorder Buffer and Architectural File)where a result value can be stored, where the tag-indexed scheme hasjust one (the physical register file). Because the results from thefunctional units, broadcast to all these storage locations, must reacha much larger number of locations in the machine than in thetag-indexed scheme, this function consumes more power, area, and time.

History

The original R10000 design had neither collapsing issue queues norvariable priority encoding, and suffered starvation problems as aresult -- the oldest instruction in the queue would sometimes not beissued until both instruction decode stopped completely for lack ofrename registers, and every other instruction had been issued. Laterrevisions of the design used a partially variable priority encoder tomitigate this problem.

Early out-of-order machines did not separate the renaming andROB/PRF storage functions. For that matter, some of theearliest, such as Sohi's RUU or the Metaflow DCAF,combined scheduling, renaming, and storage all in the samestructure.

Most modern machines do renaming by RAM indexing a map tablewith the logical register number. E.g. P6 did this; future files do this,and have data storage in the same structure.

However, earlier machines used CAMs in the renamer.E.g. the HPSM RAT, or Register Alias Table, essentiallyused a CAM on the logical register number in combinationwith different versions of the register.

In many ways, the story of out-of-order microarchitecturehas been how these CAMs have been progressively eliminated.Small CAMs are useful; large CAMs are impractical.

The P6 "microarchitecture" was the first Intel based processors that implemented both out-of-order execution and register renaming. The P6 microarchitecture manifested in Pentium Pro, Pentium II, Pentium III, Pentium M, Core, and Core 2 microprocessors.

References

* [http://lmi17.cnam.fr/~anceau/Documents/smith.pdf "Implementing Precise Exceptions in Pipelined Processors"] , J.E. Smith and A.R. Pleszkun, 1985


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать курсовую

Look at other dictionaries:

  • Register Renaming — Registerumbenennung (engl. Register Renaming) bezeichnet eine Phase der Befehlsverarbeitung, die hauptsächlich in superskalaren Mikroprozessoren zum Einsatz kommt. Üblicherweise findet in der Dekodierstufe (ID) während der Dekodierung einer… …   Deutsch Wikipedia

  • Register renaming — Registerumbenennung (engl. Register Renaming) bezeichnet eine Phase der Befehlsverarbeitung, die hauptsächlich in superskalaren Mikroprozessoren zum Einsatz kommt. Üblicherweise findet in der Dekodierstufe (ID) während der Dekodierung einer… …   Deutsch Wikipedia

  • Register file — A register file is an array of processor registers in a central processing unit (CPU). Modern integrated circuit based register files are usually implemented by way of fast static RAMs with multiple ports. Such RAMs are distinguished by having… …   Wikipedia

  • Register window — In computer engineering, the use of register windows is a technique to improve the performance of a particularly common operation, the procedure call. By devoting hardware to this problem, almost all computer programs will run… …   Wikipedia

  • Register Alias Table — In a computer processor, a register alias table is a unit that maps architectural registers to physical registers. See also Register renaming …   Wikipedia

  • Processor register — In computer architecture, a processor register is a small amount of storage available as part of a CPU or other digital processor. Such registers are (typically) addressed by mechanisms other than main memory and can be accessed more quickly.… …   Wikipedia

  • National Register of Historic Places listings in Detroit, Michigan — Location of Detroit in Michigan …   Wikipedia

  • Comparison of CPU architectures — Contents 1 Factors 1.1 Bits 1.2 Operands 1.3 Endianess 2 Architectures …   Wikipedia

  • Microarchitecture — Computer organization redirects here. For organizations that make computers, see List of computer system manufacturers. For one classification of computer architectures, see Flynn s taxonomy. For another classification of instruction set… …   Wikipedia

  • x86 — This article is about Intel microprocessor architecture in general. For the 32 bit generation of this architecture which is also called x86 , see IA 32. x86 Designer Intel, AMD Bits 16 bit, 32 bit, and/or 64 bit Introduced 1978 Design …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”