Software pipelining

Software pipelining

In computer science, software pipelining is a technique used to optimize loops, in a manner that parallels hardware pipelining. Software pipelining is a type of out-of-order execution, except that the reordering is done by a compiler (or in the case of hand written assembly code, by the programmer) instead of the processor. Some computer architectures have explicit support for software pipelining, notably Intel's IA-64 architecture.

Example

Consider the following loop: for i = 1 to bignumber A(i) B(i) C(i) end

In this example, let A(i), B(i), C(i), be instructions, each operating on data i, that are dependent on each other. In other words, A(i) must complete before B(i) can start. For example, A could load data from memory into a register, B could perform some arithmetic operation on the data, and C could store the data back into memory. However, let there be no dependence between operations for different values of i. In other words, A(2) can begin before A(1) finishes.

Without software pipelining, the operations will execute in the following sequence: A(1) B(1) C(1) A(2) B(2) C(2) A(3) B(3) C(3) ...Assume that each instruction takes 3 clock cycles to complete (ignore for the moment the cost of the looping control flow). Also assume (as is the case on most modern systems) that an instruction can be dispatched every cycle, as long it has no dependencies on an instruction that is already executing. In the unpipelined case, each iteration thus takes 7 cycles to complete (3 + 3 + 1, because A(i+1) does not have to wait for C(i)).

Now consider the following sequence of instructions (with software pipelining): A(1) A(2) A(3) B(1) B(2) B(3) C(1) C(2) C(3) ...It can be easily verified that an instruction can be dispatched each cycle, which means that the same 3 iterations can be executed in a total of 9 cycles, giving an average of 3 cycles per iteration.

Implementation

Software pipelining is often used in combination with loop unrolling, and this combination of techniques is often a far better optimization than loop unrolling alone. In the example above, we could write the code as follows (assume for the moment that bignumber is divisible by 3): for i = 1 to (bignumber - 2) step 3 A(i) A(i+1) A(i+2) B(i) B(i+1) B(i+2) C(i) C(i+1) C(i+2) end

Of course, matters are complicated if (as is usually the case) we can't guarantee that the number of iterations will be divisible by the number of iterations we unroll. See the article on loop unrolling for more on solutions to this problem. It should be noted, however, that software pipelining prevents the use of Duff's device, a widely known and efficient solution to this problem.

In the general case, loop unrolling may not be the best way to implement software pipelining. Consider a loop containing instructions with a high latency. For example, the following code: for i = 1 to bignumber A(i) ; 3 cycle latency B(i) ; 3 C(i) ; 12(perhaps a floating point operation) D(i) ; 3 E(i) ; 3 F(i) ; 3 endwould require 12 iterations of the loop to be unrolled to avoid the bottleneck of instruction C. This means that the code of the loop would increase by a factor of 12 (which not only affects memory usage, but can also affect cache performance, "see code bloat"). Even worse, the prolog (code before the loop for handling the case of bignumber not divisible by 12) will likely be even larger than the code for the loop, and very probably inefficient because software pipelining cannot be used in this code (at least not without a significant amount of further code bloat). Furthermore, if bignumber is expected to be moderate in size compared to the number of iterations unrolled (say 10-20), then the execution will spend most of its time in this inefficient prolog code, rendering the software pipelining optimization ineffectual.

Here is an alternative implementation of software pipelining for our example (the "prolog" and "postlog" will be explained later): prolog for i = 1 to (bignumber - 6) A(i+6) B(i+5) C(i+4) D(i+2) ; note that we skip i+3 E(i+1) F(i) end postlogBefore getting to the prolog and postlog, which handle iterations at the beginning and end of the loop, let's verify that this code does the same thing as the original for iterations in the middle of the loop. Specifically, consider iteration 7 in the original loop. The first iteration of the pipelined loop will be the first iteration that includes an instruction from iteration 7 of the original loop. The sequence of instructions is::Iteration 1: A(7) B(6) C(5) D(3) E(2) F(1):Iteration 2: A(8) B(7) C(6) D(4) E(3) F(2):Iteration 3: A(9) B(8) C(7) D(5) E(4) F(3):Iteration 4: A(10) B(9) C(8) D(6) E(5) F(4):Iteration 5: A(11) B(10) C(9) D(7) E(6) F(5):Iteration 6: A(12) B(11) C(10) D(8) E(7) F(6):Iteration 7: A(13) B(12) C(11) D(9) E(8) F(7)

However, unlike the original loop, the pipelined version avoid the bottleneck at instruction C. Note that there are 12 instructions between C(7) and the dependent instruction D(7), which means that the latency cycles of instruction C(7) are used for other instructions instead of being wasted.

The prolog and postlog handle iterations at the beginning and end of the loop. Here is a possible prolog for our example above: ; loop prolog (arranged on lines for clarity) A(1) A(2), B(1) A(3), B(2), C(1) A(4), B(3), C(2) A(5), B(4), C(3), D(1) A(6), B(5), C(4), D(2), E(1)Each line above corresponds to an iteration of the pipelined loop, with instructions for iterations that have not yet begun removed. The postlog would look similar.

Difficulties of implementation

The requirement of a prolog and postlog is one of the major difficulties of implementing software pipelining. Note that the prolog in this example is 18 instructions, 3 times as large as the loop itself. The postlog would also be 18 instructions. In other words, the prolog and postlog together are "6 times as large as the loop itself". While still better than attempting loop unrolling for this example, software pipelining requires a trade-off between speed and memory usage. Keep in mind, also, that if the code bloat is too large, it will affect speed anyway via a decrease in cache performance.

A further difficulty, which may render this implementation of software pipelining useless, is that on many architectures, most instructions use a register as an argument, and that the specific register to use must be hard-coded into the instruction. In other words, on many architectures, it is impossible to code such an instruction as "multiply the contents of register X and register Y and put the result in register Z", where X, Y, and Z are numbers taken from other registers or memory. To see why this is a problem, consider the following example: for i = 1 to bignumber load from memory x [i] into register r1 ; A(i) load from memory y [i] into register r2 ; B(i) multiply r3 = r1 * r2 ; C(i) store register r3 into memory z [i] ; D(i) end

It should be clear that instruction C(i) relies upon the results from A(i) and B(i) still being stored in their respective registers. Similarly, instruction D(i) relies upon the results of instruction C(i) still being stored in register r3. If we are to pipeline this loop, we need to first make some changes to which registers are used by each instruction, and also insert some instructions to move values from register to register. For example, we could decide that instructions A and B still use registers r1 and r2, but that instruction C uses registers r3, r4, and r5, and that instruction D uses register r6. We might then rewrite the loop as follows: for i = 1 to bignumber load from memory x [i] into register r1 ; A(i) load from memory y [i] into register r2 ; B(i) move the contents of r1 to r3 ; A'(i) move the contents of r2 to r4 ; B'(i) multiply r5 = r3 * r4 ; C(i) move the contents of r5 to r6 ; C'(i) store register r6 into memory z [i] ; D(i) end

Now, the requirements for scheduling of instruction A'(i), for example, are:
# A'(i) must occur after A(i)
# A'(i) must occur before C(i)
# A'(i) must occur after the previous value of r3 is read (in the particular pipelined version below, this will be in instruction C(i-1))
# A'(i) must occur before the next value of r1 is written (in the particular pipelined version below, this will be in instruction A(i+1))

With these requirements (and similar requirements for B'(i) and C'(i)) in mind, we can pipeline the loop as follows: prolog for i = 1 to (bignumber - 2) load from memory x [i+2] into register r1 ; A(i+2) load from memory y [i+2] into register r2 ; B(i+2) move the contents of r5 to r6 ; C'(i) multiply r5 = r3 * r4 ; C(i+1) store register r6 into memory z [i] ; D(i) move the contents of r1 to r3 ; A'(i+2) move the contents of r2 to r4 ; B'(i+2) end postlog

This particular instruction scheduling was done with the assumption that moving the contents of one register into another is a relatively fast operation, and should be scheduled as late as possible. To verify that this has the same affect as the non-pipelined version, let us follow the instructions for iteration 3 in the original loop::Iteration 1: A(3) B(3) C'(1) C(2) D(1) A'(3) B'(3):Iteration 2: A(4) B(4) C'(2) C(3) D(2) A'(4) B'(4):Iteration 3: A(5) B(5) C'(3) C(4) D(3) A'(5) B'(5)

To evaluate what affect the pipelining would have on performance in this example, let us assume, as before, that instructions with no dependencies can be dispatched in the immediately following cycle. Let us also assume latency of 3 for each instruction, except the move instructions, for which we will assume a latency of 1. Let us look at Iteration 2 above:
* A(4) and B(4) cannot execute until A'(3) and B'(3), respectively, have completed (otherwise, there would be contention for registers r1 and r2). However, since we have assumed a latency of 1 for the latter instructions, A(4) and B(4) will not be delayed by this dependency.
* C'(2) similarly depends on C(2) and D(1). D(1) occurs 5 instructions prior to C(2) and therefore will not cause C'(2) to be delayed.
* Similarly, for C(3), D(2), A'(4), and B'(4), it can be verified that none of the dependencies cause a delay in scheduling.Therefore, the 7 instructions per iteration are dispatched in 7 cycles, giving a total cost of 7 cycles per iteration. Compare this to the non-pipelined case:
* C(i) must wait for A(i) and B(i) to complete, resulting in C(i) being dispatched 3 cycles after B(i)
* D(i) must wait for C(i) to complete, resulting in D(i) being dispatched 3 cycles after C(i)
* A(i+1) can be dispatched immediately after D(i) (no dependency)This results in 1 + 3 + 3 + 1 = 8 cycles per iteration. In other words, the extra instructions we inserted have almost entirely offset the benefit of pipelining, at a cost of using twice as many registers.

This difficulty is very common and makes software pipelining difficult on many architectures.

IA-64 implementation

Intel's IA-64 architecture provides an example of an architecture designed with the difficulties of software pipelining in mind. Some of the architectural support for software pipelining includes:
* A "rotating" register bank; instructions can refer to a register number that is redirected to a different register each iteration of the loop (eventually looping back around to the beginning). This makes the extra instructions inserted in the previous example unnecessary.
* Predicates (used to "predicate" instructions; "see Branch predication") that take their value from special looping instructions. These predicates turn on or off certain instructions in the loop, making a separate prolog and postlog unnecessary.


Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

  • Software Pipelining — bezeichnet die Programmierung eines Prozessors mit mehreren Ausführungseinheiten, sodass möglichst viele von ihnen gleichzeitig beschäftigt sind. Das Verfahren dient also dazu, die Zeit für eine Berechnung zu verkürzen, indem mehrere parallele… …   Deutsch Wikipedia

  • Cell software development — Software development for the cell microprocessor involve a mixture of conventional development practices for the POWER architecture compatible PPU core, and novel software development challenges with regards to the functionally reduced SPU… …   Wikipedia

  • HTTP pipelining — is a technique in which multiple HTTP requests are written out to a single socket without waiting for the corresponding responses. Pipelining is only supported in HTTP/1.1, not in 1.0.The pipelining of requests results in a dramatic improvement… …   Wikipedia

  • Программная конвейеризация — циклов (англ. software pipelining)  это техника, используемая компиляторами, для оптимизации циклов, по аналогии с вычислительным конвейером в микропроцессорах. Является формой внеочередного исполнения с той разницей, что… …   Википедия

  • Compiler optimization — is the process of tuning the output of a compiler to minimize or maximize some attributes of an executable computer program. The most common requirement is to minimize the time taken to execute a program; a less common one is to minimize the… …   Wikipedia

  • Pipeline (Begriffsklärung) — Eine Pipeline (Englisch „Rohrleitung“, von pipe „Rohr, Röhre“ und line „Linie, Leitung, Verbindung“) ist: ein Rohrleitungssystem zum Transport von Flüssigkeiten oder Gasen, siehe Pipeline, in der Massenproduktion eine „Bearbeitungsröhre“, siehe… …   Deutsch Wikipedia

  • Open64 — Developer(s) Silicon Graphics, Inc., Institute of Computing Technology, Chinese Academy of Sciences, Hewlett Packard, University of Delaware Initial release 2002 Stable release 5.0 / November 10, 2011; 9 days ago ( …   Wikipedia

  • Cydrome — was a computer company started in 1984 in San Jose, California whose mission was to develop a numeric processor. The founders were David Yen, Wei Yen, Ross Towle, Arun Kumar, and Bob Rau (the chief architect). In order to improve performance in a …   Wikipedia

  • Stream Processors, Inc — Infobox Company company name = Stream Processors Incorporated (SPI) company company type=Private foundation = 2004 location=455 DeGuigne Drive Sunnyvale, California flagicon|USA USA key people =Bill Dally, Co Founder and Chairman Chip Stearns,… …   Wikipedia

  • Codeplay — Software Limited is a compiler and software tools developer based in Edinburgh, Scotland. Codeplay develop C/C++ compilers for multi core and special purpose processor architectures, offering their Sieve C++ Multicore Programming System as a… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”