Vectorization (computer science)

Vectorization (computer science)

Vectorization, in computer science, is the process of converting a computer program from a scalar implementation, which does an operation on a pair of operands at a time, to a vectorized program where a single instruction can perform multiple operations or a pair of vector (series of adjacent values) operands. Vector processing is a major feature of both conventional and modern supercomputers.

One major research topic in computer science is the search for methods of automatic vectorization; seeking methods that would allow a compiler to convert scalar programs into vectorized programs without human assistance.

Automatic vectorization

Automatic vectorization, in the context of a computer program, refers to the automatic transformation of a series of operations performed sequentially, one operation at a time, to operations performed in parallel, several at once, in a manner suitable for processing by a vector processor.

An example would be a program to multiply two vectors of numeric data. A scalar approach would be something like:

for (i = 0; i < 1024; i++) { C [i] = A [i] *B [i] ; }

This could be transformed to vectorized code something like:

for (i = 0; i < 1024; i+=4) { C [i:i+3] = A [i:i+3] *B [i:i+3] ; }

Here, C [i:i+3] represents the four array elements from C [i] to C [i+3] and we assume that the vector processor can perform four operations for a single vector instruction. Since four operations are performed for an execution time roughly similar to time taken for one scalar instruction, the vector code can run up to four times faster than the original code.

There are two distinct compiler approaches: one based on the conventional vectorization technique and the other based on loop unrolling.

Loop-level automatic vectorization

This technique, used for conventional vector machines, tries to find and exploit SIMD parallelism from the loop level. It consists of two major steps as follows.

# Find an innermost loop that can be vectorized
# Transform the loop and generate vector codesIn the first step, the vectorizing compiler looks for obstacles that can prevent vectorization. A major obstacle for vectorization is true data dependency shorter than the vector length. Other obstacles include function calls and short iteration counts.

Once the loop is determined to be vectorizable, the loop is stripmined by the vector length and each scalar instructions within the loop body are replaced with the corresponding vector instructions. Below, the component transformations for this step are shown using the above example.
* After stripmining for (i = 0; i < 1024; i+=4) { for (ii = 0; ii < 4; ii++) { C [i+ii] = A [i+ii] *B [i+ii] ; } }
* After loop distribution using temporary arrays for (i = 0; i < 1024; i+=4) { for (ii = 0; ii < 4; ii++) tA [ii] = A [i+ii] ; for (ii = 0; ii < 4; ii++) tB [ii] = B [i+ii] ; for (ii = 0; ii < 4; ii++) tC [ii] = tA [ii] *tB [ii] ; for (ii = 0; ii < 4; ii++) C [i+ii] = tC [ii] ; }
* After replacing with vector codes for (i = 0; i < 1024; i+=4) { vA = vec_ld( &A [i] ); vB = vec_ld( &B [i] ); vC = vec_mul( vA, vB ); vec_st( vC, &C [i] ); }

Basic block level automatic vectorization

This relatively new technique specifically targets modern SIMD architectures with short vector lengths. Although loops can be unrolled to increase the amount of SIMD parallelism in basic blocks, this technique targets to exploit SIMD parallelism within basic blocks rather than from loops. Because of this, SIMD codes can be generated from basic blocks outside loop nests. The two major steps are as follows.

# The innermost loop is unrolled by a factor of the vector length to form a large loop body.
# Isomorphic scalar instructions (that perform the same operation) are packed into a vector instruction if dependencies do not prevent doing so.

To show step-by-step transformations for this approach, the same example is used again.
* After loop unrolling (by the vector length, assumed to be 4 in this case) for (i = 0; i < 1024; i+=4) { sA0 = ld( &A [i+0] ); sB0 = ld( &B [i+0] ); sC0 = sA0 * sB0; st( sC0, &C [i+0] ); ... sA3 = ld( &A [i+3] ); sB3 = ld( &B [i+3] ); sC3 = sA3 * sB3; st( sC3, &C [i+3] ); }
* After packing for (i = 0; i < 1024; i+=4) { (sA0,sA1,sA2,sA3) = ld( &A [i+0:i+3] ); (sB0,sB1,sB2,sB3) = ld( &B [i+0:i+3] ); (sC0,sC1,sC2,sC3) = (sA0,sA1,sA2,sA3) * (sB0,sB1,sB2,sB3); st( (sC0,sC1,sC2,sC3), &C [i+0:i+3] ); }
* After code generation for (i = 0; i < 1024; i+=4) { vA = vec_ld( &A [i] ); vB = vec_ld( &B [i] ); vC = vec_mul( vA, vB ); vec_st( vC, &C [i] ); }Here, sA1, sB1, ... represent scalar variables and vA, vB, and vC represent vector variables.

Most automatically vectorizing commercial compilers use the conventional loop-level approach except the IBM XL Compiler that uses both.

Vectorization in the presence of control flow

When there are if-statements in the loopbody, the instructions in all control paths have to be executed to merge the multiple values of a variable. One general approach is to go through a sequence of code transformations: predication → vectorization(using one of the above methods) → remove vector predicates → remove scalar predicates. If the following code is used as an example to show these transformations;

for (i = 0; i < 1024; i++) { if (A [i] > 0) C [i] = B [i] ; else D [i] = D [i-1] ; }
* After predication for (i = 0; i < 1024; i++) { P = A [i] > 0; NP = !P; C [i] = B [i] ; (P) D [i] = D [i-1] ; (NP) }where (P) denotes a predicate guarding the statement.
* After vectorization for (i = 0; i < 1024; i+=4) { vP = A [i:i+3] > (0,0,0,0); vNP = vec_not(vP); C [i:i+3] = B [i:i+3] ; (vP) (NP1,NP2,NP3,NP4) = vNP; D [i] = D [i-1] ; (NP1) D [i+1] = D [i] ; (NP2) D [i+2] = D [i+1] ; (NP3) D [i+3] = D [i+2] ; (NP4) }
* After removing vector predicates for (i = 0; i < 1024; i+=4) { vP = A [i:i+3] > (0,0,0,0); vNP = vec_not(vP); C [i:i+3] = vec_sel(C [i:i+3] ,B [i:i+3] ,vP); (NP1,NP2,NP3,NP4) = vNP; D [i] = D [i-1] ; (NP1) D [i+1] = D [i] ; (NP2) D [i+2] = D [i+1] ; (NP3) D [i+3] = D [i+2] ; (NP4) }
* After removing scalar predicates for (i = 0; i < 1024; i+=4) { vP = A [i:i+3] > (0,0,0,0); vNP = vec_not(vP); C [i:i+3] = vec_sel(C [i:i+3] ,B [i:i+3] ,vP); (NP1,NP2,NP3,NP4) = vNP; if (NP1) D [i] = D [i-1] ; if (NP2) D [i+1] = D [i] ; if (NP3) D [i+2] = D [i+1] ; if (NP4) D [i+3] = D [i+2] ; }

Reducing vectorization overhead in the presence of control flow

Having to execute the instructions in all control paths in vector code has been one of the major factors that slow down the vector code with respect to the scalar baseline. The more complex the control flow becomes and the more instructions are bypassed in the scalar code the larger the vectorization overhead grows. To reduce this vectorization overhead, vector branches can be inserted to bypass vector instructions similar to the way scalar branches bypass scalar instructions. Below, AltiVec predicates are used to show how this can be achieved.
* Scalar baseline (original code) for (i = 0; i < 1024; i++) { if (A [i] > 0) { C [i] = B [i] ; if (B [i] < 0) D [i] = E [i] ; } }
* After vectorization in the presence of control flow for (i = 0; i < 1024; i+=4) { vPA = A [i:i+3] > (0,0,0,0); C [i:i+3] = vec_sel(C [i:i+3] ,B [i:i+3] ,vPA); vT = B [i:i+3] < (0,0,0,0); vPB = vec_sel((0,0,0,0), vT, vPA); D [i:i+3] = vec_sel(D [i:i+3] ,E [i:i+3] ,vPB); }
* After inserting vector branches for (i = 0; i < 1024; i+=4) { if (vec_any_gt(A [i:i+3] ,(0,0,0,0))) { vPA = A [i:i+3] > (0,0,0,0); C [i:i+3] = vec_sel(C [i:i+3] ,B [i:i+3] ,vPA); vT = B [i:i+3] < (0,0,0,0); vPB = vec_sel((0,0,0,0), vT, vPA); if (vec_any_ne(vPB,(0,0,0,0))) { D [i:i+3] = vec_sel(D [i:i+3] ,E [i:i+3] ,vPB); } } }There are two things to note in the final code with vector branches; First, the predicate defining instruction for vPA is also included within the body of the outer vector branch by using vec_any_gt. Second, the profitability of the inner vector branch for vPB depends on the conditional probability of vPB having false values in all fields given vPA has false values in all fields.

Let's think about an example where the outer branch in the scalar baseline is always taken bypassing most instructions in the loopbody. Without vector branches, the vector code, shown in the second loop above, should execute all vector instructions. When vector branches are used as in the final code, both the comparison and the branch are performed in vector mode potentially gaining performance over the scalar baseline.


Wikimedia Foundation. 2010.

Игры ⚽ Поможем решить контрольную работу

Look at other dictionaries:

  • Vectorization — may refer to: * Vectorization (computer graphics), the process of converting raster graphics into vector graphics. * Vectorization (computer science), converting a computer program from a scalar implementation to a vectorized program. *… …   Wikipedia

  • Algorithmic efficiency — In computer science, efficiency is used to describe properties of an algorithm relating to how much of various types of resources it consumes. Algorithmic efficiency can be thought of as analogous to engineering productivity for a repeating or… …   Wikipedia

  • Fortran — Infobox programming language name = Fortran caption = The Fortran Automatic Coding System for the IBM 704 (October 15, 1956), the first Programmer s Reference Manual for Fortran paradigm = multi paradigm: procedural, imperative, structured,… …   Wikipedia

  • David Kuck — David J. Kuck was a professor in the Computer Science Department the University of Illinois at Urbana Champaign from 1965 to 1993. He is the father of Olympic silver medalist Jonathan Kuck. While at the University of Illinois at Urbana Champaign… …   Wikipedia

  • Matrix multiplication — In mathematics, matrix multiplication is a binary operation that takes a pair of matrices, and produces another matrix. If A is an n by m matrix and B is an m by p matrix, the result AB of their multiplication is an n by p matrix defined only if… …   Wikipedia

  • Compiler optimization — is the process of tuning the output of a compiler to minimize or maximize some attributes of an executable computer program. The most common requirement is to minimize the time taken to execute a program; a less common one is to minimize the… …   Wikipedia

  • Loop optimization — In compiler theory, loop optimization plays an important role in improving cache performance, making effective use of parallel processing capabilities, and reducing overheads associated with executing loops. Most execution time of a scientific… …   Wikipedia

  • Dependence analysis — In compiler theory, dependence analysis produces execution order constraints between statements/instructions. Broadly speaking, a statement S2 depends on S1 if S1 must be executed before S2. Broadly, there are two classes of dependencies control… …   Wikipedia

  • Normalized loop — In computer science, a normalized loop (sometimes called well behaved loop), is a loop which the loop variable starts at 0 (or any constant) and get incremented by one at every iteration until the exit condition is met. Normalized loops are very… …   Wikipedia

  • Cray-1 — preserved at the Deutsches Museum Cray …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”