- General Matrix Multiply
-
The General Matrix Multiply (GEMM) is a subroutine in the Basic Linear Algebra Subprograms (BLAS) which performs matrix multiplication, that is the multiplication of two matrices. This includes:
- SGEMM for single precision,
- DGEMM for double-precision,
- CGEMM for complex single precision, and
- ZGEMM for complex double precision.
GEMM is often tuned by High Performance Computing vendors to run as fast as possible, because it is the building block for so many other routines. It is also the most important routine in the LINPACK benchmark. For this reason, implementations of fast BLAS library typically focus first on GEMM performance.
Operation
The xGEMM routine calculates the new value of matrix C based on the matrix-product of matrices A and B, and the old value of matrix C
where α and β values are scalar coefficients.
The Fortran interface for these procedures are:
SUBROUTINE xGEMM ( TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC )
where TRANSA and TRANSB determines if the matrices A and B are to be transposed. M is the number of rows in matrix C and, depending on TRANSA, the number of rows in the original matrix A or its transpose. N is the number of columns in matrix C and, depending on TRANSB, the number of columns in the matrix B or its transpose. K is the number of columns in matrix A (or its transpose) and rows in matrix B (or its transpose). LDA, LDB and LDC specifies the size of the first dimension of the matrices, as laid out in memory; meaning the memory distance between the start of each row/column, depending on the memory structure (Dongarra et al. 1990).
The C interface for these procedures are similar:
void cblas_xgemm ( const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_TRANSPOSE TransB, const int M, const int N, const int K, const SCALAR alpha, const TYPE * A, const int lda, const TYPE * B, const int ldb, const SCALAR beta, TYPE * C, const int ldc)
where 'x' in name is s,d,c, or z meaning single precision real, double precision real, single precision complex, or double precision complex, and other types have similar meanings as in Fortran interface.[citation needed]
For the C interface, a higher API also exists. For example, in the GNU Scientific Library, a corresponding interface is (take the double precision real case for instance):
int gsl_blas_dgemm( CBLAS_TRANSPOSE_t TransA, CBLAS_TRANSPOSE_t TransB, double alpha, const gsl_matrix * A, const gsl_matrix * B, double beta, gsl_matrix * C)
where TransA is CblasNoTrans or CblasTrans for A or transposed A respectively, and TransB has similar choices (GSL Team 2007).
Optimization
Not only is GEMM an important building block of other numeric software, it is often an important building block for calls to GEMM for larger matrices. By decomposing one or both of the input matrices into block matrices, GEMM can be used repeatedly on the smaller blocks to build up a result for the full matrix. This is one of the motivations for including the β parameter, so the results of previous blocks can be accumulated. Note that this decomposition requires the special case β = 1 which many implementations optimize for, thereby eliminating one multiplication for each value of C.
This decomposition allows for better locality of reference both in space and time of the data used in the product. This, in turn, takes advantage of the cache on the system (Golub & Van Loan 1996, Ch. 1). For systems with more than one level of cache, the blocking can be applied a second time to the order in which the blocks are used in the computation. Both of these levels of optimization are used in implementations such as ATLAS. More recently, implementations by Kazushige Goto have shown that blocking only for the L2 cache, combined with careful amortizing of copying to contiguous memory to reduce TLB misses, is superior to ATLAS. A highly tuned implementation based on these ideas are part of the GotoBLAS.
References
- Dongarra, Jack J.; Croz, Jeremy Du; Hammarling, Sven; Duff, Iain S. (1990), "A set of level 3 basic linear algebra subprograms", ACM Transactions on Mathematical Software 16 (1): 1–17, doi:10.1145/77626.79170, ISSN 0098-3500.
- Golub, Gene H.; Van Loan, Charles F. (1996), Matrix Computations (3rd ed.), Johns Hopkins, ISBN 978-0-8018-5414-9.
- GSL Team (2007), "§12.1.3 Level 3 GSL BLAS Interface", GNU Scientific Library. Reference Manual, http://www.gnu.org/software/gsl/manual/html_node/Level-3-GSL-BLAS-Interface.html.
- Goto, Kazushige; van de Geijn, Robert A. (2008), "Anatomy of High-Performance Matrix Multiplication", ACM Transactions on Mathematical Software 34 (3): Article 12, 25 pages, doi:10.1145/1356052.1356053.
Numerical linear algebra Key concepts Problems Hardware Software Categories:- C libraries
- Fortran libraries
- Numerical linear algebra
- Numerical software
Wikimedia Foundation. 2010.