Automatically Tuned Linear Algebra Software

Automatically Tuned Linear Algebra Software

Automatically Tuned Linear Algebra Software (ATLAS) is a software library for linear algebra. It provides a mature open source implementation of BLAS APIs for C and Fortran77.

ATLAS is often recommended as a way to automatically generate an optimized BLAS library. While its performance often trails that of specialized libraries written for one specific hardware platform, it is often the first or even only optimized BLAS implementation available on new systems and is a large improvement over the generic BLAS available at Netlib. For this reason, ATLAS is sometimes used as a performance baseline for comparison with other products.

ATLAS runs on most Unix-like operating systems and on Microsoft Windows (using Cygwin). It is released under a BSD-style license without advertising clause, and many well-known mathematics applications including MATLAB, Mathematica, and GNU Octave use it.

Functionality

ATLAS provides a full implementation of the BLAS APIs as well as some additional functions from LAPACK, a higher-level library built on top of BLAS. In BLAS, functionality is divided into three groups called levels 1, 2 and 3.

* Level 1 contains "vector operations" of the form

:mathbf{y} leftarrow alpha mathbf{x} + mathbf{y} !

:as well as scalar dot products and vector norms, among other things.

* Level 2 contains "matrix-vector operations" of the form

:mathbf{y} leftarrow alpha A mathbf{x} + eta mathbf{y} !

:as well as solving T mathbf{x} = mathbf{y} for x with T being triangular, among other things.

* Level 3 contains "matrix-matrix operations" such as the widely used General Matrix Multiply (GEMM) operation

:C leftarrow alpha A B + eta C !

:as well as solving B leftarrow alpha T^{-1} B for triangular matrices T, among other things.

Optimization approach

The optimization approach is called Automated Empirical Optimization of Software (AEOS), which identifies four fundamental approaches to computer assisted optimization of which ATLAS employs three: [cite web
author = R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra
year = 2000
url = http://www.netlib.org/lapack/lawnspdf/lawn147.pdf
title = Automated Empirical Optimization of Software and the ATLAS Project
format = PDF
accessdate = 2006-10-06
]

# Parameterization -- searching over the parameter space of a function, used for blocking factor, cache edge, ...
# Multiple implementation -- searching through various approaches to implementing the same function, e.g., for SSE support before intrinsics made them available in C code
# Code generation -- programs that write programs incorporating what knowledge they can about what will produce the best performance for the system

* Optimization of the level 1 BLAS uses parameterization and multiple implementation: Every ATLAS level 1 BLAS function has its own kernel. Since it would be difficult to maintain thousands of cases in ATLAS there is little architecture specific optimization for Level 1 BLAS. Instead multiple implementation is relied upon to allow for compiler optimization to produce high performance implementation for the system.

* Optimization of the level 2 BLAS uses parameterization and multiple implementation: With N^2 data and N^2 operations to perform the function is usually limited by bandwidth to memory, and thus there is not much opportunity for optimization: All routines in the ATLAS level 2 BLAS are built from two Level 2 BLAS kernels::* GEMV -- matrix by vector multiply update:::mathbf{y} leftarrow alpha A mathbf{x} + eta mathbf{y} !:* GER -- general rank 1 update from an outer product:::mathbf{y} leftarrow alpha mathbf{x}^T * y + eta mathbf{y} !

* Optimization of the level 3 BLAS uses code generation and the other two techniques: Since we have N^3 ops with only N^2 data, many opportunities for optimization

Level 3 BLAS

Most of the Level 3 BLAS is derived from GEMM, so that is the primary focus of the optimization.

:O(n^3) operations vs. O(n^2) data

The intuition that the n^3 operations will dominate over the n^2 data accesses only works for roughly square matricies. The real measure should be some kind of surface area to volume.The difference becomes important for very non-square mats.

Can it afford to copy?

Copying the inputs allows the data to be arranged in a way that provides optimal access for the kernel functions, but this comes at the cost of allocating temporary space, and an extra read and write of the inputs.

So the first question GEMM faces is, can it afford to copy the inputs?

If so,
* Put into block major format with good alignment
* Take advantage of user contributed kernels and cleanup
* Handle the transpose cases with the copy: make everything into TN (transpose - no-transpose)
* Deal with α in the copy

If not,
* Use the nocopy version
* Make no assumptions on the stride of matrix "A" and "B" in memory
* Handle all transpose cases explicitly
* No guarantee about alignment of data
* Support α specific code
* Run the risk of TLB issues, bad strides, ...

The actual decision is made through a simple heuristic which checks for "skinny cases".

Cache edge

For 2nd Level Cache blocking a single cache edge parameter is used.The high level choose an order to traverse the blocks: "ijk, jik, ikj, jki, kij, kji". These need not be the same order as the product is done within a block.

Typically chosen orders are "ijk" or "jik".For "jik" the ideal situation would be to copy "A" and the "NB" wide panel of "B". For "ijk" swap the role of "A" and "B".

Choosing the bigger of "M" or "N" for the outer loop reduces the footprint of the copy.But for large "K" ATLAS does not even allocate such a large amount of memory.Instead it defines a parameter, "Kp", to give best use of the L2 cache. Panels are limited to "Kp" in length.It first tries to allocate (in the "jik" case) M*Kp + NB*Kp + NB*NB.If that fails it tries 2*Kp*NB + NB*NB.(If that fails it uses the no-copy version of GEMM, but this case is unlikely for reasonable choices of cache edge.)"Kp" is a function of cache edge and "NB".

LAPACK

When integrating the ATLAS BLAS with LAPACK an important consideration is the choice of blocking factor for LAPACK. If the ATLAS blocking factor is small enough the blocking factor of LAPACK could be set to match that of ATLAS.

To take advantage of recursive factorization, ATLAS provides replacement routines for some LAPACK routines. These simply overwrite the corresponding LAPACK routines from Netlib.

Need for installation

Installing ATLAS on a particular platform is a challenging process which is typically done by a system vendor or a local expert and made available to a wider audience.

For many systems, architectural default parameters are available; these are essentially saved searches plus the results of hand tuning. If the arch defaults work they will likely get 10-15% better performance than the install search. On such systems the installation process is greatly simplified.

References

External links

* [http://math-atlas.sourceforge.net/ math-atlas.sourceforge.net] Project homepage
* [http://math-atlas.sourceforge.net/devel/atlas_contrib/ User contribution to ATLAS]
* [http://math-atlas.sourceforge.net/devel/atlas_devel/ A Collaborative guide to ATLAS Development]
* The [http://math-atlas.sourceforge.net/faq.html#doc FAQ] has links to the Quick reference guide to BLAS and Quick reference to ATLAS LAPACK API reference
* [http://www.terborg.net/research/kml/installation.html Microsoft Visual C++ Howto] for ATLAS


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

  • Automatically Tuned Linear Algebra Software — ATLAS Aktuelle Version 3.8.4 (14. Mai 2011) Aktuelle Vorabversion 3.9.53 (12. Oktober 2011) Betriebssystem POSIX Programmier­sprache C, Fortran …   Deutsch Wikipedia

  • Automatically Tuned Linear Algebra Software — Значимость предмета статьи поставлена под сомнение. Пожалуйста, покажите в статье значимость её предмета, добавив в неё доказательства значимости по частным критериям значимости или, в случае если частные критерии значимости для… …   Википедия

  • Basic Linear Algebra Subprograms — (BLAS) is a de facto application programming interface standard for publishing libraries to perform basic linear algebra operations such as vector and matrix multiplication. They were first published in 1979, and are used to build larger packages …   Wikipedia

  • Basic Linear Algebra Subprograms — BLAS Betriebssystem plattformunabhängig Kategorie Programmbibliothek für: Lineare Algebra www.netlib.org/blas Basic Linear Algebra Subprograms (BLAS) bezeichnet eine Softwareb …   Deutsch Wikipedia

  • Basic Linear Algebra Subprograms — BLAS (англ. Basic Linear Algebra Subprograms  базовые подпрограммы линейной алгебры)  стандарт де факто интерфейса программирования приложений для создания библиотек, выполняющих основные операции линейной алгебры, такие как… …   Википедия

  • Basic Linear Algebra Subprograms — (BLAS) sont un ensemble de fonctions standardisées (interface de programmation) réalisant des opérations de base de l algèbre linéaire comme des multiplications de vecteurs ou de matrices. Ces fonctions ont d abord été publiées en 1979 et sont… …   Wikipédia en Français

  • Comparison of linear algebra libraries — The following tables provide a comparison of linear algebra software libraries, either specialized or general purpose libraries with significant linear algebra coverage. Dense linear algebra General information Creator Language First public… …   Wikipedia

  • Atlas (Computersoftware Lineare Algebra) — Automatically Tuned Linear Algebra Software (ATLAS) ist eine Unterprogrammbibliothek für Lineare Algebra. ATLAS ist eine Open Source Implementierung von Basic Linear Algebra Subprograms (BLAS) und von Teilen von LAPACK für C und für Fortran. Es… …   Deutsch Wikipedia

  • Algoritmo QMR — El algoritmo QMR fue creado para resolver el sistema lineal Ax = b donde A es una matriz cuadrada que no requiere ser simétrica. Contenido 1 Introducción 2 Quas Minimal Residual 2.1 Biortogonalización de Lanczos …   Wikipedia Español

  • Algoritmo TFQMR — El Algoritmo TFQMR fue creado para resolver el sistema lineal Ax = b donde A es una matriz cuadrada que no requiere ser simétrica. Contenido 1 Introducción 2 Transpose Free QMR 3 Algoritmo Transpose Free QM …   Wikipedia Español

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”