Accelerating GPU kernels for dense linear algebra

Accelerating GPU kernels for dense linear algebra,10.1007/978-3-642-19328-6_10,Rajib Nath,Stanimire Tomov,Jack Dongarra

Accelerating GPU kernels for dense linear algebra   (Citations: 3)
BibTex | RIS | RefWorks Download
Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that significantly accelerate the corre- sponding routines from currently available libraries for GPUs. In particu- lar, Pointer Redirecting - a set of GPU specific optimization techniques - allows us to easily remove performance oscillations associated with prob- lem dimensions not divisible by fixed blocking sizes. For example, applied to the matrix-matrix multiplication routines, depending on the hardware configuration and routine parameters, this can lead to two times faster algorithms. Similarly, the matrix-vector multiplication can be accelerated more than two times in both single and double precision arithmetic. Ad- ditionally, GPU specific acceleration techniques are applied to develop new kernels (e.g. syrk, symv) that are up to 20! faster than the currently available kernels. We present these kernels and also show their accelera- tion e!ect to higher level dense linear algebra routines. The accelerated kernels are now freely available through the MAGMA BLAS library.
Conference: Vector and Parallel Processing - VECPAR , pp. 83-92, 2010
Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.
    • ...The CUBLAS dgemm performance and the MAGMA dgetrf/dgetrs performance is reduced when the sizes (or the leading dimensions) of the matrix are not multiples of the inner blocking size [7]...
    • ...The new CUBLAS 3.2 indeed increases performance for non block multiple matrix sizes through MAGMA code [7]...

    P. Fortinet al. Deployment on GPUs of an Application in Computational Atomic Physics

Sort by: