Opened 13 years ago

Last modified 13 years ago

#4033 assigned Patches

optimized matrix products

Reported by: Gunter Owned by: Gunter
Milestone: Boost 1.43.0 Component: uBLAS
Version: Boost 1.42.0 Severity: Optimization
Keywords: matrix multiply, performance Cc: consulting@…

Description

proposal from Jörn Ungermann:

Abstract

The (lacking) performance of sparse-matrix products was quite often noted on this list. Currently there are several functions which implement matrix products. Each is efficient under certain conditions only and some of them are not even documented. I think it important for users to have one (near-)optimal function only. The attached file contains an improved axpy_prod that matches the performance of "prod", "axpy_prod" and "sparse_prod" and is thereby optimal for all matrix types.

Details

The optimal choice of kernel for a matrix product depends on the properties of all involved matrices. The current implementations specialize only on the type of target matrix. By also determining the kernel depending on the source matrices, one can choose the most efficient kernel. My aim was to create a single function to match the performance of prod, axpy_prod and sparse_prod for all combinations of compressed_matrix/matrix and column/row-majority. The other matrix types should also be handled efficiently by my product, too, but I did check it only spuriously, as it should be obvious that those types are more suited for matrix setup, not for actual calculation. My axpy_prod implementation (called axpy_prod2 in the attached file) does not offer support for the undocumented triangular_expression stuff contained in the original axpy_prod, which however seems to be buggy in the current code for dense matrices. The kernels are largely based on existing kernels with one or two new ones being thrown in. They are as abstract as possible to handle arbitrary expressions efficiently. Specializing, e.g. directly on compressed_matrix would give another very significant performance boost, but also many more additional kernels, possibly more than could be maintained, especially as the transposedness might have to be handled explicitly, too.

It would be very nice, if someone could rewrite prod/prec_prod to handle matrix products in the same way as my axpy_prod2 does, but I did not look deep enough into the expression-templates to do this myself or to even know if this were possible.

In fact, I'd propose to have two sets of interfaces:

1) One convenient one, possibly compromising efficiency
2) One modeled closely after C-BLAS, delivering utmost efficiency.

The latter one could then be *very easily* overloaded by the numeric bindings for dense matrices. I added a possible generic implementation for a gemm call that could be trivially overloaded for dense matrices and forwarded to, e.g., ATLAS numeric bindings.

If one could achieve the same efficiency and automatic (i.e. by including a header) coupling to numeric bindings using only *one* interface, I'd prefer that naturally. However currently, we have not just two, but too many product functions (prod, prec_prod, axpy_prod, sparse_prod, opb_prod, block_prod).

The following table gives the result for all 64 combinations of compressed_matrix/matrix and row/column-majorities for the three involved in this case 2000x2000 matrices.

com_rm is a compressed_matrix of row_major type.
den_cm is a matrix of column_major type.
The  4th column indicates the used kernel.
The  5th column gives the runtime for axpy_prod2  ( clock()/1000 )
The  6th column gives the runtime for sparse_prod ( clock()/1000 )
The  7th column gives the runtime for axpy_prod   ( clock()/1000 )
The  8th column gives the runtime for prod        ( clock()/1000 )
The 10th column gives the speedup of axpy_prod2 compared to sparse_prod.
The 11th column gives the speedup of axpy_prod2 compared to axpy_prod.
The 12th column gives the speedup of axpy_prod2 compared to prod.

Larger matrix sizes result in prohibitive runtimes for the "slow" products, but can be used to analyse pure-sparse products.

The runtime shall be taken only qualitatively.

One can see that the only cases where the new implementation is slower are of relatively small runtime, so it may be negligible. sparse_prod uses an optimization that is very efficient if the target matrix has few off-diagonal elements, but is very inefficient if it does. The results will therefore vary depending on the test matrices.

It is also obvious to see, why some people complain about product performance, as especially axpy_prod and prod are sometimes ridiculously slow for sparse matrices and sparse_prod does not seem to be documented in the special products section. My favorite case is "com_cm, com_cm, den_rm" one, which actually occurred in the diagnostics part of our application and was the reason why we started looking into this topic.

Attachments (1)

axpy_prod2.tgz (4.9 KB ) - added by Gunter 13 years ago.
proposed new product and example

Download all attachments as: .zip

Change History (3)

by Gunter, 13 years ago

Attachment: axpy_prod2.tgz added

proposed new product and example

comment:1 by Gunter, 13 years ago

Status: newassigned

first timing result from Jörn

com_rm
com_rm, com_rm, com_rm  .ururur     10.     10.     10.  88340. spdup:    1.00    1.00 8834.00
com_rm, com_rm, den_rm  .ururur     40.     30.   3270.  87490. spdup:    0.75   81.75 2187.25
com_rm, com_rm, com_cm  .ucucuc    320.    340.    360.  85750. spdup:    1.06    1.12  267.97
com_rm, com_rm, den_cm  .ururur     20.    380. 193010.  87020. spdup:   19.00 9650.50 4351.00
com_rm, den_rm, com_rm  .ururur    210.    840.   4210.   2100. spdup:    4.00   20.05   10.00
com_rm, den_rm, den_rm  .ururur    220.    890.  16540.   2080. spdup:    4.05   75.18    9.45
com_rm, den_rm, com_cm  .urdcuc    800.  90010. 100030.   2050. spdup:  112.51  125.04    2.56
com_rm, den_rm, den_cm  .urdcuc    630.  93330. 187840.   2190. spdup:  148.14  298.16    3.48
com_rm, com_cm, com_rm  .ururur    310.    310.    330.    670. spdup:    1.00    1.06    2.16
com_rm, com_cm, den_rm  .ururur    310.    320. 184540.    710. spdup:    1.03  595.29    2.29
com_rm, com_cm, com_cm  .ucucuc    310.    310.    280.    660. spdup:    1.00    0.90    2.13
com_rm, com_cm, den_cm  .ucucuc    330.    310. 182840.    720. spdup:    0.94  554.06    2.18
com_rm, den_cm, com_rm  .ururur    300.    910.   4310.   2090. spdup:    3.03   14.37    6.97
com_rm, den_cm, den_rm  .ururur    340.    970.  72690.   2120. spdup:    2.85  213.79    6.24
com_rm, den_cm, com_cm  .urdcuc    690.  91900. 101780.   2030. spdup:  133.19  147.51    2.94
com_rm, den_cm, den_cm  .urdcuc    540.  92090. 181780.   2000. spdup:  170.54  336.63    3.70
com_cm
com_cm, com_rm, com_rm  .ururur     40.     60.     60. 173020. spdup:    1.50    1.50 4325.50
com_cm, com_rm, den_rm  .ucurdr     40.     70.   3200. 165360. spdup:    1.75   80.00 4134.00
com_cm, com_rm, com_cm  .ucucuc     50.     60.     60. 166080. spdup:    1.20    1.20 3321.60
com_cm, com_rm, den_cm  .ucucuc     70.     80.   3250. 168970. spdup:    1.14   46.43 2413.86
com_cm, den_rm, com_rm  .ururur    160.    860.   4480.  94040. spdup:    5.38   28.00  587.75
com_cm, den_rm, den_rm  .ucurdr    210.    840.  16320.  91700. spdup:    4.00   77.71  436.67
com_cm, den_rm, com_cm  .ucucuc    680.    670.   6480.  92700. spdup:    0.99    9.53  136.32
com_cm, den_rm, den_cm  .ucucuc    700.    770.   3410.  94180. spdup:    1.10    4.87  134.54
com_cm, com_cm, com_rm  .ururur    350.    330.    350.  83880. spdup:    0.94    1.00  239.66
com_cm, com_cm, den_rm  .ucucuc     30.    360. 186920.  82810. spdup:   12.00 6230.67 2760.33
com_cm, com_cm, com_cm  .ucucuc     10.     10.     10.  87320. spdup:    1.00    1.00 8732.00
com_cm, com_cm, den_cm  .ucucuc     40.     20.   3350.  86040. spdup:    0.50   83.75 2151.00
com_cm, den_cm, com_rm  .ururur    310.    870.   4830.  94410. spdup:    2.81   15.58  304.55
com_cm, den_cm, den_rm  .ucucuc    600.    900.  76000.  93800. spdup:    1.50  126.67  156.33
com_cm, den_cm, com_cm  .ucucuc    630.    600.   6410.  93480. spdup:    0.95   10.17  148.38
com_cm, den_cm, den_cm  .ucucuc    620.    640.   3330.  92700. spdup:    1.03    5.37  149.52
den_rm
den_rm, com_rm, com_rm  .ururur    550.    640.   6630.  96910. spdup:    1.16   12.05  176.20
den_rm, com_rm, den_rm  .ururur    490.    630.   3110.  96140. spdup:    1.29    6.35  196.20
den_rm, com_rm, com_cm  .ucucuc    410.    880.   4880.  98340. spdup:    2.15   11.90  239.85
den_rm, com_rm, den_cm  .ururur    500.    930.  76210.  96800. spdup:    1.86  152.42  193.60
den_rm, den_rm, com_rm  .d-d-ur  72560. 204380.1736640.  70530. spdup:    2.82   23.93    0.97
den_rm, den_rm, den_rm  .drdrdr  15910. 204960.  15880.  70870. spdup:   12.88    1.00    4.45
den_rm, den_rm, com_cm  .d-d-uc  62130. 225070.1728390.  62830. spdup:    3.62   27.82    1.01
den_rm, den_rm, den_cm  .d-d-uc  61330. 224670.  73000.  61530. spdup:    3.66    1.19    1.00
den_rm, com_cm, com_rm  .drucur    740.  90370.  99300.   2040. spdup:  122.12  134.19    2.76
den_rm, com_cm, den_rm  .ucucuc    380.  95150. 184160.   2100. spdup:  250.39  484.63    5.53
den_rm, com_cm, com_cm  .ucucuc    360.    950.   4470.   2070. spdup:    2.64   12.42    5.75
den_rm, com_cm, den_cm  .ucucuc    380.    980.  74490.   2150. spdup:    2.58  196.03    5.66
den_rm, den_cm, com_rm  .d-d-ur  15950. 232660.1762960.  16230. spdup:   14.59  110.53    1.02
den_rm, den_cm, den_rm  .d-d-ur  16030. 228230.  73100.  15410. spdup:   14.24    4.56    0.96
den_rm, den_cm, com_cm  .d-d-uc  15990. 224510.1716590.  16140. spdup:   14.04  107.35    1.01
den_rm, den_cm, den_cm  .d-d-uc  15460. 224790.  73850.  15580. spdup:   14.54    4.78    1.01
den_cm
den_cm, com_rm, com_rm  .ururur    580.    680.   6350.  95920. spdup:    1.17   10.95  165.38
den_cm, com_rm, den_rm  .ucurdr    710.    690.   3280.  97030. spdup:    0.97    4.62  136.66
den_cm, com_rm, com_cm  .ucucuc    260.    840.   4810.  96100. spdup:    3.23   18.50  369.62
den_cm, com_rm, den_cm  .ucucuc    230.    880.  16280.  95940. spdup:    3.83   70.78  417.13
den_cm, den_rm, com_rm  .d-d-ur 125890. 211500.1752280. 123910. spdup:    1.68   13.92    0.98
den_cm, den_rm, den_rm  .drdrdr  16320. 207460.  16810. 124920. spdup:   12.71    1.03    7.65
den_cm, den_rm, com_cm  .d-d-uc 126900. 206840.1645940. 122030. spdup:    1.63   12.97    0.96
den_cm, den_rm, den_cm  .dcdcdc  16390. 209770.  16810. 121270. spdup:   12.80    1.03    7.40
den_cm, com_cm, com_rm  .drucur    760.  91830.  96810.   2120. spdup:  120.83  127.38    2.79
den_cm, com_cm, den_rm  .drucur    610.  92990. 187750.   2280. spdup:  152.44  307.79    3.74
den_cm, com_cm, com_cm  .ucucuc    190.    920.   4310.   2270. spdup:    4.84   22.68   11.95
den_cm, com_cm, den_cm  .ucucuc    230.    910.  17800.   2310. spdup:    3.96   77.39   10.04
den_cm, den_cm, com_rm  .d-d-ur  66980. 240030.1864980.  65220. spdup:    3.58   27.84    0.97
den_cm, den_cm, den_rm  .d-d-ur  63400. 233410.  76700.  63820. spdup:    3.68    1.21    1.01
den_cm, den_cm, com_cm  .d-d-uc  75140. 214000.1711410.  74020. spdup:    2.85   22.78    0.99
den_cm, den_cm, den_cm  .dcdcdc  16460. 215050.  16820.  74570. spdup:   13.07    1.02    4.53

comment:2 by consulting@…, 13 years ago

Cc: consulting@… added
Note: See TracTickets for help on using tickets.