This is the first in series of posts for matrix multiplication.
I will assume that everyone here would know as to what how matrix multiplication with arrays are computed. Below is the serial implementation for matrix multiplication where matrix A and B are multiplied and the result is stored in matrix C. j and I are the dimensions of the matrix. Matrix multiplication are highly regular and can be parallelised and they are in the order of n3.
Now coming to parallelisation. Which loop do we parallelise? With our knowledge of asymptotic analysis (due credit to my teacher Viraj Kumar), we can figure out that to extract the maximum parallelism, we need s(n) to be o(r(n)) where s(n) is the inherently serial part of our code and r(n) is the parallel region.
The above solution is the most simplistic and realistic solution where we parallelise the middle loop to get the maximum performance. This is because in this case the parallel code is of the order n2 and serial code is of the order n. In this scenario we can get a speed up upto the order of parallelism as supported by our machine.
The best performance was seen by the above optimisation which was able to harness the nested parallelism. It collapses the outer two loops to form one and the other loop is also parallelised. Although in such cases we need to worry when we parallelise the inner loop because creation, synchronisation and destruction of threads add on to the overhead and may kill any performance benefits we may be getting out of the program by parallelising the code.
You may ask as to why do we prefer shared(cij) over reduction(+:cij) and also a doubt may arise to will not there be a data race in the case of cij? Because cij is shared and multiple threads will try to update it. There will BE a data race as such but it DOES NOT bother us because no matter which thread wins, eventually it is the sum that matters to us. It is like 2+3+5 or 2+5+3, the answer still remains the same. When reduction(+: cij) clause is used, a private copy for each thread has to be created and also at the same time there will be no need for an extra barrier which waits till it gets the values of cij from each thread and then sum it up all. Hence, this approach saved me a second extra or improved performance over when reduction clause is used.
You may find the code here.
Next in series – Part 2. Blocking Implementation