Current OpenMP programming language is tile oblivious, although it is the de facto standard for writing parallel programs on shared memory systems. Tiling not only can improve data locality for both the sequential and parallel programs, but also can help the compiler to maximise parallelism and minimise synchronisation for programs running on parallel machines. There is a promising future work in trying to make OpenMP tile-aware and suitably add directives which can be sort of an annotation to mark regions that can help OpenMP become tile-aware.

Currently, OpenMP doesn’t support reduction on multi dimensional array and it requires us to manually add the code. Work has been done to add functionality to OpenMP and do parallel reductions on multi dimensional arrays and not just on scalar variables. Also apparently, the current OpenMP parallelisation techniques can not harvest the maximum parallelism and data locality in the code at the same time. They suffer from either insufficient parallelism or poor data locality.

This can be done basically in four steps

Distribute the iterations of the parallelized loop among the threads

Allocate memory for the private copy of the tile used in the local recursive calculation

Perform the local recursive calculation which is specified by the reduction kernelloops

Update the global copy of the reduction tile

In my solution, I have instead used a flattened 2D array. A private variable is then created and updates happen to that variable which is then stored back to the cell.

Consider the above example for matrix multiplication. We can think of the (1,1) entry of the product as an inner product of the first row and the first column of the two matrices in the product: 650 = 1 · 21 + 5 · 22 + 9 · 23 + 13 · 24.

Similarly, we can think of partitioning all three matrices into 2-by-2 blocks, and writing the first block of the product in terms of the first block row and block column of the two terms in the product.

The advantage of thinking about matrix multiplication in this way is that even if our original matrices don’t fit into cache, the little blocks will; and if we break the matrices into b-by-b blocks, then each block multiplication involves 2b^{2} data. Note that the idea of blocking can be applied recursively: a large matrix might be partitioned into big blocks that fit in L2 cache, which in turn are partitioned into smaller blocks that fit in L1 cache, which in turn are partitioned into tiny blocks that fit into the registers. This is serial tuning optimisation which tries to improve cache performance and hence the overall speedup.

By the means of this simple tuning, for a test case of 1024×1024 and the block size being 16, the parallelised version of the naive implementation was 6 times slower than that of the parallelised version. There is an unnecessary extra computation for i being done in the bj loop which can be easily shifted to the above loop and we can remove the collapse clause, the performance for the above test case are in the same ball park. The collapse clause better distributes the work but that added bit of performance gets nullified by unnecessary n^{2} computation of iinstead of n.

I will assume that everyone here would know as to what how matrix multiplication with arrays are computed. Below is the serial implementation for matrix multiplication where matrix A and B are multiplied and the result is stored in matrix C. j and I are the dimensions of the matrix. Matrix multiplication are highly regular and can be parallelised and they are in the order of n^{3}.

Now coming to parallelisation. Which loop do we parallelise? With our knowledge of asymptotic analysis (due credit to my teacher Viraj Kumar), we can figure out that to extract the maximum parallelism, we need s(n) to be o(r(n)) where s(n) is the inherently serial part of our code and r(n) is the parallel region.

The above solution is the most simplistic and realistic solution where we parallelise the middle loop to get the maximum performance. This is because in this case the parallel code is of the order n^{2} and serial code is of the order n. In this scenario we can get a speed up upto the order of parallelism as supported by our machine.

The best performance was seen by the above optimisation which was able to harness the nested parallelism. It collapses the outer two loops to form one and the other loop is also parallelised. Although in such cases we need to worry when we parallelise the inner loop because creation, synchronisation and destruction of threads add on to the overhead and may kill any performance benefits we may be getting out of the program by parallelising the code.

You may ask as to why do we prefer shared(cij) over reduction(+:cij) and also a doubt may arise to will not there be a data race in the case of cij? Because cij is shared and multiple threads will try to update it. There will BE a data race as such but it DOES NOT bother us because no matter which thread wins, eventually it is the sum that matters to us. It is like 2+3+5 or 2+5+3, the answer still remains the same. When reduction(+: cij) clause is used, a private copy for each thread has to be created and also at the same time there will be no need for an extra barrier which waits till it gets the values of cij from each thread and then sum it up all. Hence, this approach saved me a second extra or improved performance over when reduction clause is used.

However trivial, there is a lot we can learn about parallelisation from this example and because this is a trivial example, researchers around the world have tried their hand and come up with amazing optimisations and implementations. And also matrix multiplication is the building block for dense linear algebra libraries.

This will be a 3 post series with the optimisations and their parallelised version.

The naive matrix multiplication code was adapted from Cornell University CS5220 course assignment but for some reason the code was dysfunctional. I rewrote the code partially to suit my needs and remove the redundant stuff.

There are several other optimisations as well that one can look forward to for implementing matrix multiplication like copy optimisations. This will be soon covered in my future posts.

I use a MacBook Air with an i5 processor and 4GB RAM with and L1 cache size of 64K. The maximum number of threads that can run on my CPU are 8. All experimental setup and results are based on the above system configuration.

I have referred to Tile Reduction: the First Step Towards Tile Aware Parallelization in OpenMP to explain the their

All the code in the context of this thread can be found here.

In the last post, we discussed about spatial reuse and improved cache optimisations. The other factor which plays a decisive role in improving cache use is temporal locality. If you may have observed that till now in all the algorithms that we saw, we never made use of the recently computed cell values and nor were we able to parallelise the outermost loop because of the carried data dependency. To put it in simple terms, every sweep that occurred on our multidimensional array, the cell values computed were dependent on the last iteration and hence there was no option for us to exploit recently computed values and in turn get the best out of spatial locality.

Time skewed

Visualisation

This algorithm was initially difficult for me to visualise as to how the control of the code moved and how were the stencils being computed in this fashion. I presume that the above two pictures will help you better understand as to what is happening. With the knowledge of the values of the cells computed in the current iteration, we may as well make use of these and perform stencil operation of the next time step albeit on a smaller block than the last time because only that much of data is available for reuse.

Time skewing is a type of cache tiling, that attempts reduce main memory traffic by reusing values in cache as often as possible. However one issue with this is, we need to find the right block size and of the right dimensions for the best use and is a machine dependent optimisation else we cannot extract the expected performance.

The performance gains were significant in the case of 512x512x512 where naive algorithm performs badly because of extensive cache misses. After various trials for finding a suitable tile size for the time skew algorithm, 512x64x8 gave better results in terms of performance.

Although here, there was not a lot of openMP stuff to try out, this definitely goes down well for us that we should first always look for serial optimisations and then lookout to parallelise the program because the gains are always significant in these optimisations.

Tiling is a well known transformation which improves locality by moving reuses to the same data closer in time. This optimisation will allow our stencil computing to make better cache utilisation as tiling improves spatial reuse.

3D PDE solvers suffer poor cache performance because accesses to the same data are usually too far apart, requiring array elements to be brought into cache multiple times per array sweep. Hence the need to perform tiling to reduce the cache misses.

This cache problem arises more often in 3D codes than in 2D codes because we have data more spaced out and as data is stored contagiously in row major fashion.

So what is Tiling? Tiling is an optimisation technique that breaks the problem into tiles and first solves these tiles and then proceed. If you see the image illustrated by me as above, instead of the computing stencils in an iterative manner going through all the cells in the array, we can break down the multi dimensional grid in tiles and then first perform stencil operations on all cells within the tile and then go to the next tile. This improves spatial reuse and hence better exploiting the cache. Well, wikipedia also explains it well and in detail.

Are tiles also 3D arrays? NO, we break the problem in 2D tiles. Why? Like if we are to have 3D tiles, as the elements along the z-axis are far spaced apart, they would cause more harm then do good. So what about 2D PDE solvers? You can but it really would not add on great performance benefits as you can very well fit your arrays in the cache unless you have a really large sized array. Here you do make them into 2D Tiles as they will fit in the cache. Fit into cache? Depending on your architecture, you need to aptly choose the right tile size to get maximum optimisation.

“A 6 point stencil which accesses six columns of B in three adjacent planes at the same time, as shown in Figure 4. With a distance of 2N^{2} between the leading A(I,J,K+1) and trailing A(I,J,K-1) array references, two entire N N planes now need to remain in cache, so only 3D arrays of size 32 32 M can fully exploit reuse for a 16K L1 cache.” Quoting directly from the paper whose reference is given below, is an example for a 6 Point 3D Jacobi solver.

nx, ny, nz are the the dimensions of the array and tx,ty are the tile dimensions.

Now coming to the question of parallelisation? OpenMP is tile oblivious in it’s implementation. Future work provisions for moving towards a tile aware parallelism and harness the available parallelism in a better way. Hence, parallelism is trivial.

So which loop to parallelise? There are six possibilities. First one can be ruled out as there is a carried loop data dependency over the time steps. By asymptotic analysis, we can say that the best parallelisation can be achieved by parallelising ii loop. As the loop has a step increment of block size, the number of iterations are less. In my test case, the optimal tile size giving best cache performance for my architecture was 128 and my block size is 256. So ii loop only iterates 4 times and we have 8 threads. This implies that we are not harnessing all the available parallelism. Hence we collapse the the two loops and now the combined work is distributed amongst the team of threads.

One more important question is what should be thetile size? Since we tile only the J and I loops, the K loop iterates across all array planes but executes only iteration points inside a TIxTJx(N-2) block. In order to preserve all reuse within the TIxTJx(N-2) block (N-2 because the loop iterates from 2 to n-1), it is therefore sufficient that the cache hold a (TI+2)x (TJ+2)x3 subarray. During each TIxTJx(N-2) block of iterations, we access approximately (TI+2)x(TJ+2)xN elements of array Anext. The cost function is – Cost(TI,TJ) = (TI+2)(TJ+2)/(TI TJ). Note that given multiple tile sizes with equal values of TI TJ, this function is minimal when TI and TJ have the smallest difference. The cost function thus favors square tiles.

I also looked if I could achieve any good performance gains by bringing nested parallelism to the table. The results were almost close to that of the above case but I guess that overhead associated with creation, synchronisation and destruction of thread took away any extra benefit coming our way.

So the most tuned code gave us a speedup of almost 3 times compared to that of serial code.

Now that we have discussed optimisation that takes in consideration of the spatial reuse of data, why should we not think of temporal reuse as well? We do not use the just computed values while sweeping over our multidimensional array? After all temporal locality and spatial locality are the two most critical factors that enhance cache performance. Solution to this is a time skewed algorithm.

Stencil codes perform a sequence of sweeps (called timesteps) through a given array. Generally this is a 2- or 3-dimensional regular grid. That is your outer most loop doing the sweeps. The elements of the arrays will be interchangeably referred to as cells. In each timestep, the stencil code updates all array elements. Using neighboring array elements in a fixed pattern , each cell’s new value is computed. In most cases boundary values are left unchanged and hence the loop variables are from 1 to n-2 (index begins from 0).

The cells are computed in a pattern as shown above with red, yellow and blue getting computed correspondingly in the sequence order. nx, ny and nz are the dimensions of our grid along the x, y and the z axis.

Now, that we have understood what and how the code works, lets get into how best can we extract the performance –

Now there happens to be a carried loop data dependency for the outermost loop (time steps) as after every iteration, it’s values are used to compute the cells for the next iteration. So it is not possible that we parallelise the outer loop as of now in this case.

So which of the other three loops are to be parallelised? These slides can be a helpful start for you to understand what implications I am about to make. r(n) is the parallel code and s(n) is the inherently serial code. Our aim is to find a case where s(n) is o(r(n))

If we are to parallelise the kloop, the serial code within it is of the order n^{2} and the parallel code is of the order n and hence it is not worth parallelising that loop. Experimental data also suggested the same. If we are to parallelise the i loop, the overhead associated with creation and synchronisation of new threads just to perform a small task is too expensive and is not worth it to give us the extra performance and rather it just exacerbates the performance.

Hence our solution one is to parallelise the j loop. As all the loop iterations will perform the same amount of work, we can set the scheduling to static. This solution stands out better than the two is because r(n) is n^{2} and s(n) is of the order n. Hence the speedup is the order of the parallelism supported by your machine.

Solution two involves us trying to be able to harness on the nested parallelism. There is a variable that you can set to allow nested parallelism to extract more juice out of your machine. You’ve to insert omp_set_nested(1); to enable nested parallelism. This is what I came across when trying to efficiently parallelise my code but somehow it ended up not showing any results at all in my case probably because of the architecture and a different compiler. So if any of you are able to test the results out, please revert back to me. collapse(n) [OpenMP 3.0+] directive merges n loops and then distributes the work amongst the team of threads. The above solution doesn’t give any significant extra performance add on but overall fairs better than the others. This is because the outer two loops are merged and now r(n) becomes n² and s(n) is of the order n and so is the case with solution 1. The actual difference is coming possibly from the creation of team of threads. For smaller test cases, the overhead associated dominates over the parallelism exploited. All these values are for Size³ grids and for 100 time steps.