However trivial, there is a lot we can learn about parallelisation from this example and because this is a trivial example, researchers around the world have tried their hand and come up with amazing optimisations and implementations. And also matrix multiplication is the building block for dense linear algebra libraries.

This will be a 3 post series with the optimisations and their parallelised version.

The naive matrix multiplication code was adapted from Cornell University CS5220 course assignment but for some reason the code was dysfunctional. I rewrote the code partially to suit my needs and remove the redundant stuff.

There are several other optimisations as well that one can look forward to for implementing matrix multiplication like copy optimisations. This will be soon covered in my future posts.

I use a MacBook Air with an i5 processor and 4GB RAM with and L1 cache size of 64K. The maximum number of threads that can run on my CPU are 8. All experimental setup and results are based on the above system configuration.

I have referred to Tile Reduction: the First Step Towards Tile Aware Parallelization in OpenMP to explain the their

All the code in the context of this thread can be found here.

In the last post, we discussed about spatial reuse and improved cache optimisations. The other factor which plays a decisive role in improving cache use is temporal locality. If you may have observed that till now in all the algorithms that we saw, we never made use of the recently computed cell values and nor were we able to parallelise the outermost loop because of the carried data dependency. To put it in simple terms, every sweep that occurred on our multidimensional array, the cell values computed were dependent on the last iteration and hence there was no option for us to exploit recently computed values and in turn get the best out of spatial locality.

Time skewed

Visualisation

This algorithm was initially difficult for me to visualise as to how the control of the code moved and how were the stencils being computed in this fashion. I presume that the above two pictures will help you better understand as to what is happening. With the knowledge of the values of the cells computed in the current iteration, we may as well make use of these and perform stencil operation of the next time step albeit on a smaller block than the last time because only that much of data is available for reuse.

Time skewing is a type of cache tiling, that attempts reduce main memory traffic by reusing values in cache as often as possible. However one issue with this is, we need to find the right block size and of the right dimensions for the best use and is a machine dependent optimisation else we cannot extract the expected performance.

The performance gains were significant in the case of 512x512x512 where naive algorithm performs badly because of extensive cache misses. After various trials for finding a suitable tile size for the time skew algorithm, 512x64x8 gave better results in terms of performance.

Although here, there was not a lot of openMP stuff to try out, this definitely goes down well for us that we should first always look for serial optimisations and then lookout to parallelise the program because the gains are always significant in these optimisations.

Tiling is a well known transformation which improves locality by moving reuses to the same data closer in time. This optimisation will allow our stencil computing to make better cache utilisation as tiling improves spatial reuse.

3D PDE solvers suffer poor cache performance because accesses to the same data are usually too far apart, requiring array elements to be brought into cache multiple times per array sweep. Hence the need to perform tiling to reduce the cache misses.

This cache problem arises more often in 3D codes than in 2D codes because we have data more spaced out and as data is stored contagiously in row major fashion.

So what is Tiling? Tiling is an optimisation technique that breaks the problem into tiles and first solves these tiles and then proceed. If you see the image illustrated by me as above, instead of the computing stencils in an iterative manner going through all the cells in the array, we can break down the multi dimensional grid in tiles and then first perform stencil operations on all cells within the tile and then go to the next tile. This improves spatial reuse and hence better exploiting the cache. Well, wikipedia also explains it well and in detail.

Are tiles also 3D arrays? NO, we break the problem in 2D tiles. Why? Like if we are to have 3D tiles, as the elements along the z-axis are far spaced apart, they would cause more harm then do good. So what about 2D PDE solvers? You can but it really would not add on great performance benefits as you can very well fit your arrays in the cache unless you have a really large sized array. Here you do make them into 2D Tiles as they will fit in the cache. Fit into cache? Depending on your architecture, you need to aptly choose the right tile size to get maximum optimisation.

“A 6 point stencil which accesses six columns of B in three adjacent planes at the same time, as shown in Figure 4. With a distance of 2N^{2} between the leading A(I,J,K+1) and trailing A(I,J,K-1) array references, two entire N N planes now need to remain in cache, so only 3D arrays of size 32 32 M can fully exploit reuse for a 16K L1 cache.” Quoting directly from the paper whose reference is given below, is an example for a 6 Point 3D Jacobi solver.

nx, ny, nz are the the dimensions of the array and tx,ty are the tile dimensions.

Now coming to the question of parallelisation? OpenMP is tile oblivious in it’s implementation. Future work provisions for moving towards a tile aware parallelism and harness the available parallelism in a better way. Hence, parallelism is trivial.

So which loop to parallelise? There are six possibilities. First one can be ruled out as there is a carried loop data dependency over the time steps. By asymptotic analysis, we can say that the best parallelisation can be achieved by parallelising ii loop. As the loop has a step increment of block size, the number of iterations are less. In my test case, the optimal tile size giving best cache performance for my architecture was 128 and my block size is 256. So ii loop only iterates 4 times and we have 8 threads. This implies that we are not harnessing all the available parallelism. Hence we collapse the the two loops and now the combined work is distributed amongst the team of threads.

One more important question is what should be thetile size? Since we tile only the J and I loops, the K loop iterates across all array planes but executes only iteration points inside a TIxTJx(N-2) block. In order to preserve all reuse within the TIxTJx(N-2) block (N-2 because the loop iterates from 2 to n-1), it is therefore sufficient that the cache hold a (TI+2)x (TJ+2)x3 subarray. During each TIxTJx(N-2) block of iterations, we access approximately (TI+2)x(TJ+2)xN elements of array Anext. The cost function is – Cost(TI,TJ) = (TI+2)(TJ+2)/(TI TJ). Note that given multiple tile sizes with equal values of TI TJ, this function is minimal when TI and TJ have the smallest difference. The cost function thus favors square tiles.

I also looked if I could achieve any good performance gains by bringing nested parallelism to the table. The results were almost close to that of the above case but I guess that overhead associated with creation, synchronisation and destruction of thread took away any extra benefit coming our way.

So the most tuned code gave us a speedup of almost 3 times compared to that of serial code.

Now that we have discussed optimisation that takes in consideration of the spatial reuse of data, why should we not think of temporal reuse as well? We do not use the just computed values while sweeping over our multidimensional array? After all temporal locality and spatial locality are the two most critical factors that enhance cache performance. Solution to this is a time skewed algorithm.

Stencil codes perform a sequence of sweeps (called timesteps) through a given array. Generally this is a 2- or 3-dimensional regular grid. That is your outer most loop doing the sweeps. The elements of the arrays will be interchangeably referred to as cells. In each timestep, the stencil code updates all array elements. Using neighboring array elements in a fixed pattern , each cell’s new value is computed. In most cases boundary values are left unchanged and hence the loop variables are from 1 to n-2 (index begins from 0).

The cells are computed in a pattern as shown above with red, yellow and blue getting computed correspondingly in the sequence order. nx, ny and nz are the dimensions of our grid along the x, y and the z axis.

Now, that we have understood what and how the code works, lets get into how best can we extract the performance –

Now there happens to be a carried loop data dependency for the outermost loop (time steps) as after every iteration, it’s values are used to compute the cells for the next iteration. So it is not possible that we parallelise the outer loop as of now in this case.

So which of the other three loops are to be parallelised? These slides can be a helpful start for you to understand what implications I am about to make. r(n) is the parallel code and s(n) is the inherently serial code. Our aim is to find a case where s(n) is o(r(n))

If we are to parallelise the kloop, the serial code within it is of the order n^{2} and the parallel code is of the order n and hence it is not worth parallelising that loop. Experimental data also suggested the same. If we are to parallelise the i loop, the overhead associated with creation and synchronisation of new threads just to perform a small task is too expensive and is not worth it to give us the extra performance and rather it just exacerbates the performance.

Hence our solution one is to parallelise the j loop. As all the loop iterations will perform the same amount of work, we can set the scheduling to static. This solution stands out better than the two is because r(n) is n^{2} and s(n) is of the order n. Hence the speedup is the order of the parallelism supported by your machine.

Solution two involves us trying to be able to harness on the nested parallelism. There is a variable that you can set to allow nested parallelism to extract more juice out of your machine. You’ve to insert omp_set_nested(1); to enable nested parallelism. This is what I came across when trying to efficiently parallelise my code but somehow it ended up not showing any results at all in my case probably because of the architecture and a different compiler. So if any of you are able to test the results out, please revert back to me. collapse(n) [OpenMP 3.0+] directive merges n loops and then distributes the work amongst the team of threads. The above solution doesn’t give any significant extra performance add on but overall fairs better than the others. This is because the outer two loops are merged and now r(n) becomes n² and s(n) is of the order n and so is the case with solution 1. The actual difference is coming possibly from the creation of team of threads. For smaller test cases, the overhead associated dominates over the parallelism exploited. All these values are for Size³ grids and for 100 time steps.

Stencil computations, involving operations over the elements of an array, are a common programming pattern in scientific computing, games, and image processing. As a programming pattern, stencil computations are highly regular and amenable to optimisation and parallelisation.

This is why this series of posts should interest you for you to sharpen your skills and get you started on a topic with with lots of research happening and also a chance for you to make a tangible difference!

Not a lot is there on the internet for people who are exploring about stencil computations unless you are looking out on Google Scholar! 😛 Hence, my motivation to work on this area and write about it is for all of the community and not just for the scientific and the research community.

So this is what I got started with!

Partial differential equation (PDE) solvers constitute a large fraction of scientific applications in such diverse areas as heat diffusion, electromagnetics, and fluid dynamics. These applications are often implemented using iterative finite-difference techniques, which sweep over a spatial grid, performing nearest neighbor computations called stencils. In a stencil operation, each point in a multidimensional grid is updated with weighted contributions from a subset of its neighbors in both time and space— thereby representing the coefficients of the PDE for that data element. These operations are then used to build solvers that range from simple Jacobi iterations to complex multigrid and adaptive mesh refinement methods.

To summarise it, stencils are computations where each cell’s value in your multi dimensional array is computed by it’s neighbouring cells. This is shown to be true after dealing with some amount of mathematics. The good news for you is that you as programmers need not deal with the equations and the calculus behind it.

For the series of posts that follow, all explanations will be in tandem with 3D 7 point stencil for a heat equation solver. You can also check out my code for 2D 5 point stencil for the same heat solver.

As shown above, all the seven grids in fig (a) are used to compute the value for the cell in the middle and similarly in the case of fig(b).
The stencil equation for which all implementations are tested is Anext(i, j, k) = A0(i, j, k+1)+ A0(i, j, k-1)+ A0(i, j+1, k) + A0(i, j-1, k)+ A0(i+1, j, k)+ A0(i-1, j, k) - 6 * A0(i, j, k)/ factor^{2}

This is iterated over multiple times whose value is called time steps. This is because they happen over time and have a dependency on the previous iteration and hence the name.

The next posts will discuss the advancement in the area and how can we harness the parallelism?

The code for both 2D solver and the 3D solver can be found on Github – Stencil Code.

I use a MacBook Air with an i5 processor and 4GB RAM with and L1 cache size of 64K. The maximum number of threads that can run on my CPU are 8. All experimental setup and results are based on the above system configuration. The results with their checksum value for checking accurate implementation can be found here.

There were many research papers that I looked into to get a glimpse of what is happening and understand what sort of research is taking place. The following are the references to them –

1.Tiling Optimizations for 3D Scientific Computation

2.Implicit and Explicit Optimizations for Stencil Computation

3.Tiling Stencil Computations to Maximize Parallelism

There is a lot of promising work to be done in this field. Other areas on which further posts will be based are –

1. Diamond Tiling: Tiling Techniques to Maximize Parallelism for Stencil Computations

2. Parameterized Diamond Tiling for Stencil Computations

Also, stencil computations on GPU‘s hold a lot of promise and research in the area is also very active. As soon as I am able to lay my hands on a GPU, I will definitely try out all these optimisations and other possible scope of research for improved performance and parallelisations.