This is a continuation from the last post.
Stencil codes perform a sequence of sweeps (called timesteps) through a given array. Generally this is a 2- or 3-dimensional regular grid. That is your outer most loop doing the sweeps. The elements of the arrays will be interchangeably referred to as cells. In each timestep, the stencil code updates all array elements. Using neighboring array elements in a fixed pattern , each cell’s new value is computed. In most cases boundary values are left unchanged and hence the loop variables are from 1 to n-2 (index begins from 0).
The cells are computed in a pattern as shown above with red, yellow and blue getting computed correspondingly in the sequence order. nx, ny and nz are the dimensions of our grid along the x, y and the z axis.
Now, that we have understood what and how the code works, lets get into how best can we extract the performance –
Now there happens to be a carried loop data dependency for the outermost loop (time steps) as after every iteration, it’s values are used to compute the cells for the next iteration. So it is not possible that we parallelise the outer loop as of now in this case.
So which of the other three loops are to be parallelised? These slides can be a helpful start for you to understand what implications I am about to make. r(n) is the parallel code and s(n) is the inherently serial code. Our aim is to find a case where s(n) is o(r(n))
If we are to parallelise the k loop, the serial code within it is of the order n2 and the parallel code is of the order n and hence it is not worth parallelising that loop. Experimental data also suggested the same. If we are to parallelise the i loop, the overhead associated with creation and synchronisation of new threads just to perform a small task is too expensive and is not worth it to give us the extra performance and rather it just exacerbates the performance.
Hence our solution one is to parallelise the j loop. As all the loop iterations will perform the same amount of work, we can set the scheduling to static. This solution stands out better than the two is because r(n) is n2 and s(n) is of the order n. Hence the speedup is the order of the parallelism supported by your machine.
Solution two involves us trying to be able to harness on the nested parallelism. There is a variable that you can set to allow nested parallelism to extract more juice out of your machine. You’ve to insert omp_set_nested(1); to enable nested parallelism. This is what I came across when trying to efficiently parallelise my code but somehow it ended up not showing any results at all in my case probably because of the architecture and a different compiler. So if any of you are able to test the results out, please revert back to me. collapse(n) [OpenMP 3.0+] directive merges n loops and then distributes the work amongst the team of threads. The above solution doesn’t give any significant extra performance add on but overall fairs better than the others. This is because the outer two loops are merged and now r(n) becomes n² and s(n) is of the order n and so is the case with solution 1. The actual difference is coming possibly from the creation of team of threads. For smaller test cases, the overhead associated dominates over the parallelism exploited. All these values are for Size³ grids and for 100 time steps.
Find the full code here.
That takes us to our next stage that involves optimisation for better spatial reuse.
Part 2. Blocking Alogorithm