This is the second in series of posts related to stencil computing.
Tiling is a well known transformation which improves locality by moving reuses to the same data closer in time. This optimisation will allow our stencil computing to make better cache utilisation as tiling improves spatial reuse.
3D PDE solvers suffer poor cache performance because accesses to the same data are usually too far apart, requiring array elements to be brought into cache multiple times per array sweep. Hence the need to perform tiling to reduce the cache misses.
This cache problem arises more often in 3D codes than in 2D codes because we have data more spaced out and as data is stored contagiously in row major fashion.
So what is Tiling? Tiling is an optimisation technique that breaks the problem into tiles and first solves these tiles and then proceed. If you see the image illustrated by me as above, instead of the computing stencils in an iterative manner going through all the cells in the array, we can break down the multi dimensional grid in tiles and then first perform stencil operations on all cells within the tile and then go to the next tile. This improves spatial reuse and hence better exploiting the cache. Well, wikipedia also explains it well and in detail.
Are tiles also 3D arrays? NO, we break the problem in 2D tiles. Why? Like if we are to have 3D tiles, as the elements along the z-axis are far spaced apart, they would cause more harm then do good. So what about 2D PDE solvers? You can but it really would not add on great performance benefits as you can very well fit your arrays in the cache unless you have a really large sized array. Here you do make them into 2D Tiles as they will fit in the cache. Fit into cache? Depending on your architecture, you need to aptly choose the right tile size to get maximum optimisation.
“A 6 point stencil which accesses six columns of B in three adjacent planes at the same time, as shown in Figure 4. With a distance of 2N2 between the leading A(I,J,K+1) and trailing A(I,J,K-1) array references, two entire N N planes now need to remain in cache, so only 3D arrays of size 32 32 M can fully exploit reuse for a 16K L1 cache.” Quoting directly from the paper whose reference is given below, is an example for a 6 Point 3D Jacobi solver.
nx, ny, nz are the the dimensions of the array and tx,ty are the tile dimensions.
Now coming to the question of parallelisation? OpenMP is tile oblivious in it’s implementation. Future work provisions for moving towards a tile aware parallelism and harness the available parallelism in a better way. Hence, parallelism is trivial.
So which loop to parallelise? There are six possibilities. First one can be ruled out as there is a carried loop data dependency over the time steps. By asymptotic analysis, we can say that the best parallelisation can be achieved by parallelising ii loop. As the loop has a step increment of block size, the number of iterations are less. In my test case, the optimal tile size giving best cache performance for my architecture was 128 and my block size is 256. So ii loop only iterates 4 times and we have 8 threads. This implies that we are not harnessing all the available parallelism. Hence we collapse the the two loops and now the combined work is distributed amongst the team of threads.
One more important question is what should be the tile size? Since we tile only the J and I loops, the K loop iterates across all array planes but executes only iteration points inside a TIxTJx(N-2) block. In order to preserve all reuse within the TIxTJx(N-2) block (N-2 because the loop iterates from 2 to n-1), it is therefore sufficient that the cache hold a (TI+2)x (TJ+2)x3 subarray. During each TIxTJx(N-2) block of iterations, we access approximately (TI+2)x(TJ+2)xN elements of array Anext. The cost function is – Cost(TI,TJ) = (TI+2)(TJ+2)/(TI TJ). Note that given multiple tile sizes with equal values of TI TJ, this function is minimal when TI and TJ have the smallest difference. The cost function thus favors square tiles.
I also looked if I could achieve any good performance gains by bringing nested parallelism to the table. The results were almost close to that of the above case but I guess that overhead associated with creation, synchronisation and destruction of thread took away any extra benefit coming our way.
So the most tuned code gave us a speedup of almost 3 times compared to that of serial code.
Now that we have discussed optimisation that takes in consideration of the spatial reuse of data, why should we not think of temporal reuse as well? We do not use the just computed values while sweeping over our multidimensional array? After all temporal locality and spatial locality are the two most critical factors that enhance cache performance. Solution to this is a time skewed algorithm.
Find the full code here.
Next in series – Part 3. Time Skewed Algorithm