This is the second in series of posts related to stencil computing.
In the last post, we discussed about spatial reuse and improved cache optimisations. The other factor which plays a decisive role in improving cache use is temporal locality. If you may have observed that till now in all the algorithms that we saw, we never made use of the recently computed cell values and nor were we able to parallelise the outermost loop because of the carried data dependency. To put it in simple terms, every sweep that occurred on our multidimensional array, the cell values computed were dependent on the last iteration and hence there was no option for us to exploit recently computed values and in turn get the best out of spatial locality.
This algorithm was initially difficult for me to visualise as to how the control of the code moved and how were the stencils being computed in this fashion. I presume that the above two pictures will help you better understand as to what is happening. With the knowledge of the values of the cells computed in the current iteration, we may as well make use of these and perform stencil operation of the next time step albeit on a smaller block than the last time because only that much of data is available for reuse.
Time skewing is a type of cache tiling, that attempts reduce main memory traffic by reusing values in cache as often as possible. However one issue with this is, we need to find the right block size and of the right dimensions for the best use and is a machine dependent optimisation else we cannot extract the expected performance.
The performance gains were significant in the case of 512x512x512 where naive algorithm performs badly because of extensive cache misses. After various trials for finding a suitable tile size for the time skew algorithm, 512x64x8 gave better results in terms of performance.
Although here, there was not a lot of openMP stuff to try out, this definitely goes down well for us that we should first always look for serial optimisations and then lookout to parallelise the program because the gains are always significant in these optimisations.
You can find the code on my GitHub page.