Consider the above example for matrix multiplication. We can think of the (1,1) entry of the product as an inner product of the first row and the first column of the two matrices in the product: 650 = 1 · 21 + 5 · 22 + 9 · 23 + 13 · 24.

Similarly, we can think of partitioning all three matrices into 2-by-2 blocks, and writing the first block of the product in terms of the first block row and block column of the two terms in the product.

The advantage of thinking about matrix multiplication in this way is that even if our original matrices don’t fit into cache, the little blocks will; and if we break the matrices into b-by-b blocks, then each block multiplication involves 2b^{2} data. Note that the idea of blocking can be applied recursively: a large matrix might be partitioned into big blocks that fit in L2 cache, which in turn are partitioned into smaller blocks that fit in L1 cache, which in turn are partitioned into tiny blocks that fit into the registers. This is serial tuning optimisation which tries to improve cache performance and hence the overall speedup.

By the means of this simple tuning, for a test case of 1024×1024 and the block size being 16, the parallelised version of the naive implementation was 6 times slower than that of the parallelised version. There is an unnecessary extra computation for i being done in the bj loop which can be easily shifted to the above loop and we can remove the collapse clause, the performance for the above test case are in the same ball park. The collapse clause better distributes the work but that added bit of performance gets nullified by unnecessary n^{2} computation of iinstead of n.