The spikes at the matrix dimensions that fit in cache are not due
to loading of a partial cache line but rather due to the unrolling
performed by the -O3 option of the compiler. After looking at the assembler
of the generated code, the compiler unrolled the loops in such a way as the
inner loop strides over n where n is the number of doubles that can fit into
a cache line. So what we're seeing is not partial loading, but rather the loop
overhead.
The inner loop code looks like this after O3 gets done with it.
incr = line_size / sizeof(double) /* Number of doubles in a cache line */
round = n - (n % incr) /* Number of lines in dim */
/* Strip mine */
do 20 i = 1,round,incr
do a cache lines worth of s = s + a(k,i)*b(k,j)
...
...
...
20 continue
/* One at a time to take care of leftover */
do 30 i = round+1, n, 1
s = s + a(k,i)*b(k,j)
30 continue