The results are not really unexpected. 22 GFLOP/s for a consumer grade CPU are quite good. That the peak performance wasn't reached can have many reasons. One might be that the binary had to share it's CPU time with other things running on the same machine etc.
It bears some curiosity that the \verb|fma8| kernel is faster than the \verb|fma8manpack| kernel. But there's more to this story. The \verb|fma8| kernel performs much worse (~1/3) than the \verb|fma8manpack| at other optimization levels than \verb|-O3|. Taking a look at the disassembly reviels that \verb|fma8| uses some weird combination of packed and unpacked FMA instructions. It is not even a hundred percent sure that the optimization didn't tinker with the effective OI of the \verb|fma8| kernel.
One baseline of this report is that a superficial OI of a kernel in a high-level language does not really resembles the OI of the kernel when compiled to the machine.
Please also refer to the header file \verb|aikern.h| for a more technical analysis of the disassembly of various kernels. The \verb|Makefile| generates many different variations of the kernels to play with. The \verb|log/| folder will contain details about time an GFLOP of every kernel when a binary is run.