Kernels with operational intensity (OI) of $\rfrac{1}{16}$ and $8$ have been implemented. The kernels are introduced in the following sections. However the effective operational intensity of a given kernel in a high-level language (as C) is not obvious when compiled to processor instructions. Furthermore, due to today's advanced processor architecture, adaptions had to be made to account for special capabilites. This resulted in several different kernels. Not all of them are machine independent with regard to operational intensity. All kernels were compiled with \verb|gcc 5.3.1| and different options. The compilation was checked with \verb|objdump -d -M intel-mnemonics|. For a more elaborate analysis of the disassembly on the testers computer, please refer to the header file \verb|aikern.h| that should come with this report. Additionally \verb|Makefile| provides all informations about the used and tested compiler options. Good results\footnote{all, including the special FMA kernels, use only expected memory access, doing everything else in registers} were achieved with \verb|-O2 -mavx -mfma|. But \verb|-O2 -maxv -mfma| is a tradeoff between the best possible results and obviously correct compiled code. In fact the disassembly almost looks like handwritten. If even more optimization is wanted \verb|-O3| can be used. To fully utilize FMA with packed doubles \verb|-Ofast| or \verb|-Ofast -ffast-math| has to be used. Be aware that more optimization than \verb|-O2 -maxv -mfma| results in a very hard to understand disassembly. \verb|-ffast-math| can even introduce rounding errors or reduce the executed FLOPs. It is not completely obvious that the highly optimized compiled still has the wanted operational intensity. \verb|-O0| never works out. \bigskip \begin{footnotesize} \noindent\emph{Remark:} Contrary to popular believe the roofline model is built atop the notion of operational intensity\footnote{FLOPs against bytes written to DRAM} kernels. The differences to arithmetic intensities are outlined in~\textcite{williams2009}. Depending on the definition used these two terms are not necessarily interchangeable. The notion of operational intensity in the following sections might be what some would understand by the term arithmetic itensity. \end{footnotesize} \subsection{1/16 $\neq$ 1/16. Or: The Fancy Arithmetics of a Compiler} In order to understand why the following kernels are implemented the way they are, an example of a badly behaving $\rfrac{1}{16}$ OI kernel is given in~\prettyref{lst:1-16-simple-dangerous}. The kernel has one FP operation ($*$) and reads 16 bytes (a[i], b[i]) from memory. But in practice this algorithm does not work as expected. There are several ways how one could write the same kernel. \begin{itemize} \item Submitting \verb|volatile|. This results in the loop being optimized away completely for optimization levels above \verb|-O0|. \item Using no optimization i.e. \verb|-O0|. No advanced features of the processor will be used (e.g., FMA requires at least \verb|-O2|). Also just about everything is read and written from and to the stack. Even loop variables. One may now assume that this is cached anyway --- or one ain't so. \item Using \verb|volatile| and optimization. When volatile is used gcc reads and writes variable \verb|tmp| from and to the stack, even in \verb|-O3|. If tmp is cached or not is hard to predict. It's not improbable but relying on that assumption can yield wrong results. \item Using \verb|register|, \verb|volatile| and optimization. Unfortunately \verb|register| just \emph{advises} the compiler to use a register. It does not force the compiler to do so. Seemingly \verb|volatile| overrules \verb|register| in this case -- \verb|tmp| is read and written from and to the stack. Again assuming any caching behaviour is adventurous at least. \end{itemize} In the worst case found (no optimization, no volatile, no register) this results in reads of 16 bytes (a[i],b[i]) plus 8 bytes (i), and writes of 16 bytes (i, tmp assignment). Making no caching assumptions this results in an effective operational intensity of $\rfrac{1}{40}$ for a superficial $\rfrac{1}{16}$ OI kernel. For more complex kernels the results get even worse. A triad \verb|t=a*b+c| will store easy-to-miss intermediate results on the stack if no special care is taken. To prevent this, one could write assembly directly or rely on compiler intrinsics. The kernels in this report though consist just of normal C code which was hand-crafted until an acceptable compilation was reached. The generated machine code was disassembled and manually checked for hidden memory access. The results are therefore compiler and machine specific, but should be quite generalizable for the most part. \bigskip \begin{lstlisting}[caption={Simple $\rfrac{1}{16}$ kernel with questionable compiled form}, label=lst:1-16-simple-dangerous] volatile register double tmp = 0.1; for(size_t i=0; i