hpc/roofline/report/inputs/kernels.tex

Kernels with operational intensity (OI) of $\rfrac{1}{16}$ and $8$ have been implemented. The kernels are introduced in the following sections.

However the effective operational intensity of a given kernel in a high-level language (as C) is not obvious when compiled to processor instructions. Furthermore, due to today's advanced processor architecture, adaptions had to be made to account for special capabilites. This resulted in several different kernels. Not all of them are machine independent with regard to operational intensity.

All kernels were compiled with \verb|gcc 5.3.1| and different options. The compilation was checked with \verb|objdump -d -M intel-mnemonics|. For a more elaborate analysis of the disassembly on the testers computer, please refer to the header file \verb|aikern.h| that should come with this report. Additionally \verb|Makefile| provides all informations about the used and tested compiler options.

Good results\footnote{all, including the special FMA kernels, use only expected memory access, doing everything else in registers} were achieved with \verb|-O2 -mavx -mfma|. But \verb|-O2 -maxv -mfma| is a tradeoff between the best possible results and obviously correct compiled code. In fact the disassembly almost looks like handwritten. If even more optimization is wanted \verb|-O3| can be used. To fully utilize FMA with packed doubles \verb|-Ofast| or \verb|-Ofast -ffast-math| has to be used. Be aware that more optimization than \verb|-O2 -maxv -mfma| results in a very hard to understand disassembly. \verb|-ffast-math| can even introduce rounding errors or reduce the executed FLOPs. It is not completely obvious that the highly optimized compiled still has the wanted operational intensity. \verb|-O0| never works out.

\bigskip
\begin{footnotesize}
\noindent\emph{Remark:} Contrary to popular believe the roofline model is built atop the notion of operational intensity\footnote{FLOPs against bytes written to DRAM} kernels. The differences to arithmetic intensities are outlined in~\textcite{williams2009}. Depending on the definition used these two terms are not necessarily interchangeable. The notion of operational intensity in the following sections might be what some would understand by the term arithmetic itensity.
\end{footnotesize}

\subsection{1/16 $\neq$ 1/16. Or: The Fancy Arithmetics of a Compiler}

In order to understand why the following kernels are implemented the way they are, an example of a badly behaving $\rfrac{1}{16}$ OI kernel is given in~\prettyref{lst:1-16-simple-dangerous}. The kernel has one FP operation ($*$) and reads 16 bytes (a[i], b[i]) from memory. But in practice this algorithm does not work as expected. There are several ways how one could write the same kernel.

\begin{itemize}
\item Submitting \verb|volatile|. This results in the loop being optimized away completely for optimization levels above \verb|-O0|.
\item Using no optimization i.e. \verb|-O0|. No advanced features of the processor will be used (e.g., FMA requires at least \verb|-O2|). Also just about everything is read and written from and to the stack. Even loop variables. One may now assume that this is cached anyway --- or one ain't so.
\item Using \verb|volatile| and optimization. When volatile is used gcc reads and writes variable \verb|tmp| from and to the stack, even in \verb|-O3|. If tmp is cached or not is hard to predict. It's not improbable but relying on that assumption can yield wrong results.
\item Using \verb|register|, \verb|volatile| and optimization. Unfortunately \verb|register| just \emph{advises} the compiler to use a register. It does not force the compiler to do so. Seemingly \verb|volatile| overrules \verb|register| in this case -- \verb|tmp| is read and written from and to the stack. Again assuming any caching behaviour is adventurous at least.
\end{itemize}

In the worst case found (no optimization, no volatile, no register) this results in reads of 16 bytes (a[i],b[i]) plus 8 bytes (i), and writes of 16 bytes (i, tmp assignment). Making no caching assumptions this results in an effective operational intensity of $\rfrac{1}{40}$ for a superficial $\rfrac{1}{16}$ OI kernel. For more complex kernels the results get even worse. A triad \verb|t=a*b+c| will store easy-to-miss intermediate results on the stack if no special care is taken.

To prevent this, one could write assembly directly or rely on compiler intrinsics. The kernels in this report though consist just of normal C code which was hand-crafted until an acceptable compilation was reached. The generated machine code was disassembled and manually checked for hidden memory access. The results are therefore compiler and machine specific, but should be quite generalizable for the most part.

\bigskip
\begin{lstlisting}[caption={Simple $\rfrac{1}{16}$ kernel with questionable compiled form}, label=lst:1-16-simple-dangerous]
volatile register double tmp = 0.1;
for(size_t i=0; i<size; i++)
    tmp = a[i] * b[i];
\end{lstlisting}

\subsection{The 1/16 OI Kernel}
\label{sec:1-16}
Two $\rfrac{1}{16}$ kernels have been implemented. The kernel in~\prettyref{lst:1-16-simple} is a standard kernel which does not assume special processor capabilities. The second kernel in~\prettyref{lst:1-16-fma} however is designed to make use of a processor's FMA unit.

The simple kernel in~\prettyref{lst:1-16-simple} reads 8 bytes (a[i]) once for both operands of $*$ and writes 8 bytes (again to a[i]). This results in 16 byte operations. Only one FP instruction is executed, namely $*$. At \verb|-O2| the loop variable is held in a register. This results in an $\rfrac{1}{16}$ OI kernel.

\bigskip
\begin{lstlisting}[caption={Simple $\rfrac{1}{16}$ OI kernel}, label=lst:1-16-simple]
(*\textcolor{Orchid}{\#pragma omp parallel for}*)
for(size_t i=0; i<size; i++)
  a[i] = a[i] * a[i];
\end{lstlisting}
\bigskip

The FMA aware kernel in~\prettyref{lst:1-16-fma} is a bit more involved. First a triad operation is used ($*$ and $+$ operations have to be balanced). This results in 2 FP instructions executed per round. $3*8=24$ bytes have to be read (a[i], b[i], c[i]) and 8 bytes have to be written (a[i]), in sum 32 byte operations. This results in an $\rfrac{2}{32} = \rfrac{1}{16}$  OI kernel. The loop variable is again held in a register.

Be aware that the FMA kernel \emph{cannot} be used on a non-FMA processor. For the FMA aware kernel to work correctly it is important that
\begin{enumerate*}[label=(\roman*)]
\item the processor has an FMA unit
\item the \verb|aikern.c| library is compiled with at least \verb|-O2 -mavx -mfma|
\item the compiled binary really makes use an FMA instruction (such as \verb|vfmadd132sd|~\cite{intelvfmadd132sd} or even \verb|vfmadd132pd|~\cite{intelvfmadd132pd} on the testers machine)
\end{enumerate*}.
Otherwise the results are meaningless due to write-outs of intermediary values.

Also note that in order to use the full capabilities of Intel's FMA the doubles must be packed. This happens if \verb|-Ofast| is given to \verb|gcc| in addition. However this also triggers other optimizations such that the disassembly gets long and complex. It is not immediately obvious that the generated disassembly is correct. But no instructions could be found that do not solely use registers, except loading and storing data from and to the arrays -- just as wanted.

\bigskip
\begin{lstlisting}[caption={FMA aware $\rfrac{1}{16}$ OI kernel}, label=lst:1-16-fma]
(*\textcolor{Orchid}{\#pragma omp parallel for}*)
for(size_t i=0; i<size; i++)
  a[i] = a[i] * b[i] + c[i];
\end{lstlisting}
\bigskip

\subsection{The 8 OI Kernel}
In this section the implemented 8 OI kernels are shown.~\prettyref{lst:8-1-simple} is a simple 8 OI kernel which should work on any processor. The kernel in~\prettyref{lst:8-1-fma} is tailored for processors with an FMA unit. For the kernels macros were used to repeat the floating point instructions. In some sense this behaves like a huge loop unrolling. Some of the used repeating macros are shown in~\prettyref{lst:8-1-macros}.

\bigskip
\begin{lstlisting}[caption={Macros for bulk repeating instructions}, label=lst:8-1-macros]
#define REP0(X)
#define REP1(X)   X
#define REP2(X)   REP1(X)  REP1(X)
#define REP3(X)   REP2(X)  REP1(X)
//[...]
#define REP9(X)   REP8(X)  REP1(X)
#define REP10(X)  REP9(X)  REP1(X)
#define REP20(X)  REP10(X) REP10(X)
//[...]
#define REP100(X) REP50(X) REP50(X)
\end{lstlisting}
\bigskip

The simple kernel in~\prettyref{lst:8-1-simple} reads 8 bytes (a[i]) and writes 8 bytes (a[i]) while performing 128 FLOPs in total. Therefore this represents a $\rfrac{128}{16}=\rfrac{8}{1}$ OI kernel.

\bigskip
\begin{lstlisting}[caption={Simple $8$ OI kernel}, label=lst:8-1-simple]
(*\textcolor{Orchid}{\#pragma omp parallel for}*)
for(size_t i=0; i<size; i++){
  a[i] = REP100(a[i]*)
         REP20(a[i]*)
         REP8(a[i]*)
         REP1(a[i]);
}
\end{lstlisting}
\bigskip

For the most part things mentioned already in~\prettyref{sec:1-16} hold true for the 8 OI FMA aware kernel too. Please refer to~\prettyref{sec:1-16} for more detailed information about the rationale behind. Compiling this with \verb|-O2 -mavx -mfma| yields an obviously correct result. However if one wants to make use of packed doubles \verb|-Ofast| has to be used which optimizes the code further so that the disassembly is hard to grasp. Anyway it seems that with \verb|-Ofast| at least no malicious read/writes are introduced.

The FMA aware kernel in~\prettyref{lst:8-1-fma} reads 8 bytes (a[i]) and writes 8 bytes (a[i] but only once per iteration), totalling 16 bytes. Please keep in mind that intermediate a[i] are not written back but instead (at least with \verb|-O2| or better) held in a register. There is only one \verb|vmovsd| instruction for writing the value back in each iteration. The kernel executes $64*2 = 128$ FLOPs. Therefore this is a $\rfrac{128}{16} = \rfrac{8}{1}$ OI kernel.

\bigskip
\begin{lstlisting}[caption={FMA aware $8$ OI kernel}, label=lst:8-1-fma]
(*\textcolor{Orchid}{\#pragma omp parallel for}*)
for(size_t i=0; i<size; i++){
  REP60(a[i] = a[i] * a[i] + a[i];)
  REP4(a[i] = a[i] * a[i] + a[i];)
}
\end{lstlisting}
\bigskip


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../report"
%%% End: