The best results for various kernels are given in~\prettyref{tbl:res-kernels}. The optimization binary \verb|roofline_full_manpack| was used for these results. This is the binary with all optimizations and the intrinsics kernel enabled. The following parameters were used: \verb|roofline_full_manpack -s 150000000 -r 5|. One double array was therefore 1144.41 MB big -- clearly too big for the cache.

Note how \verb|simple8| is clearly flawed with \verb|-ffast-math| enabled. This is due to the non IEEE compliant optimization as described in~\prettyref{sec:advanced-kernels}. At this level of optimization only \verb|simple8fastmath| (which is fastmath safe but flawed with lower optimization levels) should be considered as a \emph{replacement} of \verb|simple8|.

\begin{table}[h!]
  \centering
  \begin{tabular}{ll}
    \toprule
    Kernel          & Max. GFLOP/s \\
    \midrule
    simple16        & 0.9919       \\
    fma16           & 0.9891       \\
    simple8         & 123.4004     \\
    simple8fastmath & 8.7187       \\
    fma8            & 21.7866      \\
    fma8manpack     & 18.9066      \\
    \bottomrule
  \end{tabular}
  \caption{Results for various kernels}
  \label{tbl:res-kernels}
\end{table}

The rooftop graph with the best runs of the 2 best kernels of each category (\verb|simple16| and \verb|fma8|) is depictured in~\prettyref{fig:roofline-withres}.

\begin{figure}
  \begin{adjustbox}{center}
    \includegraphics[width=0.8\linewidth]{res/rooftop_res}
  \end{adjustbox}
  \caption{Roofline graph with kernel results}
  \label{fig:roofline-withres}
\end{figure}


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../report"
%%% End: