hpc/roofline/report/inputs/results.tex

The best results for various kernels are given in~\prettyref{tbl:res-kernels}. The optimization binary \verb|roofline_full_manpack| was used for these results. This is the binary with all optimizations and the intrinsics kernel enabled. The following parameters were used: \verb|roofline_full_manpack -s 150000000 -r 5|. One double array was therefore 1144.41 MB big -- clearly too big for the cache.

Note how \verb|simple8| is clearly flawed with \verb|-ffast-math| enabled. This is due to the non IEEE compliant optimization as described in~\prettyref{sec:advanced-kernels}. At this level of optimization only \verb|simple8fastmath| (which is fastmath safe but flawed with lower optimization levels) should be considered as a \emph{replacement} of \verb|simple8|.

The simple* kernel are those kernels that do not make use of FMA but can be safely used with processors without an FMA unit. fma* kernels on the other hand are those that should make use of FMA. simple8fastmath is a simple8 that can be safely used with \verb|-ffast-math| optimization. And fma8manpack is the kernel which uses intrinsics to ensure that is solely operates with FMA instructions on packed floats.

\begin{table}[h!]
  \centering
  \begin{tabular}{ll}
    \toprule
    Kernel          & Max. GFLOP/s \\
    \midrule
    simple16        & 0.9919       \\
    fma16           & 0.9891       \\
    simple8         & 123.4004     \\
    simple8fastmath & 8.7187       \\
    fma8            & 21.7866      \\
    fma8manpack     & 18.9066      \\
    \bottomrule
  \end{tabular}
  \caption{Results for various kernels}
  \label{tbl:res-kernels}
\end{table}

The rooftop graph with the best runs of the 2 best kernels of each category (\verb|simple16| and \verb|fma8|) is depictured in~\prettyref{fig:roofline-withres}.

\begin{figure}
  \begin{adjustbox}{center}
    \includegraphics[width=0.8\linewidth]{res/rooftop_res}
  \end{adjustbox}
  \caption{Roofline graph with kernel results}
  \label{fig:roofline-withres}
\end{figure}

\FloatBarrier

Best results for an input size of 100000000 are given in~\prettyref{tbl:res-kernels-10} and~\prettyref{fig:roofline-withres-10}.

\begin{table}[h!]
  \centering
  \begin{tabular}{ll}
    \toprule
    Kernel          & Max. GFLOP/s \\
    \midrule
    fma16           & 0.9816\\
    fma8            & 21.8837      \\
    \bottomrule
  \end{tabular}
  \caption{Best results for 100000000}
  \label{tbl:res-kernels-10}
\end{table}

\begin{figure}
  \begin{adjustbox}{center}
    \includegraphics[width=0.8\linewidth]{res/rooftop_res10}
  \end{adjustbox}
  \caption{Roofline graph with best results for 100000000}
  \label{fig:roofline-withres-10}
\end{figure}

Best results for an input size of 250000000 are given in~\prettyref{tbl:res-kernels-25} and~\prettyref{fig:roofline-withres-25}.

\begin{table}[h!]
  \centering
  \begin{tabular}{ll}
    \toprule
    Kernel   & Max. GFLOP/s \\
    \midrule
    simple16 & 1.0476       \\
    fma8     & 21.4297      \\
    \bottomrule
  \end{tabular}
  \caption{Best results for 250000000}
  \label{tbl:res-kernels-25}
\end{table}

\begin{figure}
  \begin{adjustbox}{center}
    \includegraphics[width=0.8\linewidth]{res/rooftop_res25}
  \end{adjustbox}
  \caption{Roofline graph with best results for 250000000}
  \label{fig:roofline-withres-25}
\end{figure}

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../report"
%%% End:
nodes.log 2016-06-24 20:48:12 +00:00			`The best results for various kernels are given in~\prettyref{tbl:res-kernels}. The optimization binary \verb\|roofline_full_manpack\| was used for these results. This is the binary with all optimizations and the intrinsics kernel enabled. The following parameters were used: \verb\|roofline_full_manpack -s 150000000 -r 5\|. One double array was therefore 1144.41 MB big -- clearly too big for the cache.`

size.log 2016-06-24 21:12:26 +00:00			`Note how \verb\|simple8\| is clearly flawed with \verb\|-ffast-math\| enabled. This is due to the non IEEE compliant optimization as described in~\prettyref{sec:advanced-kernels}. At this level of optimization only \verb\|simple8fastmath\| (which is fastmath safe but flawed with lower optimization levels) should be considered as a \emph{replacement} of \verb\|simple8\|.`

final 2016-06-24 21:40:10 +00:00			`The simple* kernel are those kernels that do not make use of FMA but can be safely used with processors without an FMA unit. fma* kernels on the other hand are those that should make use of FMA. simple8fastmath is a simple8 that can be safely used with \verb\|-ffast-math\| optimization. And fma8manpack is the kernel which uses intrinsics to ensure that is solely operates with FMA instructions on packed floats.`

nodes.log 2016-06-24 20:48:12 +00:00			`\begin{table}[h!]`
			`\centering`
			`\begin{tabular}{ll}`
			`\toprule`
			`Kernel & Max. GFLOP/s \\`
			`\midrule`
			`simple16 & 0.9919 \\`
			`fma16 & 0.9891 \\`
			`simple8 & 123.4004 \\`
			`simple8fastmath & 8.7187 \\`
			`fma8 & 21.7866 \\`
			`fma8manpack & 18.9066 \\`
			`\bottomrule`
			`\end{tabular}`
			`\caption{Results for various kernels}`
			`\label{tbl:res-kernels}`
			`\end{table}`

size.log 2016-06-24 21:12:26 +00:00			`The rooftop graph with the best runs of the 2 best kernels of each category (\verb\|simple16\| and \verb\|fma8\|) is depictured in~\prettyref{fig:roofline-withres}.`

			`\begin{figure}`
			`\begin{adjustbox}{center}`
			`\includegraphics[width=0.8\linewidth]{res/rooftop_res}`
			`\end{adjustbox}`
			`\caption{Roofline graph with kernel results}`
			`\label{fig:roofline-withres}`
			`\end{figure}`

final 2016-06-24 21:40:10 +00:00			`\FloatBarrier`

			`Best results for an input size of 100000000 are given in~\prettyref{tbl:res-kernels-10} and~\prettyref{fig:roofline-withres-10}.`

			`\begin{table}[h!]`
			`\centering`
			`\begin{tabular}{ll}`
			`\toprule`
			`Kernel & Max. GFLOP/s \\`
			`\midrule`
			`fma16 & 0.9816\\`
			`fma8 & 21.8837 \\`
			`\bottomrule`
			`\end{tabular}`
			`\caption{Best results for 100000000}`
			`\label{tbl:res-kernels-10}`
			`\end{table}`

			`\begin{figure}`
			`\begin{adjustbox}{center}`
			`\includegraphics[width=0.8\linewidth]{res/rooftop_res10}`
			`\end{adjustbox}`
			`\caption{Roofline graph with best results for 100000000}`
			`\label{fig:roofline-withres-10}`
			`\end{figure}`

			`Best results for an input size of 250000000 are given in~\prettyref{tbl:res-kernels-25} and~\prettyref{fig:roofline-withres-25}.`

			`\begin{table}[h!]`
			`\centering`
			`\begin{tabular}{ll}`
			`\toprule`
			`Kernel & Max. GFLOP/s \\`
			`\midrule`
			`simple16 & 1.0476 \\`
			`fma8 & 21.4297 \\`
			`\bottomrule`
			`\end{tabular}`
			`\caption{Best results for 250000000}`
			`\label{tbl:res-kernels-25}`
			`\end{table}`

			`\begin{figure}`
			`\begin{adjustbox}{center}`
			`\includegraphics[width=0.8\linewidth]{res/rooftop_res25}`
			`\end{adjustbox}`
			`\caption{Roofline graph with best results for 250000000}`
			`\label{fig:roofline-withres-25}`
			`\end{figure}`
size.log 2016-06-24 21:12:26 +00:00
nodes.log 2016-06-24 20:48:12 +00:00			`%%% Local Variables:`
			`%%% mode: latex`
			`%%% TeX-master: "../report"`
			`%%% End:`