88 lines
3.3 KiB
TeX
88 lines
3.3 KiB
TeX
The best results for various kernels are given in~\prettyref{tbl:res-kernels}. The optimization binary \verb|roofline_full_manpack| was used for these results. This is the binary with all optimizations and the intrinsics kernel enabled. The following parameters were used: \verb|roofline_full_manpack -s 150000000 -r 5|. One double array was therefore 1144.41 MB big -- clearly too big for the cache.
|
|
|
|
Note how \verb|simple8| is clearly flawed with \verb|-ffast-math| enabled. This is due to the non IEEE compliant optimization as described in~\prettyref{sec:advanced-kernels}. At this level of optimization only \verb|simple8fastmath| (which is fastmath safe but flawed with lower optimization levels) should be considered as a \emph{replacement} of \verb|simple8|.
|
|
|
|
The simple* kernel are those kernels that do not make use of FMA but can be safely used with processors without an FMA unit. fma* kernels on the other hand are those that should make use of FMA. simple8fastmath is a simple8 that can be safely used with \verb|-ffast-math| optimization. And fma8manpack is the kernel which uses intrinsics to ensure that is solely operates with FMA instructions on packed floats.
|
|
|
|
\begin{table}[h!]
|
|
\centering
|
|
\begin{tabular}{ll}
|
|
\toprule
|
|
Kernel & Max. GFLOP/s \\
|
|
\midrule
|
|
simple16 & 0.9919 \\
|
|
fma16 & 0.9891 \\
|
|
simple8 & 123.4004 \\
|
|
simple8fastmath & 8.7187 \\
|
|
fma8 & 21.7866 \\
|
|
fma8manpack & 18.9066 \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Results for various kernels}
|
|
\label{tbl:res-kernels}
|
|
\end{table}
|
|
|
|
The rooftop graph with the best runs of the 2 best kernels of each category (\verb|simple16| and \verb|fma8|) is depictured in~\prettyref{fig:roofline-withres}.
|
|
|
|
\begin{figure}
|
|
\begin{adjustbox}{center}
|
|
\includegraphics[width=0.8\linewidth]{res/rooftop_res}
|
|
\end{adjustbox}
|
|
\caption{Roofline graph with kernel results}
|
|
\label{fig:roofline-withres}
|
|
\end{figure}
|
|
|
|
\FloatBarrier
|
|
|
|
Best results for an input size of 100000000 are given in~\prettyref{tbl:res-kernels-10} and~\prettyref{fig:roofline-withres-10}.
|
|
|
|
\begin{table}[h!]
|
|
\centering
|
|
\begin{tabular}{ll}
|
|
\toprule
|
|
Kernel & Max. GFLOP/s \\
|
|
\midrule
|
|
fma16 & 0.9816\\
|
|
fma8 & 21.8837 \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Best results for 100000000}
|
|
\label{tbl:res-kernels-10}
|
|
\end{table}
|
|
|
|
\begin{figure}
|
|
\begin{adjustbox}{center}
|
|
\includegraphics[width=0.8\linewidth]{res/rooftop_res10}
|
|
\end{adjustbox}
|
|
\caption{Roofline graph with best results for 100000000}
|
|
\label{fig:roofline-withres-10}
|
|
\end{figure}
|
|
|
|
Best results for an input size of 250000000 are given in~\prettyref{tbl:res-kernels-25} and~\prettyref{fig:roofline-withres-25}.
|
|
|
|
\begin{table}[h!]
|
|
\centering
|
|
\begin{tabular}{ll}
|
|
\toprule
|
|
Kernel & Max. GFLOP/s \\
|
|
\midrule
|
|
simple16 & 1.0476 \\
|
|
fma8 & 21.4297 \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Best results for 250000000}
|
|
\label{tbl:res-kernels-25}
|
|
\end{table}
|
|
|
|
\begin{figure}
|
|
\begin{adjustbox}{center}
|
|
\includegraphics[width=0.8\linewidth]{res/rooftop_res25}
|
|
\end{adjustbox}
|
|
\caption{Roofline graph with best results for 250000000}
|
|
\label{fig:roofline-withres-25}
|
|
\end{figure}
|
|
|
|
%%% Local Variables:
|
|
%%% mode: latex
|
|
%%% TeX-master: "../report"
|
|
%%% End:
|