hpc/roofline/report/inputs/roofline.tex

In this section a roofline model~\cite{williams2009} will be created for the Intel\textregistered{} Core\texttrademark{} i5-4210U. In \prettyref{sec:peak} the theoretical floating-point peak performance of the CPU is calculated. \prettyref{sec:memory} then shows memory bandwidth measurements gathered with NUMA-STREAM~\cite{berstrom}. These ingredients are put together into the roofline model which is constructed in \prettyref{sec:model}.

\subsection{Theoretical Peak Performance}
\label{sec:peak}
The CPU under test was a Intel\textregistered{} Core\texttrademark{} i5-4210U. \prettyref{tbl:spec-4210} shows the relevant specifications for this processor according to~\textcite{ark4210}.

\begin{table}[h!]
  \centering
  \begin{tabular}{ll}
    \toprule
    Specification             & Value                \\
    \midrule
    \# of Cores               & 2                    \\
    \# of Threads             & 4                    \\
    Microarchitecture         & Haswell              \\
    Max Turbo Frequency       & 2.7 GHz              \\
    Processor Base Frequency  & 1.7 GHz              \\
    Instruction Set Extension & SSE 4.1/4.2, AVX 2.0 \\
    \bottomrule
  \end{tabular}
  \caption{Relevant processor specifications}
  \label{tbl:spec-4210}
\end{table}

According to~\textcite[5-2 Vol.1]{intel2016} the 4th generation Intel Core processors provide FMA (Fused Multiply-Add) units and AVX (Advanced Vector Extension). Whereas AVX can be the main driver for floating-point peak performance, the peak in this case is mainly determined by the FMA unit.

In general an FMA unit is capable of multiple floating-point (FP) operations during a single cycle. This is directly backed by the hardware (operations are \emph{``fused''} together). Specifically the FMA unit of a Haswell processor is capable of ``[...] 256-bit floating-point instructions to perform computation on 256-bit vectors''~\cite[5-28 Vol.1]{intel2016}.

Since even a DP (double-precision) FP element has only 64-bit, 256-bit would be obviously overprovisioned. But the FMA instructions do not just take scalars as arguments. Instead up to 4 DP FP elements can be packed together in a vector and operations are conducted pairwise. An example mulitply-add instruction is given in \cite{intelvfmadd132pd}.

Unfortunately no definite source could be found but according to \textcite{shimpi2012} the Haswell architecture is built with 2 FMA units per core. Taking all together we get:

\begin{enumerate}
\item Two operations are conducted at once (``fused'') and up to four DP FP elements can be packed into the argument vectors. At optimal untilization the FMA unit therefore provides $2*4 = 8 ~\text{DP FLOPs}$ each cycle. 
\item Two cores each with two FMAs can then calculate $2 * 2 * 8 = 32 ~\text{DP FLOPs}$
\end{enumerate}

\noindent{}At maximum turbo frequency the processor therefore has a theoretical peak performance of $32*2.7 = 86.4 ~\text{GFLOP/s}$. At base frequency it is capable of $32*1.7 = 54.4 ~\text{GFLOP/s}$.

\subsection{Memory Bandwidth}
\label{sec:memory}
To benchmark the memory bandwidth NUMA-STREAM~\cite{berstrom} was used. The binary ran on a Fedora 23 system with kernel \verb|4.5.7-200.fc23.x86_64 x86_64| in \verb|multi-user.target| to turn off as many distractors as possible. Compilation was done with gcc and the following options: \texttt{-O3 -std=c99 -fopenmp -lnuma -DN=80000000 -DNTIMES=100}.


Again the details of the processor architecture offer a bit of a challenge. The i5-4210U is hyper threaded meaning it provides 4 hardware threads on 2 physical cores. It is not immediately obvious how many threads NUMA-STREAM should be configured with. For this test both configurations\footnote{plus two configurations with 8 and 1 threads respectively for cross checking} were tested and the best one was chosen. The results for NUMA-STREAM configured with two threads are in~\prettyref{lst:numa-stream-results}. Prefixes are given in metric scale, i.e. $M = 10^6$ not $2^{20}$.The highest achieved rate was $10608~\text{MB/s}$ with the triad function. The triad function is the most demanding kernel of NUMA-STREAM defined at \cite{bergstrom2} as \lstinline[language=C]|a[j] = b[j]+scalar*c[j]|. All other tested configurations had worse results for all 4 kernels although with at most $300~\text{MB/s}$ difference. 

\bigskip
\begin{lstlisting}[caption={NUMA-STREAM results for two threads}, label=lst:numa-stream-results, numbers=none]
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        9373.3846       0.1368       0.1366       0.1390
Scale:       9414.1304       0.1361       0.1360       0.1381
Add:        10614.6002       0.1812       0.1809       0.1835
Triad:      10607.7910       0.1813       0.1810       0.1834
\end{lstlisting}
\bigskip

\subsection{Graph}
\label{sec:model}

The graph of the roofline model is defined by~\cite{williams2009}:
\begin{verbatim}
  Attainable GFLOP/s = Min(Peak FLOP, 
                           Peak Memory Bandwidth*Operational Intensity)
\end{verbatim}

\noindent{}The resulting graph for the values obtained in~\prettyref{sec:peak} and~\prettyref{sec:memory} can be seen in \prettyref{fig:roofline}.

\begin{figure}
  \begin{adjustbox}{center}
    \includegraphics[width=0.8\linewidth]{res/rooftop}
  \end{adjustbox}
  \caption{Roofline graph from the values obtained in~\prettyref{sec:peak} and~\prettyref{sec:memory}}
  \label{fig:roofline}
\end{figure}


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../report"
%%% End:
hälfte vom roofline report fertig 2016-06-23 00:40:48 +00:00			`In this section a roofline model~\cite{williams2009} will be created for the Intel\textregistered{} Core\texttrademark{} i5-4210U. In \prettyref{sec:peak} the theoretical floating-point peak performance of the CPU is calculated. \prettyref{sec:memory} then shows memory bandwidth measurements gathered with NUMA-STREAM~\cite{berstrom}. These ingredients are put together into the roofline model which is constructed in \prettyref{sec:model}.`

report (roofline) begonnen, kernels implementiert, komische ergebnisse - irgendwo wird noch immer was wegoptimiert glaub ich 2016-06-19 18:30:49 +00:00			`\subsection{Theoretical Peak Performance}`
hälfte vom roofline report fertig 2016-06-23 00:40:48 +00:00			`\label{sec:peak}`
			`The CPU under test was a Intel\textregistered{} Core\texttrademark{} i5-4210U. \prettyref{tbl:spec-4210} shows the relevant specifications for this processor according to~\textcite{ark4210}.`
report (roofline) begonnen, kernels implementiert, komische ergebnisse - irgendwo wird noch immer was wegoptimiert glaub ich 2016-06-19 18:30:49 +00:00
			`\begin{table}[h!]`
			`\centering`
			`\begin{tabular}{ll}`
			`\toprule`
hälfte vom roofline report fertig 2016-06-23 00:40:48 +00:00			`Specification & Value \\`
report (roofline) begonnen, kernels implementiert, komische ergebnisse - irgendwo wird noch immer was wegoptimiert glaub ich 2016-06-19 18:30:49 +00:00			`\midrule`
hälfte vom roofline report fertig 2016-06-23 00:40:48 +00:00			`\# of Cores & 2 \\`
			`\# of Threads & 4 \\`
			`Microarchitecture & Haswell \\`
			`Max Turbo Frequency & 2.7 GHz \\`
			`Processor Base Frequency & 1.7 GHz \\`
			`Instruction Set Extension & SSE 4.1/4.2, AVX 2.0 \\`
report (roofline) begonnen, kernels implementiert, komische ergebnisse - irgendwo wird noch immer was wegoptimiert glaub ich 2016-06-19 18:30:49 +00:00			`\bottomrule`
			`\end{tabular}`
hälfte vom roofline report fertig 2016-06-23 00:40:48 +00:00			`\caption{Relevant processor specifications}`
report (roofline) begonnen, kernels implementiert, komische ergebnisse - irgendwo wird noch immer was wegoptimiert glaub ich 2016-06-19 18:30:49 +00:00			`\label{tbl:spec-4210}`
			`\end{table}`

hälfte vom roofline report fertig 2016-06-23 00:40:48 +00:00			`According to~\textcite[5-2 Vol.1]{intel2016} the 4th generation Intel Core processors provide FMA (Fused Multiply-Add) units and AVX (Advanced Vector Extension). Whereas AVX can be the main driver for floating-point peak performance, the peak in this case is mainly determined by the FMA unit.`

			In general an FMA unit is capable of multiple floating-point (FP) operations during a single cycle. This is directly backed by the hardware (operations are \emph{``fused''} together). Specifically the FMA unit of a Haswell processor is capable of ``[...] 256-bit floating-point instructions to perform computation on 256-bit vectors''~\cite[5-28 Vol.1]{intel2016}.

report kernels+8er kernel bug behoben 2016-06-23 19:22:30 +00:00			`Since even a DP (double-precision) FP element has only 64-bit, 256-bit would be obviously overprovisioned. But the FMA instructions do not just take scalars as arguments. Instead up to 4 DP FP elements can be packed together in a vector and operations are conducted pairwise. An example mulitply-add instruction is given in \cite{intelvfmadd132pd}.`
hälfte vom roofline report fertig 2016-06-23 00:40:48 +00:00
			`Unfortunately no definite source could be found but according to \textcite{shimpi2012} the Haswell architecture is built with 2 FMA units per core. Taking all together we get:`

			`\begin{enumerate}`
			\item Two operations are conducted at once (``fused'') and up to four DP FP elements can be packed into the argument vectors. At optimal untilization the FMA unit therefore provides $2*4 = 8 ~\text{DP FLOPs}$ each cycle.
			`\item Two cores each with two FMAs can then calculate $2 * 2 * 8 = 32 ~\text{DP FLOPs}$`
			`\end{enumerate}`

			`\noindent{}At maximum turbo frequency the processor therefore has a theoretical peak performance of $322.7 = 86.4 ~\text{GFLOP/s}$. At base frequency it is capable of $321.7 = 54.4 ~\text{GFLOP/s}$.`

			`\subsection{Memory Bandwidth}`
			`\label{sec:memory}`
			`To benchmark the memory bandwidth NUMA-STREAM~\cite{berstrom} was used. The binary ran on a Fedora 23 system with kernel \verb\|4.5.7-200.fc23.x86_64 x86_64\| in \verb\|multi-user.target\| to turn off as many distractors as possible. Compilation was done with gcc and the following options: \texttt{-O3 -std=c99 -fopenmp -lnuma -DN=80000000 -DNTIMES=100}.`


aufräumen 2016-06-23 21:29:47 +00:00			Again the details of the processor architecture offer a bit of a challenge. The i5-4210U is hyper threaded meaning it provides 4 hardware threads on 2 physical cores. It is not immediately obvious how many threads NUMA-STREAM should be configured with. For this test both configurations\footnote{plus two configurations with 8 and 1 threads respectively for cross checking} were tested and the best one was chosen. The results for NUMA-STREAM configured with two threads are in~\prettyref{lst:numa-stream-results}. Prefixes are given in metric scale, i.e. $M = 10^6$ not $2^{20}$.The highest achieved rate was $10608~\text{MB/s}$ with the triad function. The triad function is the most demanding kernel of NUMA-STREAM defined at \cite{bergstrom2} as \lstinline[language=C]\|a[j] = b[j]+scalar*c[j]\|. All other tested configurations had worse results for all 4 kernels although with at most $300~\text{MB/s}$ difference.
hälfte vom roofline report fertig 2016-06-23 00:40:48 +00:00
			`\bigskip`
			`\begin{lstlisting}[caption={NUMA-STREAM results for two threads}, label=lst:numa-stream-results, numbers=none]`
			`Function Rate (MB/s) Avg time Min time Max time`
			`Copy: 9373.3846 0.1368 0.1366 0.1390`
			`Scale: 9414.1304 0.1361 0.1360 0.1381`
			`Add: 10614.6002 0.1812 0.1809 0.1835`
			`Triad: 10607.7910 0.1813 0.1810 0.1834`
			`\end{lstlisting}`
			`\bigskip`

			`\subsection{Graph}`
			`\label{sec:model}`
report (roofline) begonnen, kernels implementiert, komische ergebnisse - irgendwo wird noch immer was wegoptimiert glaub ich 2016-06-19 18:30:49 +00:00
hälfte vom roofline report fertig 2016-06-23 00:40:48 +00:00			`The graph of the roofline model is defined by~\cite{williams2009}:`
			`\begin{verbatim}`
			`Attainable GFLOP/s = Min(Peak FLOP,`
			`Peak Memory Bandwidth*Operational Intensity)`
			`\end{verbatim}`
report (roofline) begonnen, kernels implementiert, komische ergebnisse - irgendwo wird noch immer was wegoptimiert glaub ich 2016-06-19 18:30:49 +00:00
hälfte vom roofline report fertig 2016-06-23 00:40:48 +00:00			`\noindent{}The resulting graph for the values obtained in~\prettyref{sec:peak} and~\prettyref{sec:memory} can be seen in \prettyref{fig:roofline}.`
report (roofline) begonnen, kernels implementiert, komische ergebnisse - irgendwo wird noch immer was wegoptimiert glaub ich 2016-06-19 18:30:49 +00:00
hälfte vom roofline report fertig 2016-06-23 00:40:48 +00:00			`\begin{figure}`
			`\begin{adjustbox}{center}`
			`\includegraphics[width=0.8\linewidth]{res/rooftop}`
			`\end{adjustbox}`
			`\caption{Roofline graph from the values obtained in~\prettyref{sec:peak} and~\prettyref{sec:memory}}`
			`\label{fig:roofline}`
			`\end{figure}`
report (roofline) begonnen, kernels implementiert, komische ergebnisse - irgendwo wird noch immer was wegoptimiert glaub ich 2016-06-19 18:30:49 +00:00

			`%%% Local Variables:`
			`%%% mode: latex`
			`%%% TeX-master: "../report"`
			`%%% End:`