In this section a roofline model~\cite{williams2009} will be created for the Intel\textregistered{} Core\texttrademark{} i5-4210U. In \prettyref{sec:peak} the theoretical floating-point peak performance of the CPU is calculated. \prettyref{sec:memory} then shows memory bandwidth measurements gathered with NUMA-STREAM~\cite{berstrom}. These ingredients are put together into the roofline model which is constructed in \prettyref{sec:model}.
The CPU under test was a Intel\textregistered{} Core\texttrademark{} i5-4210U. \prettyref{tbl:spec-4210} shows the relevant specifications for this processor according to~\textcite{ark4210}.
According to~\textcite[5-2 Vol.1]{intel2016} the 4th generation Intel Core processors provide FMA (Fused Multiply-Add) units and AVX (Advanced Vector Extension). Whereas AVX can be the main driver for floating-point peak performance, the peak in this case is mainly determined by the FMA unit.
In general an FMA unit is capable of multiple floating-point (FP) operations during a single cycle. This is directly backed by the hardware (operations are \emph{``fused''} together). Specifically the FMA unit of a Haswell processor is capable of ``[...] 256-bit floating-point instructions to perform computation on 256-bit vectors''~\cite[5-28 Vol.1]{intel2016}.
Since even a DP (double-precision) FP element has only 64-bit, 256-bit would be obviously overprovisioned. But the FMA instructions do not just take scalars as arguments. Instead up to 4 DP FP elements can be packed together in a vector and operations are conducted pairwise. An example mulitply-add instruction is given in \cite{intelvfmadd132pd}.
Unfortunately no definite source could be found but according to \textcite{shimpi2012} the Haswell architecture is built with 2 FMA units per core. Taking all together we get:
\begin{enumerate}
\item Two operations are conducted at once (``fused'') and up to four DP FP elements can be packed into the argument vectors. At optimal untilization the FMA unit therefore provides $2*4=8 ~\text{DP FLOPs}$ each cycle.
\item Two cores each with two FMAs can then calculate $2*2*8=32 ~\text{DP FLOPs}$
\end{enumerate}
\noindent{}At maximum turbo frequency the processor therefore has a theoretical peak performance of $32*2.7=86.4 ~\text{GFLOP/s}$. At base frequency it is capable of $32*1.7=54.4 ~\text{GFLOP/s}$.
\subsection{Memory Bandwidth}
\label{sec:memory}
To benchmark the memory bandwidth NUMA-STREAM~\cite{berstrom} was used. The binary ran on a Fedora 23 system with kernel \verb|4.5.7-200.fc23.x86_64 x86_64| in \verb|multi-user.target| to turn off as many distractors as possible. Compilation was done with gcc and the following options: \texttt{-O3 -std=c99 -fopenmp -lnuma -DN=80000000 -DNTIMES=100}.
Again the details of the processor architecture offer a bit of a challenge. The i5-4210U is hyper threaded meaning it provides 4 hardware threads on 2 physical cores. It is not immediately obvious how many threads NUMA-STREAM should be configured with. For this test both configurations\footnote{plus two configurations with 8 and 1 threads respectively for cross checking} were tested and the best one was chosen. The results for NUMA-STREAM configured with two threads are in~\prettyref{lst:numa-stream-results}. Prefixes are given in metric scale, i.e. $M =10^6$ not $2^{20}$.The highest achieved rate was $10608~\text{MB/s}$ with the triad function. The triad function is the most demanding kernel of NUMA-STREAM defined at \cite{bergstrom2} as \lstinline[language=C]|a[j] = b[j]+scalar*c[j]|. All other tested configurations had worse results for all 4 kernels although with at most $300~\text{MB/s}$ difference.