aufräumen

2016-06-23 23:29:47 +02:00 · 2016-06-23 23:29:47 +02:00 · ba7a732d31
commit ba7a732d31
parent 35f3059294
18 changed files with 397 additions and 238 deletions
--- a/plot/plot.py
+++ b/plot/plot.py
@ -25,8 +25,10 @@ while i<=64:
 values = []
 bandwidth = 10.6
 peak = 86.4
+basepeak = 54.4
 ymem = []
 ypeak = []
+ybasepeak = []

 for i in np.arange(0,64,0.1):
    if bandwidth*i < peak:
@ -47,10 +49,12 @@ while i<=64:
        ymem.append(None)
    i*=2

+
+
 #plot data
 #data = pd.Series(data=values, name='Peak Memory Bandwidth', index=np.arange(0,64,0.1))

-data = {'Peak Memory Bandwidth': pd.Series(ymem, index=xlbl), 'Peak Floating-Point Performance': pd.Series(ypeak, index=xlbl)}
+data = {'Peak Memory Bandwidth': pd.Series(ymem, index=xlbl), 'Peak Floating-Point Performance (Turbo)': pd.Series(ypeak, index=xlbl)}
 df = pd.DataFrame(data)

 ax = df.plot()
--- a/roofline/report/inputs/kernels.tex
+++ b/roofline/report/inputs/kernels.tex
@ -4,9 +4,8 @@ However the effective operational intensity of a given kernel in a high-level la

 All kernels were compiled with \verb|gcc 5.3.1| and different options. The compilation was checked with \verb|objdump -d -M intel-mnemonics|. For a more elaborate analysis of the disassembly on the testers computer, please refer to the header file \verb|aikern.h| that should come with this report. Additionally \verb|Makefile| provides all informations about the used and tested compiler options.

-Good results\footnote{all, including the special FMA kernels, use only expected memory access, doing everything else in registers} were achieved with \verb|-O2 -mavx -mfma|. But \verb|-O2 -maxv -mfma| is a tradeoff between the best possible results and obviously correct compiled code. In fact the assembly almost looks like handwritten. If even more optimization is wanted \verb|-O3| can be used. To fully utilize FMA with packed doubles \verb|-Ofast| or \verb|-Ofast -ffast-math| has to be used. Be aware that more optimization than \verb|-O2 -maxv -mfma| results in a very hard to understand disassembly. \verb|-ffast-math| can even introduce rounding errors. It is not completely obvious that the highly optimized compiled code still has the wanted operational intensity. \verb|-O0| never works out.
+Good results\footnote{all, including the special FMA kernels, use only expected memory access, doing everything else in registers} were achieved with \verb|-O2 -mavx -mfma|. But \verb|-O2 -maxv -mfma| is a tradeoff between the best possible results and obviously correct compiled code. In fact the disassembly almost looks like handwritten. If even more optimization is wanted \verb|-O3| can be used. To fully utilize FMA with packed doubles \verb|-Ofast| or \verb|-Ofast -ffast-math| has to be used. Be aware that more optimization than \verb|-O2 -maxv -mfma| results in a very hard to understand disassembly. \verb|-ffast-math| can even introduce rounding errors or reduce the executed FLOPs. It is not completely obvious that the highly optimized compiled still has the wanted operational intensity. \verb|-O0| never works out.

-\bigskip
 \bigskip
 \begin{footnotesize}
 \noindent\emph{Remark:} Contrary to popular believe the roofline model is built atop the notion of operational intensity\footnote{FLOPs against bytes written to DRAM} kernels. The differences to arithmetic intensities are outlined in~\textcite{williams2009}. Depending on the definition used these two terms are not necessarily interchangeable. The notion of operational intensity in the following sections might be what some would understand by the term arithmetic itensity.
@ -40,11 +39,13 @@ Two $\rfrac{1}{16}$ kernels have been implemented. The kernel in~\prettyref{lst:

 The simple kernel in~\prettyref{lst:1-16-simple} reads 8 bytes (a[i]) once for both operands of $*$ and writes 8 bytes (again to a[i]). This results in 16 byte operations. Only one FP instruction is executed, namely $*$. At \verb|-O2| the loop variable is held in a register. This results in an $\rfrac{1}{16}$ OI kernel.

+\bigskip
 \begin{lstlisting}[caption={Simple $\rfrac{1}{16}$ OI kernel}, label=lst:1-16-simple]
 (*\textcolor{Orchid}{\#pragma omp parallel for}*)
 for(size_t i=0; i<size; i++)
  a[i] = a[i] * a[i];
 \end{lstlisting}
+\bigskip

 The FMA aware kernel in~\prettyref{lst:1-16-fma} is a bit more involved. First a triad operation is used ($*$ and $+$ operations have to be balanced). This results in 2 FP instructions executed per round. $3*8=24$ bytes have to be read (a[i], b[i], c[i]) and 8 bytes have to be written (a[i]), in sum 32 byte operations. This results in an $\rfrac{2}{32} = \rfrac{1}{16}$  OI kernel. The loop variable is again held in a register.

@ -56,13 +57,15 @@ Be aware that the FMA kernel \emph{cannot} be used on a non-FMA processor. For t
 \end{enumerate*}.
 Otherwise the results are meaningless due to write-outs of intermediary values.

-Also note that in order to use the full capabilities of Intel's FMA the doubles must be packed. This happens if \verb|-Ofast| is given to \verb|gcc| in addition. However this also triggers other optimizations such that the assembly gets long and complex. It is not immediately obvious that the generated assembly is correct. But no instructions could be found that do not solely use registers, except loading and storing data from and to the arrays -- just as wanted.
+Also note that in order to use the full capabilities of Intel's FMA the doubles must be packed. This happens if \verb|-Ofast| is given to \verb|gcc| in addition. However this also triggers other optimizations such that the disassembly gets long and complex. It is not immediately obvious that the generated disassembly is correct. But no instructions could be found that do not solely use registers, except loading and storing data from and to the arrays -- just as wanted.

+\bigskip
 \begin{lstlisting}[caption={FMA aware $\rfrac{1}{16}$ OI kernel}, label=lst:1-16-fma]
 (*\textcolor{Orchid}{\#pragma omp parallel for}*)
 for(size_t i=0; i<size; i++)
  a[i] = a[i] * b[i] + c[i];
 \end{lstlisting}
+\bigskip

 \subsection{The 8 OI Kernel}
 In this section the implemented 8 OI kernels are shown.~\prettyref{lst:8-1-simple} is a simple 8 OI kernel which should work on any processor. The kernel in~\prettyref{lst:8-1-fma} is tailored for processors with an FMA unit. For the kernels macros were used to repeat the floating point instructions. In some sense this behaves like a huge loop unrolling. Some of the used repeating macros are shown in~\prettyref{lst:8-1-macros}.
@ -70,13 +73,13 @@ In this section the implemented 8 OI kernels are shown.~\prettyref{lst:8-1-simpl
 \bigskip
 \begin{lstlisting}[caption={Macros for bulk repeating instructions}, label=lst:8-1-macros]
 #define REP0(X)
-#define REP1(X) X
-#define REP2(X) REP1(X) REP1(X)
-#define REP3(X) REP2(X) REP1(X)
+#define REP1(X)   X
+#define REP2(X)   REP1(X)  REP1(X)
+#define REP3(X)   REP2(X)  REP1(X)
 //[...]
-#define REP9(X) REP8(X) REP1(X)
+#define REP9(X)   REP8(X)  REP1(X)
 #define REP10(X)  REP9(X)  REP1(X)
-#define REP20(X) REP10(X) REP10(X)
+#define REP20(X)  REP10(X) REP10(X)
 //[...]
 #define REP100(X) REP50(X) REP50(X)
 \end{lstlisting}
--- a/roofline/report/inputs/roofline.tex
+++ b/roofline/report/inputs/roofline.tex
@ -42,7 +42,7 @@ Unfortunately no definite source could be found but according to \textcite{shimp
 To benchmark the memory bandwidth NUMA-STREAM~\cite{berstrom} was used. The binary ran on a Fedora 23 system with kernel \verb|4.5.7-200.fc23.x86_64 x86_64| in \verb|multi-user.target| to turn off as many distractors as possible. Compilation was done with gcc and the following options: \texttt{-O3 -std=c99 -fopenmp -lnuma -DN=80000000 -DNTIMES=100}.


-Again the details of the processor architecture offer a bit of a challenge. The i5-4210U is hyper threaded meaning it provides 4 hardware threads on 2 physical cores. It is not immediately obvious how many threads NUMA-STREAM should be configured with. For this test both configurations\footnote{plus two configurations with 8 and 1 threads respectively for cross checking} were tested and the best one was chosen. The results for NUMA-STREAM configured with two threads are listed in~\prettyref{lst:numa-stream-results}. Prefixes are given in metric scale, i.e. $M = 10^6$ not $2^{20}$.The highest achieved rate was $10608~\text{MB/s}$ with the triad function. The triad function is the most demanding kernel of NUMA-STREAM defined at \cite{bergstrom2} as \lstinline[language=C]|a[j] = b[j]+scalar*c[j]|. All other tested configurations had worse results for all 4 kernels although with at most $300~\text{MB/s}$ difference. 
+Again the details of the processor architecture offer a bit of a challenge. The i5-4210U is hyper threaded meaning it provides 4 hardware threads on 2 physical cores. It is not immediately obvious how many threads NUMA-STREAM should be configured with. For this test both configurations\footnote{plus two configurations with 8 and 1 threads respectively for cross checking} were tested and the best one was chosen. The results for NUMA-STREAM configured with two threads are in~\prettyref{lst:numa-stream-results}. Prefixes are given in metric scale, i.e. $M = 10^6$ not $2^{20}$.The highest achieved rate was $10608~\text{MB/s}$ with the triad function. The triad function is the most demanding kernel of NUMA-STREAM defined at \cite{bergstrom2} as \lstinline[language=C]|a[j] = b[j]+scalar*c[j]|. All other tested configurations had worse results for all 4 kernels although with at most $300~\text{MB/s}$ difference. 

 \bigskip
 \begin{lstlisting}[caption={NUMA-STREAM results for two threads}, label=lst:numa-stream-results, numbers=none]
--- a/roofline/report/report.aux
+++ b/roofline/report/report.aux
@ -62,8 +62,8 @@
 \newlabel{lst:1-16-fma}{{4}{6}{FMA aware $\rfrac {1}{16}$ OI kernel}{lstlisting.4}{}}
 \@writefile{lol}{\defcounter {refsection}{0}\relax }\@writefile{lol}{\contentsline {lstlisting}{\numberline {4}FMA aware ${}^{1}\tmspace  -\thinmuskip {.1667em}/_{16}$ OI kernel}{6}{lstlisting.4}}
 \@writefile{toc}{\defcounter {refsection}{0}\relax }\@writefile{toc}{\contentsline {subsection}{\numberline {3.3}The 8 OI Kernel}{6}{subsection.3.3}}
-\newlabel{lst:8-1-macros}{{5}{6}{Macros for bulk repeating instructions}{lstlisting.5}{}}
-\@writefile{lol}{\defcounter {refsection}{0}\relax }\@writefile{lol}{\contentsline {lstlisting}{\numberline {5}Macros for bulk repeating instructions}{6}{lstlisting.5}}
+\newlabel{lst:8-1-macros}{{5}{7}{Macros for bulk repeating instructions}{lstlisting.5}{}}
+\@writefile{lol}{\defcounter {refsection}{0}\relax }\@writefile{lol}{\contentsline {lstlisting}{\numberline {5}Macros for bulk repeating instructions}{7}{lstlisting.5}}
 \newlabel{lst:8-1-simple}{{6}{7}{Simple $8$ OI kernel}{lstlisting.6}{}}
 \@writefile{lol}{\defcounter {refsection}{0}\relax }\@writefile{lol}{\contentsline {lstlisting}{\numberline {6}Simple $8$ OI kernel}{7}{lstlisting.6}}
 \newlabel{lst:8-1-fma}{{7}{7}{FMA aware $8$ OI kernel}{lstlisting.7}{}}
--- a/roofline/report/report.fdb_latexmk
+++ b/roofline/report/report.fdb_latexmk
@ -1,11 +1,11 @@
 # Fdb version 3
-["biber report"] 1466704438 "report.bcf" "report.bbl" "report" 1466709195
-  "report.bcf" 1466709195 92382 2683b542d57d2326e3b37a6a44222b52 ""
+["biber report"] 1466704438 "report.bcf" "report.bbl" "report" 1466711144
+  "report.bcf" 1466711144 92382 2683b542d57d2326e3b37a6a44222b52 ""
  "roofline.bib" 1466704433 4157 226e47c750579a202f66b6f0e4df67bb ""
  (generated)
  "report.bbl"
  "report.blg"
-["pdflatex"] 1466709193 "report.tex" "report.pdf" "report" 1466709195
+["pdflatex"] 1466711143 "report.tex" "report.pdf" "report" 1466711144
  "/usr/share/texlive/texmf-dist/fonts/enc/dvips/cm-super/cm-super-t1.enc" 1136849721 2971 def0b6c1f0b107b3b936def894055589 ""
  "/usr/share/texlive/texmf-dist/fonts/enc/dvips/cm-super/cm-super-ts1.enc" 1136849721 2900 1537cc8184ad1792082cd229ecc269f4 ""
  "/usr/share/texlive/texmf-dist/fonts/map/fontname/texfonts.map" 1272929888 3287 e6b82fe08f5336d4d5ebc73fb1152e87 ""
@ -196,22 +196,22 @@
  "/usr/share/texlive/texmf-dist/web2c/texmf.cnf" 1455657841 31706 2be2b4306fae7fc20493e3b90c2ad04d ""
  "/usr/share/texlive/texmf-var/web2c/pdftex/pdflatex.fmt" 1457104667 3492982 6abaa3262ef9227a797168d32888676c ""
  "inputs/introduction.tex" 1466184626 76 eaf0f76fa74815989416f6f6d1c36f8b ""
-  "inputs/kernels.tex" 1466709193 10203 9325fb415b03bebae73e25df31a02d19 ""
-  "inputs/roofline.tex" 1466704311 5532 18ef3b0c3e19883f9e4d68e1bd73b31b ""
-  "report.aux" 1466709195 6200 ef3b9dffee45c82bc6071014e35f58b8 ""
+  "inputs/kernels.tex" 1466711142 10273 94bc8e1ce2e538a2a1c74426512dcc37 ""
+  "inputs/roofline.tex" 1466710567 5525 b96d99208485f5095cd10d50a150dff7 ""
+  "report.aux" 1466711144 6200 b98bcd77a3a008a4d0f92ab6f355b22c ""
  "report.bbl" 1466704439 7655 4b5f697a70789470cde9f922b6440ee7 "biber report"
-  "report.out" 1466709195 566 365a3bdfdb786abd7e70ca003f732afb ""
-  "report.run.xml" 1466709195 2317 80d7743117fafc51b1e42b536d793f68 ""
-  "report.tex" 1466708164 4496 4af727a449506efbfccac9df327ec9fe ""
-  "report.toc" 1466709195 1210 9050233c7a77a885db53f60f534c1c7a ""
+  "report.out" 1466711144 566 365a3bdfdb786abd7e70ca003f732afb ""
+  "report.run.xml" 1466711144 2317 80d7743117fafc51b1e42b536d793f68 ""
+  "report.tex" 1466709836 4497 1f64f8ce17913e2b9dd71c7d6e896da8 ""
+  "report.toc" 1466711144 1210 9050233c7a77a885db53f60f534c1c7a ""
  "res/rooftop-eps-converted-to.pdf" 1466670002 22114 f6f2c1d53d8b6a5f4042e202648c7b36 ""
  "res/rooftop.eps" 1466669975 36013 2a6358f72820d80a6e87ee15e92d5669 ""
  (generated)
-  "report.out"
  "report.toc"
-  "report.bcf"
  "report.run.xml"
-  "report-blx.bib"
+  "report.bcf"
  "report.log"
-  "report.aux"
+  "report-blx.bib"
  "report.pdf"
+  "report.aux"
+  "report.out"
--- a/roofline/report/report.log
+++ b/roofline/report/report.log
@ -1,4 +1,4 @@
-This is pdfTeX, Version 3.14159265-2.6-1.40.15 (TeX Live 2014) (preloaded format=pdflatex 2016.3.4)  23 JUN 2016 21:13
+This is pdfTeX, Version 3.14159265-2.6-1.40.15 (TeX Live 2014) (preloaded format=pdflatex 2016.3.4)  23 JUN 2016 21:45
 entering extended mode
 restricted \write18 enabled.
 %&-line parsing enabled.
@ -1369,23 +1369,23 @@ Package pdftex.def Info: res/rooftop-eps-converted-to.pdf used on input line 70
 ) [3] (./inputs/kernels.tex [4 <./res/rooftop-eps-converted-to.pdf>]

 Package hyperref Warning: Token not allowed in a PDF string (PDFDocEncoding):
-(hyperref)                removing `math shift' on input line 15.
+(hyperref)                removing `math shift' on input line 14.


 Package hyperref Warning: Token not allowed in a PDF string (PDFDocEncoding):
-(hyperref)                removing `\not' on input line 15.
+(hyperref)                removing `\not' on input line 14.


 Package hyperref Warning: Token not allowed in a PDF string (PDFDocEncoding):
-(hyperref)                removing `math shift' on input line 15.
+(hyperref)                removing `math shift' on input line 14.

-[5] [6])
+[5] [6]) [7]
 Overfull \hbox (19.7725pt too wide) in paragraph at lines 116--116
 \T1/cmtt/m/n/10.95 blob / e5aa9ca4a77623ff6f1c2d5daa7995565b944506 / stream . c
 # L286$[][] \T1/cmr/m/n/10.95 (-20) (vis-ited on 06/20/2016). 
 []

-[7] 
+
 AED: lastpage setting LastPage
 [8]
 Package atveryend Info: Empty hook `BeforeClearDocument' on input line 117.
@ -1401,13 +1401,13 @@ Package logreq Info: Writing requests to 'report.run.xml'.
 Package atveryend Info: Empty hook `AtVeryVeryEnd' on input line 117.
 ) 
 Here is how much of TeX's memory you used:
- 21436 strings out of 493339
- 338721 string characters out of 6141383
- 878402 words of memory out of 5000000
+ 21442 strings out of 493339
+ 338775 string characters out of 6141383
+ 879402 words of memory out of 5000000
 24309 multiletter control sequences out of 15000+600000
- 29876 words of font info for 133 fonts, out of 8000000 for 9000
+ 30053 words of font info for 136 fonts, out of 8000000 for 9000
 953 hyphenation exceptions out of 8191
- 48i,8n,76p,1008b,1880s stack positions out of 5000i,500n,10000p,200000b,80000s
+ 48i,8n,76p,1001b,1880s stack positions out of 5000i,500n,10000p,200000b,80000s
 {/usr/share/texlive/texmf-dist/fonts/enc/dvips/cm-super/cm-super-ts1.enc}{/us
 r/share/texlive/texmf-dist/fonts/enc/dvips/cm-super/cm-super-t1.enc}</usr/share
 /texlive/texmf-dist/fonts/type1/public/amsfonts/cm/cmmi10.pfb></usr/share/texli
@ -1427,7 +1427,7 @@ t/fonts/type1/public/cm-super/sfrm1440.pfb></usr/share/texlive/texmf-dist/fonts
 /type1/public/cm-super/sfti0900.pfb></usr/share/texlive/texmf-dist/fonts/type1/
 public/cm-super/sfti1095.pfb></usr/share/texlive/texmf-dist/fonts/type1/public/
 cm-super/sftt1095.pfb>
-Output written on report.pdf (8 pages, 328260 bytes).
+Output written on report.pdf (8 pages, 328183 bytes).
 PDF statistics:
 353 PDF objects out of 1000 (max. 8388607)
 278 compressed objects within 3 object streams
--- a/roofline/report/report.pdf
+++ b/roofline/report/report.pdf
--- a/roofline/report/report.tex
+++ b/roofline/report/report.tex
@ -89,7 +89,7 @@
 \maketitle

 \begin{abstract}
-  A \emph{roofline model} for a multicore-processor is obtained by calcuating the theoretical peak performance of the processor and benchmarking the peak memory bandwith. Two artificial computational kernels with arithmetic intensities of $\frac{1}{16}$ GFLOPs/Byte and $8$ GFLOPs/Byte are devised. The performance of the two kernels is then compared to the theoretical calculations in the roofline model.
+  A \emph{roofline model} for a multicore-processor is obtained by calcuating the theoretical peak performance of the processor and benchmarking the peak memory bandwith. Two artificial computational kernels with operational intensities of $\frac{1}{16}$ GFLOPs/Byte and $8$ GFLOPs/Byte are devised. The performance of the two kernels is then compared to the theoretical calculations in the roofline model.
 \end{abstract}

 \tableofcontents
--- a/roofline/src/Makefile
+++ b/roofline/src/Makefile
@ -1,58 +1,66 @@
-all: roofline roofline_avx roofline_o3avx roofline_o3 roofline_avxfma roofline_avxfmafast
+all: bin lib

 # Roofline Binary
+bin: roofline roofline_o3 roofline_fma roofline_fma_o3 roofline_fma_fast_o3 roofline_fma_fast_fastmath_o3
+	mkdir bin
+	mv $^ bin
+
 roofline: roofline.c aikern.a
 	gcc -Wall -Wextra -std=c99 -fopenmp $^ -o $@

-roofline_avx: roofline.c aikern_avx.a
-	gcc -Wall -Wextra -O3 -std=c99 -fopenmp $^ -o $@
-
-roofline_o3avx: roofline.c aikern_o3avx.a
-	gcc -Wall -Wextra -O3 -std=c99 -fopenmp $^ -o $@
-
 roofline_o3: roofline.c aikern_o3.a
-	gcc -Wall -Wextra -O3 -std=c99 -fopenmp $^ -o $@
+	gcc -Wall -Wextra -std=c99 -fopenmp $^ -o $@

-roofline_avxfma: roofline.c aikern_avxfma.a
-	gcc -Wall -Wextra -O3 -std=c99 -fopenmp $^ -o $@
+roofline_fma: roofline.c aikern_fma.a
+	gcc -Wall -Wextra -std=c99 -fopenmp $^ -o $@
+
+roofline_fma_o3: roofline.c aikern_fma_o3.a
+	gcc -Wall -Wextra -std=c99 -fopenmp $^ -o $@
+
+roofline_fma_fast_o3: roofline.c aikern_fma_fast_o3.a
+	gcc -Wall -Wextra -std=c99 -fopenmp $^ -o $@
+
+roofline_fma_fast_fastmath_o3: roofline.c aikern_fma_fast_fastmath_o3.a
+	gcc -Wall -Wextra -std=c99 -fopenmp $^ -o $@

-roofline_avxfmafast: roofline.c aikern_avxfmafast.a
-	gcc -Wall -Wextra -O3 -std=c99 -fopenmp $^ -o $@

 # Static Libraries
-aikern.a: aikern.c aikern.h
-	gcc -c -o aikern.o aikern.c
-	ar rcs aikern.a aikern.o
+lib: aikern.a aikern_o3.a aikern_fma.a aikern_fma_o3.a aikern_fma_fast_o3.a aikern_fma_fast_fastmath_o3.a
+	mkdir lib
+	mv $^ lib

-aikern_avx.a: aikern.c aikern.h
-	gcc -mavx -c -o aikern_avx.o aikern.c
-	ar rcs aikern_avx.a aikern_avx.o
+aikern.a: aikern.c aikern.h
+	gcc -Wall -Wextra -Wno-unused -fopenmp -c -o aikern.o $<
+	ar rcs aikern.a aikern.o
+	rm aikern.o

 aikern_o3.a: aikern.c aikern.h
-	gcc -O3 -c -o aikern_o3.o aikern.c
-	ar rcs aikern_o3.a aikern_o3.o
+	gcc -Wall -Wextra -Wno-unused -O3 -fopenmp -c -o aikern_o3.o $<
+	ar rcs $@ aikern_o3.o
+	rm aikern_o3.o

-aikern_o3avx.a: aikern.c aikern.h
-	gcc -O3 -mavx -c -o aikern_o3avx.o aikern.c
-	ar rcs aikern_o3avx.a aikern_o3avx.o
+aikern_fma.a: aikern.c aikern.h
+	gcc -Wall -Wextra -Wno-unused -O2 -mavx -mfma -fopenmp -c -o aikern_fma.o $<
+	ar rcs $@ aikern_fma.o
+	rm aikern_fma.o

-# This is the only version that actually uses FMA
-aikern_avxfma.a: aikern.c aikern.h
-	gcc -O2 -mavx -mfma -c -o aikern_avxfma.o aikern.c
-	ar rcs aikern_avxfma.a aikern_avxfma.o
+aikern_fma_o3.a: aikern.c aikern.h
+	gcc -Wall -Wextra -Wno-unused -O3 -mavx -mfma -fopenmp -c -o aikern_fma_o3.o $<
+	ar rcs $@ aikern_fma_o3.o
+	rm aikern_fma_o3.o

-aikern_avxfmafast.a: aikern.c aikern.h
-	gcc -O2 -mavx -mfma -Ofast -c -o aikern_avxfmafastmath.o aikern.c
-	ar rcs aikern_avxfmafast.a aikern_avxfmafastmath.o
-
-aikern_avxfmafastmath.a: aikern.c aikern.h
-	gcc -O2 -mavx -mfma -Ofast -ffast-math -c -o aikern_avxfmafastmath.o aikern.c
-	ar rcs aikern_avxfmafast.a aikern_avxfmafastmath.o
+aikern_fma_fast_o3.a: aikern.c aikern.h
+	gcc -Wall -Wextra -Wno-unused -O3 -mavx -mfma -fopenmp -Ofast -c -o aikern_fma_fast_o3.o $<
+	ar rcs $@ aikern_fma_fast_o3.o
+	rm aikern_fma_fast_o3.o

+aikern_fma_fast_fastmath_o3.a: aikern.c aikern.h
+	gcc -Wall -Wextra -Wno-unused -O3 -mavx -mfma -fopenmp -Ofast -ffast-math -c -o aikern_fma_fast_fastmath_o3.o $<
+	ar rcs $@ aikern_fma_fast_fastmath_o3.o
+	rm aikern_fma_fast_fastmath_o3.o

+# Cleanup
 clean:
-	rm -f roofline roofline_avx roofline_o3avx roofline_o3 roofline_avxfma roofline_avxfmafast
-	rm -f *.o
-	rm -f *.a
-	rm -f *.so
+	rm -fR bin
+	rm -fR lib

--- a/roofline/src/aikern.c
+++ b/roofline/src/aikern.c
@ -1,88 +1,46 @@
 # include <stdlib.h>
+# include <stdio.h>
+# include <unistd.h>
+# include <stdarg.h>
+# include <errno.h>
+# include <string.h>
+# include <sys/time.h>
+
+# include "aikern.h"
+
+/**
+ * @brief terminate program on program error
+ * @param msg additional message to print
+ * @param ret exit value
+ */
+static void bail_out(char* fmt, ...);
+
+/**
+ * @brief microseconds since epoch
+ */
+static double pin_time(void);

 void kernel_1_16_simple(double* a, double* b, double* c, size_t size)
 {
+  double t = pin_time();
  #pragma omp parallel for
  for(size_t i=0; i<size; i++){
-	/* 
-	   COMM: 1 reads, 1 write = 16 bytes
-	   COMP: 1 FLOP 
-	   -> AI = 1/16 
-	*/
 	a[i] = a[i] * a[i];
  }
 }

 void kernel_1_16_fuseaware(double* a, double* b, double* c, size_t size)
 {
-  /* === Warning ===
-	 This is dangerous if FMA is not used/can't be used. Then there
-	 are intermediary writes (and reads) to the stack. With FMA:
-	 
-	 vmovsd xmm0,QWORD PTR [rdi+rax*8]				# 1 read
-	 vmovsd xmm1,QWORD PTR [rdx+rax*8]				# 1 read
-	 vfmadd132sd xmm0,xmm1,QWORD PTR [rsi+rax*8]	# 2 FLOPs + 1 read
-	 vmovsd QWORD PTR [rdi+rax*8],xmm0				# 1 write
-
-	 Uses packed doubles with -Ofast.
-  */
-  
  #pragma omp parallel for
  for(size_t i=0; i<size; i++){
-	/* 
-	   COMM: 3 reads, 1 write = 32 bytes
-	   COMP: 2 FLOP 
-	   -> AI = 2/32 = 1/16 
-	*/
 	a[i] = a[i] * b[i] + c[i];
  }
 }

-#define REP0(X)
-#define REP1(X) X
-#define REP2(X) REP1(X) REP1(X)
-#define REP3(X) REP2(X) REP1(X)
-#define REP4(X) REP3(X) REP1(X)
-#define REP5(X) REP4(X) REP1(X)
-#define REP6(X) REP5(X) REP1(X)
-#define REP7(X) REP6(X) REP1(X)
-#define REP8(X) REP7(X) REP1(X)
-#define REP9(X) REP8(X) REP1(X)
-
-#define REP10(X)  REP9(X)  REP1(X)
-#define REP20(X) REP10(X) REP10(X)
-#define REP30(X) REP20(X) REP10(X)
-#define REP40(X) REP30(X) REP10(X)
-#define REP50(X) REP40(X) REP10(X)
-#define REP60(X) REP50(X) REP10(X)
-  
-#define REP100(X) REP50(X) REP50(X)
-
 void kernel_8_1_simple(double* a, double* b, double* c, size_t size)
 {
-  /* === Warning ===
-	 Seems correct with -O3. Though -O3 does some loop unrolling.
-
-	 With -O0 this is dangerous, intermediary values stored on stack
-	 who knows if they survive in cache -> unpredictable.
-
-	 With AVX and -O2 (not necessarily FMA) best results 
-	 (obviously correct, only register shuffling). With FMA:
-
-	 vmovsd xmm1,QWORD PTR [rdi]					# 1 read
-	 vmulsd xmm0,xmm1,xmm1							# 1 FLOP+register shuffling
-	 vmulsd xmm0,xmm0,xmm1							# 127x 1 FLOP+register shuffling
-	 # [...]
-	 vmovsd QWORD PTR [rdi-0x8],xmm0				# 1 write
-  */
-  
  #pragma omp parallel for
  for(size_t i=0; i<size; i++){
-	/* 
-	   COMM: 1 read+1 write = 16 byte
-	   COMP: 128 FLOPs
-	   -> AI = 128/16 = 8 
-	*/
 	a[i] = REP100(a[i]*)
 		   REP20(a[i]*)
 		   REP8(a[i]*)
@ -92,88 +50,30 @@ void kernel_8_1_simple(double* a, double* b, double* c, size_t size)

 void kernel_8_1_fuseaware(double* a, double* b, double* c, size_t size)
 {
-  /*
-	With FMA (and -O2):
-
-	vmovsd xmm0,QWORD PTR [rdi]					# 1 read
-	vfmadd132sd xmm0,xmm0,xmm0					# 64 x 2 FLOPs+register shuffling
-	vmovsd QWORD PTR [rdi-0x8],xmm0				# 1 write
-
-	 Uses packed doubles with -Ofast.
-   */
-
  #pragma omp parallel for
  for(size_t i=0; i<size; i++){
-	/* 
-	   COMM: 1 read + 1 write
-	   COMP: 128 FLOP
-	   -> AI = 8 
-	*/
 	REP60(a[i] = a[i] * a[i] + a[i];)
 	REP4(a[i] = a[i] * a[i] + a[i];)  
  }
 }


-/* === FAILED KERNELS === */
-
-/*
-  These are theoretically correct kernels but all of them yield
-  dangerous results with gcc 5.3.1 (checked the assembly).
-*/
-
 void kernel_1_16_simple_dangerous(double* a, double* b, double* c, size_t size)
 {
-  /* === Problem ===
-	As soon as volatile is used gcc uses the stack for tmp.
-	Even if "register" is in place. Resulting in one additional write per loop.
-	Omitting volatile results in optimizing away the whole loop 
-	(checked at -O2, which is necessary for FMA to eventually step in).
-	Maybe the value stays in cache, maybe not. It does not live a register.
-
-	Even with -O3:
-	movsd  xmm0,QWORD PTR [rdi+rax*8]  # 1 read
-	mulsd  xmm0,QWORD PTR [rsi+rax*8]  # 1 read (+ write to xmm0, not counted)
-	# [...]							   # instructions for loop
-	movsd  QWORD PTR [rsp-0x8],xmm0    # malicious write
-
-	Without volatile (-O3):
-	repz ret						   # that's it
-  */
-
-  
-  // volatile to prevent compiler from optimizing this away
-  // register to advise compiler to put this in register
-  double tmp = 0.1;
+  register volatile double tmp = 0.1;

  #pragma omp parallel for
  for(size_t i=0; i<size; i++){
-	/* 
-	   COMM: 2 reads = 16 bytes
-	   COMP: 1 FLOP
-	   -> AI = 1/16 
-	*/
 	tmp = a[i] * b[i];
  }
 }

 void kernel_8_1_simple_dangerous(double* a, double* b, double* c, size_t size)
 {
-  /* === Problem ==
-	 Same as for kernel_1_16_simple_dangerous
-  */
-  
-  // volatile to prevent compiler from optimizing this away
-  // register to advise compiler to put this in register
-  volatile register double tmp = 0.1;
+  register volatile double tmp = 0.1;
  
  #pragma omp parallel for
  for(size_t i=0; i<size; i++){
-	/* 
-	   COMM: 1 read
-	   COMP: 8 FLOP
-	   -> AI = 8	
-	*/
 	tmp = a[i] * a[i] * a[i] * a[i] *
 	      a[i] * a[i] * a[i] * a[i];
  }
@ -183,18 +83,55 @@ void kernel_1_8_vo_dangerous(double* a, double* b, double* c, size_t size)
 {
  /* This is the 1/8 AI kernel from the lecture

-	 === Problem ==
-	 Same as for kernel_1_16_simple_dangerous
-	 
-	 Without volatile the loop is optimized away completely.
-	 With volatile tmp is written to the stack in every loop
-	 (-O3). tmp could be cached or not. This might depend on
-	 how large the array is and how the cpu work internally
-	 -> unpredictable.
  */

-  volatile double tmp=0.0;
+  register volatile double tmp=0.0;
+  
  for(size_t i=0; i<size; i++) {
 	tmp = a[i] * a[i];
  }
+
+}
+
+
+/* === Helper Functions === */
+
+static double pin_time(void)
+{
+  struct timeval tp;
+  int i;
+
+  i = gettimeofday(&tp,NULL);
+
+  if(i != 0)
+	  bail_out("Time measurement impossible. gettimeofday error");
+  
+  return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
+}
+
+static void bail_out(char* fmt, ...)
+{
+  char* prog_name = "aikern";
+  
+  if(fmt != NULL)
+	{
+	  char msgbuf[150];
+	
+	  va_list vl;
+	  va_start(vl, fmt);
+
+	  if(vsnprintf(msgbuf, sizeof(msgbuf), fmt, vl) < 0)
+		  msgbuf[0] = '\0';
+
+	  va_end( vl);
+	  
+	  if(strlen(msgbuf) > 0)
+		(void)fprintf(stderr, "%s: %s \n", prog_name, msgbuf);
+
+	}
+
+  if(errno != 0)
+	(void)fprintf(stderr, "%s: %s\n", prog_name, strerror(errno));
+	  
+  exit(EXIT_FAILURE);
 }
--- a/roofline/src/aikern.h
+++ b/roofline/src/aikern.h
@ -1,8 +1,224 @@
+#ifndef AIKERN_H
+#define AIKERN_H
+
+/**
+ * @brief A simple 1/16 operational intensity kernel
+ * @param a		An array with double values of size param size
+ * @param b		An array with double values of size param size
+ * @param c		An array with double values of size param size
+ * @param size  Size of the three param arrays
+ *
+ * === Warning ===
+ * Don't use with -O0: Stores everything on stack
+ *
+ * === Description ===
+ * Uses a simple floating point operation: a[i] = a[i] * a[i];
+ * 
+ * Runs in a parallelized for loop.
+ * 
+ * === Analysis ===
+ * COMM: 1 read (8 byte), 1 write = 16 bytes
+ * COMP: 1 FLOP
+ *      ---------
+ * OI:   1/16
+ * 
+ * === Optimization ===
+ * Nothing special 
+ *
+ */
 void kernel_1_16_simple(double* a, double* b, double* c, size_t size);
+
+
+
+/**
+ * @brief A 1/16 operational intensity kernel utilizing FMA
+ * @param a		An array with double values of size param size
+ * @param b		An array with double values of size param size
+ * @param c		An array with double values of size param size
+ * @param size  Size of the three param arrays
+ * 
+ * === Warning ===
+ * This is dangerous if FMA is not used/can't be used. Then there
+ * are intermediary writes (and reads) to the stack.
+ *
+ * === Description ===
+ * Uses a triad function: a[i] = a[i] * b[i] + c[i]; in order
+ * to utilize the FMA unit.
+ * 
+ * Runs in a parallelized for loop.
+ *
+ * === Analysis ===
+ * With gcc -O2 -mavx -mfma FMA compiles to:
+ *	 vmovsd xmm0,QWORD PTR [rdi+rax*8]				# 1 read (8 byte)
+ *	 vmovsd xmm1,QWORD PTR [rdx+rax*8]				# 1 read
+ *	 vfmadd132sd xmm0,xmm1,QWORD PTR [rsi+rax*8]	# 2 FLOPs + 1 read
+ *	 vmovsd QWORD PTR [rdi+rax*8],xmm0				# 1 write
+ *													 --------
+ *													  1/16 OI  
+ * 
+ * === Optimization ===
+ * For packed doubles compile with -Ofast
+ *
+ */
 void kernel_1_16_fuseaware(double* a, double* b, double* c, size_t size);
+
+
+
+/**
+ * @brief A simple 8/1 operational intensity kernel
+ * @param a		An array with double values of size param size
+ * @param b		An array with double values of size param size
+ * @param c		An array with double values of size param size
+ * @param size  Size of the three param arrays
+ *
+ * === Warning ===
+ * Don't use with -O0: Stores everything on stack
+ *
+ * === Description ===
+ * Uses a simple floating point operation: a[i] = a[i] * a[i] * ...* a[i];
+ * 
+ * Runs in a parallelized for loop.
+ *
+ * === Analysis ===
+ * With AVX and -O2 (not necessarily FMA) best results (obviously correct
+ * easy to read disassembly).
+ *
+ * With gcc -O2 -mavx compiles to:
+ *	 vmovsd xmm1,QWORD PTR [rdi]					# 1 read
+ *	 vmulsd xmm0,xmm1,xmm1							# 1 FLOP+register shuffling
+ *	 vmulsd xmm0,xmm0,xmm1							# 127x 1 FLOP+register shuffling
+ *	 # [...]
+ *	 vmovsd QWORD PTR [rdi-0x8],xmm0				# 1 write
+ *													 --------
+ *													  128/16 = 8/1 OI  
+ * 
+ * === Optimization ===
+ * Nothing special
+ */
 void kernel_8_1_simple(double* a, double* b, double* c, size_t size);
+
+/**
+ * @brief A 8/1 operational intensity kernel utilizing FMA
+ * @param a		An array with double values of size param size
+ * @param b		An array with double values of size param size
+ * @param c		An array with double values of size param size
+ * @param size  Size of the three param arrays
+ *
+ * === Warning ===
+ * This is dangerous if FMA is not used/can't be used. Then there
+ * are intermediary writes (and reads) to the stack.
+ *
+ * === Description ===
+ * Uses multiple triad function: a[i] = a[i] * a[i] + a[i]; in order
+ * to utilize the FMA unit.
+ * 
+ * Runs in a parallelized for loop.
+ *
+ * === Analysis ===
+ * With gcc -O2 -mavx -mfma FMA compiles to:
+ *	vmovsd xmm0,QWORD PTR [rdi]					# 1 read
+ *	vfmadd132sd xmm0,xmm0,xmm0					# 64 x 2 FLOPs+register shuffling
+ *	vmovsd QWORD PTR [rdi-0x8],xmm0				# 1 write
+ *												  --------
+ *												  128/16 = 8/1 OI  
+ * 
+ * === Optimization ===
+ * For packed doubles compile with -Ofast
+ *
+ */
 void kernel_8_1_fuseaware(double* a, double* b, double* c, size_t size);

+
+/********************************************
+ *  Kernels which potentially compile to	*
+ *  different operational intensities than	*
+ *  specified								*
+ ********************************************/
+
+/**
+ * @brief A 1/16 operational intensity which might compile to a flawed oi kernel
+ * @param a		An array with double values of size param size
+ * @param b		An array with double values of size param size
+ * @param c		An array with double values of size param size
+ * @param size  Size of the three param arrays
+ *
+ *  === Problem ===
+ *	As soon as volatile is used gcc uses the stack for tmp.
+ *	Even if "register" is in place. Resulting in one additional write per loop.
+ *	Omitting volatile results in optimizing away the whole loop 
+ *	(checked at -O2, which is necessary for FMA to eventually step in).
+ *	Maybe the value stays in cache, maybe not. It does not live a register.
+ *
+ *	Even with -O3:
+ *	movsd  xmm0,QWORD PTR [rdi+rax*8]  # 1 read
+ *	mulsd  xmm0,QWORD PTR [rsi+rax*8]  # 1 read (+ write to xmm0, not counted)
+ *	# [...]							   # instructions for loop
+ *	movsd  QWORD PTR [rsp-0x8],xmm0    # malicious write
+ *
+ *	Without volatile (-O3):
+ *	repz ret						   # that's it
+ */
 void kernel_1_16_simple_dangerous(double* a, double* b, double* c, size_t size);
+
+/**
+ * @brief A 8/1 operational intensity which might compile to a flawed oi kernel
+ * @param a		An array with double values of size param size
+ * @param b		An array with double values of size param size
+ * @param c		An array with double values of size param size
+ * @param size  Size of the three param arrays
+ *
+ * === Problem ==
+ * Same as for kernel_1_16_simple_dangerous
+ */
 void kernel_8_1_simple_dangerous(double* a, double* b, double* c, size_t size);
+
+/**
+ * @brief A 1/8 operational intensity which might compile to a flawed oi kernel
+ * @param a		An array with double values of size param size
+ * @param b		An array with double values of size param size
+ * @param c		An array with double values of size param size
+ * @param size  Size of the three param arrays
+ *
+ * === Problem ==
+ * Same as for kernel_1_16_simple_dangerous
+ * 
+ * Without volatile the loop is optimized away completely.
+ * With volatile tmp is written to the stack in every loop
+ * (-O3). tmp could be cached or not. This might depend on
+ * how large the array is and how the cpu work internally
+ * -> unpredictable.
+ */
 void kernel_1_8_vo_dangerous(double* a, double* b, double* c, size_t size);
+
+
+/****************************************
+ * Helper macros for repeating things	*
+ ****************************************/
+
+#define REP0(X)
+#define REP1(X) X
+#define REP2(X) REP1(X) REP1(X)
+#define REP3(X) REP2(X) REP1(X)
+#define REP4(X) REP3(X) REP1(X)
+#define REP5(X) REP4(X) REP1(X)
+#define REP6(X) REP5(X) REP1(X)
+#define REP7(X) REP6(X) REP1(X)
+#define REP8(X) REP7(X) REP1(X)
+#define REP9(X) REP8(X) REP1(X)
+
+#define REP10(X)  REP9(X)  REP1(X)
+#define REP20(X) REP10(X) REP10(X)
+#define REP30(X) REP20(X) REP10(X)
+#define REP40(X) REP30(X) REP10(X)
+#define REP50(X) REP40(X) REP10(X)
+#define REP60(X) REP50(X) REP10(X)
+  
+#define REP100(X) REP50(X) REP50(X)
+
+#ifdef ENDEBUG
+#define DEBUG(...) do { fprintf(stderr, __VA_ARGS__); fprintf(stderr, "\n"); } while(0)
+#else
+#define DEBUG(...)
+#endif
+
+#endif /* AIKERN_H */
--- a/roofline/src/roofline
+++ b/roofline/src/roofline
--- a/roofline/src/roofline.c
+++ b/roofline/src/roofline.c
@ -57,7 +57,7 @@ static int get_int(char* oparg);
 /**
 * @brief microseconds since epoch
 */
-static double mysecond(void);
+static double pin_time(void);

 /**
 * @brief a simple test kernel with ai of 1/16
@ -103,10 +103,8 @@ int main(int argc, char* argv[]) {
  size_t size = get_size(size_arg);
  int runs = get_int(runs_arg);

-  printf("Will run with array sizes of %zu\n", size);
+  printf("Will run with array sizes of %zu elements\n", size);
  printf("Will calculate min, max, avg for %d runs\n", runs);
-
-  /* Make this volatile so that nothing is optimized away here */
  double* a = malloc(sizeof(double)*(size));
  double* b = malloc(sizeof(double)*(size));
  double* c = malloc(sizeof(double)*(size));
@ -114,60 +112,53 @@ int main(int argc, char* argv[]) {
  if(a==NULL || b==NULL || c == NULL)
 	  bail_out("One of the mallocs failed\n. a = %p, b=%p, c=%p", a, b, c);
  
-  printf("Allocated 3 arrays\n");
+  printf("Allocated 3 arrays (3*%.2f MB = %.2f GB)\n", (sizeof(double)*(size)/1024.0/1024.0), (sizeof(double)*(size)*3/1024.0/1024.0/1024));
+  printf("Filling arrays with dummy values. This will also warm the cache\n");
  
-  printf("Filling arrays with dummy values\n");
-  
-  /*  #pragma omp parallel for
+  #pragma omp parallel for
  for (size_t j=0; j<size; j++)
 	{
 	  a[j] = 1.0;
 	  b[j] = 2.0;
 	  c[j] = 3.0;
-	  }*/
+	}

  double t;
-  printf("Warming up cache\n");
-  t = mysecond();
+  printf("Heating up machine\n");
+  t = pin_time();
  testkern(a,b,c, size);
-  t = mysecond() - t;
-  printf("Cache warming took %.4f microseconds = %.4f seconds (with test AI of 1/16 FLOPs/Byte)\n", (t*1.0E6), t);
+  t = pin_time() - t;
+  printf("Machine heating took %.4f microseconds = %.4f seconds (with test OI kernel)\n", (t*1.0E6), t);
+
  
  /*
 	TESTS!!

   */
  printf("1/16 simple\n");
-  t = mysecond();
+  t = pin_time();
  kernel_1_16_simple(a,b,c, size);
-  t = mysecond() - t;
+  t = pin_time() - t;
  printf("Cache warming took %.4f microseconds = %.4f seconds (with test AI of 1/16 FLOPs/Byte)\n", (t*1.0E6), t);

  printf("1/16 fuseaware\n");
-  t = mysecond();
+  t = pin_time();
  kernel_1_16_fuseaware(a,b,c, size);
-  t = mysecond() - t;
+  t = pin_time() - t;
  printf("Cache warming took %.4f microseconds = %.4f seconds (with test AI of 1/16 FLOPs/Byte)\n", (t*1.0E6), t);

  printf("8 simple\n");
-  t = mysecond();
+  t = pin_time();
  kernel_8_1_simple(a,b,c, size);
-  t = mysecond() - t;
+  t = pin_time() - t;
  printf("Cache warming took %.4f microseconds = %.4f seconds (with test AI of 1/16 FLOPs/Byte)\n", (t*1.0E6), t);
  
  printf("8 fuseaware\n");
-  t = mysecond();
+  t = pin_time();
  kernel_8_1_fuseaware(a,b,c, size);
-  t = mysecond() - t;
+  t = pin_time() - t;
  printf("Cache warming took %.4f microseconds = %.4f seconds (with test AI of 1/16 FLOPs/Byte)\n", (t*1.0E6), t);

-  printf("1/8 vo\n");
-  t = mysecond();
-  kernel_1_8_vo_dangerous(a,b,c, size);
-  t = mysecond() - t;
-  printf("Cache warming took %.4f microseconds = %.4f seconds (with test AI of 1/16 FLOPs/Byte)\n", (t*1.0E6), t);
-
-  
  exit(EXIT_SUCCESS);
 }

@ -185,7 +176,7 @@ static void testkern(double* a, double* b, double* c, size_t size)

 /* === Helper Functions === */

-static double mysecond(void)
+static double pin_time(void)
 {
  struct timeval tp;
  int i;
--- a/roofline/src/roofline_avx
+++ b/roofline/src/roofline_avx
--- a/roofline/src/roofline_avxfma
+++ b/roofline/src/roofline_avxfma
--- a/roofline/src/roofline_avxfmafast
+++ b/roofline/src/roofline_avxfmafast
--- a/roofline/src/roofline_o3
+++ b/roofline/src/roofline_o3
--- a/roofline/src/roofline_o3avx
+++ b/roofline/src/roofline_o3avx