hpc/reduce/report/report.tex

\documentclass[a4paper, DIV=12]{scrartcl}
\usepackage[english]{babel}
\usepackage[utf8]{inputenc}
\usepackage[dvipsnames]{xcolor}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{stmaryrd}
\usepackage{graphicx}
\usepackage{pdflscape}
\usepackage{listingsutf8}
\usepackage{spverbatim}
\usepackage{placeins}
\usepackage{lmodern}
%\usepackage{helvet}
\usepackage{booktabs}
\usepackage[T1]{fontenc}
\usepackage{microtype}
\usepackage{framed}
\usepackage[colorlinks=true,
            linkcolor=blue,
            urlcolor=blue,
            breaklinks=true,
            citecolor=blue]{hyperref}
\usepackage{prettyref}
\usepackage{lastpage}
\usepackage{subcaption}
\usepackage{tabularx}
\usepackage{adjustbox}
\usepackage{pdfpages}
\usepackage{xspace}
\usepackage[inline]{enumitem}
\usepackage[abbreviate=false,maxbibnames=99,backend=biber]{biblatex}
\usepackage{textcomp}
\usepackage{tikz}
\usepackage[ruled,linesnumbered]{algorithm2e}

\setkomafont{disposition}{\normalfont\bfseries}

\setlist[itemize]{itemsep=0.1em}
\setlist[enumerate]{itemsep=0.1em}


\newrefformat{tbl}{\hyperref[#1]{Table~\ref*{#1}}}
\newrefformat{fig}{\hyperref[#1]{Figure~\ref*{#1}}}
\newrefformat{lst}{\hyperref[#1]{Listing~\ref*{#1}}}
\newrefformat{equ}{\hyperref[#1]{Equation~\ref*{#1}}}
\newrefformat{sec}{\hyperref[#1]{Section~\ref*{#1}}}
\newrefformat{alg}{\hyperref[#1]{Algorithm~\ref*{#1}}}
\renewcommand{\arraystretch}{1.2}

\newcommand\bigforall{\mbox{\Large $\mathsurround0pt\forall$}}
\everymath{\displaystyle}

\lstset{ %
  backgroundcolor=\color{white},   % choose the background color; you must add \usepackage{color} or
  basicstyle=\ttfamily,            % the size of the fonts that are used for the code
  breakatwhitespace=true,          % sets if automatic breaks should only happen at whitespace
  breaklines=true,                 % sets automatic line breaking
  captionpos=b,                    % sets the caption-position to bottom
  escapeinside={(*}{*)},           % if you want to add LaTeX within your code
  extendedchars=true,              % lets you use non-ASCII characters; for 8-bits encodings only, does not work with UTF-8
  frame=single,	                   % adds a frame around the code
  keepspaces=true,                 % keeps spaces in text, useful for keeping indentation of code (possibly needs columns=flexible)
  language=TeX,                    % the language of the code
  numbers=left,                    % where to put the line-numbers; possible values are (none, left, right)
  numbersep=5pt,                   % how far the line-numbers are from the code
  numberstyle=\tiny\color{gray}, % the style that is used for the line-numbers
  rulecolor=\color{black},         % if not set, the frame-color may be changed on line-breaks within not-black text (e.g. comments (green here))
  showspaces=false,                % show spaces everywhere adding particular underscores; it overrides 'showstringspaces'
  showstringspaces=false,          % underline spaces within strings only
  showtabs=false,                  % show tabs within strings adding particular underscores
  stepnumber=1,                    % the step between two line-numbers. If it's 1, each line will be numbered
  tabsize=2,	                   % sets default tabsize to 2 spaces
  title=\lstname,                   % show the filename of files included with \lstinputlisting; also try caption instead of title
  emph=[3]{int:,array,set,of,int,if,then,else,constraint,var,union,endif,function,where,in,div,predicate,let,opt,full,format,def,for,True,False,return,or},
  emphstyle=[3]\color{ForestGreen},
  emph=[2]{length,max,forall,startEmptyBuffer,fix,startEmptyBufferShow,exactly,cumulative,occurs,deopt,sum,,all},
  emphstyle=[2]\color{blue},
  commentstyle=\color{BrickRed},
  stringstyle =\color{red},
}

\begin{document}

\subject{High Performance Computing}
\title{Reduction trees for MPI Reductions}
\subtitle{Project 2}

\author{Johannes Winklehner\\1226104 \and Armin Friedl\\1053597}
\date{\today}

\maketitle

\tableofcontents

\newpage

\section{Problem Description}
\label{sec:description}

The purpose of this project is to compare different implementations of the collective communication call MPI\_Reduce.
The compared implementations should all use different forms of Tree Reduction algorithms.
As a baseline for the comparison serves a given implementation of the MPI standard, which is in our case NEC MPI.
\begin{description}
 \item[Binomial Tree]
   A binomial tree has a non-fixed degree where each tree $B_i$ has exactly $i$ subtrees of size $B_0$ to $B_{i-1}$.
   The number of nodes in such a tree is equal to $2^i$ and the depth is $i$.
 \item[Fibonacci Tree]
  The Fibonacci tree uses a fixed degree of $2$ where a tree of size $F_i$ has one subtree of size $F_{i-1}$ and one of $F_{i-2}$.
  Therefore the number of nodes in this kind of tree is $fib(i+3)-1$ using the Fibonacci function $fib(x) = fib(x-1)+fib(x-2)$ and its depth is as well $i$.
 \item[Binary Tree]
  The binary tree used for reduction is a common complete binary tree where a tree $T_i$ has two subtrees $T_{i-1}$.
  Such a tree has $2^{i+1}-1$ nodes and its depth is as for the other types $i$.
\end{description}

\begin{center}
\begin{minipage}{.4\textwidth}
\begin{tikzpicture}
\node [circle,draw]{$B_i$}
child { node [circle,draw]{$B_{i-1}$}}
child {node [circle,draw] {$B_{i-2}$}}
child {node {\dots} edge from parent[draw=none]}
child {node [circle,draw] {$B_0$}};
\end{tikzpicture}
%\caption{Binomial Tree of size $i$}
\end{minipage}
\begin{minipage}{.2\textwidth}
\begin{tikzpicture}
\node [circle,draw]{$F_i$}
child { node [circle,draw]{$F_{i-1}$}}
child {node [circle,draw] {$F_{i-2}$}};
\end{tikzpicture}
%\caption{Fibonacci Tree of size $i$}
\end{minipage}
\begin{minipage}{.2\textwidth}
\begin{tikzpicture}
\node [circle,draw]{$T_i$}
child { node [circle,draw]{$T_{i-1}$}}
child {node [circle,draw] {$T_{i-2}$}};
\end{tikzpicture}
%\caption{Complete Binary Tree of size $i$}
\end{minipage}
\end{center}

All three implementations of the reduce function must use exactly the same interface as the MPI standard defines it.
This interface is shown in \prettyref{lst:reduce}.
This requires that all implementations support any arbitrary MPI datatype as well as operations.
The standard also provides some constraints regarding the associativity and commutativity of executable operations.
Every MPI operation must be associative, but does not necessarily have to be commutative.
This means that all results of the operation must be computed in the MPI rank order of all processes.

\begin{lstlisting}[language=C, caption=MPI Reduce interface, label=lst:reduce]
int MPI_Reduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)
\end{lstlisting}

The standard also defines additional features of the reduce function, for example an in place operator for the root process.
However since those details where not mentioned in the assignment description, we did not consider them as part of the project.

The basic algorithm for a tree reduction, which will be shown in the next section, is very similar for all kinds of trees and uses Point-to-Point communication between tree nodes.
The assumption for our implementations to be efficient is that the underlying communication network is fully connected and allows for bidirectional communication.

\FloatBarrier

\section{Implemented Algorithms}
\label{sec:algorithms}

The basic algorithm for a tree reduction is very simple and is shown in \prettyref{alg:reduce}.
At first the parent and all child nodes have to be determined to know the communication partners of each process.
Then each process receives the partial results from all of its children and calculates its own result from the received data.
To ensure the correctness of the result for non commutative operations the iteration of child nodes has to be done in rank order.
Processes which are leaf nodes in the tree have no children and therefore skip the receiving part of the algorithm.
If a process has a parent and is therefore not the root process, it sends its result to the determined parent node.
However if the process is the root process the reduction is finished and can be returned.

\begin{algorithm}
  \caption{Tree Reduce}
  \label{alg:reduce}
  \KwIn{An array $\vec{a}$ of a given $datatype$ with size $count$ for each process}
  \KwOut{The result of the reduction on the $root$ process}
  determine $parent$ and $children$\;
  $result = \vec{a}$\;
  \ForAll{child in children}{
  	receive $result$ from $child$\;
  	$result =$ local reduce of received array and $result$\;
  }
  \eIf{parent exists}{
  	send $result$ to $parent$\;
  }{
  	$output = result$\;
  }
\end{algorithm}

The calculation of the parent and child nodes is the only aspect which has to be changed for all possible kinds of trees.
However there are of course certain optimizations possible where some knowledge about the structure of the tree can be used.
Such implementation details will be shown in the following part.
The code for all our implementations can be found in the Appendix in \prettyref{sec:appendix}.

\subsection{Binomial Tree Reduce}

The first of the three implementations we completed was the binomial tree reduction.
Since there were already some examples and explanations on how reductions and broadcasts work on binomial trees presented during the lectures, this was probably the most straight forward part of the project.
When looking at some trees of different sizes we quickly noticed, that the position of each node is static and the tree only grows in one direction.
This fact can be used in a sense that the children do not have to be precomputed but instead can be calculated during the loop before the corresponding receive operation.
A comparison between a $B_2$ and a $B_3$ tree is shown in \prettyref{fig:binomtrees}.

\begin{figure}
\begin{center}
\begin{tikzpicture}
\begin{scope}
\node [circle,draw]{$0$}
child { node [circle,draw]{$1$}}
child {node [circle,draw] {$2$}
	child {node [circle,draw] {$3$}}};
\end{scope}
\begin{scope}[shift={(5,0)}]
\node [circle,draw]{$0$}
child { node [circle,draw]{$1$}}
child {node [circle,draw] {$2$}
	child {node [circle,draw] {$3$}}}
child {node [circle,draw] {$4$}
	child {node [circle,draw] {$5$}}
	child {node [circle,draw] {$6$}
		child {node [circle,draw] {$7$}}}};
\end{scope}
\end{tikzpicture}
\caption{Comparison between a $B_2$ and a $B_3$}
\label{fig:binomtrees}
\end{center}
\end{figure}

From some of those trees we then determined that for the node with rank $r$ the child in each iteration is $r+i$ where $i=1$ at the start and is multiplied by $2$ after each iteration.
Before each iteration there is an additional condition, which checks if the node has any children left or if it should send the result.

\subsection{Fibonacci Tree Reduce}

The core difference of a Fibonacci tree compared to a binomial tree is the fixed degree of $2$.
To guarantee the correct order of the computed operation the position of a node inside the tree is not only dependent on the rank of the process, but also on the total size of the tree.
This is due to the fact that all ranks in one subtree must be lower than the ranks in the second subtree.
Therefore the position of a node with a certain rank changes depending on the tree size.
This can be seen in the comparison of the trees $F_2$ and $F_3$ in \prettyref{fig:fibtrees}.

\begin{figure}
\begin{center}
\begin{tikzpicture}
\begin{scope}
\node [circle,draw]{$0$}
child { node [circle,draw]{$1$}}
child {node [circle,draw] {$2$}
	child {node [circle,draw] {$3$}}};
\end{scope}
\begin{scope}[shift={(5,0)}]
\node [circle,draw]{$0$}
child { node [circle,draw]{$1$}
	child { node [circle,draw]{$2$}}}
child {node [circle,draw] {$3$}
	child {node [circle,draw] {$4$}}
	child {node [circle,draw] {$5$}
		child {node [circle,draw] {$6$}}}};
\end{scope}
\end{tikzpicture}
\caption{Comparison between a $F_2$ and a $F_3$}
\label{fig:fibtrees}
\end{center}
\end{figure}

During the receiving step of the algorithm we do not need a loop any more, since there are always two or less children for each node.
On the other hand the calculation of the children now has to be done in a loop.
As a first step we have to determine the size of the tree which can contain all processes.
This can be done by searching the Fibonacci numbers for the first value which is greater than the number of processes.
Since we know the size of both subtrees using the Fibonacci numbers we can determine whether a node is supposed to be in the left or right subtree.
When doing this recursively the position of a node and its children can be calculated.
The runtime of this part depends on the size of the tree and is therefore bound by the Fibonacci numbers.

Now that all communication partners have been determined each process has to execute at most two receives and afterwards one send command.
Noticeable when comparing this technique to the biomial tree is that there is already one less node in a tree $F_3$ than in the $B_3$.
This means that the binomial tree can handle more processes in the same number of rounds.

\subsection{Binary Tree Reduce}

The reduction using a binary tree can be implemented in a very similar way like the Fibonacci tree since the degree is also two.
The key difference is of course the structure of the trees and therefore the calculation of the children.
Again the position of certain nodes changes depending on the size of the tree since the lower ranks must be in the left subtree and the higher ones in the right subtree.
The structure of such trees can be shown rather nice because they are simply complete binary trees.
There is again a comparison between a tree $T_2$ and $T_3$ which is shown in \prettyref{fig:bintrees}.

\begin{figure}
\begin{center}
\begin{tikzpicture}[
    auto,
    level 1/.style={sibling distance=40mm},
    level 2/.style={sibling distance=20mm},
    level 3/.style={sibling distance=10mm}]
\begin{scope}
\node [circle,draw]{$0$}
child { node [circle,draw]{$1$}
	child { node [circle,draw]{$2$}}
	child { node [circle,draw]{$3$}}}
child { node [circle,draw]{$4$}
	child { node [circle,draw]{$5$}}
	child { node [circle,draw]{$6$}}};
\end{scope}
\begin{scope}[shift={(7,0)}]
\node [circle,draw]{$0$}
child { node [circle,draw]{$1$}
	child { node [circle,draw]{$2$}
		child { node [circle,draw]{$3$}}
		child { node [circle,draw]{$4$}}}
	child { node [circle,draw]{$5$}
		child { node [circle,draw]{$6$}}
		child { node [circle,draw]{$7$}}}}
child { node [circle,draw]{$8$}
	child { node [circle,draw]{$9$}
		child { node [circle,draw]{$10$}}
		child { node [circle,draw]{$11$}}}
	child { node [circle,draw]{$12$}
		child { node [circle,draw]{$13$}}
		child { node [circle,draw]{$14$}}}};
\end{scope}
\end{tikzpicture}
\caption{Comparison between a $T_2$ and a $T_3$}
\label{fig:bintrees}
\end{center}
\end{figure}

The tree size as well as the computation of the child nodes can be done using a logarithmic function on the number of processes.
The rest works in exactly the same way as the previously explained algorithms.
When structuring the tree like this the drawback is that in each round a node receives data from both children.
As a result the number of rounds for this algorithm is the size of the tree plus an additional round.

\FloatBarrier

\section{Results}
\label{sec:results}

Before we compared the runtime of our algorithms the correctness has to be tested by a reasonable amount.
As already stated in the project description this process can be done easily by comparing each result to the MPI\_Reduce function.
This can be done for all implementations at once to quickly test them for correctness.
With this method we tried various combinations of array sizes, process numbers as well as different operations and the result always turned out to be correct.

After doing those tests to show the correctness we started doing some benchmarking comparisons of our implementations.
To do this we utilized 36 nodes of the jupiter system with 16 processes each.
We did two kinds of tests which will be explained now to compare the runtime of our implementations as well as the MPI\_Reduce function.

\subsection{Process Count}

For the first benchmark we used a different number of processes to check the scaling of all methods.
The size of the array on each process for this test is $1000$.
This is a rather low value, but since we are using tree reduction which is not optimal for high amounts of data such a value makes sense.
The amount of processes used was increased from starting with only one node up to using all available 36 nodes.
On each node all 16 processes where used in all tests.
Therefore the total process count ranged from 16 to 576.
For all out tests in this project we used a repetition count of 30 which allowed us to run a high number of different inputs in a reasonable amount of time.
In \prettyref{fig:nodeplot} the results of this benchmark are shown.

\begin{figure}
  \begin{adjustbox}{center}
    \includegraphics[width=0.8\linewidth]{newnodeplot}
  \end{adjustbox}
  \caption{Average runtimes on 1 to 36 nodes with 16 processes each.}
  \label{fig:nodeplot}
\end{figure}

\subsection{Array Size}

Our second used a fixed number of processes but the size of the local arrays was increasing.
This should show how the different implementations perform for small arrays or even a single number.
But it also shows how they perform with a large amount of data on each process.
The amount of nodes was fixed at 36 for the complete test.
The size of the local arrays was increased by a factor of 10 in each iteration starting with just 1 and increased it up to 1000000.
The number of repetitions is the same as for the last test at 30.

\begin{figure}
  \begin{adjustbox}{center}
    \includegraphics[width=0.8\linewidth]{sizeplot}
  \end{adjustbox}
  \caption{Average runtimes on 36 nodes with an array size of 1 to 1000000.}
  \label{fig:nodeplot}
\end{figure}


\FloatBarrier

\section{Analysis}
\label{sec:analysis}

Our first result seen in \prettyref{fig:nodeplot} suggests that our implementation using the binomial tree got the best results by a big margin.
This result was very surprising to us because we expected that the MPI\_Reduce function of the library would outperform our rather simple implementations.
However this seems not to be the case and such tree algorithms are apparently really good for the dataset tested there.
Although the result of the MPI\_Reduce function seems to very unstable and it varies a lot during the test.
This might be due to a too low number of repetitions, the very short execution time or some other factors.
That the binary tree performed better than the Fibonacci tree was also quite surprising, since the communication pattern of the Fibonacci tree is almost round optimal in contrast to the binary tree.


In the second test it is clearly shown that for larger array sizes the tree algorithms perform much worse.
For us the MPI\_Reduce function had a better performance than our implementations when using more than 100000 array elements.
This was pretty much expected after the first benchmark since for such arrays further optimizations like pipelining would be necessary.

\newpage

\section{Appendix}
\label{sec:appendix}

\lstinputlisting[language=C]{../binom_reduce.c}

\lstinputlisting[language=C]{../fib_reduce.c}

\lstinputlisting[language=C]{../bin_reduce.c}

\end{document}


%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End: