\subject{High Performance Computing}
\title{Reduction trees for MPI Reductions}
\subtitle{Project 2}
\author{Johannes Winklehner\\1226104 \and Armin Friedl\\1053597}
\section{Problem Description}
The purpose of this project is to compare different implementations of the collective communication call MPI\_Reduce.
The compared implementations should all use different forms of Tree Reduction algorithms.
As a baseline for the comparison serves a given implementation of the MPI standard, which is in our case NEC MPI.
\item[Binomial Tree]
A binomial tree has a non-fixed degree where each tree $B_i$ has exactly $i$ subtrees of size $B_0$ to $B_{i-1}$.
The number of nodes in such a tree is equal to $2^i$ and the depth is $i$.
\item[Fibonacci Tree]
The Fibonacci tree uses a fixed degree of $2$ where a tree of size $F_i$ has one subtree of size $T_{i-1}$ and one of $T_{i-2}$.
Therefore the number of nodes in this kind of tree is $fib(i+3)-1$ using the Fibonacci function $fib(x) = fib(x-1)+fib(x-2)$ and its depth is as well $i$.
\item[Binary Tree]
The binary tree used for reduction is a common complete binary tree where a tree $T_i$ has two subtrees $T_{i-1}$.
Such a tree has $2^{i+1}-1$ nodes and its depth is as for the other types $i$.
\node [circle,draw]{$B_i$}
child { node [circle,draw]{$B_{i-1}$}}
child {node [circle,draw] {$B_{i-2}$}}
child {node {\dots} edge from parent[draw=none]}
child {node [circle,draw] {$B_0$}};
%\caption{Binomial Tree of size $i$}
\node [circle,draw]{$F_i$}
child { node [circle,draw]{$F_{i-1}$}}
child {node [circle,draw] {$F_{i-2}$}};
%\caption{Fibonacci Tree of size $i$}
\node [circle,draw]{$T_i$}
child { node [circle,draw]{$T_{i-1}$}}
child {node [circle,draw] {$T_{i-2}$}};
%\caption{Complete Binary Tree of size $i$}
All three implementations of the reduce function must use exactly the same interface as the MPI standard defines it.
This interface is shown in \prettyref{lst:reduce}.
This requires that all implementations support any arbitrary MPI datatype as well as operations.
The standard also provides some constraints regarding the associativity and commutativity of executable operations.
Every MPI operation must be associative, but does not necessarily have to be commutative.
This means that all results of the operation must be computed in the MPI rank order of all processes.
\begin{lstlisting}[language=C, caption=MPI Reduce interface, label=lst:reduce]
int MPI_Reduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)
The standard also defines additional features of the reduce function, for example an in place operator for the root process.
However since those details where not mentioned in the assignment description, we did not consider them as part of the project.
The basic algorithm for a tree reduction, which will be shown in the next section, is very similar for all kinds of trees and uses Point-to-Point communication between tree nodes.
The assumption for our implementations to be efficient is that the underlying communication network is fully connected and allows for bidirectional communication.
\section{Implemented Algorithms}
The basic algorithm for a tree reduction is very simple and is shown in \prettyref{alg:reduce}.
At first the parent and all child nodes have to be determined to know the communication partners of each process.
Then each process receives the partial results from all of its children and calculates its own result from the received data.
To ensure the correctness of the result for non commutative operations the iteration of child nodes has to be done in rank order.
Processes which are leaf nodes in the tree have no children and therefore skip the receiving part of the algorithm.
If a process has a parent and is therefore not the root process, it sends its result to the determined parent node.
However if the process is the root process the reduction is finished and can be returned.
\caption{Tree Reduce}
\KwIn{An array $\vec{a}$ of a given $datatype$ with size $count$ for each process}
\KwOut{The result of the reduction on the $root$ process}
determine $parent$ and $children$\;
$result = \vec{a}$\;
\ForAll{child in children}{
receive $result$ from $child$\;
$result =$ local reduce of received array and $result$\;
\eIf{parent exists}{
send $result$ to $parent$\;
$output = result$\;
The calculation of the parent and child nodes is the only aspect which has to be changed for all possible kinds of trees.
However there are of course certain optimizations possible to use some knowledge of a concrete tree.
\section{Implementation Details}
