**The problem**: you are given 50,000,000 real numbers (IEEE 794 doubles) normal distributed with mean 0 and standard deviation 1. Sort them as fast as possible. The target machine is a quad core cpu and with 4GB of RAM (thanks to static_rtti for doing the measurements).

The idea is to see if the distribution can be taken into account to improve sorting speed.

**tl;dr**: Not faster than sorting any 50,000,000 real numbers.

**Solution 1 uses std::sort from C++ STL. Simple solution, but not very impressive: 6.425s.**

**Solution 2 uses tbb:parallel_sort from Intel's Thread Building Blocks. Amazingly simple parallel solution: 2.958s. I will look into TBB next time when I'll write multithreaded programs.**

**Solution 3 uses Java7's Fork and Join for parallelization. Surprisingly, it parallelizes better than the TBB solution. Running time is 3.411s**.

**Solution 4**uses Least Significant Digit Radix Sort. This solution is described by Michael Herf here. The solution is easy to understand: stable sort mantissa, then exponent, then sign. Radix sort is very cache unfriendly and hard to parallelize (at least this variant). As an improvement, this implementation uses radix sort to partially sort the data and then insertion sort finalize sorting. Insertion sort is cache friendly and almost linear if elements are partially sorted. A similar trick can be used with qsort: don't qsort ranges shorter than 9 elements, and use a final insertion sort to fix out of order elements. This solution was the fastest sequential solution :

**1.459s**. An attempt to parallelize it is here, but no real speedup was obtained.

**Math**

Up to now no solution uses the fact that the elements were normally distributed. Let's see what this means.

For normal distribution with mean $\mu = 0$ and stddev $\sigma = 1$ we have the cumulative distribution function:

$cdf(x) = P(X < x) = \frac{1}{2}\left[1 + erf\left(\frac{x}{\sqrt{2}}\right)\right]$

where the error function, $erf$, is defined as:$erf(x) = \frac{2}{\sqrt{\pi}}\int_0^x{e^{-t^2}\mathrm{d}t}$

$erf$ and $cdf$ can be calculated only by approximation (see for example Handbook of http://people.math.sfu.ca/~cbm/aands/), but most approximations, if computed as given in the book, involve exponentiation which is slow. However, polynomials can be used to approximate the cdf. To compute polynomial one can use the following little octave program.

`octave:1> output_precision(20)octave:2> x = 0:0.0001:6octave:3> [p] = polyfit(x, stdnormal_cdf(x), 7)p =-6.1477472013548900738e-05 1.4979503290953361146e-03 -1.4572377330342150409e-02 7.0142027233260337282e-02 -1.5795386985864678930e-01 5.6651279954993555288e-02 3.8484595025230294851e-01 5.0081970774936157564e-01octave:4> min(polyval(p, x) - stdnormal_cdf(x))ans = -8.2027858615862925262e-04octave:5> max(polyval(p, x) - stdnormal_cdf(x))ans = 8.2027858615862925262e-04`

The source code can be checked here.

What's the big deal with the $cdf$ function?

- In the settings of our problem (50,000,000 normal distributed reals) $50,000,000 \cdot cdf(x)$ approximates $x$'s position in the sorted array.
- $cdf^{-1}$ can be used to divide the input stream into equal sized chunks. Quick Sort can be enhanced to have $O(N\log N)$ worst case scenario.

**Solution 5**(and here) uses $cdf$ to put elements in the input array in their position in the final array. The solution is a straight forward implementation of the math discussed above and has true $O(N)$ complexity. The running time is

**2.470s**which is twice as slow as the radix sort solution. Much of the penalty came from cache misses and computation of $cdf$.

**Solution 6 is the WINNER**solution with only

**1.188s**. The idea: divide the input array into 512 chunks using the inverse $cdf$ (so all elements in chunk $i$ are smaller than all elements in chunk $i+1$) and sort each chunk individually. It is a straight forward to parallelize: sort different chunks on different threads.

**Conclusions**

- Does knowing the distribution helps parallelization?
- Sequential: No. The fastest solution uses the representation of doubles to achieve linear time.
- Multi-threaded: Maybe. The normal distribution was used to divided the input stream into smaller chunks which were sorted on multiple cores. However, it may be that the speedup is because of sorting and not because of distribution.
- Is it worth it?
- No. Using 4 cores to gain 1.237 speedup is definitely a waste of resources. I think most of the penalty comes from extensive use of memory which creates many cache misses and fake sharing.

if you want more speed you can use CUDA (if you have nvidia) ... programming on GPU can be very fast and with many threads (the more closer to thread limit the better)

ReplyDelete