Thursday, June 16, 2011

How fast can you sort 50,000,000 normal distributed doubles?

This is a post about a contest on CodeGolf.StackExchange.

The problem: you are given 50,000,000 real numbers (IEEE 794 doubles) normal distributed with mean 0 and standard deviation 1. Sort them as fast as possible. The target machine is a quad core cpu and with 4GB of RAM (thanks to static_rtti for doing the measurements).

The idea is to see if the distribution can be taken into account to improve sorting speed.

tl;dr: Not faster than sorting any 50,000,000 real numbers.

Solution 1 uses std::sort from C++ STL. Simple solution, but not very impressive: 6.425s.

Solution 2 uses tbb:parallel_sort from Intel's Thread Building Blocks. Amazingly simple parallel solution: 2.958s. I will look into TBB next time when I'll write multithreaded programs.


Solution 3 uses Java7's Fork and Join for parallelization. Surprisingly, it parallelizes better than the TBB solution. Running time is 3.411s.

Solution 4 uses Least Significant Digit Radix Sort. This solution is described by Michael Herf here. The solution is easy to understand: stable sort mantissa, then exponent, then sign. Radix sort is very cache unfriendly and hard to parallelize (at least this variant). As an improvement, this implementation uses radix sort to partially sort the data and then insertion sort finalize sorting. Insertion sort is cache friendly and almost linear if elements are partially sorted. A similar trick can be used with qsort: don't qsort ranges shorter than 9 elements, and use a final insertion sort to fix out of order elements. This solution was the fastest sequential solution : 1.459s. An attempt to parallelize it is here, but no real speedup was obtained.

Math


Up to now no solution uses the fact that the elements were normally distributed. Let's see what this means.

For normal distribution with mean $\mu = 0$ and stddev $\sigma = 1$ we have the cumulative distribution function:
$cdf(x) = P(X < x) = \frac{1}{2}\left[1 + erf\left(\frac{x}{\sqrt{2}}\right)\right]$
where the error function, $erf$, is defined as:
$erf(x) = \frac{2}{\sqrt{\pi}}\int_0^x{e^{-t^2}\mathrm{d}t}$

$erf$ and $cdf$ can be calculated only by approximation (see for example Handbook of http://people.math.sfu.ca/~cbm/aands/), but most approximations, if computed as given in the book, involve exponentiation which is slow. However, polynomials can be used to approximate the cdf. To compute polynomial one can use the following little octave program.







octave:1> output_precision(20)
octave:2> x = 0:0.0001:6
octave:3> [p] = polyfit(x, stdnormal_cdf(x), 7)
p =
-6.1477472013548900738e-05   1.4979503290953361146e-03  -1.4572377330342150409e-02   7.0142027233260337282e-02  -1.5795386985864678930e-01   5.6651279954993555288e-02   3.8484595025230294851e-01   5.0081970774936157564e-01
octave:4> min(polyval(p, x) - stdnormal_cdf(x))
ans = -8.2027858615862925262e-04
octave:5> max(polyval(p, x) - stdnormal_cdf(x))
ans = 8.2027858615862925262e-04


The source code can be checked here.

What's the big deal with the $cdf$ function?
  1. In the settings of our problem (50,000,000 normal distributed reals) $50,000,000 \cdot cdf(x)$ approximates $x$'s position in the sorted array.
  2. $cdf^{-1}$ can be used to divide the input stream into equal sized chunks. Quick Sort can be enhanced to have $O(N\log N)$ worst case scenario.
Solution 5 (and here) uses $cdf$ to put elements in the input array in their position in the final array. The solution is a straight forward implementation of the math discussed above and has true $O(N)$ complexity. The running time is 2.470s which is twice as slow as the radix sort solution. Much of the penalty came from cache misses and computation of $cdf$.

Solution 6 is the WINNER solution with only 1.188s. The idea: divide the input array into 512 chunks using the inverse $cdf$ (so all elements in chunk $i$ are smaller than all elements in chunk $i+1$) and sort each chunk individually. It is a straight forward to parallelize: sort different chunks on different threads.


Conclusions

  • Does knowing the distribution helps parallelization?
    • Sequential: No. The fastest solution uses the representation of doubles to achieve linear time.
    • Multi-threaded: Maybe. The normal distribution was used to divided the input stream into smaller chunks which were sorted on multiple cores. However, it may be that the speedup is because of sorting and not because of distribution.
  • Is it worth it?
    • No. Using 4 cores to gain 1.237 speedup is definitely a waste of resources. I think most of the penalty comes from extensive use of memory which creates many cache misses and fake sharing.

Thursday, March 10, 2011

Tale about binaries

Until recently STRUCTure handled binaries separetly using the implications graph. The problem with this scheme is that binaries need special attention which complicates the design and the implementation. I decided to give binaries a less special status and keep them both in the watch lists and implications graph. Needless to say, the code was greatly simplified.

The upside. The reasoning algorithms now work with binary clauses. Blocked clause elimination which was previously ignoring all variables that appeared in binaries can now eliminate even more clauses. Trivial necessary assignments should be detected by watch lists and the rest of them by hyper binary resolution.

The downside. Branching heuristic was greatly affected because the literals propagations were previously approximated using the topological sort of the implication graph. However I can rebuild the implications graph, so there will only be a small speed penalty.

Sunday, March 6, 2011

Instances solved during a run.

This morning I though of checking the number of variables in subinstances solved during a typical run. I created a histogram below (note log scale on both axis). Over 1.000.000 instances had 3 variables. That's a lot of overhead considering that an instance with very few variables is easy to solve by checking all possible assignments. In the picture below created represents the instances created and solved represents the instances created but not solved by the simplification procedure.



At some point I added a simple backtracking algorithm to deal away with small instances, however it didn't influence the results that much. Probably I should test it again. I could use, for example, sat4j to solve small instances, but since probably sat4j is more efficient than STRUCTure the results will be biased.

Nevertheless switching to another algorithm doesn't fix the root problem which I can't understand yet. My guess is that lookahead solvers are not that good on easy instances with many variables that appear in relatively few clauses. What I want to do next is to add another simplification procedure to eliminate these variables like variable elimination in SatElite, but simpler and at every branching node.

Wednesday, March 2, 2011

Dependent Variables

Let's talk a little about XOR clauses. If you have a variable that appears in a single XOR we can remove that clause because it always can be satisfied by setting the proper value of the chosen variable. Let's call this variable dependent. I was reading Mate Soos's paper on Gaussian Elimination and he suggested that we can apply a similar technique for variables that appear in two XOR clauses. These variables, called mutexes, are very common when splitting long XOR clauses before transforming them in CNF clauses.

For example: \[
\begin{align*}
a+b+c+d+e+f & = 1 \\
\text{can be split into} \\
a+b+c+z & = 0 \\
z+d+e+f & = 1
\end{align*}
\] with $z$ being a mutex.

I think that dependent variables can be generalized further for all variables that appears only in XOR clauses. Let $u$ be such a variable. Since a XOR clause is actually a equation in $G_2$ we use the shortest XOR clause containing $u$ to knock out $u$ from all other XOR clauses containing $u$ as in Gaussian Elimination. Now u appears in a single XOR clause and therefore it can be removed.together with the clause. I wonder if anybody else thought of this.

Update: Here is a graph showing the performance improvement of Dependent Variable Elimination. On one instance the speed improvement was almost 200 fold. That's impressive. Of course, the instances without XOR gates were not affected.

SAT Competition 2011

I've submitted the solver just now for SAT Competition 2011. The submission can be downloaded from here and the sources from here (note the git tag sc11). There is a huge gap between STRUCTure and CryptoMiniSat 2.6.0. My goal is not to win the competition (which could happen only if Java HotSpot developers proved that P = NP), but to get a complete objective evaluation of STRUCTure. IIRC look ahead solvers are somehow better than cdcl solvers on random instances and on these instances STRUCTure seems to perform better than CryptoMiniSat.




Update: As it turns out I had a bug that rendered extraction of XOR gates ineffective on my SAT Competition 2011 submission. No wonder why adjusting the coefficient for XOR gates in branching heuristic didn't influence the results.

Thursday, February 24, 2011

Improving Branching. Part. II: Polarity

In previous post I concluded that variable's polarity searched first affects the solving time of some instances (mostly of satisfiable instances of course). With this in mind I ran four tests with the branching heuristic modified as follows

  • initial - branch first on positive polarity
  • plus - branch first on polarity with the highest score
  • minus - branch first on polarity with the lowest score
  • random - branch first on a random polarity
The results are in the graph below. Needless to say that random rocks. plus and minus performed similarly. By choosing a random polarity STRUCTure can now solve three more instances from the AProVE class (I am developing negative emotional feelings towards them).

Improving Branching. Part. I: Coefficients

The current algorithm for choosing branching variable is as follows: compute scores for all literals, compute scores for each variable based on the score of its two polarities (i.e. positive literal and negative literal) and then choose a variable at random from those with highest scores.

The score for a literal is computed by propagating the literal and checking modified clauses (not including satisfying clauses). For every modified clause of length L and for each literal in that clause the score of the literal is increased with αL if clause is a disjunction of literals or βL if the clause is a XOR gate.

The score for a variable is computed using a formula from March_eq: S(b) = 1024 * S'(b) * S'(-b) + S'(b) + S'(-b) where S(b) is the score for variable b, S'(b) and S'(-b) are scores for literals b and -b.

The question is: what values should α and β take? To find out this I choose a training set made of a few instances and run the solver for a few hundreds different values of α and β. I got quite a few timeouts, but eventually I got many values to compute a interesting heat map (see below: yellow is bad, blue is good). From this I learned a few things: values for α under 0.25 are awful, β has little effects on the results (notice the vertical bars).

With the best found  values for α and β I ran the solver on a more extensive set of instances too see which instances were negatively affected and extended my initial set. This time I set the possible values for α in interval [0.5,1) though, retrospectively I think I shouldn't have.


Notice a new red vertical band around 0.8 which was dark blue in the previous heat map. Again, β had little effect. Notice that up to now I eliminated two ranges for α[0.00,0.25) and [0.75,1.00).

With the new found best values I run the solver on my usual set of instances (a subset of easy and medium instances from SAT09 Competition). The good news is that STRUCTure was able to solve a few more instances with some speedup on others. Many instances were not affected. The bad news is that one instance that was included in my training set timed out. The difference between training and testing was the number of CPU used (16 versus 8) which suggests that the solution was on a branch searched by one of the extra CPUs. Two things to remember: 1) Run the training set on fewer CPUs. 2) first polarity searched is relevant.


In conclusion I think that heat maps are a good way to figured out bad ranges for parameters but it is hard to find exceptional values.

Update: Here is another heat map which I used for SAT Competition 2011. This time I included the timeouts and the heatmap seems to provide a more clear picture of interaction of the two coefficients tested. The x-axis is α described above. On y-axis are the values for γ which defines for an implication u → v how much the v's score influences u's score. There are 4 regions distinguishable on the heat map from which good values for α and γ can be concluded: 0 ≤ α < 0.4 and 0 ≤ γ < 0.3.

Tuesday, February 22, 2011

Blocked Clause Elimination

I just finished implementing a technique Blocked Clause Elimination (BCE). The idea is that for a literal l and a clause C containing l if for every C' clause  R = C ⊗l C' is a tautology (where l is the resolution operator on literal l) the clause C can be eliminated. The removed clause C is said to be blocked on literal t. You can check Mate Soos' post on the same topic.

My implementation is pretty fast since it doesn't run BCE until completion, but it iterates over the literals and tries to find clauses blocked on that literal. It also doesn't handle dependent variables (i.e. variables that appear in exactly one XOR clause) or pure literals (literals that appear as a single phase). The graph below compares the performance BCE.

  • begin - before reasoning first time. Should speedup first node evaluation.
  • end - after reasoning first time (just before branch). Should find more blocked clauses.
  • begin_and_end - the previous two combined
  • always - at every node of search tree. Should discover new blocked clauses from propagating branching.
  • never - disabled

I expected that begin_and_end was at least as good as end, but it was not the case. It seems that doing BCE before first node evaluation slows down search on instances from the AProVE class. 

BCE is stronger than pure literals and dependent variables eliminations, but the later two eliminate most of the benefits of BCE. BCE after simplifications is able to find many clauses that are not already eliminated by other techniques and this is why end in the above graph shows the best performance.

Wednesday, February 16, 2011

XOR gates

I finally came about to extract XOR gates from cnf formulas. First thing I wanted to check how much of the literals come from XOR clauses on a few instances from SAT09 competition. The table below suggest that handing XOR gates is promising.

Test name
Literals in XOR gates
Literals in formula
Percent
aprove09/AProVE09-24.cnf
366734
760056
48,25
aprove09/AProVE09-22.cnf
65416
138984
47,07
aprove09/AProVE09-25.cnf
205992
447524
46,03
aprove09/AProVE09-13.cnf
38150
94732
40,27
aprove09/AProVE09-19.cnf
162024
410098
39,51
aprove09/AProVE09-12.cnf
145400
371302
39,16
aprove09/AProVE09-05.cnf
68630
177140
38,74
aprove09/AProVE09-06.cnf
361904
942686
38,39
aprove09/AProVE09-08.cnf
39716
103498
38,37
aprove09/AProVE09-07.cnf
39716
103528
38,36
aprove09/AProVE09-11.cnf
109548
285640
38,35
aprove09/AProVE09-10.cnf
312994
884978
35,37
aprove09/AProVE09-17.cnf
133408
383728
34,77
aprove09/AProVE09-15.cnf
372164
1077424
34,54
aprove09/AProVE09-20.cnf
123232
385046
32,00
aprove09/AProVE09-03.cnf
46820
718720
6,51
aprove09/AProVE09-01.cnf
72476
1127880
6,43
aprove09/AProVE09-21.cnf
19180
305486
6,28

Update: I finished implemented xor-gates and fixed most of the bugs. The change required to change the way I store clauses. Until now clauses were stored as: a b c ... 0. Now they are stored as (length,type) a b c ..., where length is the length of the clause, type is OR/XOR/NXOR and (length, type) means that length and type encoded together as a single integer. The most basic solver is working now, but except for equivalent literals renaming other simplification algorithms are not functional. The graph below shows the impact of handling XOR-gates.


The benefits of handing xor gates are multiple. First formulas are much shorter because xor gates requires exponential number of CNF clauses. This means that it is less costly to move instances around. Second the branch heuristic is less fooled by almost identical clauses. Third most other algorithms can safely ignore xor clauses to a bonus speedup. In conclusion I'm glad I spent the last few days refactoring STRUCTure.

Update: When I enabled all reasoning algorithms (e.g. hyper binary resolution, sub-summing, etc.) the performance difference changed dramatically. It leads me to believe that initial evaluations contained a bug. The graph below shoes the performance of STRUCTure behaves with and without xor gates. It seems that handling xor gates helps on a few instances, but the instances from AProvE class are better of without them.



Friday, February 11, 2011

Transitive closure

Given two implications a → b and b → c using the property of transitivity we can deduce another implication a → c. It is generally good to know many implications because they are used in binary and hyper binary resolutions. If implications are stored as a directed graph we can apply transitive closure on the graph to deduce many more implications. However, transitive closure is very expensive (O(|V|⋅(|V|+|E|)) time), memory hungry and not always gives the expected benefits because the implication might be already very dense.

My idea is to do the transitive closure when the graph is sparse and many new binaries can be found. There are two ways to compute the density of a directed graph:

  • log(|E|)/log(|V|) which returns a value in interval (-∞, 2)
  • 2*|E|/(|V|⋅(|V|-1)) which returns a value in interval [0, 2]

I chose the former, though I plan evaluate the later in the near feature. I tested a few values and figured out that 1.10 is the optimal threshold. If the graph is denser than 1.10 then the transitive closure is too expensive so it is omitted. The next figure shows the results for three thresholds 1.05 (blue), 1.10 (green) and 1.20 (red). Transitive closure has some benefits, but too much of it can slow down the search.

Wednesday, February 9, 2011

STRUCTure

For my master project I'm writing STRUCTure, a distributed parallel sat solver. The project can be downloaded from github.

STRUCTure is build on top of Constellation which is a framework for distributed application. In Constellation you have activities that perform your tasks. Each activity can spawn and communicate with other activities provided that the identifiers are known. There is no distributed shared memory and an activity can be run on any participating machine. With this in mind let us see the design of STRUCTure.

STRUCTure is a DPLL type algorithm. It is not conflict-driven and doesn't learn new clauses. Let's say we have a SAT formula (usually in CNF), F. On F we can apply different reasonings such as

  • equivalent literals.
  • contradictory literals.
  • transitivity.
  • pure literals.
  • units propagation.
  • hyper binary resolution.
  • sub-summing.
  • self-sub-summing (currently one clause is a binary).

After the formula was simplified we have three options.
  • All literals were assigned and thus F is satisfiable.
  • A contradiction (e.g. F contains an empty clause) was found and thus F is unsatisfiable.
  • No solution was found and thus F is unknown.

If the formula F is unknown we can pick a literal a and branch on a, i.e. we try to solve children formula F- = {-a} ∧ F and F+ = {a} ∧ F.
  • if F- and F+ and unsatisfiable then F is unsatisfiable.
  • else if F- or F+ are satisfiable then F is satisfiable with solution given by the satisfied children.
  • else F is unknown (sometimes, e.g. maximum depth was reached, we can just return unknown without branching).
The performance of STRUCTure can be seen in the next image. The program was run on DAS-4 against SAT09 competition medium-level instances with a time limit of 600s on 8 cores.


Wednesday, January 26, 2011

Testing my new blog!

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer suscipit feugiat nibh sed hendrerit. Nulla feugiat aliquet nibh, tristique aliquam lacus mattis id. Mauris semper, elit sed facilisis sodales, purus neque bibendum turpis, quis venenatis metus risus non metus. Aliquam erat volutpat. Curabitur fermentum ligula nec tellus auctor eu mattis lacus varius. Duis euismod turpis sed dui ultricies ut varius sem sollicitudin. Donec ac erat leo, at condimentum est. In hac habitasse platea dictumst. Curabitur pellentesque auctor ipsum. Cras sagittis urna in justo aliquet placerat sodales dui tristique. Curabitur et leo non sapien egestas accumsan. Suspendisse porta quam in arcu faucibus ac facilisis libero congue. Nunc odio turpis, faucibus non aliquam id, blandit vel enim. Proin sed risus eget sem rhoncus hendrerit. Nulla purus turpis, ultricies vel semper nec, fermentum sit amet est. Donec quis ante dui. Aenean aliquam, nulla eu pharetra bibendum, nisi dui varius diam, eget fermentum nunc orci vitae nunc. Aenean volutpat mi non lacus pulvinar rutrum. Sed ut leo et est placerat tempus vestibulum sit amet neque.

Is Latex working $\sqrt{x^2}=|x|$ or \(\sqrt{x^2}=|\pi|\) or $\LaTeX$?