I have release today another sat solver. Gasca is a pure CDCL (Conflict-Driven Clause-Learning) SAT solver written in Go. It was written before I joined Google in Oct. 2011 as a very short project to understand some alternatives to the ideas in my master thesis. Some pieces of code are inspired from Minisat or Cryptominisat, though Gasca is not a port to Go of neither of them.
Go ahead, clone my repository and try it out:
$ hg clone ssh://hg@bitbucket.org/brtzsnr/gasca
I must say that I found Go very refreshing after writing my master thesis in Java. Go is not comparable to C when it comes to controlling every little aspect of the performance, so Gasca will never stand a chance against top SAT solvers. However, I felt that writing programs in Go is a breeze.
Next Thought
A blog about my CS adventures.
Monday, January 16, 2012
Thursday, June 16, 2011
How fast can you sort 50,000,000 normal distributed doubles?
This is a post about a contest on CodeGolf.StackExchange.
The problem: you are given 50,000,000 real numbers (IEEE 794 doubles) normal distributed with mean 0 and standard deviation 1. Sort them as fast as possible. The target machine is a quad core cpu and with 4GB of RAM (thanks to static_rtti for doing the measurements).
The idea is to see if the distribution can be taken into account to improve sorting speed.
tl;dr: Not faster than sorting any 50,000,000 real numbers.
Solution 1 uses std::sort from C++ STL. Simple solution, but not very impressive: 6.425s.
Solution 2 uses tbb:parallel_sort from Intel's Thread Building Blocks. Amazingly simple parallel solution: 2.958s. I will look into TBB next time when I'll write multithreaded programs.
Solution 3 uses Java7's Fork and Join for parallelization. Surprisingly, it parallelizes better than the TBB solution. Running time is 3.411s.
Solution 4 uses Least Significant Digit Radix Sort. This solution is described by Michael Herf here. The solution is easy to understand: stable sort mantissa, then exponent, then sign. Radix sort is very cache unfriendly and hard to parallelize (at least this variant). As an improvement, this implementation uses radix sort to partially sort the data and then insertion sort finalize sorting. Insertion sort is cache friendly and almost linear if elements are partially sorted. A similar trick can be used with qsort: don't qsort ranges shorter than 9 elements, and use a final insertion sort to fix out of order elements. This solution was the fastest sequential solution : 1.459s. An attempt to parallelize it is here, but no real speedup was obtained.
Math
Up to now no solution uses the fact that the elements were normally distributed. Let's see what this means.
For normal distribution with mean $\mu = 0$ and stddev $\sigma = 1$ we have the cumulative distribution function:
The problem: you are given 50,000,000 real numbers (IEEE 794 doubles) normal distributed with mean 0 and standard deviation 1. Sort them as fast as possible. The target machine is a quad core cpu and with 4GB of RAM (thanks to static_rtti for doing the measurements).
The idea is to see if the distribution can be taken into account to improve sorting speed.
tl;dr: Not faster than sorting any 50,000,000 real numbers.
Solution 1 uses std::sort from C++ STL. Simple solution, but not very impressive: 6.425s.
Solution 2 uses tbb:parallel_sort from Intel's Thread Building Blocks. Amazingly simple parallel solution: 2.958s. I will look into TBB next time when I'll write multithreaded programs.
Solution 3 uses Java7's Fork and Join for parallelization. Surprisingly, it parallelizes better than the TBB solution. Running time is 3.411s.
Solution 4 uses Least Significant Digit Radix Sort. This solution is described by Michael Herf here. The solution is easy to understand: stable sort mantissa, then exponent, then sign. Radix sort is very cache unfriendly and hard to parallelize (at least this variant). As an improvement, this implementation uses radix sort to partially sort the data and then insertion sort finalize sorting. Insertion sort is cache friendly and almost linear if elements are partially sorted. A similar trick can be used with qsort: don't qsort ranges shorter than 9 elements, and use a final insertion sort to fix out of order elements. This solution was the fastest sequential solution : 1.459s. An attempt to parallelize it is here, but no real speedup was obtained.
Math
Up to now no solution uses the fact that the elements were normally distributed. Let's see what this means.
For normal distribution with mean $\mu = 0$ and stddev $\sigma = 1$ we have the cumulative distribution function:
$cdf(x) = P(X < x) = \frac{1}{2}\left[1 + erf\left(\frac{x}{\sqrt{2}}\right)\right]$
where the error function, $erf$, is defined as:$erf(x) = \frac{2}{\sqrt{\pi}}\int_0^x{e^{-t^2}\mathrm{d}t}$
$erf$ and $cdf$ can be calculated only by approximation (see for example Handbook of http://people.math.sfu.ca/~cbm/aands/), but most approximations, if computed as given in the book, involve exponentiation which is slow. However, polynomials can be used to approximate the cdf. To compute polynomial one can use the following little octave program.
octave:1> output_precision(20)octave:2> x = 0:0.0001:6octave:3> [p] = polyfit(x, stdnormal_cdf(x), 7)p =-6.1477472013548900738e-05 1.4979503290953361146e-03 -1.4572377330342150409e-02 7.0142027233260337282e-02 -1.5795386985864678930e-01 5.6651279954993555288e-02 3.8484595025230294851e-01 5.0081970774936157564e-01octave:4> min(polyval(p, x) - stdnormal_cdf(x))ans = -8.2027858615862925262e-04octave:5> max(polyval(p, x) - stdnormal_cdf(x))ans = 8.2027858615862925262e-04
The source code can be checked here.
What's the big deal with the $cdf$ function?
- In the settings of our problem (50,000,000 normal distributed reals) $50,000,000 \cdot cdf(x)$ approximates $x$'s position in the sorted array.
- $cdf^{-1}$ can be used to divide the input stream into equal sized chunks. Quick Sort can be enhanced to have $O(N\log N)$ worst case scenario.
Solution 5 (and here) uses $cdf$ to put elements in the input array in their position in the final array. The solution is a straight forward implementation of the math discussed above and has true $O(N)$ complexity. The running time is 2.470s which is twice as slow as the radix sort solution. Much of the penalty came from cache misses and computation of $cdf$.
Solution 6 is the WINNER solution with only 1.188s. The idea: divide the input array into 512 chunks using the inverse $cdf$ (so all elements in chunk $i$ are smaller than all elements in chunk $i+1$) and sort each chunk individually. It is a straight forward to parallelize: sort different chunks on different threads.
Conclusions
- Does knowing the distribution helps parallelization?
- Sequential: No. The fastest solution uses the representation of doubles to achieve linear time.
- Multi-threaded: Maybe. The normal distribution was used to divided the input stream into smaller chunks which were sorted on multiple cores. However, it may be that the speedup is because of sorting and not because of distribution.
- Is it worth it?
- No. Using 4 cores to gain 1.237 speedup is definitely a waste of resources. I think most of the penalty comes from extensive use of memory which creates many cache misses and fake sharing.
Labels:
cg.se,
parallelization,
sorting
Thursday, March 10, 2011
Tale about binaries
Until recently STRUCTure handled binaries separetly using the implications graph. The problem with this scheme is that binaries need special attention which complicates the design and the implementation. I decided to give binaries a less special status and keep them both in the watch lists and implications graph. Needless to say, the code was greatly simplified.
The upside. The reasoning algorithms now work with binary clauses. Blocked clause elimination which was previously ignoring all variables that appeared in binaries can now eliminate even more clauses. Trivial necessary assignments should be detected by watch lists and the rest of them by hyper binary resolution.
The downside. Branching heuristic was greatly affected because the literals propagations were previously approximated using the topological sort of the implication graph. However I can rebuild the implications graph, so there will only be a small speed penalty.
The upside. The reasoning algorithms now work with binary clauses. Blocked clause elimination which was previously ignoring all variables that appeared in binaries can now eliminate even more clauses. Trivial necessary assignments should be detected by watch lists and the rest of them by hyper binary resolution.
The downside. Branching heuristic was greatly affected because the literals propagations were previously approximated using the topological sort of the implication graph. However I can rebuild the implications graph, so there will only be a small speed penalty.
Labels:
binaries,
implication-graph,
sat
Sunday, March 6, 2011
Instances solved during a run.
This morning I though of checking the number of variables in subinstances solved during a typical run. I created a histogram below (note log scale on both axis). Over 1.000.000 instances had 3 variables. That's a lot of overhead considering that an instance with very few variables is easy to solve by checking all possible assignments. In the picture below created represents the instances created and solved represents the instances created but not solved by the simplification procedure.
Nevertheless switching to another algorithm doesn't fix the root problem which I can't understand yet. My guess is that lookahead solvers are not that good on easy instances with many variables that appear in relatively few clauses. What I want to do next is to add another simplification procedure to eliminate these variables like variable elimination in SatElite, but simpler and at every branching node.
At some point I added a simple backtracking algorithm to deal away with small instances, however it didn't influence the results that much. Probably I should test it again. I could use, for example, sat4j to solve small instances, but since probably sat4j is more efficient than STRUCTure the results will be biased.
Nevertheless switching to another algorithm doesn't fix the root problem which I can't understand yet. My guess is that lookahead solvers are not that good on easy instances with many variables that appear in relatively few clauses. What I want to do next is to add another simplification procedure to eliminate these variables like variable elimination in SatElite, but simpler and at every branching node.
Wednesday, March 2, 2011
Dependent Variables
Let's talk a little about XOR clauses. If you have a variable that appears in a single XOR we can remove that clause because it always can be satisfied by setting the proper value of the chosen variable. Let's call this variable dependent. I was reading Mate Soos's paper on Gaussian Elimination and he suggested that we can apply a similar technique for variables that appear in two XOR clauses. These variables, called mutexes, are very common when splitting long XOR clauses before transforming them in CNF clauses.
For example: \[
\begin{align*}
a+b+c+d+e+f & = 1 \\
\text{can be split into} \\
a+b+c+z & = 0 \\
z+d+e+f & = 1
\end{align*}
\] with $z$ being a mutex.
I think that dependent variables can be generalized further for all variables that appears only in XOR clauses. Let $u$ be such a variable. Since a XOR clause is actually a equation in $G_2$ we use the shortest XOR clause containing $u$ to knock out $u$ from all other XOR clauses containing $u$ as in Gaussian Elimination. Now u appears in a single XOR clause and therefore it can be removed.together with the clause. I wonder if anybody else thought of this.
Update: Here is a graph showing the performance improvement of Dependent Variable Elimination. On one instance the speed improvement was almost 200 fold. That's impressive. Of course, the instances without XOR gates were not affected.
For example: \[
\begin{align*}
a+b+c+d+e+f & = 1 \\
\text{can be split into} \\
a+b+c+z & = 0 \\
z+d+e+f & = 1
\end{align*}
\] with $z$ being a mutex.
I think that dependent variables can be generalized further for all variables that appears only in XOR clauses. Let $u$ be such a variable. Since a XOR clause is actually a equation in $G_2$ we use the shortest XOR clause containing $u$ to knock out $u$ from all other XOR clauses containing $u$ as in Gaussian Elimination. Now u appears in a single XOR clause and therefore it can be removed.together with the clause. I wonder if anybody else thought of this.
Update: Here is a graph showing the performance improvement of Dependent Variable Elimination. On one instance the speed improvement was almost 200 fold. That's impressive. Of course, the instances without XOR gates were not affected.
SAT Competition 2011
I've submitted the solver just now for SAT Competition 2011. The submission can be downloaded from here and the sources from here (note the git tag sc11). There is a huge gap between STRUCTure and CryptoMiniSat 2.6.0. My goal is not to win the competition (which could happen only if Java HotSpot developers proved that P = NP), but to get a complete objective evaluation of STRUCTure. IIRC look ahead solvers are somehow better than cdcl solvers on random instances and on these instances STRUCTure seems to perform better than CryptoMiniSat.
Update: As it turns out I had a bug that rendered extraction of XOR gates ineffective on my SAT Competition 2011 submission. No wonder why adjusting the coefficient for XOR gates in branching heuristic didn't influence the results.
Update: As it turns out I had a bug that rendered extraction of XOR gates ineffective on my SAT Competition 2011 submission. No wonder why adjusting the coefficient for XOR gates in branching heuristic didn't influence the results.
Thursday, February 24, 2011
Improving Branching. Part. II: Polarity
In previous post I concluded that variable's polarity searched first affects the solving time of some instances (mostly of satisfiable instances of course). With this in mind I ran four tests with the branching heuristic modified as follows
- initial - branch first on positive polarity
- plus - branch first on polarity with the highest score
- minus - branch first on polarity with the lowest score
- random - branch first on a random polarity
The results are in the graph below. Needless to say that random rocks. plus and minus performed similarly. By choosing a random polarity STRUCTure can now solve three more instances from the AProVE class (I am developing negative emotional feelings towards them).
Subscribe to:
Posts (Atom)