16 de junio de 2015

Optimizing "Send More Money" in Racket

The problem

I read some articles that discussed the problem of finding by brute force a solution with distinct digits that satisfy:
  SEND
+ MORE
------
 MONEY
Let's see how to solve this in Racket, following the idea of using just brute force but completely ignoring the monads. The idea is to make small changes in the program and see how they affect the runtime.

Using a few macros, we can make the program go a x12 or x14 times faster using only small modifications the program (for a loose definition of "small".)

(In this article, I will highlight the macros that we add using m:something in the name. In particular, because they are replacing build-in functions of Racket. In general, the macros are not highlighted and they are just mixed with functions ...)

In the examples I will compare the execution times on my desktop and my netbook. In each case, we show the average of 5 runs the program.

Literal translation of Haskel

The first version is almost an exact copy of the version in Haskel, changing the list monad by a for. (This is very similar to the version proposed by minikomi in HN. In my first version, I did not remember the existence of the remove* function in Racket I had rewrote it.)

Average execution time (5 runs, in milliseconds):

CPU GC CPU GC
literal 2648.8 58.8 11051.0 174.6

We draw the conclusion that my netbook is 4 times slower than my desktop :). We could compare the execution times in other languages, but it is clear that this is greatly influenced by the speed of the computer.

#lang racket
(define digits '(0 1 2 3 4 5 6 7 8 9))

(define (to-number digits)
  (foldl {lambda (x y) (+ (* 10 y) x)} 0 digits))

(time (for*/list ([s (remove* (list 0) digits)]
                  [e (remove* (list s) digits)]
                  [n (remove* (list s e) digits)]
                  [d (remove* (list s e n) digits)]
                  [send (in-value (to-number (list s e n d)))]
                  [m (remove* (list 0 s e n d) digits)]
                  [o (remove* (list s e n d m) digits)]
                  [r (remove* (list s e n d m o) digits)]
                  [more (in-value (to-number (list m o r e)))]
                  [y (remove* (list s e n d m o r) digits)]
                  [money (in-value (to-number (list m o n e y)))]
                  #:when (= (+ send more) money))
        (list send more money)))

Using in-list to indicate that they are lists

Unlike in Haskel, in Racket the variables they have no fixed type and there is no type inference. (I'd like to see how the previous program works in Typed Racket.)

So it's convenient to use in-list to give Racket a "hint" that the result of remove* is a list. Actually in-list it is not a "hint", it has the instructions to iterate a list efficiently. Without this "hint", for uses the generic instructions for sequences and they are a little slower.

Average execution time (5 runs, in milliseconds):

CPU GC CPU GC
literal 2648.8 58.8 11051.0 174.6
in-list 2377.4 15.4 9185.4 62.6

The runtime decreased 10%-20%. The GC time decreased by 60%, probably because it's not necessary to create the intermediate structures of the generic sequences that are used for the cicles.

Note that in absolute value, the CPU time dropped much more than that lowered the GC time. A very small part of the time is lost in the GC.

(time (for*/list ([s (in-list (remove* (list 0) digits))]
                  [e (in-list (remove* (list s) digits))]
                  [n (in-list (remove* (list s e) digits))]
                  [d (in-list (remove* (list s e n) digits))]
                  [send (in-value (to-number (list s e n d)))]
                  [m (in-list (remove* (list 0 s e n d) digits))]
                  [o (in-list (remove* (list s e n d m) digits))]
                  [r (in-list (remove* (list s e n d m o) digits))]
                  [more (in-value (to-number (list m o r e)))]
                  [y (in-list (remove* (list s e n d m o r) digits))]
                  [money (in-value (to-number (list m o n e y)))]
                  #:when (= (+ send more) money))
        (list send more money)))

Replace equal? with =

The previous code uses many equal?s. The worst part is that they are invisible equal?s. It turns out that (remove* proc list) is equivalent to (remove proc list equal?). So we can change it for (remove list proc =). The problem is that equal? can call arbitrary code, for example to check if two structs are equal?. Hence the code that runs equal? is slower and difficult to optimize. Instead, = always calls a simple comparison.

Average execution time (5 runs, in milliseconds):

CPU GC CPU GC
in-list 2377.4 15.4 9185.4 62.6
= 1507.0 28.2 5559.8 62.8

The run time decreased by 40%. The GC time increaded slightly. It turns out that using equal? is much slower than =.

(time (for*/list ([s (in-list (remove* (list 0) digits =))]
                  [e (in-list (remove* (list s) digits =))]
                  [n (in-list (remove* (list s e) digits =))]
                  [d (in-list (remove* (list s e n) digits =))]
                  [send (in-value (to-number (list s e n d)))]
                  [m (in-list (remove* (list 0 s e n d) digits =))]
                  [o (in-list (remove* (list s e n d m) digits =))]
                  [r (in-list (remove* (list s e n d m o) digits =))]
                  [more (in-value (to-number (list m o r e)))]
                  [y (in-list (remove* (list s e n d m o r) digits =))]
                  [money (in-value (to-number (list m o n e y)))]
                  #:when (= (+ send more) money))
        (list send more money)))

Changing remove* by #:when

The problem is that in each pass remove* makes a new list without the numbers that we don't want to repeat. This creates many many many lists. The GC hardly spends any time because when it runs almost all of these lists already have no references and they simply disappear. However, the lists are created on different memory pages and that makes it slow to find them.

Then we can replace the remove* with #:when. Instead of removing the items from the list before choosing the number, we check that they are not repeated after choosing them.

Average execution time (5 runs, in milliseconds):

CPU GC CPU GC
= 1507.0 28.2 5559.8 62.8
#:when 1766.0 40.8 6071.8 187.4
The run time increased 10%-20% :(. The GC time increased x2-x3 :(2.

I guess this is because there are more intermediate lists. I still don't understand the details of this. Although the time difference is in the wrong direction, all the functions used here are simpler, so it will be easier to apply the next transformation.

(time (for*/list ([s (in-list digits)]
                  #:when (not (ormap {lambda (x) (= x s)} (list 0)))
                  [e (in-list digits)]
                  #:when (not (ormap {lambda (x) (= x e)} (list s)))
                  [n (in-list digits)]
                  #:when (not (ormap {lambda (x) (= x n)} (list s e)))
                  [d (in-list digits)]
                  #:when (not (ormap {lambda (x) (= x d)} (list s e n)))
                  [send (in-value (to-number (list s e n d)))]
                  [m (in-list digits)]
                  #:when (not (ormap {lambda (x) (= x m)} (list 0 s e n d)))
                  [o (in-list digits)]
                  #:when (not (ormap {lambda (x) (= x o)} (list s e n d m)))
                  [r (in-list digits)]
                  #:when (not (ormap {lambda (x) (= x r)} (list s e n d m o)))
                  [more (in-value (to-number (list m o r e)))]
                  [y (in-list digits)]
                  #:when (not (ormap {lambda (x) (= x y)} (list s e n d m o r)))
                  [money (in-value (to-number (list m o n e y)))]
                  #:when (= (+ send more) money))
        (list send more money)))

Using macros instead of higher-order functions

In the previous version there are many closures {lambda (x) (= x y)} that apparently are created in memory. However Racket has many tricks to transform them internally into normal functions or directly eliminate them.

In this case, the compiler inlines them. Actually, in this case the compiler inlines first ormap and then it inlines the {lambda ...}, so in the code that is executed these closures don't appear even as function calls because the code is already inlined.

The problems are caused by those (list ...) that just create many lists that are destroyed almost immediately. Then we can replace ormap (which is a function) with a new macro m:ormap.

(define-syntax m:ormap
  (syntax-rules (list)
    [(m:ormap proc (list n ...)) (let ([proc-val proc])
                                   (or (proc-val n) ...))]))

As m:ormap is a macro, it can do strange things, like spying the expressions in the parameters positions and see that the second is a explicit list with the pattern (list ...) and then apply directly the proc function to each of the elements, without creating the list.

Here we have a technical detail. We need to create an intermediate variable pro-val to hold the value of proc. If we use directly proc, the expression of proc would be repeated and then the code that generates proc would be executed several times. That can cause unexpected results. For example, let's compare the results when we use the incorrect version:

(define-syntax m:ormap/bad
  (syntax-rules (list)
    [(m:ormap proc (list n ...)) (or (proc n) ...)]))

(ormap (begin (display "*") {lambda (x) (display x) #f}) (list 1 2 3))
(m:ormap (begin (display "*") {lambda (x) (display x) #f}) (list 1 2 3))
(m:ormap/bad (begin (display "*") {lambda (x) (display x) #f}) (list 1 2 3))
(newline)

(ormap (let ([r (random 6)]) {lambda (x) (display r) #f}) (list 1 2 3))
(m:ormap (let ([r (random 6)]) {lambda (x) (display r) #f}) (list 1 2 3))
(m:ormap/bad (let ([r (random 6)]) {lambda (x) (display r) #f}) (list 1 2 3))
(newline)


In general, it is good that the macros are as intuitive as possible, because in Racket it's not usual not indicate in the name that it's a macro, so you have to try to write them without surprising behavior. In this case, it would be ideal that m:ormp behaves as similar as ormap as possible. It's almost like a function, only with the magic of spying inside lists. This is the reason that it's better if it evaluates the first parameter only once.

Another important detail is that if the function is simple, the Racket compiler can inline it. In this case the {lambda (x) (= x n)} are inlined. And as all the instances of proc-val are inlined, so the auxiliary variable disappears during compilation. Then
(ormap {lambda (x) (= x n)} (list s e))
is equal to
(or (= s n) (= e n))

Similarly, we replace foldl with m:foldl. The definition is a little more complicated. Moreover, rather than one intermediate variable it creates a lot of intermediate variables which fortunately disappear (similarly to the intermediary variable  proc-val that disappears in m:ormap).

(define-syntax m:foldl
  (syntax-rules (list)
    [(m:foldl proc ini (list)) ini]
    [(m:foldl proc ini (list x y ...)) (let ([proc-val proc])
                                         (m:foldl proc-val (proc-val x ini) (list y ...)))]))

Here is another technical problem if we use m:fold (which is a macro) inside to-number (which is a function). It turns out that to-number is inlined after the expansion of m:fold. So m:fold doesn't see the explicit list, but it sees only the argument of to-number. In general, you have to be careful because the macros that look like functions sometimes don't compose magically with the actual functions. Therefore we must also make a macro m:to-number.

(define-syntax-rule (m:to-number digits)
  (m:foldl {lambda (x y) (+ (* 10 y) x)} 0 digits))

Now the compiler first expands m:to-number
(m:to-number (list s e n d))
to
(m:foldl {lambda (x y) (+ (* 10 y) x)} 0 (list s e n d))
without breaking the explicit list and then m:fold sees the explicit list when it is expanded.

Average execution time (5 runs, in milliseconds):

CPU GC CPU GC
#:when 1766.0 40.8 6071.8 187.4
macros 271.4 0.0 951.4 0.0

The run time decreased 85%, or an x6! The GC time dropped to 0. The GC time was small, but this is a good sign that no intermediate lists or closures are created.

The truth is that this use of macros is a little complicated, especially because you have to think carefully how they interact with functions. But one x6 is worth in tight loops and benchmarks. For most of the code, it's better to use functions, and be sure that they handle correctly all the corner cases.

It would be nice that this optimization is applied automatically. It looks like optimizing
(ormap display (list 1 2 3))
to
(ormap (display 1) (display 2) (display 3))
is not very difficult. But when the Racket compiler can act, ormap is already inlined and the structure that the optimizer sees is more complicated. This is one of the optimizations that I have on my wishlist and I think at some point it can be added to Racket.

(define-syntax m:ormap
  (syntax-rules (list)
    [(m:ormap p (list n ...)) (let ([p_ p])
                                (or (p_ n) ...))]))

(define-syntax m:foldl
  (syntax-rules (list)
    [(m:foldl p ini (list)) ini]
    [(m:foldl p ini (list x y ...)) (let ([p_ p])
                                      (m:foldl p_ (p_ x ini) (list y ...)))]))

(define-syntax-rule (m:to-number digits)
  (m:foldl {lambda (x y) (+ (* 10 y) x)} 0 digits))

(time (for*/list ([s (in-list digits)]
                  #:when (not (m:ormap {lambda (x) (= x s)} (list 0)))
                  [e (in-list digits)]
                  #:when (not (m:ormap {lambda (x) (= x e)} (list s)))
                  [n (in-list digits)]
                  #:when (not (m:ormap {lambda (x) (= x n)} (list s e)))
                  [d (in-list digits)]
                  #:when (not (m:ormap {lambda (x) (= x d)} (list s e n)))
                  [send (in-value (m:to-number (list s e n d)))]
                  [m (in-list digits)]
                  #:when (not (m:ormap {lambda (x) (= x m)} (list 0 s e n d)))
                  [o (in-list digits)]
                  #:when (not (m:ormap {lambda (x) (= x o)} (list s e n d m)))
                  [r (in-list digits)]
                  #:when (not (m:ormap {lambda (x) (= x r)} (list s e n d m o)))
                  [more (in-value (m:to-number (list m o r e)))]
                  [y (in-list digits)]
                  #:when (not (m:ormap {lambda (x) (= x y)} (list s e n d m o r)))
                  [money (in-value (m:to-number (list m o n e y)))]
                  #:when (= (+ send more) money))
        (list send more money)))

Using unsafe functions

This is a simple transformation and perhaps it could have been applied before. The problem is that it transforms code that raises errors into wrong code or a crashing program. So it's better to save this for extreme cases too. The idea is to replace all the operations with their unsafe version. Note that in-list checks first that the argument is really a list, and then uses the unsafe versions internally, so it does not seem necessary to create an unsafe-in-list. (It is also difficult to create this type of sequence generators. It's long enough for another whole article.)

Average execution time (5 runs, in milliseconds):

CPU GC CPU GC
macros 271.4 0.0 951.4 0.0
unsafe 234.0 0.0 833.2 0.0

The run time decreases only 15%. Not bad if we are sure that we didn't make a mistake that will crash the program.

(require racket/unsafe/ops)

(define-syntax-rule (m:unsafe-fx-to-number digits)
  (m:foldl {lambda (x y) (unsafe-fx+ (unsafe-fx* 10 y) x)} 0 digits))

(time (for*/list ([s (in-list digits)]
                  #:when (not (m:ormap {lambda (x) (unsafe-fx= x s)} (list 0)))
                  [e (in-list digits)]
                  #:when (not (m:ormap {lambda (x) (unsafe-fx= x e)} (list s)))
                  [n (in-list digits)]
                  #:when (not (m:ormap {lambda (x) (unsafe-fx= x n)} (list s e)))
                  [d (in-list digits)]
                  #:when (not (m:ormap {lambda (x) (unsafe-fx= x d)} (list s e n)))
                  [send (in-value (m:unsafe-fx-to-number (list s e n d)))]
                  [m (in-list digits)]
                  #:when (not (m:ormap {lambda (x) (unsafe-fx= x m)} (list 0 s e n d)))
                  [o (in-list digits)]
                  #:when (not (m:ormap {lambda (x) (unsafe-fx= x o)} (list s e n d m)))
                  [r (in-list digits)]
                  #:when (not (m:ormap {lambda (x) (unsafe-fx= x r)} (list s e n d m o)))
                  [more (in-value (m:unsafe-fx-to-number (list m o r e)))]
                  [y (in-list digits)]
                  #:when (not (m:ormap {lambda (x) (unsafe-fx= x y)} (list s e n d m o r)))
                  [money (in-value (m:unsafe-fx-to-number (list m o n e y)))]
                  #:when (unsafe-fx= (unsafe-fx+ send more) money))
        (list send more money)))

Using in-range instead of in-list

The idea is that instead of using (in-list digits) we can use (in-range 10). In the for the clausule (in-range 10) does not create a sequence, it that has instructions to iterate through the 10 numbers. So again no list nor sequence are created unnecessary. This seems faster because it is only a matter of adding numbers on the micro or the cache, rather than search for items of the list at any memory location.

Average execution time (5 runs, in milliseconds):

CPU GC CPU GC
unsafe 234.0 0.0 833.2 0.0
in-range 231.0 0.0 764.2 0.0

The runtime dropped only 1% -10%, almost at the error level for the desktop computer. It is less than what I expected.

(time (for*/list ([s (in-range 10)]
                  #:when (not (m:ormap {lambda (x) (unsafe-fx= x s)} (list 0)))
                  [e (in-range 10)]
                  #:when (not (m:ormap {lambda (x) (unsafe-fx= x e)} (list s)))
                  [n (in-range 10)]
                  #:when (not (m:ormap {lambda (x) (unsafe-fx= x n)} (list s e)))
                  [d (in-range 10)]
                  #:when (not (m:ormap {lambda (x) (unsafe-fx= x d)} (list s e n)))
                  [send (in-value (m:unsafe-fx-to-number (list s e n d)))]
                  [m (in-range 10)]
                  #:when (not (m:ormap {lambda (x) (unsafe-fx= x m)} (list 0 s e n d)))
                  [o (in-range 10)]
                  #:when (not (m:ormap {lambda (x) (unsafe-fx= x o)} (list s e n d m)))
                  [r (in-range 10)]
                  #:when (not (m:ormap {lambda (x) (unsafe-fx= x r)} (list s e n d m o)))
                  [more (in-value (m:unsafe-fx-to-number (list m o r e)))]
                  [y (in-range 10)]
                  #:when (not (m:ormap {lambda (x) (unsafe-fx= x y)} (list s e n d m o r)))
                  [money (in-value (m:unsafe-fx-to-number (list m o n e y)))]
                  #:when (unsafe-fx= (unsafe-fx+ send more) money))
        (list send more money)))

Conclusions

Let's see all the run times again.

Average execution time (5 runs, in milliseconds):

CPU GC CPU GC
literal 2648.8 58.8 11051.0 174.6
in-list 2377.4 15.4 9185.4 62.6
= 1507.0 28.2 5559.8 62.8
#:when 1766.0 40.8 6071.8 187.4
macros 271.4 0.0 951.4 0.0
unsafe 234.0 0.0 833.2 0.0
in-range 231.0 0.0 764.2 0.0


To make them easier to compare, let's make two graphs. In a graph we draw the actual times. In the other we draw the crowed times with multipliers to make all of them visible at the same graphic. We divided the time of the netbook by 4 and multiply times of the GC by 10. (So the GC time in the netbook is multiplied by 10/4.) Notice that the GC time is very low.
Actual times (same scale)

Compressed times (different scales)

At each step we made small changes, which allowed us to reach a x12 or x14 times faster version of the program. The only tricky part was to make the macros m:ormap and m:fold that are reusable (and also to remember to make m:to-number). With these macros we can test all the possibilities without using lists and make the program faster.

The problem is that the new program is a bit more difficult to maintain, as it was evident when we had to make the macro m:to-number. So it's best to save special macros for tight loops and benchmarks. (Or for those cases where you really need a macro, of course.)

Extensions for m:ormap

To use m:ormap as a dropdown replacement of ormap in any case without worrying, we should fix some details:
  • It would be good to add error checks to the macro m:ormap. When several macros are composed, if each of them don't have good error checking, the errors are detected too late. At that time, the code has already suffered several transformations and error messages are very cryptic.
  • Better yet, it would be good for m:ormap to transform itself into ormap if it doesn't find an explicit list, so we can always apply it.
  • It would be nice that it also detects cases like '(), '(1 2 3), null, (cons x (list ...)), and if one is brave enough also (list* ...) and (append ...)
  • We should be careful with rare cases with lists whose elements have side effects, like
    (m:ormap {lambda (x) (display 1)} (list (display 2) (error 'error)) (values 3 4) (display 5))
    before generating an error it should show only 2 instead of 21 or 251 or ...
    I think it's possible to fix it, but it would make the macro much longer.
  • As a bonus, we can add some details like the use of the syntax-property of 'disappeared-use so DrRacket puts the arrow that points to list.