9 de septiembre de 2016

Adding a New Primitive to the Innards of Racket

Introduction

The basic functions of Racket are implemented in C and are called primitives, for example cons, car, +, *, and many more (they are about 1500). In this article we will see how to add a new primitive.

In general, it’s not usual to add a new primitive, but this seems like an interesting exercise to understand how the other Racket primitives are implemented. Moreover, the new primitive is useful for the secret hidden type system of the optimizer of Racket (which is unrelated to the type system of TypedRacket).

The new primitive

The new primitive will be
    (strict-true? v)
Returns #t if v is #t and otherwise returns #f.

In general, it is not good Racket code style to compare something with #t. Usually the code should only see that the value is not #f. As it’s a bad style primitive, I will hide it to make it inaccessible.

(Although it is hidden, the internal test may call it by importing it from #%kernel. Anyway, to make some examples for the article, I will export it renamed as ugly-strict-true? that clearly express my opinion.)

The new primitive is similar to
    (not v)
Returns #t if v is #f and returns #f otherwise.

So the plan is to see how not is implemented, and copy its implementation.
You can look at the details and the complete code in github.

Disable cstartup.inc

The first step to add a primitive is to make Racket ignore the cstartup.inc file that is actually a .zo file with the bytecode encoded in a strange way. The content of the .zo files depends on the position of each primitive, so when you add or remove a primitive they become obsolete. In the same step we change the version number so all the .zo files are recalculated. (When the file cstartup.inc disabled everything is slower, so remember to enable it after finishing the changes.)

To disable this cstarup.inc file we have to edit the file schminc.h and put
    #define USE_COMPILED_STARTUP 0

The version number is changed in schminc.h and base/info.rkt

Adding the primitive

First you have to write the C implementation, which in this case goes in the file bool.c. There is nothing surprising here. We just look if the first argument is scheme_true, that is the internal name of the variable that has the value #t.
    static Scheme_Object *strict_true_p_prim(int argc, Scheme_Object *argv[])
    {
      return (SAME_OBJ(argv[0], scheme_true) ? scheme_true: scheme_false);
    }

Then we create a primitive to allow us to use the function in defined in C in the Racket code, and we add it the environment to be recognized for compiling the new programs in Racket.
    p = scheme_make_folding_prim(strict_true_p_prim, "strict-true?", 1, 1, 1);
    SCHEME_PRIM_PROC_FLAGS(p) | = scheme_intern_prim_opt_flags (SCHEME_PRIM_IS_OMITABLE);
    scheme_add_global_constant( "strict-true?", p, env);

This definition expresses many details about the primitive. Luckily it is like not and we can copy it.

Some of the following examples seem ridiculous, but after the original code suffers several steps Racket expansion, macros and inlining, sometimes the result looks like this.

It accepts a single argument

So we don’t need to check the number of arguments. So
    (strict-true? 1 2)
fails at run time.

It returns a single value.

This enables some optimizations, for example
    (lambda (x) (car (cons (strict-true? x) x)))
    ==>
    (lambda (x) (strict-true? x))
instead of
    ==>
    (lambda (x) (values (strict-true? x)))
in which values is used to check that the result has exactly one value.

And it also marks
    (lambda (x)
      (display x)
      (strict-true? x))
as a function that returns a single value, which helps the interpreter/JIT to avoid generating code for the case of multiple values.

We mark it as folding,

So during optimization it can participate in the constant folding, in expressions like (strict-true? #t) ==> #t.

Or more complicated cases, that use the interaction with other optimizations, such as
    (lamba (v)
      (when (vector? v)
      (strict-true? (box? v))))
    ==>
    (lamba (v)
      (when (vector? v)
      (strict-true? #f)))
    ==>
    (lambda (v)
      (when (vector? v)
      #f)

We mark it as OMITABLE,

So will automatically have some reductions as
    (lambda (x)
      (strict-true? x)
      (display x))
    ==>
    (lambda (x)
      (display x))

It’s not marked as UNARY_INLINED

The attentive reader will note that it is not marked as UNARY_INLINED, that is another flag that not has, but we'll add it soon.

Enable cstartup.inc again

Now we can recalculate the file cstartup.inc using
    make cstartup

and re-enable it modifiying the file schminc.h
    #define USE_COMPILED_STARTUP 1

While I'm preparing new code, I like to keep these steps in different commits in case I have to make a rebase and recalculate cstartup.c, but the final version is better to squash them into one.

JIT

If the primitive is simple, it is possible to add it to the list of primitives that are processed by the JIT so that the running code is compiled to machine code rather than interpreted from the bytecode. Racket uses GNU lightning with several conventions and auxiliary functions to facilitate writing code.

First we have to add a flag in the definition of strict-true? in bool.c
    p = scheme_make_folding_prim(strict_true_p_prim, "strict-true?", 1, 1, 1);
    SCHEME_PRIM_PROC_FLAGS(p) |= scheme_intern_prim_opt_flags(SCHEME_PRIM_IS_UNARY_INLINED
                                                              | SCHEME_PRIM_IS_OMITABLE);
    scheme_add_global_constant("strict-true?", p, env);
And then we add in jitinline.c
    else if (IS_NAMED_PRIM(rator, "strict-true?")) {
      generate_inlined_constant_test(jitter, app, scheme_true, NULL, for_branch, branch_short, dest);
      return 1;
    }

This case was easy because we already have the function generate_inlined_constant_test that generates the JIT code to check if the value is equal to the chosen constant, in this case scheme_true. But generally writing the implementation of a primitive for the JIT is very very painful.

To see the effect of this change, we can compare how long it takes the following program to run before and after enabling the JIT version of strict-true?. We compare it with the equivalent version for not.that in both case use the JITed code. (We use a mutable variable x, so that the optimizer is not sure about the value of x and it have to check the value in each cycle, and there is no risk that a clever optimization removes the check.)

    #lang racket / base
   
    (define x 'something-else)
    (set! x x)
   
    (time
      (for/or ([i (in-range 1000000000)])
        (not x)))
   
    (time
      (for/or ([i (in-range 1000000000)])
        (ugly-strict-true? x)))

cpu time (ms)
Before:
After:
not
4109
4109
ugly-strict-true?
9031
4110

These times include time of the the loop, if we could deduct its time the difference would be even more remarkable. Usually, I like to repeat the measurement several times to get an informal version of a statistics, but in this case each test takes about 5-10 seconds and the improvement is x2 so it’s not necessary to repeat it to be sure.

Global variable

With the changes that we have already made the primitive is completely ready. Now we can add some specific optimizations for strict-true?. Some optimizations can recognize strict-true? by name, but for others it is necessary to put strict-true? in a global variable so that it can be recognize or use from other files.

In schpriv.h we add
    extern Scheme_Object *scheme_strict_true_p_proc;

There is no official rule for the name of the global variables, but generally if it is a primitive the name is scheme_<name>_proc, where in the name of the - changes to _ and ? to _p and things like that, with a few exceptions.

In the definition in bool.c we have to save the primitive in the variable and register the variable with the garbage collector.
    READ_ONLY Scheme_Object *scheme_strict_true_p_proc;
    REGISTER_SO(scheme_strict_true_p_proc);
    p = scheme_make_folding_prim(strict_true_p_prim, "strict-true?", 1, 1, 1);
    SCHEME_PRIM_PROC_FLAGS(p) |= scheme_intern_prim_opt_flags (SCHEME_PRIM_IS_UNARY_INLINED
                                                               | SCHEME_PRIM_IS_OMITABLE);
    scheme_strict_true_p_proc = p;
    scheme_add_global_constant("strict-true?", p, env);

Optimizations

There are several interesting optimizations that we can add now. To make this article short I will not put the code here and I’ll left as an exercise for the interested reader to search for the details in the commits. All the changes are in functions in optimize.c.

Relevant predicate

We add strict-true? to the list of relevant predicates, which are the ones that are tracked by the Racket optimizer. These predicates are almost disjoint (I don’t know the technical term), so this automatically enables optimizations such as
    (strict-true? (vector 1 2 3)) ==> #f

and also
    (vector? (if x
               (begin (display x) #t)
               #t))
    ==>
    (begin
      (if x
        (begin (display x) #t)
        #t)
      #f)

Replace a variable by a constant

If the optimizer deduces that a variable verifies strict-true? then it replaces it by #t.
    (lambda (x)
      (when (strict-true? x)
        x))
    ==>
    (lambda (x)
      (when (strict-true? x)
        #t))

Add boolean? as a relevant predicate

We can also add to the primitive boolean? to the relevant predicates. The problem is that boolean? includes strict-true? and not (not is a predicate, is more evident if you write it as not? or strict-false? but for historical reasons we call it not). So we have to add code to handle unions and intersections and see if a variable that satisfies one predicate can satisfy a different predicate. Whith this change we have reductions like:
    (lambda (f)
      (boolean? (if (f)
              (begin (display 1) #t)
              (begin (display 2) #f)))
    ==>
    (lambda (f)
      (if (f)
         (begin (display 1) #t)
         (begin (display 2) #f))
      #t)

Another example that seems to be most interesting is
    (lambda (x)
      (when (boolean? x)
        (if x (box x) (vector x))))
    ==>
    (lambda (x)
      (when (boolean? x)
        (If x (box #t) (vector #f))))

Predicates are Boolean

We can also mark the result of many common predicates as a Boolean, then we have
    (lambda (v)
      (let ([x (vector? v)])
        (if x (box x) (vector x))))
    ==>
    (lambda (v)
      (let ([x (vector? v)])
        (if x (box #t) (vector #f))))

Optimization for or

The macro or is a very usual and interesting case. The expansion (without optimization) is
    (lambda (x)
      (or (symbol? x) 7)
    ==>
    (lambda (x)
      (let ([temp (symbol? x)])
        (if temp temp 7)))

With the new enhancements, the optimizer realizes that temp is Boolean and optimizes
    ==>
    (lambda (x)
      (let ([temp (symbol? x)])
        (if temp #t 7)))

It would be good to move the expression that defines temp to the place of use and get a result like
    ==>
    (lambda (x)
      (if (symbol? x) #t 7))

but the optimizer only moves expressions that define variables used once, and the optimizer does not realize that after the replacement of temp by #t, the variable temp appears only once.

Luckily, there was an ad hoc optimization for expressions like the produced by or, but it was applied only when they were in a Boolean context, such as (if (or ...) ... ...) or (not (or ...)). With some changes we can extend to (or <boolean?> ...).

Conclusion



With these changes, we would have a new primitive in Racket. It is not very useful for use directly, but it helps implement some optimizations for usual idioms, particularly for expressions that combine or and an expression with a Boolean result.

You can look at the details and the complete code in github.