12

The implementation of Integer>>#factorial in Pharo is:

factorial
        "Answer the factorial of the receiver."

        self = 0 ifTrue: [^ 1].
        self > 0 ifTrue: [^ self * (self - 1) factorial].
        self error: 'Not valid for negative integers'

This a tail-recursive definition. However, I can evaluate 10000 factorial without error in the workspace.

Does Pharo perform tail-call optimisation in any circumstances, is it doing some other optimisation, or is it just using a really deep stack?

1
  • 3
    You could just put a breakpoint in the first ifTrue: case and just count the number of times the same method is on the stack... ;-) Commented May 7, 2016 at 14:54

4 Answers 4

8

It's a really deep stack. Or rather, no stack at all.

Pharo is a descendent of Squeak, which inherits its execution semantics directly from Smalltalk-80. There is no linear fixed-size stack, instead every method call creates a new MethodContext object which provides the space for arguments and temporary variables in each recursive call. It also points to the sending context (for later return) creating a linked list of contexts (which is displayed just like a stack in the debugger). Context objects are allocated on the heap just like any other object. That means call chains can be very deep, since all available memory can be used. You can inspect thisContext to see the currently active method context.

Allocating all these context objects is expensive. For speed, modern VMs (such as the Cog VM used in Pharo) do actually use a stack internally, which consists of linked pages, so it can be arbitrarily large as well. The context objects are only created on demand (e.g. while debugging) and refer to the hidden stack frames and vice versa. This machinery behind the scenes is quite complex, but fortunately hidden from the Smalltalk programmer.

1
  • So, Pharo's execution model is close to R6RS Scheme's? But somewhat in reverse: Scheme constructs continuations "forward" - what to execute next, whereas Pharo has a pointer to the previous context - go back there once done. Or am I missing something?
    – mobiuseng
    Commented May 13, 2016 at 10:46
8

There is no mystery in the execution model of Pharo. The recursive fragment

^ self * (self - 1) factorial

that happens inside the second ifTrue: compiles to the following sequence of bytecodes:

39 <70> self                  ; receiver of outer message *
40 <70> self                  ; receiver of inner message -
41 <76> pushConstant: 1       ; argument of self - 1
42 <B1> send: -               ; subtract
43 <D0> send: factorial       ; send factorial (nothing special here!) 
44 <B8> send: *               ; multiply
45 <7C> returnTop             ; return

Note that in line 43 nothing special happens. The code just sends factorial in the same way it would, had the selector been any other. In particular we can see that there is no special manipulation of the stack here.

This doesn't mean that there cannot be optimizations in the underlying native code. But that is a different discussion. It is the execution model the one that matters to the programmer because any optimization underneath bytecodes is meant to support this model at the conceptual level.

UPDATE

Interestingly, the non-recursive version

factorial2
  | f |
  f := 1.
  2 to: self do: [:i | f := f * i].
  ^f

is a little bit slower that the recursive one (Pharo). The reason must be that the overhead associated to increasing i is a little bit greater than the recursive send mechanism.

Here are the expressions I tried:

[25000 factorial] timeToRun
[25000 factorial2] timeToRun
3
  • 1
    OK, so it's just normal recursive calls. Why then does Pharo not overflow the stack? Can you refer me to anything that describes the execution model? Commented May 7, 2016 at 18:53
  • 2
    @WilfredHughes There are cases when you cannot use recursion because you could exhaust the stack. In the case of factorial this is usually not the case because every recursion is just a call with no arguments and therefore only the receiver and the return address are pushed on the native stack. At the same time, factorial grows so fast that one usually reaches the desired result long before consuming all the stack space. For the execution model pelase take a look to Smalltalk-80 The Language and its Implementation. It's online. Commented May 7, 2016 at 21:33
  • 3
    The url for the book mentioned above is [link] (stephane.ducasse.free.fr/FreeBooks/BlueBook/Bluebook.pdf) Hard to believe the book was written 30 years ago. Commented May 8, 2016 at 5:21
1

IMHO, the initial code that is presumed to have a tail-recursive call to factorial

factorial
        "Answer the factorial of the receiver."

        self = 0 ifTrue: [^ 1].
        self > 0 ifTrue: [^ self * (self - 1) factorial].
        self error: 'Not valid for negative integers'

it isn't, actually. The bytecode reported by Leandro's reply proofs that:

39 <70> self                  ; receiver of outer message *
40 <70> self                  ; receiver of inner message -
41 <76> pushConstant: 1       ; argument of self - 1
42 <B1> send: -               ; subtract
43 <D0> send: factorial       ; send factorial (nothing special here!) 
44 <B8> send: *               ; multiply
45 <7C> returnTop             ; return

before the returnTop there is a send of * instead of factorial. I would have written a message using an accumulator as

factorial: acc
    ^ self = 0
        ifTrue: [ acc ]
        ifFalse: [ self - 1 factorial: acc * self ]

that produces the bytecode reported in this picture.

Btw,

n := 10000.
[n slowFactorial] timeToRun .
[n factorial] timeToRun.
[n factorial: 1] timeToRun.

both the first and second ones takes 29 millisecs, the last one 595 millisecs on a fresh Pharo 9 image. Why so slow?

1

No, Pharo and its VM do not optimize recursive tail calls.

It is apparent from running tests on a Pharo 9 image, and this master thesis on the subject confirms that.

As of today Pharo ships with two factorial methods, one (Integer >> factorial) uses a 2-partition algorithm and is the most efficient, the other looks like this:

Integer >> slowFactorial [
    self > 0
        ifTrue: [ ^ self * (self - 1) factorial ].
    self = 0
        ifTrue: [ ^ 1 ].
    self error: 'Not valid for negative integers'
]

It has an outer recursive structure, but actually still calls the non-recursive factorial method. That probably explains why Massimo Nocentini got nearly identical results when he timed them.

If we try this modified version:

Integer >> recursiveFactorial [
    self > 0
        ifTrue: [ ^ self * (self - 1) recursiveFactorial ].
    self = 0
        ifTrue: [ ^ 1 ].
    self error: 'Not valid for negative integers'
]

we now have a real recursive method, but, as Massimo pointed out, it's still not tail recursive.

This is tail recursive:

tailRecursiveFactorial: acc
^ self = 0
    ifTrue: [ acc ]
    ifFalse: [ self - 1 tailRecursiveFactorial: acc * self ]

Without tail call optimization this version shows by far the worst performance, even compared to recursiveFactorial. I think that's because it burdens the stack with all the redundant intermediate results.

Not the answer you're looking for? Browse other questions tagged or ask your own question.