Switching to Stackless: Why this, Why now?

2022 UPDATE: STACKLESS ARRIVES!

2024 UPDATE: Still working and going strong! :man_lifting_weights:


I wanted to write a short explanation about why switching Rebol to a stackless implementation--similar to Stackless Python--has jumped itself up in the priority queue, to something I'm working on in the near term.

It's not a frivolous feature. It's because:

  • the Wasm build broke. The patch I made to "fix" it means that if you say foo: func [] [foo] and run it, you will irrevocably crash the web REPL.

  • Wasm was allowed to break it. There is nothing in the C standard for being able to predict a stack overflow, and the method R3-Alpha had been using was knowingly non-standard. It was a bit of a thorn in my side--given how much care was taken in general to be on the right side of the standard...so I was always hoping for a better answer to come along.

  • size optimization was eating stack. We like to use emscripten's best options for making a small library. But when we do, clang and emscripten avoid inlining...and instead choose to leave the functions as-is and make a lot of function calls. With inlining disabled, Ren-C uses a fair number of function calls per evaluation step, and that competes with the JavaScript call stack...we were getting stack overflows not just on infinite recursions, but also seemingly not-too-deep routines (that didn't even have to be recursive).

So this has been a bit of a setback. It took down the REPL site for a while and has involved a fair amount of mulling about what to do.

Can We Just Crash On StackOverflows for Now and Press On?

We could put in a recursion counter and call that good enough...like old CPython does.

But...the problem with pressing on when you know something is wrong, is that there's some fairly deep impact on how the system and APIs work. If enough groundwork isn't laid to know where the stackless influence is, we might write bad extensions or interoperability that fundamentally won't work later.

Not only that, we don't want to look bad against the competition. As remarked in the motivations section for Stackless Python's "Why the C Stack Should Vanish", they give a number of points, one of which is:

"PERL has already no longer a C stack. They will have some good reasons for this. Python has no reason to be limited and stay behind."

We can take some shortcuts. Full stackless operation won't ever be possible if you call out of Rebol. Each transition from Rebol into code that calls back into Rebol without deferring its stack back will not be stackless. So if some Rebol natives fall into this category as well for a while, that's okay. I will likely find a balance point that is "good enough" and leave a lot of natives to be converted on an as-needed basis.

Why Wasn't Stackless Ren-C on the Agenda Before?

I was always contemplating this direction. So while it will be by no means trivial to switch, the design of frames and basic fabric of the interpreter should afford bending that way more than if I hadn't. Preliminary tests are promising that conversion should progress at a decent pace.

One reason I'd avoided making a commitment to a stackless approach was because I felt it could be obfuscating when debugging the interpreter--and that such obfuscation would run counter to the "keep it simple" Amish principles. It's nice when you can fire into the C debugger and scroll through the C stack frames and see things like code for a native "CASE" statement right there...with the execution point in the middle on the branch it's running right now. A stackless interpreter has its state in data structures that aren't the C stack, and inspecting those data structures is a bit less intuitive.

As an alternative to stacklessness: I imagined that the platforms C programs were built on would be able to build a costing model to where a recursion could reliably predict if it would overflow (so long as it wasn't calling 3rd party code). Then your program wouldn't be stuck in the situation of simulating a stack to protect itself from the C one it should be able to take for granted. It would ask the C runtime if there was space for a recursion, and it would have a measurement for the worst-case cost of that function and tell you yes or no--as it might tell you that it couldn't malloc() a certain amount of memory from the heap, so you can raise an error.

But it seems indeed this was just in my imagination. Guarantees about the stack aren't on anyone's agenda in the C world; so it's the job of the higher levels like Rebol or Python or whoever to worry about it. So long as Ren-C is written in... C (or anything like it in the systems world), stack overflows are unpredictable and you must just find a way to bound your stack use if it's important.

What Happened To Make Wasm Stack Detection Fail?

While I haven't dug into all the details, the basic understanding is Emscripten jumped to a new version of Clang, which had a new extension model. Emscripten could then integrate more deeply in the compilation process. So instead of just getting a blob of output from the compiler and transforming it after-the-fact, they could add more intelligence to the code generation process.

The biggest obvious feature we got is that if your browser doesn't have shared memory or threads, the workaround build got a lot better and faster. The new method was "Bsyncify" (an update of an older deprecated technique called "Asyncify") that leapfrogged the Emterpreter to the point they deprecated it entirely. It actually was not hard to get this workaround build tweaked and going, but missing features in the threaded build kept us using the old emscripten SDK for threading until those were sorted.

Fortunately: the threading features were added. Unfortunately: after those feature kicked in, the resulting build didn't then run at all. (And also unfortunately, debugging this stuff isn't easy.)

That's when I found out our problem turned out to be related to how Rebol detected stack overflow conditions. And stack overflow errors turn out to be a tricky topic that is misunderstood by many programmers.

Once A Stack Overflow Happens, You Are Hosed

C programmers know that a stack overflow error is not recoverable. (Or at least, they should know.) But people using other languages are under the impression that they can catch them and do something meaningful, when they cannot do so in any general sense:

Sensibly, Rebol didn't try to install strange hooks to let C code badly "recover" from an actual stack overflow error. It tried to conservatively guess about how close you probably were to the stack's limit...and pre-emptively generated an error. On most people's machines, this was good enough to not crash the interpreter, but give you an error:

>> foo: func [] [foo]
>> foo
** Internal Error: Stack overflow

Though "recovery" has all the same problems you can see in the JavaScript question I ask. A lot of what people experience as "Rebol the language" is written in Rebol. You don't know what mezzanine might have crashed, and if crashing that has left the system in an unstable or unusable state. Bulletproofing all of the system against this...is hard.

Well, There's Some of the Motivation

Early tinkering has been promising...though this really does muck with the primordial debugger. But it's good to have that debugger in there so that it doesn't get lost in the shuffle while considering this.

One thing about Stackless Python is seeing how they use pluggable evaluator functions which can become part of the stackless approach. That's good, because this ties into how we can have a stackless parse. Previously it might have been seen as bad that Ren-C was roping in the frame system used by the evaluator and debugger for each PARSE recursion due to inefficiency and eating more stack--but this should address any concerns from that.

1 Like

Because I don't care much for keeping secrets, I'll mention that the "big" thing I am working on is Stackless. This is probably the biggest single change the codebase will have seen in a long time.

It requires a fundamental change to how natives are written in C. Previously anything that invoked the evaluator--such as Rebol's WHILE--would begin running C code that implemented that construct, and not execute a C return statement until all the looping was done. Clearly that means that if you put another Rebol WHILE inside of a while loop, you would have multiple C stacks for the native while in effect at a time. This is a recipe for growing the C stack arbitrarily deep, in proportion to how deep the Rebol stack is.

The new world requires every native to become a kind of state machine. It will run a C return every time it wants an evaluation...returning not a result, but a request for the evaluator to do that thing and call it back. This state machine has to retain memory of what the last thing it did was... did it ask to run a condition, or did it ask to run a body? Because what it does with the result will depend on that. After many times of return-ing to the evaluator with requests, it will ultimately return a signal of an actual result when the loop condition is false, or when there is a BREAK.

The More I Work On This, The More Essential It Feels

The legacy "CPython" implementation strongly paralleled historical Rebol and Red in how its interpreter recursed. Modern interpreters are rejecting this notion; the arguments are made clearly for why. When I notice things JavaScript can do with ease that we cannot, I can't allow them to have that edge. The same goes for Perl, Python, or Ruby: when the comparisons are direct, it does not pay to anchor one's ship to a method that is destined to fail.

The ramifications are deep. Yet as intimidating as it is to face a mountain of a task like this, I do not feel someone can announce an "interpreted language of the future" which is missing this key advancement.

From reading the reports on Stackless Python it sounds like it took a year to do. I don't know how long it will take. What I can report is that there have been some early successes. Even though the exploration of the consequences are making the code a bit haphazard-seeming, strong defensive programming is keeping things running...hopefully well enough to let design improvements take shape so it tidies up.

With this being such a clear imperative, I don't see how it couldn't be attacked before going after the deep-reaching features like debuggers and tracers. In my opinion the language will never truly be learnable unless it has a full and flexible debugger--so this is why I've moved into operating on my instincts of what is right now the highest priority.

2 Likes

This sounds intense. Good luck @hostilefork, you're the right person for the challenge!

1 Like

Wanted to give a small status report. Things are advancing a bit each day...painfully, but semi-clearly.

  • The good news of having Ren-C entering this with such a wide range of wild abilities is that the design is very informed. So far, no previous fun-and-cool features look like they're going to get the axe because of the imposition of new architecture. It seems to be a strictly-enabling change. Even though it certainly throws some of the old ideas about how to write a debugger out the window...those ideas were rickety. And I see ways in which it will really empower new methods. I just have to see the rest.

  • Nothing is firmed up enough to where it's productive to start doing formal timing. However...no doubt some of the stackless operations are going to come with extra cost. But when I look at the kinds of convolutions you'd have to do to achieve these goals without them as interpreter features...it reminds me not to be "penny-wise and pound-foolish". Powerful operations that help you avoid writing huge and broken usermode surrogates that are MUCH slower can win in the big picture... so sacrificing a couple of micro-optimizations is :ok: see also the old SPECIALIZE performance comparison notes

By doing things like tackling PARSE and TRAP+FAIL early on, I'm not avoiding hard questions and just going around converting all the easy natives to stackless first. I'm pacing them out (e.g. I just did LOOP and UNTIL today, which took only minutes each.) Doing a couple at a time helps me let each instance of returning to it be a "new" experience where I might have new insights.

The three biggest unknowns right now are:

  • Debugging - It's very much on my mind to create an experience that's not just like the hacked up demo I did...but real. I want it to be better, with real automation and reflection. It's been a trickier thing than I first imagined to design a pleasing/extensible debugger in a non-threaded environment running in the same language it is debugging (!).

  • Path Dispatch - which was always weird and now has to be redone. And not that I anticipate getting something as bad as history's method would be hard (it seems to be another one of Rebol's semantics-free-zones). Hence what worries me about it that if I'm going to bother doing a full rewrite...I'd kind of like to make some headway on pinning down just what the heck is it. How is it generic, or safe, or how does it apply to custom types. I may have to kick that can down the road again and just focus on making it stackless.

  • Performance - I've been applying intuition to it but it's going to need some looking at with instrumentation and timings. We've spoken about performance regression testing...and it seems like the web automation demo via headless Firefox might be not just the most relevant target, but the most easy to get comparison figures for.

I started this around 18-Apr so it's been a bit over a month. It's a little hard to estimate how fast progress will be, especially in these three areas. But pure stackless processing is probably not more than a couple of months away; with less clear progress on debugging or guarantees about how fast it will be.

In any case, it looks like I might be on the cusp of another relocation shortly. :palm_tree: :oncoming_automobile: :palm_tree: So we'll see if that increases or decreases productivity.

4 Likes

So we have witnessed the first successfully trapped stack overflow error in the Wasm build (both Asyncify and Threaded). This required switching it to the ABORTING_MALLOC=0 build setting...because by default, Emscripten assumes a failing malloc() should kill your program instead of return NULL.

As this shows, the "stack overflow" error manifested as running out of memory:

>> recursive: func [] [array: copy [a b c] recursive]

>> recursive
** Internal Error: not enough memory

I included a COPY to make the point that when you overflow in a "stackless" system, you won't necessarily run out of memory on the attempt to allocate a stack frame. When stack frames are consuming heap/pooled memory just like everything else, it's just as likely that the COPY to make a new array will be the trigger as trying to allocate a new frame.

You may notice that there's no stack frame list to hint you at the deep recursion being the cause of the problem. That's because when you run out of memory and are trying to present an error, you can't be guaranteed to have enough memory to allocate an array for this purpose. Similarly, you can't expect to be able to allocate the object to represent the error itself. These must be preallocated at some size. That's all additional work that needs to be done, that had never been done (as Rebol never really handled out of memory errors rigorously).

I have removed the C_STACK_OVERFLOWING() check which happened when you pushed function calls. To revisit a writeup on why it was bad, consult the comments. The tests have been adapted so that all the "infinite" tests currently just throw after some large number of recursions, and they complete.

BUT... the evaluator and function calls are not the only place that C_STACK_OVERFLOWING() was used to detect deep recursions. Other operations like scanning, molding, binding, etc. used C stack recursions, and for deeply nested blocks this could overflow. Also, not all such operations were protected. Here's a list of the other clients of C_STACK_OVERFLOWING:

  • Binding - Binding is currently implemented as a native that calls a C function, which is "C-recursive"... that C function calls itself for as deep as the block structure it is binding is. With a sufficiently deep block, this could cause an "untrappable" stack overflow. One way to attack this is to make each bind level create a frame...as is done with stackless COMPOSE, that uses a "Binding Executor". While this may sound like it would be slow, running through common code paths would incentivize optimizations that would--in turn--speed up the whole evaluator, PARSE, etc.

  • Copying - There are many different forms of copying and options while doing so. Things like "relativizing" blocks are weaved in with the copying process...and one can imagine tying it in with binding as well. In any case, the core of copying is currently something called "clonification" which uses C_STACK_OVERFLOWING. It's not as obvious to me just by looking at it how this code could use the trampoline to power it... but it's likely possible.

  • Comparing - The general topic of comparison semantics in Rebol is rather neglected, but it's yet another routine that is done through plain recursions. It seems to be something amenable to frame-based recursion.

  • Scanning - If you imagine writing something like load "[[[[[[[[[[[[[..." there is going to be a limit somewhere. Red seems to use a specialized stack for this vs. doing recursions (the error you get for deep scans is PARSE - stack limit reached, and not a typical stack overflow). But Rebol's scanner was based on plain C recursion. I again wonder just how generic and fast the "trampoline" could be engineered to be, to where it would be close enough to as fast to be worth it vs. a specialized stack.

  • Molding - Another C recursive routine, which might be reframable to use the trampoline. As with most of these routines that use C recursions, there has to be some concern over dealing with cycles. Another wrench thrown in is that within the system, molding is designed to work on so-called "Relative" values, which users never see but are part of the implementation. This means that if all the C parts in the system tried to call a native MOLD, they would not have the proper bit patterns to give as a fulfilled value. :-/

So that's a little map of some of the parts of the code that could trip up with stack overflows when given a sufficiently pathological data structure. This isn't about operating on big data structures, just deep ones.

I'm not really as concerned about these deep-datastructure recursions as I am about having a story for the debugger in this stackless world, and a bunch of bigger questions.

Also, while stackless is coming along, it's likely the biggest single change ever done (previous competitors: specific binding, UTF-8 Everywhere). So it's probably healthier to take breaks and mix in other things.

2 Likes

Impressive progress @hostilefork

Stackless went pretty well in terms of core design, and showed off generators and coroutines. These seem like non-negotiable features to me, when most of the competition has them (as with multiple returns...)

But something about huge systemic overhauls that affect every part of the system is that they make you look at every part of the system. This same problem happened with UTF-8 everywhere, when I was fairly overwhelmed trying to work on it. So I took a break for some months, then came back and applied it when it was ready.

The issue list of things I had to address that were unrelated to stackless were so long that it was demoralizing feeling like I couldn't proceed without fixing them. I got fairly blocked. In addition, I realized that coming to terms with the debugger model I wanted would require a lot of work.

Looking at other things unblocked me, and now I'm feeling progress is at a pretty good clip on solving a fair number of basic longstanding issues.

I'm not entirely sure about the timescale of particular features. But what I like about how things are going now is that it's framing up comfortable decisions in the style that I had wanted for a Beta/One. Taking care of "little things" like knowing how APPEND works w.r.t. splicing or how constness vs. mutability works is fundamental to being able to teach people things they don't have to unlearn, so I'm not begrudging myself time spent on it.

Stackless is still in mind, and changes accounting for it.

3 Likes

Further validation...

...I moved up to GCC 13, and address sanitizer now breaks the assumptions of Rebol's old C_STACK_OVERFLOWING() check.

MySQL got bit by it (June 2023, they jumped compilers earlier than me):

There is another problem with stack growth detection under Address Sanitizer on modern gcc and clang compilers (possibly under other sanitizers as well but this hasn't been tested).

Starting from clang version 15 and gcc version 13, both of these compilers are using custom technique for allocating stack frames when Asan is enabled (-fsanitize=address compiler flag is specified).

https://github.com/google/sanitizers/wiki/AddressSanitizerUseAfterReturn

They rely on a custom function __asan_stack_malloc() that allocates a block of memory from an internal ASan memory pool.

The problem is that this __asan_stack_malloc() may return absolutely random (or, I would at least say, unordered memory addresses) even for consequent calls.

In other words, there is no such thing as "stack growth direction" in this case and we just simply cannot rely on this concept in the MySQL code.

As I mention, modern Ren-C isn't completely free from "stackful" recursions yet. But we're not really that reliant on catching those cases overflowing in everyday code. We can live without C_STACK_OVERFLOWING(), and it's probably time we do. All the documentation and writeup of how bad it is and how it can't work in the webassembly build and it's not in the C standard is an unnecessary weight at this point.

But the bootstrap executable isn't stackless at all. So if we tossed the check there, a simple usermode recursion can crash the interpreter--without providing any feedback about what happened. That's... not great. And I do like running address sanitizer on the bootstrap build as much as anywhere else.

For now, it seems I can bypass the arbitrary C stack frame allocations with:

export ASAN_OPTIONS=detect_stack_use_after_return=0

And that's fine for the bootstrap executable. But on balance, I think we're better served by getting rid of the C_STACK_OVERFLOWING() check in the main code, and enabling the new address sanitizer check.

FWIW, Ren-C's scanner is now stackless. (Doesn't sound like a necessity on the surface, but bear in mind API calls use the scanner, so it's kind of entwined in the evaluation model... this lets you unwind mid-API call with a partial scan.)

For historical reference, here's what C_STACK_OVERFLOWING() comments (my explanation) said:

//=////////////////////////////////////////////////////////////////////////=//
//
//  C STACK
//
//=////////////////////////////////////////////////////////////////////////=//
//
// Rebol doesn't want to crash in the event of a stack overflow, but would
// like to gracefully error and return the user to the console.  While it
// is possible for Rebol to set a limit to how deeply it allows function
// calls in the interpreter to recurse, there's no *portable* way to
// catch a stack overflow in the C code of the interpreter itself.
//
// Hence, by default Rebol will use a non-standard heuristic.  A flag is
// passed to say if OS_STACK_GROWS_UP.  If so, it then extrapolates that C
// function call frames will be laid out consecutively, and the memory
// difference between a stack variable in the topmost stacks can be checked
// against some limit.
//
// This has nothing to do with guarantees in the C standard, and compilers
// can really put variables at any address they feel like:
//
// http://stackoverflow.com/a/1677482/211160
//
// Additionally, it puts the burden on every recursive or deeply nested
// routine to sprinkle calls to the C_STACK_OVERFLOWING macro somewhere
// in it.  The ideal answer is to make Rebol itself corral an interpreted
// script such that it can't cause the C code to stack overflow.  Lacking
// that ideal this technique could break, so build configurations should
// be able to turn it off if needed.
//
// In the meantime, C_STACK_OVERFLOWING is a macro which takes the
// address of some variable local to the currently executed function.
// Note that because the limit is noticed before the C stack has *actually*
// overflowed, you still have a bit of stack room to do the cleanup and
// raise the failure.  (You need to take care of any unmanaged series
// allocations, etc).  So cleaning up that state should be doable without
// making deep function calls.
//
// !!! Future approaches should look into use of Windows stack exceptions
// or libsigsegv:
//
// http://stackoverflow.com/questions/5013806/
//

#if TO_EMSCRIPTEN || TO_WASI
    //
    // !!! Catching stack overflows in emscripten stopped working in the
    // BinaryEn build; the stack seems to not grow up or down specifically.
    // As a temporary non-solution, see what happens to just let it crash.
    //
    #define C_STACK_OVERFLOWING(address_of_local_var) \
        false

#elif defined(OS_STACK_GROWS_UP)

    #define C_STACK_OVERFLOWING(address_of_local_var) \
        (i_cast(uintptr_t, (address_of_local_var)) >= g_ts.C_stack_limit_addr)

#elif defined(OS_STACK_GROWS_DOWN)

    #define C_STACK_OVERFLOWING(address_of_local_var) \
        (i_cast(uintptr_t, (address_of_local_var)) <= g_ts.C_stack_limit_addr)

#else

    #define C_STACK_OVERFLOWING(local_var_addr) \
        (g_ts.C_stack_grows_up \
            ? i_cast(uintptr_t, (local_var_addr)) >= g_ts.C_stack_limit_addr \
            : i_cast(uintptr_t, (local_var_addr)) <= g_ts.C_stack_limit_addr)
#endif