Web Build Performance Stats

hostilefork · January 18, 2021, 6:51am

I resurrected the "stats" function to get some metrics. It's actually a good example of how nicely Ren-C can improve things:

Here's the code for stats in R3-Alpha (which references an object prototype defined elsewhere in sysobj.h, and you can also see that all you see in this file of the function spec is REBNATIVE(stats))
Here's that in Ren-C, and the maintainability advantages should be obvious. The distinction of counting natives didn't exist in the same fashion as before, so it was deleted, but we could do that kind of thing another way.

In any case, running the statistics between R3-Alpha and Ren-C are going to show a lot more series and memory use in Ren-C. The main reasons are:

There's a Windows encapping issue that it reads the whole executable into memory to probe it for resource sections. This is especially crazy for debug builds. I'd raised this as an issue for Shixin to look at but forgot about it.
Function frames do not use the data stack, and instead the arguments of functions are stored in individual arrays. While there are some optimizations to mean this doesn't require an allocation on quite every function call, it means a good portion of function calls do allocate series. This stresses the GC, but, I've mentioned how it was important for many reasons (including that the data stack memory isn't stable, and that meant the previous approach had bugs passing pointers to arguments around. It's a given that this is how things are done now--especially with stackless--so it just needs to be designed around and tuned.
WORD!s are special cases of string series. Things like the word table and binding didn't count in series memory before, and wasn't tabulated in R3-Alpha in the series count. There are some other examples of this.
ACTION!s create more series and contexts. The HELP information for most actions that have help information has two objects linked to it...one mapping parameter names to datatypes, and one mapping parameter names to descriptions. I'm hoping that the one mapping parameter names to datatypes can be covered by the parameter information that the interpreter also sees...but for today, there's a difference because one contains TYPESET!s and the other contains human-readable BLOCK!s.
So Much More Is Done In Usermode. Ranging from console code to command-line argument processing, there's more source code (which counts as series itself) and more code running.

I see it as good--not bad--that a ton of things run in the boot process. Although I think you should be able to build an run a minimal system...even one that doesn't waste memory on HELP strings (it's now easier to make such things, since the spec isn't preserved).

But for today, the closest we have to a "minimal build" is the web build. It's a bit more comparable to R3-Alpha in terms of how much startup code it runs.

The Current State

Starting up R3-Alpha on Linux, I get the following for stats/profile:

r3-alpha>> stats/profile
== make object! [
    timer: 0:00:02.639939
    evals: 20375
    eval-natives: 3340
    eval-functions: 369
    series-made: 8393
    series-freed: 2597
    series-expanded: 70
    series-bytes: 2211900
    series-recycled: 2526
    made-blocks: 5761
    made-objects: 64
    recycles: 1
]

Ren-C on the web is considerably heavier, at least when it comes to evals + series made + GC churn (a little less overall series bytes...probably mostly owed to optimizations that fit small series into the place where tracking information would be stored if it were a larger one):

ren-c/web>> stats/profile
== make object! [
    evals: 65422
    series-made: 28569
    series-freed: 11160
    series-expanded: 419
    series-bytes: 1731611
    series-recycled: 8669
    made-blocks: 16447
    made-objects: 109
    recycles: 229  ; !!! see update, this is now 1
]

The increased number of evals just goes with the "a lot more is done in usermode" bit. There's lots of ways to attack that if it's bothersome.

The series-made number is much bigger. 8393 v. 28569. I mentioned how a lot of this is going to come from the fact that many evals need to make series, but we don't really have a breakdown of that number here to be sure that's accounting for them. Anyway, this number isn't all that bothersome to me given that knowledge...but it should be sanity-checked.

What does bother me is the 229 recycles. That's a lot. Despite making 3-4x as many series, I don't see how exactly that's translating into 200x the recycling.

UPDATE: This was the result of accidentally committed debug code. It's back to 1.

Writing Down The Current State is Better Than Nothing

Ideally we'd have some kind of performance regression chart that plotted some of these numbers after each build. Though really it's not too worth doing that unless the numbers carried more information that was more actionable.

But...lacking an automated method, writing it down now and having a forum thread to keep track of findings and improvements is better than nothing.

There's likely a lot that could be done to help the desktop build (such as obviously tending to that encap-reading issue). But I'd like to focus principally on improvements to the internals that offer benefit to the web build, where I think the main relevance is. And:

Having a system built from rigorously understood invariants is the best plan for optimization over the long-term. If you don't have a lot of assertions and confidence about what is and isn't true around your codebase, you can't know if a rearrangement will break it or not. So I spend a lot of time focusing on defining these invariants and making sure they are true.
Avoid optimizing things before you're sure if they're right. I'm guilty as anyone of fiddling with things for optimization reasons just because it's cool or I get curious of whether something can work or not. Programmers are tinkerers and that's just how it is. But it's definitely not time to go over things with a fine-toothed comb when so many design issues are not worked out.

hostilefork · February 3, 2021, 6:20am

Because I thought "oh this might be complex" I didn't immediately look at it. But I should have just set a breakpoint, because this was the result of some debugging Recycle() calls accidentally getting committed. It was recycling on every native creation!

Removing it gets us to the expected recycle of 1.

Yay.

hostilefork · December 8, 2023, 7:04pm

Circa January 2021, I did a capture of some statistics about the web build. It looked like this right after booting the console:

hostilefork:

ren-c/web>> stats/profile
== make object! [
    evals: 65422
    series-made: 28569
    series-freed: 11160
    series-expanded: 419
    series-bytes: 1731611
    series-recycled: 8669
    made-blocks: 16447
    made-objects: 109
    recycles: 1
]

I thought I might run it again here 3 years later. And I was a bit shocked initially, because at first glance it seems like things have gotten completely out of hand:

>> stats/profile
== make object! [
    evals: 3277865
    series-made: 928871
    series-freed: 646233
    series-expanded: 6312
    series-bytes: 6159217
    series-recycled: 196735
    blocks-made: 86893
    objects-made: 220
    recycles: 13
]

That's a factor of 50 more "evals"... with a factor of 32 more series... just to boot the web console, what the heck happened?

Series Count Is Misleading

The first thing to notice is that the console doesn't take 50x as long to load as it used to. That should be a hint that something about the statistics may have gotten thrown off.

One thing that's grossly overinflating the "series" count is that Sea of Words invented a new mechanism for binding in modules. This method makes tiny series "stubs" that get linked onto canon symbols to hold indices during a binding operation. They're very cheap to make and to destroy and aren't involved in GC. They account for at least 1/3 of those "series" being made.

These statistics need to be adjusted somehow to break those out differently.

Eval Count Is Misleading

Due to stackless processing, what counts as an "eval tick" has multiplied by a lot. It used to be that something like a REDUCE operation would count as one tick, and then each eval step it did would be a tick. Now each time the REDUCE yields to the evaluator to ask for another evaluation, that's a tick...then the evaluator ticks.

(While these added steps may sound like a burden, it actually accelerates the web build. Because it means we're building an unwindable stack that can yield to the browser's event loop, without relying on a crutch like binaryen that would bloat up the entire runtime with "stackless emulation"...that would fail anyway on deep stacks. For a reminder of why this is necessary: "Switching to Stackless: Why This, Why Now?")

Also, parsing got subsumed into the tick count. Even PARSE3 breaks its operations down into ticks now. So if you say parse3 "aaaabbbb" [some ["a" | b"]] that's not one tick, it's at least 8 * 3 or 24.

The Real "Villain": UPARSE

Once you've cut out some basic accounting anomalies, there's still certainly a lot of work to do. But reading between the lines, there's a clear central source of resource usage going through the roof...which is quite simply, UPARSE.

Not that the web console uses it all that terribly much. But using it pretty much at all will explode the amount of interpreter code that runs, quickly eclipsing everything else.

I don't think the best way to frame it is negative. The way I see this is as a big, real, challenge for the system. UPARSE is the most real, most powerful, and most tested dialect implementation that the Redbol world has ever seen.

So... What Can Be Done?

Counting up all the ways in which UPARSE pays the price for its current all-usermode implementation is to big for this post. But it's important to realize that the costly mechanisms it uses all have motivations. Meeting those needs with more efficient tools means other dialects can use those tools too.

To name an example off the top of my head, would be how every combinator function is pre-hooked with code enclosing each call with a test for "are we hooked, if so call the hook with the frame, otherwise just call the frame." That's screaming for some kind of generalized answer where you don't have to hook in advance, but be able to inject the hooks later on some category of functions. But using the massively inefficient approach is how we can test the viability of something like the Stepwise PARSE Debugger.

But at the end of the day, it does come down to the fact that parsing is just too general-purpose and useful to be done with usermode code. It would be nice if we had a language like Red/System to write the combinators in...which could be compiled to WebAssembly in the browser and x86 or ARM on the desktop. Yet the option on the table right now is hand-coding C for the combinators and the combinating processes.

Prioritization Is Difficult

While UPARSE clearly needs to be reviewed and nativized, I'm still not sure if this is the next item of business in the priority queue. Binding casts a dark shadow over the entire system--and has its own performance quandaries. Not to mention that it hasn't been figured out how variables can be instantiated in mid-parse with something like a "LET combinator".

Any investments in making UPARSE faster that also make it harder to modify and test under new designs has to be considered carefully. Also, being written in usermode exposes pitfalls that you wouldn't see otherwise--like the need for definitional CONTINUE that was exposed.

Anyway... data is data, and wanted to look at it. This is where things are at the moment.