Boot Footprint: Giant String Literal vs. Encap?

hostilefork · September 24, 2022, 1:57am

One thing you can do with C is embed literal data. This is how R3-Alpha ships with its mezzanine functions "built in", the prep process stores everything in a big compressed array of bytes called (misleadingly) Native_Specs:

https://github.com/rebol/rebol/blob/25033f897b2bd466068d7663563cd3ff64740b94/src/core/b-init.c#L166

The name being Native_Specs might suggest it was the contents of %natives.r. But it's actually a lot more, with glued-together source code... including all of the contents of the %base-xxx.r, %sys-xxx.r, and %mezz-xxx.r files. So I renamed it to Boot_Block_Compressed.

But it doesn't embed the files as-is... it LOADs them and SAVEs them using an already-built version of R3. This round-tripping removes the comments and normalizes the spacing. It also actually scrambled it with CLOAK for whatever reason--a waste of time because you could read all the code with SOURCE if you felt like it. :-/

(Ren-C doesn't use an old-R3's LOAD+SAVE to strip out comments, because it would lock down the format. Your hands would be tied on adding or changing lexical forms in the sys/base/mezzanine. So it has its own STRIPLOAD function that does a light stripping out of comments and spaces for this glue-files-together purpose)

Is Embedding Big Fat C Constants Supported By The Standard?

C compilers are only required to allow you to build in string literals that are 509 characters in C89, and 4095 characters in C99. They can allow more, but don't have to.

So I recall R3-Alpha having problems when you turn up --pedantic warning levels by using a syntax like:

const char Native_Specs[] = "\x01\x02\x03...";

That warning went away when I changed it to:

const unsigned char Boot_Block_Compressed[] = { 0x01, 0x02, 0x03 ...};

Regarding the problem of hitting length limits, Ren-C actually breaks things up a bit more...because each extension has its own constant declaration like this for its Rebol portion.

Because this code is decompressed and scanned once--and then tossed--there's probably a number of experiments that could be done. What if the blob were loaded as mutable data, and then used as some kind of buffer for another purpose? Is there some way to help hint to the OS that you really are only going to use the information only once so it will throw out the page from memory? Or will the right thing happen to scan it and use it just once?

Long story short--it hasn't been a problem, even with the TCC build. So it has been taken for granted that it works acceptably.

But Would Encapping Be Better?

One vision of how the boot would work is that it would only load enough to get de-encapping working. Then the de-encapping would be how all the blobs for the "built-in" extensions were extracted.

This seems like an interesting vision, because if someone gave you a big fat Ren-C and you wanted any skinnier version, you could basically ask it to cut everything out you don't want and give you a new EXE. You could roll it up with any customizations you like.

But if you're using any "real" form of encapping (e.g. manipulating the resource portions of a Linux ELF file or a Windows PE file) this gets complicated. And Ren-C's encap facilities are written in usermode...so that expects things like file I/O and PARSE of a BINARY!, etc. I also assume that unzip facilities would be part of encapping. So you need a reasonably runnable system just to get to that point.

I've punted on worrying too much about this, because of the focus on the web build.

It would be a bad investment of limited resources to handwrite and maintain encapping code in C, just so that encapping can be the means by which more of the bootstrap can be done with encap.

Script Code Is Easy to Encap, EXE/DLL Code Is Not

So the "easy" part would be changing the build to go in two steps.

The first step would make an r3-one.exe that is capable of augmenting itself with encapped data. The second step would ask that r3 to fold in various scripts and resources to make an r3-two.exe that had more things in it...such as a console.

This isn't that far out to accomplish. The hard part is when what you're encapping isn't script data, but compiled and executable C code...like bits from a DLL. e.g. encapping "extensions".

What some people do in this situation is to actually glue the DLL file into the executable, but extract it to the filesystem and load the extracted version. If you Google around for "using a DLL as an embedded resource" you'll find people who've done such things...but the answers you find will be from over a decade ago, because no one cares about how they ship such things anymore.

Making Encap A Dependency Is Probably Unwise...

It isn't going to be a terribly big win for bootstrap if it can't be used to pull out or put in extensions.

I don't think it's wise to pursue handcrafted C de-encapping. In fact there's no way I'd be writing any kind of encap code right now if it weren't already made. Kind of the only reason we have the usermode encapping around is because Atronix was using it, but I was trying to keep the feature but cut it out of the C. It hasn't been tossed entirely because it functions as test code.

We could make a token two-step build (the phase one executable, that uses the phase one to build a phase two with encapped data in it).

But it seems what we might want more is an easy option to not build in encapping whatsoever, and have more control over options at build time than the current list of extensions.

For the limited audience looking at desktop builds--I imagine the answer will be that if you want a differently-sized r3.exe, you do it with a C compiler and ticking different boxes. Or you build everything as a DLL and accept it's not all one file.