Eliminating "Internal" API Support in User Natives

I'm tackling the build process for user natives, and trying to figure out how they might actually be used. One key point is that you don't use the nuanced and complex internal API with them.

From the README I'm writing:

Symbol linkage to the internal libRebol API is automatically provided by the extension. This is the ergonomic external API (e.g. %rebol.h), not the full internal implementation API (e.g. %sys-core.h).

But the initial implementation of user natives DID make the full internal API available. This involved invoking TCC during the build process to pre-process all of the many files included by %sys-core.h into a single blob, and then pack it into the executable. It also made every API and constant available in the symbol table--with its name, to use with tcc_add_symbol().

This predated libRebol's modern form, so it was the only choice at the time. But now it does not make sense to allow calling %sys-core.h from user natives. If anyone is actually seeking some "optimal efficiency" (with "maximal effort") that the core API can offer, TCC is not an optimizing compiler in the first place. Any serious larger-scale initiative would want to use the extension model, with a "fancier" compiler and debugger on hand.

libRebol also has a drastically smaller footprint for the header (as %rebol.h is a single standalone file). Not having a large table of every internal API and constant in the system--along with its name--saves space as well. So "rebol.h" is what is used, as the most practical option.

It's a good change, along with several other good changes in that README that should push the feature along (and reign it in so it's not holding the build process hostage to a forked version of TCC that I never quite knew how to build).

Because the forum is kind of a "development notebook", I'm taking some of the code that's getting the axe and putting it here with some commentary. This makes it easier to link to and talk about, or find later, vs. having it disappear into git archive history without an explanation.

Here is the rebmake object code that used to be part of the monolithic build process for making %sys-core.i. The concept was that TCC would be invoked for only its preprocessing step, spitting out code that could be pasted into the front of a user native compilation to give it access to the internal API:

sys-core-i: make rebmake/object-file-class [
    compiler: make rebmake/tcc [
        exec-file: cfg-tcc/exec-file
    ]
    output: %prep/include/sys-core.i
    source: src-dir/include/sys-core.h
    definitions: join-of app-config/definitions [ {DEBUG_STDIO_OK} ]
    includes: append-of app-config/includes reduce [tcc-dir tcc-dir/include]
    cflags: compose [{-dD} {-nostdlib} (opt cfg-tcc/cflags)]
]

It would be called as sys-core-i/command/E, where /E is the POSIX-specified C compiler switch for preprocess-only:

Copy C-language source files to standard output, executing all preprocessor directives; no compilation shall be performed. If any operand is not a text file, the effects are unspecified.

This was a little weird, because it meant you must have a TCC compiler on hand to do this preprocessing during the build (even if you're building the main interpreter with gcc or MSVC). Because you'd want the file preprocessed the way TCC would do it.

Bear in mind, this produced quite a big file from all the multitudes of files in the internal implementation of the interpreter. When you compare with the tiny and tight %rebol.h, %sys-core.i was quite the beast!

But anyway, the next step was to pack that up into a big const data blob that was put into the interpreter, so it could paste it in at the beginning of any COMPILE text. But Shixin apparently found it needed a bit of massaging...this may be relevant:

print "------ Building embedded header file"
args: parse-args system/options/args
output-dir: system/options/path/prep
mkdir/deep output-dir/core

inp: read fix-win32-path to file! output-dir/extensions/tcc/sys-core.i
replace/all inp "// #define" "#define"
replace/all inp "// #undef" "#undef"
replace/all inp "<ce>" "##" ;bug in tcc??

// remove "#define __BASE_FILE__" to avoid duplicates
remove-macro: func [
    return: <void>
    macro [any-string!]
    <local> pos-m inc eol
][
    macro: to binary! macro
    if pos-m: find inp macro [
        inc: find/reverse pos-m to binary! "#define"
        eol: find pos-m to binary! newline
        remove/part inc (index? eol) - (index? inc)
    ]
]

remove-macro "__BASE_FILE__"

// remove everything up to DEBUG_STDIO_OK
// (they all seem to be builtin macros)
remove/part inp -1 + index? find inp to binary! "#define DEBUG_STDIO_OK"

comment [write %/tmp/sys-core.i inp] // for inspection

e: (make-emitter
    "Embedded sys-core.h" output-dir/prep/extensions/tcc/tmp-embedded-header.c)

e/emit {
    #include "sys-core.h"

    extern const REBYTE core_header_source[];
    const REBYTE core_header_source[] = {
        $<Binary-To-C Join-Of Inp #{00}>
    };
}

print "------ Writing embedded header file"
e/write-emitted

While it's not all that complicated to do, it isn't a good general solution to the problem of the embedded TCC wanting to find include files. Imagine for instance that you were to want to use a snippet of code you got from someone else in your compilation, and it was in a disk file. What if that file said #include "rebol.h" in it? An approach like this wouldn't be able to find and use that.

So the current strategy is to ship embedded files in the extension "as-is". I've posted to a mailing list about making it possible to use those without extracting them to a temporary directory:

http://lists.nongnu.org/archive/html/tinycc-devel/2018-12/msg00011.html

But in the meantime, there's no shame in extracting them into a temp dir. It helps avoid the trap of getting stuck reimplementing a preprocessor and linker, and paves the way for user natives being able to do more than trivial demos.

For example: the Windows build would likely want to implicitly provide windows.h and a linker .def file, and you should be able to use other .c files in your build that #include "windows.h".

One aspect of including the entirety of %sys-core.h in for use by user natives is that not only was the header needed, but all the pointers to all the non-inline functions (and global constants) had to be registered with their names with tcc_add_symbol(). That's a lot of stuff to include in the executable.

To build up that table, there was a hook added to @Brett's API analyzer callback. This is something that gets called for every function in the system that's exported to the rest of the core. It then emitted a symbol table of C function pointers, and C data pointers.

So first, it emitted some boilerplate definitions:

e-syms: make-emitter "Function Symbols" output-dir/core/tmp-symbols.c

e-syms/emit {
    #include "sys-core.h"

    #define SYM_CFUNC(x) {#x, (CFUNC*)(x)}
    #define SYM_DATA(x) {#x, &x}

    struct rebol_sym_cfunc_t {
        const char *name;
        CFUNC *cfunc;
    };

    struct rebol_sym_data_t {
        const char *name;
        void *data;
    };

    extern const struct rebol_sym_cfunc_t rebol_sym_cfuncs [];
    const struct rebol_sym_cfunc_t rebol_sym_cfuncs [] = ^{
}

That got it ready to start emitting the function symbols. So on each emit-proto callback during the scanning of the C source files for the interpreter, it would add a line...a slightly different line for if something was a "generic dispatcher" for a type (marked REBTYPE):

if "REBTYPE" = proto-parser/proto.id [
    e-syms/emit [the-file proto-parser] {
        /* $<The-File> */ SYM_CFUNC(T_$<Proto-Parser/Proto.Arg.1>),
    }
] else [
    e-syms/emit [the-file proto-parser] {
        /* $<The-File> */ SYM_CFUNC($<Proto-Parser/Proto.Id>),
    }
]

Then it would terminate the list of function symbols, and begin the list of global data pointer symbols:

e-syms/emit {
    {NULL, NULL} /* Terminator */
^};

/* Globals from sys-globals.h */
extern const struct rebol_sym_data_t rebol_sym_data [];
const struct rebol_sym_data_t rebol_sym_data [] = ^{
}

Enumerating over function prototypes was already done for other reasons (to make a big system-wide include file so any part of the core could call any other part without worrying about separately updating prototypes from definitions.) But enumerating over the global data symbols appears to have had to be added just for the TCC extension. Despite that being the only use I think I'll leave it in the build process "just in case".

Anyway, for each "id" the parse of the %sys-globals.h found, it emitted a line:

e-syms/emit 'id {
    SYM_DATA($<Id>),
}

Finally it wrote the last lines and emitted the file:

e-syms/emit {
    {NULL, NULL} /* Terminator */
^/};
}

e-syms/write-emitted