"On Building Ren-C With C++ Compilers"

hostilefork · May 23, 2016, 4:00am

Ren-C is a C codebase, but it can optionally also be built with a C++ compiler. Building as C++ adds extra checking at compile-time of the interpreter's implementation. This helps catch bugs that a C compiler could only check at runtime. Plus--in the debug build only--the C++ build adds some extra runtime assertions, that do not affect interpreter semantics.

This document attempts to outline reasons for the choices in tooling, as it pertains to C and C++.

Why use C instead of requiring C++ and using the C++ standard library?

Basically all languages measure their success against others with things like:

How large is the source code a user must write for a given problem
How big in bytes is the installation for a program made with the system
How fast are programs written in the idiomatic style of the language
How well does the language help protect against common mistakes or bugs

If today's Rebol executable were written in C++, it would not affect (1) (since the interpreter would behave the same). Leveraging the C++ language features and the well-vetted C++ standard library would very likely improve Rebol on points like (2), (3), and (4). After all, C++ was designed to not mean compromises on performance or size.

But Rebol (and Red) languages have unusual ideas of what counts as a "metric of success" or "signal of failure". One metric is the total complexity of the system, when the toolchain used to build the system is included in the equation. This is to say that Rebol's "complexity footprint" size must account for the tools used to build it. (This aspect of concern is one that can be found also in the Forth community.)

To understand a self-hosting C compiler, you really only have to understand something like "TCC". Its modern forms are around 300 kilobytes on x86, and that is enough to build Ren-C. Yet the smallest full C++ compiler executables are around 3 megabytes. The size of C++ is directly in proportion to the depth of nuances, features, and understandings required. Even very straightforward code, like the implementation of std::vector for gcc, is quite deep about the features of the compiler that it depends upon to work. This is considered a "cost" to be avoided.

Why worry about building a C program as C++?

The practice of building C programs as C++ may sound strange, but it is not. In fact, the GCC compiler itself is written in a generally-compatible subset of C and C++. It is also recommendation CPL.2 from the C++ Core Guidelines:

C rule summary:

CPL.1: Prefer C++ to C

CPL.2: If you must use C, use the common subset of C and C++, and compile the C code as C++

CPL.3: If you must use C for interfaces, use C++ in the code using such interfaces

For Ren-C, the advantages started out small--with enhanced cases of type checking. Eventually this grew into more serious usage as a static analysis tool.

Next some very selective runtime checks in the debug build which lean on C++ were added. These checks take advantage of something C++ can do which some C developers hate: namely that you can write a simple thing like *a = *b; and it might actually insert a function call to implement overloaded assignment. When this "hated" property is limited to only adding asserts, it saves the codebase from becoming littered with:

 assert(Can_Assign_Values_Debug(a, b));
 *a = *b;

Having the C++ build do it "invisibly" from just *a = *b; is lighter weight--and keeps the focus on the main program logic. So some asserts do work like this, if they are not that critical. However, the C build uses ordinary macros such as CHECKED_ASSIGN(a, b) if it's important enough to warrant disruption in the source at every callsite.

At this point, it would be fair to say that some critical Ren-C features (e.g. Specific Binding and its outgrowths) would have been impractical to implement without the extra compile-time checks from C++. It has really been an interesting study in its own right, of how the benefits of C++ can be applied while sticking to the domain of C. For instance, see C Casts for the Masses

How old (or new) a C compiler can be used to build Ren-C?

When Rebol's "R3-Alpha" codebase was released 12-Dec-2012, it was billed as mostly C89-compatible...which is to say that it could be built with compilers that obeyed a specification formally published in 1989. (Some compilers had the features prior to that.) Yet immediately obvious was the use of "C++-style" comments, e.g. // comment instead of /* comment */, which was not adopted formally by C until 1999.

On the question, Carl Sassenrath said:

"In the past it's been determined by practicality: what actually works over the widest range. (I can usually port REBOL3/core to any target machine in 5 minutes.)"

"This hasn't been easy, and it's why there are various C restrictions. I agree... many of the C++isms are nice, but can dramatically reduce portability to older boxes. It's a balancing act."

Since the decisions arose pragmatically, there was no standard to point to. So when Ren-C was started, it whipped the code into shape for C89 compliance, which could be checked with GCC's --std=c89 --pedantic switch. Every warning was attended to, and that included things like adhering to the 509 character maximum for string literals in generated code.

There were two exceptions of features that are not in the C89 standard which were left in early Ren-C:

C++-style // comments, mentioned above
The expectation that a long long datatype existed (see -Wno-long-long), because of the requirement of a 64-bit integer type

Despite attempts to keep these as the only two "non-C89" features, the long long requirement did set a sort of baseline "minimum era" for the expected compiler. And one feature that one could reasonably assume of any 64-bit capable compiler would be assignments in mid-scope:

 void foo() {
     int x = 10; // legal in C89
     int y1;
     if (Some_Check()) {
         printf("no declarations after statements.\n");
         return;
     }
     y1 = 20;
     int y2 = 20; // illegal in C89
 }

R3-Alpha almost always used "y1-style" initialization, even though "x-style" was legal. This was perhaps because it is a pain when adding and removing statements to have to switch assignments back and forth between "y1-style" and "y2-style", so just sticking to the always-legal form of separating declarations and assignments was preferred. But not only was there a mixture, the lack of compiler enforcement meant a few cases of "y2-style" had crept in.

So Ren-C originally removed those few "y2-style" stragglers. But it's preferable to do initialization at the point of declaration, and any compiler supporting 64-bit integers almost certainly would be able to do this too. That led this to be one of the -pedantic C89 rules that is disregarded.

The other common C feature that isn't in C89 but exists in most older compilers is support for inline functions. However, it was not until C99 that "inline" was standardized...MSVC used __inline, for instance.

R3-Alpha used a small amount of inlining, and preferred preprocessor macros. But it is common knowledge that if a C macro is "function-like", e.g. #define MIN(a,b) ((a) < (b) ? (a) : (b)) then there can be problems if evaluating an argument has a side-effect. And even if evaluating an argument twice doesn't have side effects, it could be computationally wasteful to evaluate it more than once. It's also easy to forget parentheses around expressions, and long macros are ugly when broken across lines. There are a lot of reasons why "preprocessor macros are bad".

Not all C function-like macros can be replaced by C inline functions. Since C lacks templates you lose generality e.g. inline min(int a, int b) { return a < b ? a : b; } only works for int and not float. Yet avoiding inline functions in Rebol may have less to do with a lack of compiler support, and more to do with the fact that it is "only a hint". (The ultimate decision of whether inline is used is left to the optimizer.)

Ren-C shifted to use of the specific combination inline static, which means if inline is not available as a language keyword then it can be defined even as nothing. Then the functions are just static, and compiled into each translation unit.

So the full list of non-C89 features used are:

C++ style comments //
long long
Declarations after statements
inline

Other C99 or C11 features are not used.

How old a C++ compiler can build Ren-C as C++?

While the C build tries to support very old compilers, the C++ checks do not. It is not worth it to maintain compatibility for C++98 for the debug checks that C++ odes.

(Note: It wouldn't necessarily be hard to have conditional usage of C++11 features to make most of it work in C++98, there's just little point in the added complexity and #ifdefs. It is not intended that release builds be made using C++ compilation.)

In terms of newer compilers, C++14 and C++17 have been tested and work under GCC and Clang.

Can a C-built Ren-C library work with a C++-built one?

The external library "libRebol" should not notice a difference. But the system internals ("sys-core.h") do compile differently, so an extension which can pick apart cells and such must be compiled with C++ if the core it runs against is also compiled with C++.

bradrn · January 30, 2024, 2:02am

OK, I got around to watching this talk. It’s pretty cool! (And nice to see who I’m talking to, finally!)

But I’m not convinced it’s a matter of ‘dependency footprint’, per se. Rather it’s about starting from a very small core, and bootstrapping the rest of the environment from there. The important thing is that the core itself is small and minimal, not that it’s implemented using minimal dependencies.

hostilefork · January 30, 2024, 7:40pm

Well, you don't get to tell the practitioners of a religion what counts in their religion or doesn't, it's up to the practitioners... and as the term "Rebol" has been used, there's been more than a little religion in it.

Without that, then I'd certainly prefer to base the code on C++. As I say above:

But not only do we want to build as C, the hope is to one day bootstrap on a dialect like Red/System... and have our "user-natives" implement their bodies as blocks of that dialect, instead of strings of C code, as the TCC extension does today.

(I'd like to see the system dialect extended to implement the useful features that the C++ build offers, without needing to pull in the whole of the C++ language. Though I haven't put much thought into what those features would look like.)

If you ask people who use Forth you'll get similar reactions, that constructions of the language which don't honor the spirit of dependency control in the build process "aren't Forth"... that it is about more than a simple semantics of a usermode specification that defines the language.

bradrn · January 31, 2024, 1:10am

But I should hope I’m allowed to write commentary on their words!

I do think we’re coming at the same thing from different angles. I’m more interested in language design: it’s a matter of choosing primitives carefully and keeping them simple, which incidentally ensures you can implement them using minimal dependencies. But you’re more interested in implementation: it’s about implementing the language in a minimal way, which incidentally forces the primitives to be simple and few. The result is the same, but we’re more interested in different parts of it.