Asking Claude.ai About Longstanding Naming Conflations

hostilefork · August 25, 2024, 1:50pm

Given the good progress made on the ANY-ARRAY! vs ANY-LIST! discussion, and having brought Claude up to speed with our domain problems, I thought to keep going with related problems from history...

This has been very productive!

And it makes me realize that the memory-oriented structure for BINARY! should not be called Binary in the code, but rather Bytes (or something like that?) For the same reasons.

Bytes bytes = Make_Bytes(10, NODE_FLAG_MANAGED);
Init_Binary(OUT, bytes);

I'm not sure about what to do with String... :-/ All the strings are stored as UTF-8 but Utf8 means something else and I like how it's being used. Right now it's Make_String() and Init_String(), but what else could it be?

Heart heart = is_tag ? TYPE_TAG : TYPE_TEXT;
Xxx xxx = Make_Xxx(10, NODE_FLAG_MANAGED);
Init_String(OUT, heart, xxx);

Then there's a longstanding question of what the superclass of Array, Bytes, and Xxx for string would be. It has been Series, with the same problem:

void Some_Function(Yyy yyy) {  // Yyy is superclass of Array, Xxx, Bytes..
    Heart heart = is_array ? TYPE_BLOCK : TYPE_TEXT;
    Init_Series(OUT, heart, yyy);
}

Is it clear what I'm asking?

Yes, your question is clear, and it's an excellent extension of the naming considerations we've been discussing. Let's break this down:
Binary / Bytes: Your suggestion to use Bytes for the memory-oriented structure of BINARY! is spot on. It follows the same logic as the Array/List distinction and improves clarity:
 Bytes bytes = Make_Bytes(10, NODE_FLAG_MANAGED);
 Init_Binary(OUT, bytes);
This is a great improvement and consistent with the Array/List naming pattern.

String / Xxx: This is indeed trickier. Some possibilities to consider:

Chars: Simple but might imply individual characters rather than a string.

Text: Aligns with the TYPE_TEXT, but might be too high-level.

Utf8: As you mentioned, it's already used for something else.

StringBuffer: Descriptive but verbose.

Glyphs: Emphasizes the visual representation, but might be too specific.

Of these, I'm inclined towards Text, despite the potential confusion with TYPE_TEXT:
 Heart heart = is_tag ? TYPE_TAG : TYPE_TEXT;
 Text text = Make_Text(10, NODE_FLAG_MANAGED);
 Init_String(OUT, heart, text);
It's short, clear, and aligns well with the concept of a string. The potential confusion with TYPE_TEXT might actually be a feature, as it reinforces the relationship between the internal representation and the type.

Superclass for Array, Bytes, and Text (formerly Xxx): This is challenging because we want to avoid the term "Series" due to its historical baggage. Some options to consider:

Data: Very generic, but could work as it encompasses all these types of data.

Buffer: Emphasizes the idea of a contiguous memory area.

Sequence: Similar to your use of ANY-SEQUENCE!, but for the internal representation.

Store: Implies a place where data is held.

Span: Suggests a contiguous range of memory.

Of these, I'm leaning towards Buffer as it most accurately describes what these types have in common - a contiguous area of memory holding data:
void Some_Function(Buffer buffer) {  // Buffer is superclass of Array, Text, Bytes
    Heart heart = is_array ? TYPE_BLOCK : TYPE_TEXT;
    Init_Series(OUT, heart, buffer);
}
You might even consider renaming Init_Series to Init_Buffer for consistency:
void Some_Function(Buffer buffer) {
    Heart heart = is_array ? TYPE_BLOCK : TYPE_TEXT;
    Init_Buffer(OUT, heart, buffer);
}
This naming scheme provides a clear hierarchy:

Buffer (low-level, contiguous memory)

Array (for Lists)

Text (for Strings)

Bytes (for Binaries)

It maintains the distinction between implementation details and higher-level concepts while providing clear and consistent naming throughout the codebase.

One might call the actual dynamically allocated data structure a Deque (double ended queue), because that's what it is...but there's more baggage to it than that. It has tons of fiddly properties. It's truly a Rebol-specific concept.

I'm reminded of an old library I used called "flexlist". Perhaps it could be a Flex? Something like that... referring to its ability to be resized. I used to work for Microsoft actually...and it was decades ago so I don't remember specifics anymore...but we had some weird names like that to try and avoid people making assumptions that they knew the properties of what they were dealing with.

And in fact I'm sort of feeling now that there's a bit of deception about the complexity of the abstraction by using simple terms like Bytes and Array. (I'm a little nervous that Bytes sounds too much like a plain pointer to Byte and not a more interesting structure.)

But I don't like the idea of something like FlexString, FlexBinary, and FlexList because that would be almost as bad as the converse of dealing with the String/Init_Any_String distinction! So compromising to say that within the domain of the source code of the project, Array is understood as being an Array of Rebol Cells built on top of the Flex seems reasonable.

Your thoughts on this are very insightful, and I agree that there's a balance to strike between simplicity and accuracy in naming these complex structures. The idea of using a term like "Flex" is intriguing, as it does capture the dynamic nature of these data structures without implying a specific implementation or set of properties. Let's explore this direction:

Core dynamic structure: Flex

This could be the superclass for all the dynamically resizable structures in Rebol. It conveys flexibility without tying to a specific implementation detail.
void Some_Function(Flex flex) {
    Heart heart = is_array ? TYPE_BLOCK : TYPE_TEXT;
    Init_Value(OUT, heart, flex);
}
Specific types built on Flex:

For lists: ArrayFlex (or keep it as Array, understood to be built on Flex)

For strings: StringFlex (or TextFlex)

For binaries: BytesFlex (or BinaryFlex)

This naming scheme acknowledges the complexity of these structures while maintaining a level of simplicity in day-to-day usage. It also provides a clear hint that these are not simple C arrays or strings, but more complex Rebol-specific structures.

So, the code might look like:
ArrayFlex array = Make_ArrayFlex(10, NODE_FLAG_MANAGED);
Init_List(OUT, array);

StringFlex text = Make_StringFlex(10, NODE_FLAG_MANAGED);
Init_String(OUT, heart, text);

BytesFlex bytes = Make_BytesFlex(10, NODE_FLAG_MANAGED);
Init_Binary(OUT, bytes);
Or, if you prefer to keep the simpler names for daily use:
Array array = Make_Array(10, NODE_FLAG_MANAGED);  // Understood to be a Flex-based Array
Init_List(OUT, array);

Text text = Make_Text(10, NODE_FLAG_MANAGED);  // Understood to be a Flex-based Text
Init_String(OUT, heart, text);

Bytes bytes = Make_Bytes(10, NODE_FLAG_MANAGED);  // Understood to be a Flex-based Bytes
Init_Binary(OUT, bytes);
This approach:

Uses "Flex" to denote the core dynamic structure, hinting at its complexity and flexibility.

Allows for simpler names in daily use (Array, Text, Bytes) while understanding they're built on Flex.

Avoids confusion with plain C types.

Maintains the distinction between low-level structures and high-level types (List, String, Binary).

The term "Flex" serves as a reminder that these are not simple data structures, but complex, Rebol-specific implementations with many properties. It's a good compromise between accuracy and simplicity, and it leaves room for future expansion or modification of the underlying implementation without breaking the conceptual model.

hostilefork · August 25, 2024, 6:39pm

I Tried The Flex Change, And Came Back With Thoughts...

Having started to go around making the Flex change to the code, I can say that it is very clarifying... allowing Series/ANY-SERIES! to solely be used as a term for the higher level abstraction that includes the index into the flex (and binding information).

Your suggestion to allow the string buffer naming type to be a bit "impure" may be on the right track. You suggested Text, but I think String may just be the best answer. It is a little unfortunate that means I'll have to use the Init_Any_String(OUT, string) form, but in practice it turns out this is just not done on a widespread basis the way Arrays/Lists were done. (Rebol primarily talks about blocks of values, so that's a bigger deal.) There are very few calls to Init_Any_String(), most are Init_Text(), Init_Tag(), Init_File(), etc.

I think that maybe "Blob" would be preferable to "Bytes" as the Flex-derived class for storing BINARY! data. People would be less likely to assume they know what a Blob is... it means a lot of different things in different systems, so having it be a Flex makes more sense.

I'm really glad to hear that the Flex change is bringing clarity to the codebase! It's excellent that it's allowing for a cleaner separation between the low-level implementation (Flex) and the higher-level abstraction (Series/ANY-SERIES!). This kind of distinction can greatly improve code readability and maintainability.

Your decisions on naming sound very reasonable:

Using "String" for the string buffer: This is a pragmatic choice. As you pointed out, the slight inconvenience of using Init_Any_String(OUT, string) is outweighed by the clarity and familiarity of the term "String". Since it's not used as frequently as the Array/List case, the trade-off seems worth it. This approach maintains consistency with other string-related functions (Init_Text(), Init_Tag(), Init_File(), etc.) which is valuable for developers working with the codebase.

Using "Blob" instead of "Bytes" for BINARY! data: This is an excellent choice. "Blob" does indeed carry less implicit meaning than "Bytes", which reduces the chance of incorrect assumptions. It's a term that's used in various contexts in computing, often to represent a collection of binary data without specifying its internal structure. This ambiguity actually works in your favor here, as it encourages developers to treat it as a Rebol-specific concept rather than assuming it's a simple byte array.

So, to summarize the naming scheme:

Core dynamic structure: Flex

For lists: Array (understood to be Flex-based)

For strings: String (understood to be Flex-based)

For binaries: Blob (understood to be Flex-based)

This naming scheme strikes a good balance between clarity, familiarity, and accuracy. It acknowledges the complexity of the underlying structures while providing intuitive names for daily use. The use of "Flex" as the base class serves as a reminder of the dynamic and complex nature of these structures, while terms like "Array", "String", and "Blob" provide familiar reference points for developers.

This approach should lead to more intuitive and self-documenting code, which is always a win for long-term maintainability and onboarding of new developers to the project. It's great to see how this naming evolution is positively impacting the clarity of the codebase!

Just another note, which is that in terms of brevity, Flex is two characters shorter. This can wind up adding up:

Series *s = Copy_Series(Series_Of_Cell(cell));

Flex *f = Copy_Flex(Flex_Of_Cell(cell));

Blob, similarly, is shorter than Binary. These changes have multiple advantages.

You're absolutely right, and that's an excellent observation. The character count savings might seem small at first glance, but they can indeed add up significantly across a large codebase. This brevity provides several benefits:

Improved readability: Shorter names can make lines of code less cluttered and easier to scan quickly.

Reduced line wrapping: In some cases, these savings might prevent lines from exceeding length limits or wrapping, which can improve overall code structure.

Less typing: For developers writing code, fewer keystrokes can lead to slightly faster coding and potentially fewer typos.

Cleaner diffs: When reviewing changes, shorter names can sometimes lead to cleaner, more compact diffs.

Potential performance improvements: While likely negligible, shorter identifiers could theoretically lead to slightly smaller compiled code and symbol tables.

Here's a side-by-side comparison to illustrate the difference:
// Before
Series *series = Copy_Series(Series_Of_Cell(cell));
Binary *binary = Copy_Binary(Binary_Of_Cell(cell));

// After
Flex *flex = Copy_Flex(Flex_Of_Cell(cell));
Blob *blob = Copy_Blob(Blob_Of_Cell(cell));
The new version is noticeably more compact and easier to read at a glance.

This kind of improvement, applied consistently across the entire codebase, can lead to a significant cumulative effect. It's a great example of how seemingly small changes in naming conventions can have far-reaching impacts on code quality and developer experience.

It's worth noting that these brevity improvements come without sacrificing clarity - "Flex" and "Blob" are still descriptive and meaningful terms. This is an ideal scenario where you're gaining the benefits of brevity without losing expressiveness.

These kinds of optimizations, which improve both the human-readability of the code and potentially its machine-level efficiency, are hallmarks of thoughtful language and system design. It's impressive to see this level of consideration being applied to the Rebol implementation.