What deserves to be a datatype?

bradrn · March 5, 2024, 8:20am

This is that thread.

I’ll begin by observing that in Rebol, the complexity of the lexer vs the parser is ‘reversed’ compared to other programming languages. In Rebol, the actual syntax is highly minimalistic: there’s only a few constructs which provide explicit grouping, and none provide anything more than a simple list of items. By contrast, the lexer is exceedingly complicated: nearly every datatype has its own literal form, oftentimes more than one.

Language design ends up ‘reversed’ in a similar way. In most languages, discussion centres around questions like ‘which new syntactic constructs should we add’. By contrast, Rebol (and especially Ren-C) more often poses the question: ‘which new datatypes do we want to include, with which literal syntax?’.

At the moment, I still feel uncomfortable discussing such questions. I don’t feel that I fully understand the kind of criteria we should consider to know whether a datatype is worth including or not. Or, more concisely, I don’t understand how decide: what deserves to be a Ren-C datatype?.

One obvious criterion is simply, datatypes representing common types of data. This is why we have things like MONEY! and FILE! and DATE! and so on. Ultimately this stems from Rebol’s heritage as a data-transfer format, but obviously these types are far more broadly useful.

Another obvious criterion is syntax which is important for programming. This gives us GROUP! and GET-WORD! and PATH! and so on. These exist as datatypes ultimately because Rebol is homoiconic, but their presence has suggested a wide range of uses beyond simple programming.

This accounts for most of the types in Ren-C. And, if that were all to it, I’d have no objections.

But, unfortunately, there are some other types, whose presence is explained by neither of those criteria. As I’ve said previously, the ones which make me feel most uncomfortable are THE-* and TYPE-*. Neither of these represent common types of data that one would want to pass around. And, with the possible exceptions of THE-WORD! and TYPE-BLOCK!, they’re basically useless in ‘regular’ programming.

Despite this, @hostilefork has lobbied pretty hard for both of these. Hopefully it should be clear now why I find this viewpoint confusing. I can’t say the existence of these types is problematic, as such, but I feel this indicates a gap in my understanding of the language.

The closest to an explanation I’ve found is that these types are useful in dialecting. That is, they may not be useful for programming per se, but having the syntax around is useful for constructing new languages. (For instance, using TYPE-WORD!s in PARSE dialect, or THE-WORD!s for module inclusion.) The problem with this is, as we’ve established, that there’s a huge number of syntaxes which would be ‘useful in dialecting’: clearly, this is too low a bar for deciding ‘what deserves to be a datatype’.

(And, incidentally, this also establishes that we’re quite willing to reject datatypes that don’t seem to be of sufficiently general usage.)

Another argument is simply consistency: other sigils have versions for words, blocks, tuples, etc., so THE-* and TYPE-* should as well. But this doesn’t strike me as particularly convincing — there’s nothing intrinsic in Ren-C which requires sigils to generalise to all possible types. Indeed, we’re quite willing to avoid doing so when it would make no sense. (For instance, we don’t have ISSUE-TEXT!, ISSUE-BINARY!, ISSUE-EMAIL!… we just have a single textual ISSUE! type, because doing otherwise would be silly.)

So, when all is said and done, we have a set of types which don’t seem to be of general use, and have no convincing reason to exist, but are nonetheless kept in the language. And I want to know why that is, because I can’t figure it out.

hostilefork · March 5, 2024, 2:11pm

A Few Opening Thoughts...

@rgchris wrote the tagline of the 2019 conference as:

Rebol • /ˈrɛbəl/ “It’s about language […] We believe that language itself, when coupled with the human mind, gains significant productivity advantages over traditional software technologies.”

He's elsewhere given the high-level bullet points of:

Data/messaging come first, and words without qualifying symbols [are] the premium currency.
Source should as much as possible resemble the molded representation of said source loaded.

So Rebol's spiritual inspiration is more-or-less English. In that sense, it's important not to drift too far into "symbol soup" in the general practice. There's probably some sweet spot of the percentage of what your dialects can do with WORD!...and that percentage should be high. (So regardless of the merit of the underlying ideas in Raku, it's a poster child for what we don't want common Rebol code to look like.)

It is certainly possible to not like the premise, e.g. Joe Marshall rejects it. And you can pick what at least look like bad examples easily. But if he stopped and reflected on the reflexive and fluid mode his mind was in while writing the paragraph of English critiquing the Rebol, he should grok that what's being pursued is to put you in that same "zone". We're trying to tap into that Language Instinct (X-bar Theory) that research has shown we all carry in our heads.

So with all this in mind, it's important to realize that Rebol embraces its outgrowth from 10-fingered creatures and QWERTY keyboards, vs. fighting that. Since the inspiration is English, it's an inevitable outcome that it's going to be at odds with the kind of clean and orthogonal model sought by languages which draw their inspiration from other places (e.g. math).

@BlackATTR has said: "[Rebols] are a bit like the family of soft-body invertebrates in the language kingdom. Their special traits don't necessarily shine in common computing domains, but... On a second level I think Domain Specific Languages remains an open frontier largely uncracked by the hidebound languages who originally mapped the territory.

There isn't a "no silliness rule" in effect. What curbs the existence of things like--say--SET-ISSUE! is trying to balance competing meanings, and pick the most useful one.

rebol2>> type? first [#this:is:a-valid-issue]
== issue!

Given that : is legal internally to the ISSUE!, that's one of the points guiding us toward ruling out SET-ISSUE! (as well as ruling out issues as words more generally), and favoring that a colon at the end is literally part of the content.

Note that if colons are internal to things that look word-ish, they are actually URL! (more specifically, "Uniform Resource Names")

>> type? first [urn:isbn:0451450523]
== url!

Silly or not, apostrophes are legal inside words:

rebol2>> type? first ['abc'def]
== lit-word!

rebol2>> to word! first ['abc'def]
== abc'def

I actually don't think that's silly, because I want the words.

if x is 10 [...]  ; there is debate on what this means, but I won't digress
if x isn't 10 [...]  ; natural complement

if did (match [~null~ integer!] null) [  ; did treats "boxed" nulls as truthy
    print "This one exists today, and avoids thinking there was no match."
]
if didn't match ...  ; complement of the above.

Terminal apostrophes lead to a weird looking thing when quoted, seemingly enclosed in quotes like a string type. But my desire to be able to have "name-prime" or "name-double-prime" style words like foo' or foo'' (especially for variables that hold meta states) makes me tolerate the consequence of 'foo' and learn to read it correctly... though I'll choose @ foo' or the foo' instead when writing in source.

What does this imply for edge cases, like lone apostrophe? Rebol2 and Red call it illegal. Ren-C had a usage as quoted void which was dropped, so it's back on the table. Plain ' could be a WORD!, but then '' can't be... because that needs to be a quoted form of the single apostrophe.

As documented here recently, my own philosophy on how far we should be willing to go with WORD! has faced reckonings. The introduction of SIGIL! solved the problems with wanting : and :: as words, and said "no, they should not be words" and went in a new direction with that.

We've gone into the reasoning for why $foo: does not exist, and why arrays are the "API" for letting you pick these apart as $[foo:] or [$foo]: etc. So again it's nothing to do with avoiding silliness... it's mechanically motivated, with me simply not knowing how to implement the underlying bytes and a pleasing API for destructuring it otherwise.

In terms of prohibiting any items in the SIGIL! table, I am not by-and-large concerned about things that aren't competing for the same lexical space. If there needs to be special code in the lexer to rule something out that works fine otherwise, then in my calculus, throwing in a branch with an error "has a cost" and is (some) addition of complexity.

The error branches are there to stop ambiguities and align with the rules of the implementation--not to keep parts out of your hands.

If that sounds like it's not enough of constraint, well... it is a big constraint. Ambiguity-wise, Ren-C has a nearly-total lexical saturation for ASCII (in as far as its rules permit). It's less ambiguous than Rebol for the most part, though it introduces some of its own... e.g. what is ~/abc/~... is that a quasi-path! or a plain PATH! with meta-trash in the first and last positions? (Right now this is a leading argument for ruling out path isotopes, because I want tildes in paths more than I want antiform or quasiform paths.)

Anyhoo... my advice to you would be to get a bit more experience in the medium...and "Find The Game", as we say in improv. It's of course perfectly valid and desirable to scrutinize the design and the datatypes. But if you had found the game, then I think you'd see these aspects as more of a tangential detail when weighed against bigger design issues... plus be more targeted in what things needed critique.

For myself, I'm bothered by things like:

>> $1
== $1.00

I have little use for the MONEY! type, and when it can't serve correctly as DOLLAR-INTEGER! it becomes basically completely useless to me.

Some of these questions... like preserving quote-style string vs. non... register on the needle in ways that trying to assassinate THE-BLOCK! or TYPE-TUPLE! do not... especially when I have compelling applications for them.

bradrn · March 5, 2024, 3:09pm

It’s late here, and that’s a fairly comprehensive post, so it’ll take me a while to absorb it fully. I’ll continue thinking over it until I reach some more definite conclusion.

But, until then, here’s my immediate thoughts:

You’re right; in the scheme of things, it isn’t a significant objection. But from a personal perspective, it’s important for me: it’s a place where I clearly haven’t understood the language, and that annoys me.

~~(Also, I suspect you misunderstood me. I singled out THE-BLOCK! and TYPE-BLOCK! as two types which do have compelling applications. It’s the other types in their family which confuse me.)~~

[EDIT: I got confused there, see below]

This is one of the big things I haven’t fully absorbed, I think. ‘You should be able to do most things using WORD!’ is a good summary of the aims.

As it happens, I have a (quite intense) side interest in linguistics. By and large, I strongly reject Chomskyanism, including the ‘language instinct’ idea. As for X-bar theory, that was more or less a fad which has by now long passed. (The Chomskyanists obsess over Minimalism now, though I’m sure that too will pass.)

Insofar as I subscribe to any linguistic theory at all, I tend to sympathise the most with construction grammar… which, interestingly, strikes me as being remarkably close to how we think about Rebol programs. It’s certainly a closer fit to Rebol than generative approaches are: essentially, ‘building sentences [or programs] out of smaller, idiomatic parts’.

Perhaps ‘silliness’ wasn’t quite the right word here — it’s that same sense of ‘most-useful-ness’ which I was trying to get at. SET-ISSUE! is of minimal use, so it gets trashed in favour of the more useful datatype.

That being said, I hadn’t fully appreciated the extent to which these kinds of collisions between syntaxes was possible. There are more ‘competing meanings’ here than I had thought.

By the way, this is legal in Haskell too. It’s particularly common for making ‘primed’ symbols, which as you note is thoroughly useful.

Sure, and I agree with that, which is why I didn’t object to these in my original post.

I don’t see any problem with this; could you elaborate?

bradrn · March 6, 2024, 1:20pm

Actually… re-reading this, I got confused here. It’s THE-WORD! and TYPE-WORD! which I can see the need for. By contrast, out of all the types we have, THE-BLOCK! is the one which feels most redundant and useless. (And TYPE-TUPLE! and TYPE-BLOCK! aren’t far behind.)

hostilefork · March 6, 2024, 5:20pm

Imagine I decide to use $1 and $2 etc. to be some kind of positional substitution notation in a dialect:

>> substitute [a b $2 c d $1 e f] [<some> <thing>]
== [a b <thing> c d <some> e f]

If I reflect it out to the user in any way, it will carry the decoration I don't want unless I get involved in removing the extra digits:

>> substitute/trace [a b $2 c d $1 e f] [<some> <thing>]
DEBUG: $1.00 is <some>
DEBUG: $2.00 is <thing>
== [a b <thing> c d <some> e f]

Being a headache in that way--and having to decide things like if you round down $1.01 or error--means it's not a fit for such purposes. It wasn't intended to be used that way, but my point was just that it's one of my pet peeves about the type, because I would use DOLLAR-INTEGER! more than MONEY! in the kinds of things I'd use Rebol stuff for.

Where do you find the time for all these interests? Point was just whatever it is that is innate in us to let us see structure in streams of words, we want to leverage that "zone" as I called it. Most languages don't try to go there.

Rebol does so on purpose, letting you organically decide when you want to delimit e.g. with BLOCK! in parse rules ([try some integer!] vs. [try [some integer!]]) or GROUP! in evaluator code. As you get more comfortable with things, the training wheels fall off and you tend to delimit less...or at least much more purposefully. As with English.

One place I see THE-BLOCK! coming into great use is if for a dialect that is not purely mechanical (the way PICK and FOR-EACH are), and leaving the @ off of a block signaling that you would like the "INSIDE / IN" binding to be applied automatically.

e.g. some variation of the CIRCLED dialect might work like so:

>> x: 10 y: 20

>> var: circled [x (y)]
== y

>> get var
== 20

>> var: circled @[x (y)]
== y

>> get var
** Error: y is not bound

Being able to draw that distinction is dependent on the callee being able to see the sigil. And I think that if we say that @[...] blocks do not bind (as no other @ types do, at least by default) you help to guide the implementation offerings of the dialects in this direction. If you don't have a binding passed in, you can't give one back (unless the value already had it deliberately glued on)

I've mentioned the other cues here that are useful, such as when ANY and ALL don't want to evaluate but only to run the predicate:

>> any/predicate [3 + 4 10 + 20] :even?
== 30

>> any/predicate @[3 + 4 10 + 20] :even?
== 4

This is important if someone else ran a reduce step and you still have questions about the data... it keeps you from having to do something like MAP-EACH item to a QUOTED! version just to suppress evaluation in the ALL. It could be done with a refinement, but the single-character notation as a convention--which aligns with not adding binding--seems a salient solution.

So it's far from useless in my eyes. And I will reiterate that an important application of TYPE-TUPLE! is when you have a predicate function inside an object or module:

 >> obj: make object! [lucky?: number -> [number = 7]]

 >> parse [7 7 7] [some &obj.lucky?]
 == 7

And then TYPE-PATH! when your function has refinements.

bradrn · March 6, 2024, 11:26pm

Starting in the middle:

This is simply going back to what I said originally: ‘The closest to an explanation I’ve found is that these types are useful in dialecting’.

But then, if you’re willing to accept that as sufficient reason for syntax to exist… well, you can use that to justify basically any syntax, as indeed I once tried to do:

The lesson I took out of that discussion is that adding syntax purely for usage in dialected code is a bad idea, because it’s impossible to know where to stop.

That being said, beyond dialecting, it is pretty useful to have this distinction of ‘this list represents code, evaluate please’ vs ‘this list is just a list, do not evaluate’. After all, it’s nice to be able to pass around lists without worrying that they’ll be evaluated randomly. And it makes sense to store that distinction on the datatype, rather than as a refinement (cf. /ONLY vs antiforms). So, on balance… yeah, I think THE-* is starting to make much more sense to me now.

Ah-ha, I keep on forgetting that Rebol uses TUPLE!s for access within an object. (In my defense, historical Rebol didn’t use dot-syntax, and neither do some of the languages I use daily, so I sometimes forget it exists.)

Although… then again, this looks like it’s yet another instance of ‘syntax which is only useful in dialects’. That still bothers me, for the reasons I already mentioned.

OK, I didn’t even think of that possibility. That makes sense.

I disagree with it though, for two reasons:

I think MONEY! is actually very useful. Working with money requires specialised requirements (e.g. fixed-point storage), which can be a bit painful — so having that type built-in to the language eliminates a whole class of subtle errors. And, of course, all kinds of software requires working with money.
I think positional substitution is a particularly annoying kind of substitution. I use it in Bash, and hate it. I’d much rather do something like substitute [a b $bar c d $foo e f] {foo: <some> bar: <thing>}, which is less error-prone and more descriptive.

So, I’d rank them the other way around: MONEY! is most useful, and DOLLAR-INTEGER! is less useful.

Admittedly, I would do the same. But for a general-purpose language I think MONEY! is important and very useful.

I couldn’t really tell you; I’ve always just had very broad interests. And linguistics has always been one of my favourite areas.

bradrn · March 8, 2024, 1:39pm

Returning back to my original post…

The conclusion I’ve taken from this discussion is that: no, I actually didn’t have a gap in my understanding of Ren-C as a language. Instead, the gap was that I hadn’t realised how widely useful THE-* is. Now that I understand that, it fits neatly into my criteria for deciding when datatypes are useful to have.

Admittedly, this still leaves TYPE-* without a motivation. (Other than ‘useful in dialecting’, and I’ve already explained why I dislike that.) However, on reflection, the whole type system of Ren-C does seem to be in flux at the moment… to take but one example, currently we only use TYPE-BLOCK!s which are one element long (at least as far as I’m aware). So perhaps the design just needs to be refined a bit, and then it will become something which makes more sense.

hostilefork · March 8, 2024, 5:12pm

Glad you see the applications of THE-XXX! now.

There's an uneasy aspect of how dialects which interact with the evaluator either align with the evaluator or not, and how much to balance that. I've already mentioned "circling"... the feature was initially thought of as:

>> [a b]: multi-returning-thing ...
== 10

>> a
== 10

>> b
== 20

>> [x @y]: multi-returning-thing ...
== 20

>> x
== 10

>> y
== 20

(Aside: it really should be fascinating to people that a popular language feature like multi-returning would be implemented in this way... with a part like SET-BLOCK!, that isn't reserved by the system but is free for other designs... and with an "unstable isotope" of antiform block carrying the values, such that it decays to the first element on a normal variable assignment. While Rebol did put the overall weird idea of an evaluator of this style "out there", Ren-C has dialed it up to 11... and this really is where we're talking about the "new artistic medium, unlike anything else".)

But here's a problem: if we take @ generally meaning "inert, no binding added", then there's a suggestion of [x @y]: not being able to even see the binding of Y in order to write it.

Does a dialect need to follow the evaluator exactly? No... and here, the SET-BLOCK! can see the evaluator's "current" binding and can find the Y. But how divergent it can be depends on what percentage of the evaluator it's going to interact with. If you're in another case like the variables of FOR-EACH, and have a situation where instead of [x @y] you have a case like [@y] and you want to further shorthand that as @y passed as an argument, the rules of the evaluator will affect you in being unable to set the variable.

(In this particular case, there are already problems with @ not coupling well with wanting a meta result, e.g. [x ^y]:, how would you circle the Y in that case? GROUP!s escape into evaluation to synthesize the variable name, so they're not available. This seems a good fit for FENCE! for circling, e.g. [x {^y}]:, and is the sort of thing driving the "braces are too valuable to waste on a non-array type" mentality.)

`[block]` `•` `(group)` `•` `{fence}`

Another thing I'll bring up is the current meaning for @ in PARSE, "I mean it literally, not as a rule":

 >> block: ["a" "b"]

 >> parse ["a" "b" "a" "b" "a" "b"] [some block]
 == "b"

 >> parse [["a" "b"] ["a" "b"]] [some @block]
 == ["a" "b"]

Is that good or bad? Is there some other meaning that's more related to the evaluator behavior that's getting elbowed out, here? We want this as a keyword, anyway... maybe LIT/LITERAL:

 >> parse [["a" "b"] ["a" "b"]] [some literal block]
 == ["a" "b"]

 >> parse [["a" "b"] ["a" "b"]] [some lit block]
 == ["a" "b"]

Having lately brought some of the philosophy points to the forefront, I might have been over-concerned about symbol meanings in dialect...and should retrain my focus onto those mechanically essential aspects, like carrying the sigil to someone who will be looking at it.

As mentioned, that should often be done by a word too, probably:

>> block: [a b c]

>> inert block
== @[a b c]

All of this is mad science, but can be very addictive once you get into it.

Very much so! I'm glad you have a pretty full grip on it (what's there is not complicated, outside of maybe isotopes... but you understand those too).

With the time I have, I'm pushing on a lot of things...FENCE! is one. But I hope the type concept gets a shot in the arm like binding has, which has gotten it out of the stalled state.

bradrn · March 9, 2024, 12:06am

Well… this behaviour is hard-coded in the evaluator, right? So I don’t see how it’s any less ‘reserved by the system’ than the behaviour of all the other vocabulary which Ren-C gives you.

(But also, there’s parallels in other languages: e.g. Haskell doesn’t have ‘true’ multi-returns, but if you pattern-match on a returned tuple, that looks very much like a multi-return.)

VAR-WORD! instead of THE-WORD!. At least in the case of FOR-EACH, it makes sense with the semantics of ‘use existing variable’. I think I can justify it with SET-BLOCK! too: ‘use as main variable’. (It does make processing more complicated though, since FOR-EACH would have to see what has bindings and what doesn’t.)

I thought you said we could handle this case by doing [x @[^y]]: and so on? It’s a little ugly, but logical.

The one major thing I don’t really understand yet is the various kinds of ‘non-valued intents’ (as you’ve put it). But that can go in a separate thread.

What deserves to be a datatype?

A Few Opening Thoughts...

[block] • (group) • {fence}

`[block]` `•` `(group)` `•` `{fence}`