"Finding the Invariant" - Case Study: TO

I was trying to make some fairly generic routines that would accept either a WORD!, a TEXT!, a CHAR!, or an INTEGER!. Then it would give you either a WORD! or an INTEGER! out.

At some point I ran up against this:

>> to integer! "1"
== 1

>> to integer! #"1"  ; I wanted 1...
== 49  ; ..but I got a codepoint (the ASCII value for 1)

That is the status quo. But I thought to compare this with another possibility:

>> to integer! "1"
== 1

>> to integer! #"1"
== 1

>> codepoint of #"1"
== 49

With this, you could imagine putting a TEXT! or a CHAR! in as input to to integer!...not knowing which type of input you had...yet either way get a result that had some kind of consistent representational meaning.

But the status quo is unlikely to ever have a useful invariant like that. Even just considering these two cases, TO INTEGER! becomes a bizarre operation. If it had a name it would be something like convert-decimal-string-to-integer-value-unless-char-in-which-case-codepoint.

No one wants that operation. So of course you see it used in cases where people already know which they have... something like to integer! my-string or to integer! my char!. All that's happening is that the short word TO is being leveraged to get a frequently used integer property.

Yet it's not even that "short" when you have to add a type onto it. As I point out with CODEPOINT OF, might there be clearer ways to say that at little or no extra cost?

 TO INTEGER!
 CODEPOINT OF  ; just one character longer... and more explanatory

If you take a look at what a difference ENBIN and DEBIN are making vs. trying to pick arbitrary TO conversions, I think it tells a similar story.

So might the TO conversions be studied in such a way that there's some actual chance that accepting multiple types as input could be an asset instead of a liability?

I think this case of TO INTEGER! of #"1" is a good talking point for that.

When I gave this example, I was contrasting TEXT! and CHAR! conversions TO INTEGER!.

I should have pointed out that ISSUE! is a more obvious dissonance:

rebol2>> to integer! #1
== 1

rebol2>> to integer! #"1"
== 49

That just looks wrong as-is. But when #1 and #"1" are synonyms as immutable "issuechar!"/TOKEN! (as by all appearances they should be), there can't be a distinction.

It may be that some other operator like as integer! fits into getting the codepoint. But I mention codepoint of as a possibility, which is only one character longer (and doesn't require hitting <shift> to get an !) and helps you know exactly what's going on. It seems like a good choice.

So as part of the issuechar conversion, TO INTEGER! conversions of issuechar! can be temporarily disabled...advising you to use either CODEPOINT OF TOKEN or TO INTEGER! AS TEXT! TOKEN. I'm proposing that for empty issuechar! the codepoint of # will be 0, in order to make sure that we're not actually putting codepoint 0 into any strings. When the transition period is over, you can change the TO INTEGER! AS TEXT! TOKEN to just TO INTEGER! TOKEN.

How Does This Inform the TO/MAKE Matrix?

In "Hacking Away on the TO/MAKE Matrix", I'd previously said:

  1. A TO conversion won't run arbitrary code that you pass to it , or possibly A TO conversion won't even GET any variables, much less evaluate
  2. Every TO conversion targeting a series type performs a new allocation
  3. TO TEXT! 10 is "10" and TO INTEGER! "10" is 10
  4. A TO conversion of a value to its own datatype will do the same thing as COPY

So here we have another stake in the ground, for TO INTEGER! of #1 and #10.

This emboldened me a bit to try changing the function signatures of TO conversions to not be passed bindings. That would make it mechanically impossible for a TO conversion to do anything resembling a REDUCE or GET to dereference variables on what is passed into it.

Then I hit rule #4, and there's some friction. If you TO BLOCK! a block, you historically have not lost the bindings in that block in the process.

But maybe you should (?)

The historical behavior of TO BLOCK! of TEXT! is to load it, but what you get isn't bound:

rebol2>> blk: to block! "print {Hello}"
== [print "Hello"]

rebol2>> do blk
** Script Error: print word has no context

So maybe TO BLOCK! of another block gives you an unbound block. A problem with that is that if it's giving you a shallow copy, then the blocks underneath it would still be bound. I'm thinking about a feature where there is a way of viewing cells--kind of like CONST--in which you can carry a reference to a block that has bindings but can't use them. So it wouldn't necessarily require a deep copy to get this.

But there's a little hint at what the rules are clamping down to suggest. There may well be a difference between TO GROUP! BLOCK and COPY AS GROUP! BLOCK...because the former may give you something unbound, while the latter preserves the binding.

It would actually be good if bindings were erased whenever they could be. Stray bindings not only contribute to leaking references into code that shouldn't have it, but also means that anything pointed to still appears live in the garbage collector. Inert code meant to be used purely symbolically can wind up costing as much as all of its unused bindings. So if you can use a form of copying that strips bindings, it would be ideal if you did so.

1 Like

A long time ago, when debates started about reducing the combinatorics, the question came up as to what TO TEXT! meant... and if it should be the replacement for one of the enigmatic FORM or MOLD operations.

Synonym for the historical behavior of "mold" has been pretty long off the table. But it seems to feel very natural that to text! #1 (and its synonym to text! #"1") are going to giving you "1". So maybe when the requirements are all put together, then it makes sense to have TO TEXT! supplant FORM.

But historical FORM is weird. It's not quite MOLD, and it's not quite TO TEXT!

rebol2>> form make char! 0
== "^@"

rebol2>> form make char! 1
== "^A"

We need TO TEXT! of a zero token to be either an empty string or error...and I'm favoring error.

And I don't know quite what the value of FORM is, here. If you asked to "print out a control A", this isn't what you want... and it's not what you get with print [#"^A"] in Rebol2.

If FORM is "make a human readable version of something", and MOLD is to "make the Rebol 'REN' readable version of something"... this looks like two overlapping concepts.

It might point to the idea that ^A should be a valid TOKEN! literal molded format, vs. needing to say #"^A" or #^A

So I'm kind of thinking FORM doesn't have much of a place in this picture.