Weird WORD!s - Allow, But Escape Them?

hostilefork · March 15, 2021, 3:30am

My feeling is you should be able to build paths and tuples out of anything that's a valid WORD!. But is it time we had an escaping mode for "weird words"?

Let's say you didn't want <.> to be a TAG!, but rather a TUPLE! where the first element was < and the second was >.

We could do something like backquotes:

`<`.`>`

Having an escaping mode for words would open up more lexical space. For instance, I like the idea of allowing $FOO, $(FOO), $FOO/BAR, $[FOO BAR] etc. as another type...

But this would seem to kill off the idea of being able to have $ and $$ etc. as WORD!s, because you get into ambiguous situations... is $/foo a PATH! with the $ word in the first slot, or an ENV-PATH! with an empty first slot?

These ambiguities create problems for other things that might stand alone all right, because we don't want to have "second-class-citizen" WORD!s that can't appear in paths.

But what if we used backticks if they wind up in paths?

`$`/foo   ; PATH! with $ in the first slot
$/foo  ; ENV-PATH! with blank in the first slot

This could give us the likes of : and :: as operators...

>> `:`: does [print "I am colon!"]

>> :
I am colon!

>> type of :`:`
== #[datatype! action!]

It could work for other standalone characters, like @ and perhaps &. % could be the same (with %"" or %{} used for empty file)

I feel like # and / may not be good candidates for this treatment, it would need more thought.

The point wouldn't be that you'd likely be going crazy with paths involving these characters, but rather that you might want to do interesting things with them standalone. It's just to put them on the map as legitimate words.

IngoHohmann · March 15, 2021, 2:18pm

I definitely think there should be a way to escape weird words.
I'm not yet a big fan of this specific proposal, but can't think of anything better.

hostilefork · July 30, 2022, 9:40pm

Long ago, @Mark-hi suggested following Lisp's example and using vertical bars for escaping symbols:

>> |word with spaces|: 10

>> print ["The value is" |word with spaces|]
The value is 10

Maybe this seems more palatable. But Lisp uses backslash to escape their vertical bars. And we don't want to have to mangle things like: some ["a" | "b"] into:

parse "ab" [some ["a" |\|| "b"]]

So we'd have to make some different tradeoffs in the design than Lisp.

All-Vertical-Bar Tokens Could Be Escaping-Exempt

For starters: if a token consists of only vertical bars, we might say we don't think of that as being escaped:

>> as text! '|
== "|"

>> as text! '|||
== "|||"

You might say "Hey, if it's that easy, why wouldn't Lisp have thought of that?"

It's because this sacrifices symbols that start and end with spaces. So if you think this way, you can't have things like:

>> as text! '| this wouldn't be possible |
== " this wouldn't be possible "

You'd have to escape the spaces one way or another:

>> as text! '|\ maybe this would work\ |
== " maybe this would work "

Similar issues arise with commas and other delimiters. We have to be able to decide if the sequence |, is starting some arbitrary WORD! with a comma as the first character, or if that's a vertical bar WORD! followed by a COMMA!. Same for |) to decide if you should consider that parenthesis to be a vertical bar WORD! followed by a parentheses that might close an existing group... or a arbitrary word with left parenthesis as the first character.

It seems pretty fair to me to say that delimiter characters can't be in your "arbitrary word" at the beginning or end, at least without escaping. Though having dots in the words is a requested feature:

>> as text! |graham.likes.these|
== "graham.likes.these"

I don't know the exact boundaries here:

>> as text! |should this (work?)|
== "should this (work?)"

But seeing as we've gotten by for a pretty good while without such weirdness in WORD! at all, I don't think these edge cases need to be the focus of the present moment.

Note that even though foo and |foo| could be interchangeable, we can't say | and ||| are interchangeable. Instead, | would be interchangeable with |\||.

What About Things Like "Flags" `<|`

As with the "all bars" cases, we want to be able to use these unescaped as operators. For instance, "left flag" has been used to point to the left evaluation while eliding everything to the right:

>> 1 + 2 <| print "Hello" print "World"
Hello
World
== 3

But just because they would be unescaped when standing alone, doesn't mean we can get away with that everywhere. Let's imagine we want to make the PATH! whose first element is <| and whose second element is |>

Under new design proposals, if we just were to write <|/|>, that's actually a TAG!...the kind of tag that would permit internal < and >

>> as text! <|<|>
== "<"

So we'd get:

>> type of <|/|>
== #[datatype! tag!]

>> as text! <|/|>
== "/"

To try for the PATH! we want, let's think about hypothetically just wrapping the flags in vertical bars:

>> as block! '|<||/||>|
== [<| |>]

It's not the worst looking thing.

But if we're not escaping the vertical bar that's part of the flag, then how it would know that the first element should be <| instead of seeing the |<| pattern and assuming that meant it was < ?

One reasoning could be that as long as it hasn't hit a delimiter (] or ) or , or / or . or space or newline) then all vertical bars are considered content.

This policy would allow:

>> /|<||: does [print "Hello"]

>> <|
Hello

But again we have to ask what such assumptions rule out. And what it rules out are any internal delimiters--so no spaces, parentheses, brackets, dots, slashes.

That seems a bit much to throw out, for the sake of a few weird operators...if-and-when they happen to wind up in paths. So we'd probably have to escape this:

>> /|<\||: does [print "Hello"]

>> <|
Hello

Otherwise, Start-And-End Vertical Bar Must Be Escaped

So there's an idea that <| won't require escaping when standalone as WORD!, but then when they are put in PATH! they will be as |<\||.

But what about |<| ? That notation pretty clearly needs to be reserved for how < appears when put in a PATH!, otherwise we'd have things like </> would be a PATH! and not a TAG!. (It needs to be a TAG!)

Quick Look At Those Backticks Again

Backticks have the advantage that we seem to be 99% uninterested in them as symbols otherwise. So it's a bit less messy:

`<|`: does [print "Hello"]

But they still might have other applications that are less esoteric. It seems wasteful to apply them here.

Especially because in practice, we have an evaluator to draw from, which could make things look better:

('<|): does [print "Hello"]

Tentative Strategy

I think that it's rare enough that people will be putting vertical-bar-words in paths and tuples we can just go ahead and say you always escape them.

So if you want two vertical bars in a TUPLE! you'd say |\||.|\|| - and that should be reasonable discouragement against saying it too often.

But spaces and commas and parentheses and such will need to be delimited inside your "weird escaped words", at least at the start and end.

>> block: transcode "(|) (|)"
== [(|) (|)]

>> type of first block
== #[datatype! group!]

>> length of first block
== 2

>> first first block
== |

e.g. this is not a GROUP! holding a word with a spelling of ") ("

If it starts AND ends with a vertical bar, it's an escape notation unless it's all vertical bars.

Hopefully this is enough of a sketch to enable pushing through to the next dilemma...

IngoHohmann · July 31, 2022, 9:54pm

Another idea might be to have backticked (or something else) braced strings be the notation for weird words.

`{word.with.dots}
`{|||}
`{ and with {} inside and spaces aroun }

This way there's no need to invent new escaping rules, because it's already there in strings.

hostilefork · August 1, 2022, 8:00pm

I do agree the same escaping rules should be used, but I think the caret-escaping is an increasingly poor idea.

Carl wanted to move away from it and use C's escaping:

String character escapes use C notation. They use backslash notation, for example "\n" for newline and "\t" for tab.

I definitely feel that with caret taking on more of a purpose in the language now for ^META, that sticking with the status quo for escaping in strings may be wise.

Needs thought...

hostilefork · February 18, 2024, 5:22am

Remotivated a bit by my own answers to why there's a 1-sigil limit, I've had something of a Come-to-Jesus Moment on this... and I think it's one of these places where saying "no" is actually the right answer.

|::|: does [print "I look all right, but..."]

|\|\|\||: does [print "THAT's |||: ? ... kill it with fire!"]

>> to path! [<| |>]
== |<\||/|\|>|  ; NOOOOO! 🔥🔥☠️☠️🔥🔥

I believe the way to deal with things that have sigils and delimiters/interstitials in them is with a second-class-citizen symbol type. (It may be too much of a hassle to make it a whole different datatype from WORD!, but the capabilities of some word symbols would be limited in terms of decoratability or putting in paths and tuples.)

The means of applying sigils would be using array types to wrap them, which may not cover all uses (you might need to get inside [] '$, for instance, to defuse it if it's an action) but would get some of the most important ones.

[::]: does [print "This SET-BLOCK! does the job...visually and mechanically."]

>> make tuple! ['map '...]
** Error ; oh well

>> map.[...]: 10

>> map.[...]
== 10

|||: does [print "assigns |||, so back to normalcy for vertical bars"]

...But My Other Language Lets Me Do It!

Well your other language isn't here now, is it?!

As it happens, when first encountering the idea of dialecting I was a pretty staunch believer that system-endorsed escaping was necessary in words. Because when you've cooked up a dialect that uses types to cue behavior, it sucks if you hit a wall where you need spaces in what was otherwise an ANY-WORD!, or maybe you need something to start with a number.

The scenario one would predict would be that everyone would wind up inventing their own escaping system, and having to retrofit their code once they hit an edge case.

Maybe one person would start using two underscores to mean space, let's say they have ways of sending and receiving, SET-WORD! means send:

 dialect: [
     thermocouple_mux__pin: (+/-/16.bit) [11 7 913 -34 81]
     ...
 ]

 form-receiver-word: lambda [w [set-word!]] [
     replace/all (to text! w) "__" space
 ]

When what they really wanted was just:

 dialect: [
     |thermocouple_mux pin|: (+/-/16.bit) [11 7 913 -34 81]
     ...
 ]

>> type of first dialect
== &[set-word]

>> form first dialect
== "thermocouple_mux pin"

Maybe 99.9% of the time they never need a space in the name, but some external pressure exists to line up the name with something that expects a space. Or maybe it's something that starts with a number. But once this comes up, the whole bet the person made on using the type might fall apart...

Couldn't the system have some empathy, vs. generating errors the way Ren-C, Red, and R3-Alpha do?

>> to set-word! "thermocouple_mux pin"
*** Script Error: contains invalid characters

Ren-C's Expanded Array Types Are Your Empathy

You have ["thermocouple_mux pin"]: and [thermocouple_mux pin]: if you need them, and (...):, and soon {...}:

I'm not as bothered by the idea that you have to accommodate another type or use your own FORM routine to produce strings, as I am about the idea of the source being crap because you had to distort it. If you don't have to distort it, then a little more work in the implementation is a small price to pay. Let's just work on making sure the implementation is easy to bend--having the language be the best it can.

If you've used all your array variants up for fully unrelated purposes, then what you have is probably verging on undeadable... and maybe you should have considered /send "thermocouple_mux pin" to begin with.

Broadly speaking... to be part of the crusade for simplification, you need to be able to push back against ugly requirements. If you can't push back then you've pretty much already lost the battle for control of complexity. The complexity is controlling you.

If you can't stand up against something like CR LF line endings--that's just a stepping stone on the path of pain:

Fight for the Future: How DELINE will save us from CR LF

(By analogy: I'm sure the people who decided filesystems should allow spaces in them meant well. But the decision was a catastrophic one, creating far more pain than benefit.)

Permissiveness Wouldn't Win Over Any Skeptics

Anyway, I went down the rabbit hole of escaped weird words... and the resulting chaos was beyond my ability to reason about it. It turns out to be very not-Amish.

What comes out of the line of thinking ruins the pleasing aspects of working with the parts. And I don't think giving in to this particular issue would wind up being enough to win over someone who ultimately wants to see the project bent to external constraints. They'll disagree with the premise and leave when they hit the next speedbump. Best to stick to principles, if anything notable is to be accomplished.

hostilefork · February 18, 2024, 11:24am

It may be that with the FENCE! proposal, the "weird" fences could evaluate for the purpose of letting you GET and SET an arbitrary word. I'm not sure what :{...} and {...}: would be used for otherwise.

I mention it because one problem with the dialecting of SET-BLOCK! is you can't use it to set @ or ^ because they are used for nameless variables:

 >> [_ @]: pack [1 2]
 == 2

 >> [^ _]: pack [1 2]
 == '1

This means that instead of [^]: ... assigning a meaning to ^, it signals you want the nameless result to be meta. Maybe that's all right, and {^}: can pick up the slack, as well as offer a way to get the weird word with :{^}.

But it would have to stop there I think, because ${...} would need to mean "bind the fence" not "give back the word in the fence bound".

Something to think about.

hostilefork · February 20, 2024, 2:56am

So I'm having cold-feet on the idea that sigils should be able to be WORD!, when you can't create decorated forms of them.

In the "freedom to" and "freedom from" dynamic, it's nice to have freedom from a WORD! that can't be turned into a SET-WORD!.

But there are big convolutions that come into play if colons are words, with people saying something like:

 make object! [
    {:}: does [print "colon as key"]
 ]

...if they expect that to make a key in the object named ":". Having a key that can't be made into a SET-WORD! turns everything on its head if you expect things to handle it.

>> setify ':
== {:}:  ; but not an ANY-WORD!... welcome to consequence city, population: me

It really wrecks the implementation to say that you search not only for SET-WORD!, but also SET-FENCE! (or whatever). It ripples too far.

Same for @ and $ and & and ^.

So here we see another angle of that freedom-to/freedom-from. If you're free from the idea that : is a word, then it can be a COLON!. ^ can be a CARET!. If we have a datatype for COLON!, we can give that colon behavior in the evaluator. Users can give it behavior in dialects.

SET-BLOCK! doesn't have to be concerned about not being able to assign to a variable named "^" because there aren't variables named "^".

When we were at a 64-datatype limit, ^ and @ had to be words, there wasn't space for them to be datatypes. But now this is possible, and I think it's the direction to go.

They could either be their own datatype (as members of the category ANY-SIGIL?), or just use one type and be SIGIL! Probably one type is better.

I do like the idea of having :: as a dialecting part. But as @bradrn said (somewhere), if you're willing to use the justification of "would be nice to have in a dialect" then that could be used to justify practically anything.

Maybe the types are something like COLONS! and CARETS!, and have lengths?

>> type off first [:::]
== &[colons]

>> length of first [:::]
== 3

Then colon? and &colon? could be the type constraint for a single colon. That's kind of batty (and disrupts the idea that these are always sigils).

OR MAYBE... we could argue that :: is a unique datatype that's the answer to SIGIL OF for a SET-WORD!

>> sigil of first [:a]
== :

>> sigil of first [a:]
== ::

>> sigil of first [a]  ; void or null, or maybe BLANK! is an ANY-SIGIL?
== ~null~  ; anti

This gives completeness (one sigil per sigilized form), and adds a kind of nice extra part that I think might be nice in dialecting. I guess that :: would be named COLONS!, or maybe COLON-2! would be easier to see as distinct from COLON! (or COLON-COLON!... kind of a long type name, but whatever).

Anyway, this would remove one axis of second-class citizens from the world, and I would be much relieved about the cases of needing to render out MAKE OBJECT! with sigils as keys.

bradrn · February 20, 2024, 3:59am

Here: Upcoming Datatype $WORD... What Will It Mean? - #10 by bradrn.

(I don’t have much to say about the rest of the post, other than that it seems reasonable enough.)

hostilefork · February 20, 2024, 4:26pm

This turned out to be a really good-feeling change! Ugly dark corners got cleaned up.

The test file is the kind of thing that would give @rgchris nightmares, because it's very symbol-y by its nature (but it's a test file...avoid writing code targeting humans that looks like this):

ren-c/tests/datatypes/sigil.test.reb at master · metaeducation/ren-c · GitHub

Hopefully the regularity there has a feeling of completeness, to where you won't be calling for the death of any of the contained types.

Speaking of which... note the application of the @[...] block type to tell ALL not to evaluate, just to apply the predicate to the items in the block as-is:

apply :all [
    @[$word $tu.p.le $pa/th $[bl o ck] $(gr o up)]
    /predicate item -> ['$ = sigil of item]
]

That could be controlled with a refinement, but I think it's nice to carry it on the value. Usually it looks better than that, because it's not used with symbol soup.

You can also do this with INERT, that will put the decoration on the block before it's passed to ALL.

>> inert [a b c]
== @[a b c]

apply :all [
    inert [$word $tu.p.le $pa/th $[bl o ck] $(gr o up)]
    /predicate item -> ['$ = sigil of item]
]

Yep, one type is better. I made it just be a variant of the ISSUE! cell, so the UTF-8 for the sigil is in the cell directly, and it falls mostly under the handling of anything that could take UTF-8-bearing types before.

You can use sigils, words, etc. in some places you could otherwise use quoted strings, e.g. string parsing:

>> str: "AT&T"

>> parse str [try some [change '& ("-and-") | skip 1]]

>> str
== "AT-and-T"

It's not the biggest benefit, but you do save on the visual noise of three teeny vertical lines (and less importantly, you save on creating a string series node). So I like being able to do it.

parse str [try some [change '& ("-and-") | skip 1]]

parse str [try some [change "&" ("-and-") | skip 1]]

Of course if you are parsing an array, the meanings have to be different.

bradrn · February 21, 2024, 1:05am

hostilefork:

Speaking of which... note the application of the @[...] block type to tell ALL not to evaluate, just to apply the predicate to the items in the block as-is:
apply :all [
    @[$word $tu.p.le $pa/th $[bl o ck] $(gr o up)]
    /predicate item -> ['$ = sigil of item]
]

I don’t understand this code… what result would it return?

hostilefork · February 21, 2024, 2:00am

-> is an infix operator that takes its left hand argument literally.

It is a shorthand for making lambdas.

x -> [...]  ; same as lambda [x] [...]

So the above is just another way of writing:

all/predicate @[
    [$word $tu.p.le $pa/th $[bl o ck] $(gr o up)]
] lambda [item] [
    '$ = sigil of item
]

Because the ALL can see the @ on the block it receives, it can use this as a cue to not evaluate. Hence synonymous with:

all [
    '$ = sigil of '$word
    '$ = sigil of '$tu.p.le
    '$ = sigil of '$pa/th
    '$ = sigil of '$[bl o ck]
    '$ = sigil of '$(gr o up)
]

I think the @XXX behavior is going to turn out to be far more important than even I thought at first (and that it should not bind, if you need it bound say $ @xxx)... I'll be making a writeup on that.

Of course, now that I look at the repetition in the tests, the fact that I didn't actually factor out the tester means this looks a lot messier than if I'd just written it out.

Maybe it would be better (and give Chris less of a heart attack) if it said:

for-each [sigil' items] [
    ~null~ [  word   tu.p.le   pa/th   [bl o ck]   (gr o up)  ]
    '::    [  word:  tu.p.le:  pa/th:  [bl o ck]:  (gr o up): ]
    ':     [ :word  :tu.p.le  :pa/th  :[bl o ck]  :(gr o up)  ]
    '^     [ ^word  ^tu.p.le  ^pa/th  ^[bl o ck]  ^(gr o up)  ]
    '&     [ &word  &tu.p.le  &pa/th  &[bl o ck]  &(gr o up)  ]
    '@     [ @word  @tu.p.le  @pa/th  @[bl o ck]  @(gr o up)  ]
    '$     [ $word  $tu.p.le  $pa/th  $[bl o ck]  $(gr o up)  ]
 ][
      for-each item items [
          if (unmeta sigil') <> sigil of item [fail [mold item]]
      ]
 ]

That actually pinpoints which item gave the bad sigil back, if one does...

RE: the UNMETA there, I love how rational the isotopic model is. I chose NULL for the result of SIGIL OF when there isn't one vs. fabricating. That makes the most sense (you can test (if sigil of value)), but that means you can't put the null result in blocks...yet there are answers! Just got to wear your META-hat. ^

hostilefork · March 3, 2024, 3:32pm

A post was split to a new topic: Meet REIFY and DEGRADE