"Raw" Strings

hostilefork · October 23, 2021, 1:33pm

I'm now 99% sure that { } best serves its purpose as a string form.

But there were several talking points that came out of that discussion. One was a desire for "raw" strings.

This is the idea that outside of the termination sequence, there's no escaping. This means you can put pretty much anything in the string. Backslashes, carets...it's all fair game.

Two Very Different String Forms: When Are They Raw?

To me, it made the most sense that the raw form be the braced form...because it is so frequently applied to sections of arbitrary documentation text (such as the Description: in module headers).

Description: {
    If you call this from C, then write:

        if (a ^ b != 0) {  // bitwise XOR
            printf("This is an example\n");
        }

     So there you see carets and backslashes working.
}

With binding support for string interpolation, we can imagine this getting even more useful for representing snippets of other languages with escaped portions inside of them.

However, @giuliolunati favored the idea of making quoted strings mostly-raw, because there was an easy-seeming way to escape quotes using only the quote character.

>> "This would be ""quotes"" inside a string"
== {This would be "quotes" inside a string}

How braces are escaped historically ( e.g. ^} ) is messier. That drags caret into it, so now you're worrying about needing to escape carets and braces. A more uniform approach appealed to Giulio, so he favored the quotes form...and would prefer it to be able to represent characters "as-is" by default, with this exception for embedded quotes.

But from my point of view, I was seeking to avoid is any need to do search/replace on the embedded information. So mostly-raw wasn't good enough; e.g. I'd like to copy and paste the C code out of the example above, not have its quotes doubled:

Description: "
    if (a ^ b != 0) {  // bitwise XOR
        printf(""This is an example\n"");   <-- doubled quotes not good
    }
"

We also discussed that I'm averse to having ordinary quotes as multi-line strings... though maybe we should allow them. :-/ But even if we did, it feels unintentionally incomplete to see something like:

Description: "

So for these reasons I wanted to focus the raw string effort on braced strings. Yet there are a lot of things that it gets hard to represent in a raw string form when you try to use unmatched braces in the content.

After thinking about it a bit, we came up with the option of being able to set the delimiter according to a number of braces and a vertical bar.

{...}  ; expects any { } inside to be matched pairs 
{|...|}  ; allows internal unpaired and mismatched { }, {| |} matched pairs
{{|...|}}  ; allows internal unpaired and mismatched {| |}, {{| |}} matched

etc. etc.

This can handle some pretty sticky strings like {|ab"c"} {"d"ef|} if need be, where the data extracted is:

ab"c"} {"d"ef

The approach would allow any number of {{ }}, which sounds like it could get ugly. But it's kind of like generic quoting, where I don't anticipate people using ridiculously high levels like {{{{|...|}}}}. But having it be a general method has value--especially in generated code scenarios.

Strings starting or ending with vertical bar forms like {|} get sacrificed, so you'll have to use "|" instead. But you would use "}" so this just moves | into the same category as } and {.

I also suggested an additional rule...that you don't terminate braced strings except as:

{...} -space-
{...} -newline-
{...}]
{...})

If we limit it to these possibilities, you can write things like:

 code: {char c = '}';}

It seems that a lot of unpaired brace cases are single character literals like this, which don't fall under the rule. You might want to put the bars in for good measure anyway:

 code: {|char c = '}';|}

In any case, the other thing we were leaning toward here was that when quoted strings are escaped, they do so compatibility with the classical C backslashes, based on the idea that caret escaping hadn't done the language any particular favors.

Just wanted to write this up while I still was thinking about it...

iArnold · October 23, 2021, 6:40pm

You are very welcome to join the 100% party on this issue

giuliolunati · October 24, 2021, 7:30am

Another set of delimiters could be:

{ ... }
<{ ... }>
<<{ ... }>>
...
(or replace <> with another couple)

This choice has a little advantage: the only troubling chars in content are { }, while < > are fine.

"<" = {<}
">" = {>}

hostilefork · October 24, 2021, 8:10am

Interesting idea; though it does take away from having representations like <{foo}> be a TAG!.

So the question is if being able to say {|} is worth losing that family of tag possibilities, as well as just the general visual issue of "looking like a tag".

It's probably better to sacrifice | ... but good to write up all the options for consideration.

hostilefork · October 24, 2021, 8:31am

Hm...well, it can't be ({...}) or [{...}] as those have meaning.

But so long as we make the rule that you can't put things right up against strings otherwise:

{...}
|{...}|
||{...}||

This would allow {|} to mean "|" and |{}}| to mean "}", for instance.

Maybe a similar technique could help with weird tags?

>> make tag! ">"
== |<>>|

Further thought: I've talked in the past about why banana clips (| a b c |) don't work in Rebol as a delimiter class, but "inverse banana clips" could work:

|(a b c)|

If you squint you might think of that as a symmetrical delimiter class, that has some nicer properties than #(...) in terms of marking both ends...and it has a reasonably "clean" look.

The block form is a bit harder to distinguish from [[a b c]], but...maybe still would have some purpose.

|[a b c]|

Interesting...

giuliolunati · October 24, 2021, 10:18am

|{... is too similar to | {... in PARSE

'{...}' conflicts with '{...} :-/

={...}= could look good in multiline:

={
....
}=

But ={... is too similar to = {...

Maybe *{...}* ?
Also, it's slightly more visible than |{ ... }|

hostilefork · October 24, 2021, 10:43am

I think my default is to agree that the exterior being an expression of the type of the thing is probably best, which is why I suggested that line of thinking first.

But to play devil's advocate a bit for the notation: It would be a rare juxtaposition, and just as with any other "weird" juxtaposition it's something you can control with your choices.

Remember there are all sorts of other situations where you can shuffle it when you're bothered...

if tag <> <div> [...]  ; does <> next to div look weird?

if not tag = <div> [...]  ; you make adjustments if it bugs you

Maybe *{...}* ?

Just because the PARSE dialect puts strings up against | and code puts strings up against = doesn't mean there's not another dialect that would put strings up against *.

So if you really want to maximize the differential between strings and arbitrary-thing-next-to-string...then having the braces on the outside is the way to go. This would favor the {{|...|}} style of solutions. There's certainly value in that, and maybe to say <|...|> style solutions for tags too.

It's an issue of what you think is more important: stronger difference from characters of adjacent values or being able to have things like <|> and {|} with their historical meanings.

(Note: <|> is today an "arrow-word" and not a tag, this concept would have to be updated to remove | from the arrow word characters, since <||> would be empty tag and not an arrow.)

Anyway...remember situations that involve escaping are always going to have something suboptimal about them. So efinitely do see if you can operate in the realm of real examples. Because if you argue in the abstract about some behavior of {...} when you could just substitute a "..." then it's not a good example. The best examples are those that show no good choices exist.

giuliolunati · October 24, 2021, 11:14am

Ok, that's a strong argument. So {|...|} is good for me.

LkpPo · October 24, 2021, 12:16pm

Why do you need the tuple?

Aren't consecutive brackets sufficient instead of using two symbols?

{...}
{{...}}
{{{...}}}
...

hostilefork · October 24, 2021, 7:51pm

The vertical bars mean taking away only one character instead of two that you can use on the edges...and the belief is that it's more common to want to start or end regions of arbitrary text with { and } than with a bar.

So you can represent things like:

{|{a"b"c}|}

{|x"x}|}

hostilefork · October 24, 2021, 8:08pm

hostilefork:

I also suggested an additional rule...that you don't terminate braced strings except as:

{...} -space-

{...} -newline-

{...}]

{...})

If we limit it to these possibilities, you can write things like:
code: {char c = '}';}

So this sounds nice in theory, but it creates a lot of headaches in terms of how it would interact with wanting to allow matched pairs of braces inside the content.

code: {
   if (true) { printf("the character is %c\n", '}'); }
}

At the point of the '}' we've already seen one open brace. If this "doesn't count as a closing brace" does that mean it leaves the count at 1, or does the don't count rule only apply if we're at zero?

C doesn't require the spaces, so: then what about this?

code: {
   if(true){printf("the character is %c\n", '}');}
}

There's a conflict here because we're not assuming we know anything about how the text inside is being handled. I think when we go to trying to reason about what isn't a legal terminator (outside of allowing matching pairs of the delimiter sequence) it messes with that concept.

I'd actually thought of the idea before the beefed up delimiters, so maybe it's best to leave it at that rule:

code: {|
   if(true){printf("the character is %c\n", '}');}
|}

...though we could say that if you open a delimiter sequence at the end of a line, then the closing delimiter has to be at the beginning of a line? This would disallow:

stuff: {
    Then things like this would be illegal. }

hostilefork · October 25, 2021, 1:28am

To add more escaping levels, do you prefer:

{...}

{|...|}

{{|...|}}

{{{|...|}}}

Or:

{...}

{|...|}

{||...||}

{|||...|||}

Both have the same basic weakness that you can't start or end the content with a |.

If you think of this as a generic pattern, e.g. one that TAG!s might follow, the multiple-bar form avoids making you write << or >> sequences to accomplish escaping. Even if these wouldn't technically conflict, it seems the visual appearance could create noise that looks like features which are more useful.

(I still like my console replay proposal, which would be a novel feature other languages don't have.)

It isn't likely to come up that often to need more than {|..|}. Remember that you should only need that when you are trying to wrap code that has unpaired instances of {| and |} in it.

(Of course, the nature of escaping is such that once you come up with these "rare" sequences you affect the situation by now making instances so they become non-rare... so they will start existing.)

 Rebol [
     Title "The Law Of Escape Sequences"
     Description: {||
         Whenever you introduce a new escaping delimiter, like
         `|}`, then you inevitably end up having to talk about it.

         Though I've pointed out we could avoid needing the {|| ||}
         and just using { } if we made the rule that an open brace
         that is at the end of a line will only be terminated by a
         close brace that's at the beginning of a line.  That seems
         like it would be an improvement over this...but you'd still
         have the issue on single-line literals.
     ||}
 ]

I think going by the bar count is likely better than going by the outer brace or angle bracket count.

hostilefork · October 25, 2021, 6:25am

This raises a couple of additional questions about tags.

First of all, should TAG! be a raw string form? If you insert a backslash is that intended as a backslash, or should carets be carets... etc?

Historically tags were tailored to the HTML expectations a bit, such as that you could put a literal > inside an attribute string and it wouldn't close the tag. Yes that is legal in HTML, you don't need to do >

rebol2>> <div label="1 > 2">
== <div label="1 > 2">

Note it's not actually parsing the tag or making any rules about its structure, it's just counting quotes.

rebol2>> <"">
== <"">

rebol2>> <""">
** Syntax Error: Missing " at <"""

If TAG! were more agnostic about its contents, we'd be talking about something more like:

<div label="I'm a <div>">  ; works because of nesting rules

<|div label="1 > 2"|>  ; needs the | to avoid > terminating the tag

If not obvious in these discussions where we talk about |, it's not in the form'd version... only the molded version to make it LOAD-able:

>> form <|div label="1 > 2"|>
<div label="1 > 2">

Whether you see this as a worthwhile compromise depends on if you see it as a feature or not to be able to do things like:

quote-tag: <">  ; doesn't work today, but could be useful?

plain-tag: <foo>  ; works today
one-quote: <'foo>  ; doesn't work today (in Red)
two-quote: <''foo>  ; works today
three-quote: <'''foo>  ; doesn't work today (in Red)

Since both " and ' can enclose attribute strings, Red is more consistent here... but...

...I'd lean to wanting to scrap the sensitivity to " and '. My feeling is that having a more general TAG! is probably worth the agnosticism about the content. I think this will be even more true if interpolation and string binding work. Because you'll be seeing a lot more of:

text: "One is greater than Two...?  :-/"
html: reword {<div label="1 > 2">$text</div>}

So lone tags will be serving more of a role as a dialecting part, and making them better general strings is probably the strongest play.

^-- attention @rgchris

Note that this also brings <...> into the mix as an option for capturing patterns that use quotes and braces in weird ways.

>> weird: <"}>
>> print as text! weird
"}

This keeps you from having to write {|"}|} or """}" or "\"}" in your source. Which is pretty neat.

(I'm actually on the fence of whether print [<a>] should print <a> or just a - the question is about why you think things are in TAG!s...if the delimiters are part of something visually significant or really just to get a distinct part of speech for the payload of "a". Again, this is something that interpolation may twist the balance to where those who wish to generate HTML won't be as concerned about rendering behavior of TAG! as it would be used in more targeted ways, and not for boilerplate.)

giuliolunati · October 25, 2021, 8:36am

Green light for {|||||....

LkpPo · November 27, 2021, 1:14pm

With interpolation you will need a way to protect the variable placeholder when the variable name is tricky. No?

hostilefork · November 27, 2021, 7:53pm

The concept here is that the binding is the built in feature, but interpolation is done by functions. So those functions could have different rules or parameterizations.

e.g. today's REWORD allows you to specify delimiters:

>> reword/escape "abc*var*def" [var "hello"] ["*" "*"]
== "abchellodef"

>> reword/escape "abc*|var|*def" [var "hello"] ["*|" "|*"]
== "abchellodef"

My concept here is that binding would take the place of needing to provide the mapping from words to values, but that options like /ESCAPE could still exist. That would inhibit substitutions such as [1 "hello"] so maybe a more generic REWORD would still be useful.

It's nice to have heuristics that non-word characters act as implicit delimiters. In particular, if we rule out the idea of / for field selection, we could expand URLs and files more simply. Though if you were trying to mix with dots of filenames you'd need something like parentheses:

interpolate http://$domain/$subfolder/$(obj.name).txt