Distinguishing Strings By Delimiters Used

If it were possible to transcode {whatever} and "whatever" such that they could be identified as different symbols (not asking for different properties, just the ability to distinguish between them as different symbols), that would open up another lane of lexical space for some dialects.

In other words, if you could transcode {whatever} and be able to identify it as a (making up this name) LONG-TEXT!, which is a form in the family of TEXT! with the same property as a SERIES.

2 Likes

I'm not sure what specific ideas you had in mind with this...

...but maybe the FENCE! proposal covers it?

Historically strings have presented a challenge to retaining the quoting style you used to make them, because:

  • Some strings are created programmatically and don't come from a LOAD, so they have no initial delimiters.

  • Mutations can make it so you can't use the same delimiter to output the string as when you LOADed it...

    >> str: "abc"
    == "abc"
    
    >> append str {"}
    == {"abc}
    

I'm a few weeks into having FENCE! exist. The fence part was trivial: just another array type with different delimiters. What was harder was the string part.

I decided the easiest thing to do was to make the scanner code between the bootstrap executable and the current executable roughly compatible. So I backpatched a lot of modernness onto the bootstrap EXE in just that part. So now you can scan strings like -{foo}- and --{foo}--.

But the molding questions are trickier. Historical heuristics are weird so even if you say "foo" you can get {foo} in some cases.

Red is similarly lossy, though their heuristics are different...but what they share in common is to try and dodge escaping.

red>> str: "I am a (quoted string) ^"but^" that's not preserved"
== {I am a (quoted string) "but" that's not preserved}

It may not be valueless to have a weak guarantee...that if a string does originate from a LOAD and you don't mess with it, you'll get the same delimiter forms back.

But getting the identical representation back is still hard, unless you store a copy of what was scanned. Because otherwise how would you tell if a codepoint was expressed literally or as an escaped value, if you're allowed to escape codepoints that can render literally? And what if there are multiple forms of escaping, e.g. numeric or symbolic? ("foo^(TAB)bar")

If we subtyped strings, I kind of wonder if subtyping them into something that can't contain newlines vs. that can might have some kind of benefit.

I don't know. But if we had LONG-TEXT! and TEXT!, then we'd end up with routines that wanted to take either form as taking ANY-TEXT!, and that seems a lot like ANY-STRING!.

All things considered, I’m not sure that this is a desirable property. Because, if you’re only storing the string itself, this is an impossible goal. You’d have to store the original delimiter choice to do this… and is it really worth passing around an extra piece of data, just to make string printing look very slightly nicer?

For once, I think Red is on the right track. Choose a heuristic, ideally the most predictable one possible, and be done with it. {foo} and "foo" are the same string, and I don’t see why Ren-C should disguise that fact.

(Haskell has a similar issue, where strings with non-ASCII characters like "é" get printed back as "\223". In practise it hasn’t caused too many problems — if you want a string to be printed a certain way, there are functions to do that.)

1 Like