Removing "&" from Legal Word Characters - Any Objections?

hostilefork · July 10, 2019, 1:19am

The & character is not usually a legal character in identifiers in programming languages, and is saved for operators and such.

I have not seen much in the way of good uses of & in names... it's rather ugly.

AT&T: "this seems pretty pointless"

Even if we took it away from word characters we could have exceptions such as for & and && where those standalone could be WORD!, if anyone really cared about that.

One of my pet usage suggestions for & has been to embrace the HTML Entity List, and allow it as a syntax for characters (minus the semicolon of course).

append some-string &nbsp  ; adding a non-breaking space by entity name

Also, having that table built into the executable and offering it out could be pretty useful, even preferring to mold a known character using that instead of the numeric form.

This would help reduce the over-saturation of usages of #. An additional practical matter for that would be that you could specify characters as &{...} as well as &"..." ... this would help avoid escaping when putting characters in quotes as in the API:

 REBVAL *ch = rebValue("second [10 &{b} 20]");

Today we can't use #{b} instead of #"b" because that gets interpreted as a binary. So you have to do this as the less appealing:

 REBVAL *ch = rebValue("second [10 #\"b\" 20]");

The same thing happens trying to pass characters in double quotes inside --do code on the command line. I think desire for this duality of forms applies to the other string-like things as well (e.g. FILE! should be able to be either %{...} or %"...")

We don't necessarily need to do it right now, but deprecating & in words ASAP helps clear the path to this or other applications.

Does anyone have particularly great arguments for why & should be allowed in WORD!?

swhite · July 10, 2019, 1:34pm

Take it out. Take them all out. I have gotten burned a number of times by special characters in identifiers.

BlackATTR · July 10, 2019, 2:29pm

This is really important for a bunch of the things I'm trying to achieve. Working with XML in particular means dealing with a ton of these entities, and I would love to figure out a way to map these so that someone who's searching (in an automated way) for a piece of HTML stored in XML doesn't have to know the myriad character substitutions to do an effective/comprehensive and accurate search.

hostilefork · July 10, 2019, 5:20pm

Can you give me code examples of what you mean as useful?

I mention that I am thinking if a character exists in the table, we would (in the cell) cache its table entry, so we could quickly say, even:

 >> first "Æae"
 == &AElig

 >> second "Æae"
 == &aelig

This strikes me as appealing, and dovetails well with the web build. But what are you thinking exactly?

BlackATTR · July 10, 2019, 7:57pm

That looks great. I don't have code examples yet. But to give a more specific example, let's say that I have tens of thousands of text files. Some of these are XML files containing embedded HTML. This requires that the embedded HTML tags are carefully escaped. E.g.,

<?xml version="1.0" encoding="UTF-8"?><WYSIWYG>
<GenericHTML>
<Content>&lt;div class="tf module pt-20"&gt;
&lt;div class="content"&gt;
&lt;h1 class="tf content-title"&gt;What do I do if I&amp;rsquo;ve found the right event but the date is far off and I&amp;rsquo;m not sure if my plans will change?&lt;/h1&gt;
&lt;p class="normal pb-20"&gt;We understand a lot can change in a year or two. If you cancel your plans you&amp;rsquo;ll have your credit restored. &lt;br /&gt;&lt;br /&gt; &lt;strong&gt;Note:&lt;/strong&gt; Fees do not cover insurance.&lt;/p&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
</Content></GenericHTML></WYSIWYG>

A person wanting to perform an automated search through thousands of files like this to make updates/replacements to the content would have to search for both the unescaped and escaped forms of these characters and delimiters.

hostilefork · July 10, 2019, 8:04pm

Well we can't achieve DWIM. Are they escaped or not?

The proposal at hand would offer an easy way to speak in terms of the unescaped characters. It's very early in the thinking process to say mold first ">" returning &gt, but that is the kind of thing I'm saying is on the table.

(Really the post is about not making this decision, but restricting use of & so we can open these doors post Beta/One...)

So in that world, you could search for unspaced [mold ch ";"] and replace it, and search for ch and replace it. But again this is highly speculative.

BlackATTR · July 10, 2019, 8:22pm

Yes, I think we agree. There is no DWIM here, somewhere, somehow the characters need a map/lookup table. That map could certainly get unweildy to manage manually when you consider all of the foreign character-sets. So naturally I support this type of proposal.

LkpPo · July 17, 2019, 8:45pm

Like swhite I am ok to remove special characters from identifiers. a-zA-Z0-9_ and all kinds of ascii and unicode dashes suffice.

iArnold · July 27, 2019, 7:21pm

I am on the strippers team as well!

hostilefork · October 8, 2020, 3:01am

On this topic Removing “&” from Legal Word Characters - Any Objections? I suggested that using & for CHAR! would offer some benefits. One was similar appearance to HTML entities. Another was being able to use characters in contexts which were enclosed in double-quotes and would have to escape (like the API):

hostilefork:

An additional practical matter for that would be that you could specify characters as &{...} as well as &"..." ... this would help avoid escaping when putting characters in quotes as in the API:
 REBVAL *ch = rebValue("second [10 &{b} 20]");
Today we can't use #{b} instead of #"b" because that gets interpreted as a binary.

What I've been leaning toward lately is going the other way with this... and giving & to BINARY!, while ISSUE! and CHAR! become unified and use the #.

issuechar! forms (tentatively TOKEN!)
- #foo
- #"foo bar"
- #{foo bar}
BINARY! forms:
- &aebd
- &"aebd"
- &{aebd}

Reasons to Favor Binary Moving

The & character is too loopy and detailed to look very good sitting next to single letters:

>> first "abc"
== &a

>> first "abc"
== #a

When you think about the landscape of things ISSUE! is suggested to be used for, the ampersand is more polluted-looking:

#323-207-META

&323-207-META

So I feel like # is the better choice here. Binaries won't suffer so much, because they are typically enclosed in delimiters:

&{decafbad}

#{decafbad}

#{DECAFBAD}

&{DECAFBAD}

&"DECAFBAD"

#"DECAFBAD"

I actually kind of like the & better, visually. But beyond that, what I especially like is putting distance from the CHARISSUE! type (TOKEN! still sounding like a good name for the unification).

If we wanted to, we could have binaries be expressed even without delimiters, like &ffee or &0011 ... I don't know how desirable that is.

hostilefork:

One of my pet usage suggestions for & has been to embrace the HTML Entity List, and allow it as a syntax for characters (minus the semicolon of course).
append some-string &nbsp  ; adding a non-breaking space by entity name

So this idea would basically not happen. This isn't to say we couldn't make it easier to process HTML entities. But in the theory about tokens, #nbsp would be an immutable literal that fits in a cell and really represents the string content "nbsp". Though you can imagine dialecting such things...perhaps saying that when you quote them you want them as-is but otherwise an HTML entity:

 my-dialect ["abc" #nbsp #nbsp '#nbsp]
 == "abc  nbsp"

I dunno. Anyway, wanted to give an update here...as the ISSUECHAR! thing really does look like it's shaping up very well and is seeming extremely likely to happen.

IngoHohmann · October 8, 2020, 6:45pm

I like the token part, and "&" looks Ok for binaries.
Html entities would have been nice, though.
Just an idea, no proposal:
If binaries would need the quoting, then plain &nbsp could be free for char entities...

hostilefork · October 9, 2020, 1:30am

Maybe. But I feel like with #x and #"x" and #{x} meaning the same thing, with this also being taken to apply with FILE! with %x and %"x" and %{x}... that it would break the pattern in a non-intuitive way.

We can think about it. There's also other another option for binaries, as ${decafbad} and $"decafbad"

hostilefork · September 25, 2022, 12:36am

Something I hadn't really thought of here--but that might fit in modern "explosion of dialect parts" thinking--would be if & was just another generic sigil.

&word
&tu.p.le
&pa/th
&[bl o ck]
&(gr o up)

So in your dialect--if you wanted to--you could interpret &word as an HTML entity, if that were what you wanted to do with it. But you could apply other meanings.

There'd be some syntax compatibility problems with HTML, e.g. it has &#32 and I dunno how that would fit in. &[32] or &(32) ?