Bitsets and Binary

rgchris · May 28, 2018, 8:50pm

With the divergence of STRING! and BINARY! you have two similarly diverging needs of BITSET!

For STRING!—matching character sequences including higher unicode values:

dashes: charset ["-" "–" "—"]

For BINARY!—matching byte ranges:

upper: charset [128 - 255]

However, there are situations where it'd be desirable to mix usages, for example where UTF-8 sequences are delimited by certain byte sequences:

parse some-stream [2 upper some dashes]

Presumably some-stream would be BINARY! as there'd be no non-UTF-8 higher byte sequences in STRING!

hostilefork · June 3, 2018, 12:24am

I guess I had thought that whether you were parsing a BINARY! or an ANY-STRING! would provide the interpretation. But you are right that when parsing a BINARY! one might want the codepoint interpretation vs. the byte interpretation.

There's really only two avenues of solution: a PARSE keyword to distinguish the usage, or a datatype distinction (byteset!)? I have a vague feeling this is an esoteric problem for which PARSE's oddity should be paying the tax, and it should be a PARSE keyword...maybe even something strange like being able to go INTO a BINARY! from a string parse or vice versa, switching the interpretation mode?