Newlina Non Grata

hostilefork · September 15, 2019, 3:21am

There was an old bug raised by @Respectech on Atronix's repo, which I will digest as:

On Windows:
>> write %test.txt "a^/b^/c"
When opening %test.txt in Notepad, for instance, you'll see:
abc
But then...
>> read %test.txt
== #{610A620A63}
Either I expect 'write itself to convert "^/" to "^M^J", or I expect there to be a refinement like 'write/string that does that conversion. If it doesn't, it breaks Carl's original goal of having Rebol handle the details of cross-platform compatibility.

That's his opinion, but I have a different point of view on the resolution, here.

I don't believe in CR LF as a relevant textual exchange format

CR LF is roadkill on the information superhighway. Just like UCS2, worrying about UTF-16BE (Big Endian) and UTF-16LE (Little Endian), the Byte Order Mark for UTF8, and many other concepts whose time have passed.

I wrote an article that digests my opinion here: "Death to Carraige Return".

This means that in the strong stances I think Rebol needs to take against software complexity, only one line feed format should be supported by default. No "magic" that is just actually adding hidden variables to the process. If you're one of those unfortunate souls who can't work with a single character newline--the one the world has come to standardize on--then you need to either work at the BINARY! level or you need to use a special codec.

I am strongly against any magic that migrates CR LF to LF in a non-explicit fashion, and I think reading TEXT! out of a file that has CR LF should error by default.

Here is my response, migrated from where it won't be seen on @szeng's repo to here...where we might discuss it:

My Original Response (8-Oct-2015)

I actually ran into this the other day trying a basic rewriting task on a Windows machine. TO-STRING of a BINARY! and TO-BINARY of a STRING! are not symmetric... the former will take out carriage returns while the latter does not. Input file:
One
Two
Code:
>> bin: read %test.txt
== #{4F6E650D0A54776F}  ; note the 0D 0A

>> str: to-string bin
== "One^/Two"

>> to-binary str
== #{4F6E650A54776F}  ; note just 0A
So basically--if you use strings, Rebol3 is currently normalizing them to unix format, regardless of platform.

Part of this appeals to me because I think CR LF is irritating. If there's going to be a risk of making an error, I'd rather see the CRs being eliminated from disk files that had them vs. adding into files that don't. :-/ A comment here on "The Great Newline Schism" sort of points to my feeling that everywhere in Windows but Notepad can deal these days:

Wow I just noticed that this is a non-issue with windows 7. Just use unix-newlines. The only application (of the few tested) I have found which does not understand unix-newlines as newlines is the useless notepad. For instance it seems the following applications understand unix-newlines just fine in windows 7:

cmd scripts

powershell scripts

word 2013 (I can open a txt file with unix-newlines, though I never use that, I can also paste text with unix-newlines and get correct/desired line breaking)
OneNote 2013 (pasting text)
wordpad (not that I use it)
Sublime Text 3 (naturally, just on the list because it the best! smile
Eclipse

That cmd-scripts work with unix-newlines was the most surprising and crucial feature for me.

There are bound to be gotchas that may be discovered over the years, but so far so good. I think Microsoft is trying to help here...

So we should help too. But I think the big mistake here is trying to take a real/actual/concrete problem and make it "invisible"...thus losing data without warning.

You can't wish away complexity, but you can ask it to go away. I'd suggest that Rebol favor the universe that Unix/Posix/Linux (then OS/X, and now Windows seem to be going for) with just LF. Look at the move to line-feeds-only as a vote for the future... like using UTF-8 as an exchange medium.

So consider files or binaries with carriage returns in them to be a foreign format. Don't read them or write them without a special codec, the same way you'd need for UCS-2 or anything else.
>> to-string #{4F6E650D0A54776F} 
** Error: Deprecated 0x0D (CR) byte in UTF8 (try decode-utf8-legacy)
Then have the decoder have options to preserve CR bytes, discard them, give errors if they are found standalone vs. paired with an LF, in reverse order, etc. All the lovely issues you have from the two-character sequence.

It might seem tempting to just say that if you manage to get a string into the system with CR in it that you should write it out. But I'd say the UTF8 default encoder used and standardized by the system should be picky too. Given how much of Rebol's common assumption (and the assumption we'd like to be able to make systemically) is that newline is all you need, if you didn't filter your newlines out you will be getting a mixture most of the time.

So...

Make a strong decision about the default: LF is favored by everyone these days but Notepad, and it's better to help facilitate living in that world.

Standardize that when Rebol files are exchanged over the network they will not have CRLF in them. Don't load source unless a special command line switch or mode is set...default is OFF. (I feel the same way about tabs.) No matter what tolerance is given by these modes do not let string literals have the "bad" characters in them.

If someone is working in a hybrid environment where their data files do have CR in them, be noisy. Don't read as strings or write back out with CR unless they really know what they are doing and demand it. Make it as easy as feasible to demand and give guidance...but make it clear that the native tongue is no-CR.

hostilefork · September 15, 2019, 3:33am

This goes along with what I've said about removing NUL ('\0') from TEXT!. You see this on the rise in specs, as people are realizing the security implications.

If we are effective about the distinction between BINARY! and TEXT!, and make it as easy as possible to parse and process a BINARY! with TEXT! fragments as we can... then perhaps TEXT! would prohibit control codes entirely, except for newline and space.

Making BINARY! a fluid "unclean" type that can do nearly everything a string can do, and quarantining the non-printables to it, might be the most forward-looking concept we have at hand. Let's just look at practical cases and how those with BINARY!-ish needs can be appeased while removing hidden variables and vulnerabilities as best we can.