Fight for the Future: How DELINE will save us from CR LF

hostilefork · February 23, 2020, 7:22pm

An early idea that seems to have been in Rebol's "easy cross-platform vision" was the desire to simplify strings in the language to have a single codepoint to represent line breaks. This had been the standard for Unix machines for some time, and when Apple went to a unix basis for OS X they adopted it too.

... but to try and be a good citizen, Rebol didn't want to do this at the cost of bucking the trend on Windows...where files on disk had two-byte "CR LF" sequences. Despite most every programmer's editor being able to handle plain LF on Windows for decades, the rigid holdout of NOTEPAD.EXE would continue to make such files render with everything on one line.

(How did NOTEPAD.EXE become so powerful? It's a good question--read some modern opinions.)

The good instinct here was that a single codepoint is a reduced complexity situation. Whether it's a PARSE rule or any other code that does string manipulation, that single codepoint for the idea of a line break is way easier to look for and manage.

But it's not 1997 anymore (and it never should have been, at least in this respect). Having carriage returns in your files is an artifact of history. The untold man-hours lost by developers trying to appease NOTEPAD.EXE were wasted--and it's one of those cases where people should have been firm and standardized on LF. So while you're getting those old files upgraded to UTF-8 and out of whatever codepage 866 or formats with byte-order-marks you have... lose the CRs too.

The fact is that when the interpreter core attempts to be magical about this it becomes a mess. Low-level C stops being able to assume it can work with strings directly--with the actual bytes that are in them--to having to make copies or move memory around to remove the things you don't want to be there. You wind up entangled in questions of what to do if you see [CR CR LF] or [LF CR]. And when you start mutating the user's input behind the scenes without explicit say-so, the "magic" often leads to mysterious side effects and information loss.

New Answer: Strict Core, but Enhance DELINE

Historically DELINE took strings. But now I'm going to make it take BINARY!, FILE!, and URL!.

If you suspect that a file has CR LF sequences in it and you wish to be tolerant of this, then:

do deline %some-wonky-local-file.reb

Otherwise, the default behavior is that DO will error on CR. TO TEXT! of a BINARY! will consider CR to be codepoint-non-grata...you will need to use DELINE and it will now accept BINARY!.

TO TEXT! is going to be prescriptive in other ways. It's not going to allow embedded 0 bytes, because that creates risky interactions with old-style zero-terminated C strings. It means you cannot trust a simple extractor of a char* as giving you all the relevant data--you always need to worry about a length output parameter. I think that's an undesirable property for the string extractor for libRebol, and you should use a BINARY! in such cases where you are forced to always get back a size in the API.

Q: How "Platform-Sensitive" Should It Be? (A: None?)

It seems like DELINE pretty much by default needs to accept files that either have CR LF in them or that do not. Because if you say:

do deline https://example.com/some-wonky-internet-file.reb

You are dealing with a file that's not on your computer, and we don't want to create a disincentive to the person hosting it cleaning it up. They may not be able to edit your script.

But then we have to ask about things like whether ENLINE should default to being a no-op on Linux platforms, and only adding the CRs on Windows. Historical Rebol added it either way.

My hope is that people will really avoid using ENLINE and DELINE at all possible, and get their files in order. But as tools they will be there for people who find themselves stuck and can't do that. I feel this is definitely a step in the right direction, and overall code cleanliness and performance will benefit from it.

hostilefork · February 27, 2020, 5:29pm

I haven't gotten a lot of feedback on my concepts here, but I want to stress it is important. I've tried to be clear: I'm uncomfortable with changing data out from under people. It's the opposite of fighting complexity, it's making things unpredictable.

Here's Rebol2 on Windows behavior (which matches Red 0.64 on Windows):

rebol2>> write %foo.txt "a^M^/b"

rebol2>> read/binary %foo.txt
== #{610D0D0A62}  ; actually wrote ["a" CR CR LF "b"]

rebol2>> read %foo.txt
== "a^/^/b"

Here's R3-Alpha on Linux behavior:

r3-alpha>> write %foo.txt "a^M^/b" 

>> read %foo.txt
== #{610D0A62}  ; wrote ["a" CR LF "b"]

r3-alpha>> read/string %foo.txt    
== "a^/b"

The CR LF situation started out bad for people who are on their own, in languages that chose not to get involved. But this state of affairs in the Rebolverse adds up to being worse. The people whose files these are won't appreciate the laissez-faire way in which bytes are doubled or thrown out, etc. It only gets worse the more weird your usage is because the code doing this processing is haphazard. Predicting its behavior on CR CR LF or LF CR CR or other edge cases is going to be nigh impossible.

People need to be source-level participants in us throwing data out (or in). This is what makes me favor words like DELINE and ENLINE that clearly indicate mutation, and encourage that people have to use them in order to get the mutations.

So expressions that read like TO TEXT! BIN have two options:

Leave the CR LF situation as-is
Raise an error and guide you to a preserving or mutating primitive.

That's not what happened in R3-Alpha, where TO STRING! acts like this

r3-alpha>> to string! to binary! "a^M^/b"
== "a^/b"

In my thinking, to get that behavior you use DELINE on the BINARY!. Then, TO TEXT! would error saying "hey, you need to either use DELINE -or- TO-TEXT/RELAX depending on your intent".

Note the delining behavior of TO STRING! wasn't in Rebol2 (nor now in Red 0.64):

rebol2>> to string! to binary! "a^M^/b"
== "a^M^/b"

But it leads to the named inconsistency...where a casual programmer who writes to string! read/binary %foo.txt gets a different answer from someone who just says read %foo.txt and gets a string back.

I feel we're headed in a better direction with explicit DELINE

Let's try and lean really strongly to everyone weaning themselves off of CR LF. Those who can't are going to need to get involved at the source level.

hostilefork · February 27, 2020, 5:30pm

Here are some new tests of AS TEXT! and TO TEXT! demonstrating the subtleties:

CR codepoints (^M) are illegal in TO-string conversion unless /RELAX is used. They are legal in AS-conversions unless /STRICT mode is used

    str: "a^M^/b"
    a-bin: as binary! str  comment {remembers it was utf-8, optimizes!}
    t-bin: to binary! str  comment {makes dissociated/unconstrained copy}

    ('illegal-cr = pick trap [to text! t-bin] 'id)
    ('illegal-cr = pick trap [to-text t-bin] 'id)
    (str = to-text/relax t-bin)

    ('illegal-cr = pick trap [to text! a-bin] 'id)
    ('illegal-cr = pick trap [to-text a-bin] 'id)
    (str = to-text/relax a-bin)

    (str = as text! t-bin)
    (str = as-text t-bin)
    ('illegal-cr = pick trap [as-text/strict t-bin] 'id)

    (str = as text! a-bin)
    (str = as-text a-bin)
    ('illegal-cr = pick trap [as-text/strict a-bin] 'id)

#{00} bytes are illegal in strings regardless of /RELAX or /STRICT

    ('illegal-zero-byte = pick trap [to text! #{00}] 'id)
    ('illegal-zero-byte = pick trap [to-text #{00}] 'id)
    ('illegal-zero-byte = pick trap [to-text/relax #{00}] 'id)

    ('illegal-zero-byte = pick trap [as text! #{00}] 'id)
    ('illegal-zero-byte = pick trap [as-text #{00}] 'id)
    ('illegal-zero-byte = pick trap [as-text/strict #{00}] 'id)

IngoHohmann · February 28, 2020, 6:40am

Sounds good to me.

How do I check wether a string, e.g. from a file/network uses lf or crlf? I know some really ancient programs are still running, and may need crlf.

hostilefork · March 3, 2020, 12:56pm

Checking would be the same as you would think:

bin: read %wherever.txt
if find bin cr [print "Has a CR"]

You can deline it after that. But note that DELINE is now strict; it's not the same thing as replace/all bin cr null...as it will error if the input file isn't strictly CR LF sequences. The goal is that ENLINE and DELINE work with valid formats and so if you have a Frankenstein file you'll find out.

(I should take this time to point out how cool it is that we actually have the null state to mean "replace with nothing")

giuliolunati · October 3, 2020, 4:54pm

My 2cents...

I think CR is a valid utf8, so TO-TEXT and AS-TEXT should preserve it, without /RELAX.
Instead, DELINE/RELAX could be useful for non-consistent files (if a file mixes CR+LF, single CR and single LF, then convert all to LF)

iArnold · October 3, 2020, 7:51pm

Sold! For 2 cents to the gentleman mr Lunati!

kealist · June 13, 2021, 2:40pm

I don't mean for this to come across as too harsh but just feeling frustrated. I always come back to this and I just hate the current behavior. I don't have to deal with this kind of stuff in any other language I use. I banged my head against which file was causing the error for a while. Finally got it sorted out. I'm trying to teach a non-programmer friend a little bit about parse, and having to get them to change how git works in Windows for line endings on their computer to even be able to run a program. Something should be happening seamlessly under the hood to handle the issues, but or could be controlled explicitly for someone who wants to, but I would argue against this being default behavior. I don't use Rebol that often anymore, but everytime I come back to ren-c I hit this, and as a Window user, I don't want using rebol to a constant struggle against line endings. This mostly trying dealing with line endings in scripts I write again, and having to switch any text editor I use to switch line endings. I don't have to do this in any other language I use.

kealist · June 13, 2021, 2:58pm

Just for context, it goes like this.

I want to write a scraper for some data
Make a new script, %script.r3. Write print to text! read https://blah.com
Run script (do %script.r), error, illegal line ending
Search, find this post, oh, I have to use to-text/relax
Run script, error, illegal line ending
Waste time running in console, it seems the code it self works ok
Waste more time.
Oh, I have to save script.r3 in LF mode only
Search how to do that in whatever text editor I am doing
Switch VSCode to do that
Save, run, it works
Put it in repo, and share with friend
They clone and try to run do %script.r3
Error, line endings.

This is pretty much my experience every time I come back to Ren-C

hostilefork · June 13, 2021, 5:57pm

Feedback is good. No worries.

Hopefully you've been following some of the developments in UPARSE.

It's still a prototype and not fast enough for use on anything but small bits of input yet. But it's very good!

Of course things are developing some speedbumps here and there as PARSE and UPARSE are brought into sync. I have no problem with people learning Red or Rebol2, and coming to Ren-C at some later time when their tastes evolve...

I've put forth that if every script in Ren-C ever written must have the same line endings, this offers a distinct feature for those sharing scripts. People won't fight over it, because the fight is decided for you. And the LF-only version has very clear advantages.

You raise a good point here that this can all be for naught if other people are using a "corrupting" transfer process like git cloning with CR LF translation.

But I think you're doing your friend a big favor by having them fix it as early as possible. And be aware of it as early as possible. There's just so many good reasons not to use that translation:

Windows Git Users: Don't Convert Line Endings

The story I wish came across wouldn't be that this is about one language you use, but about shifting gears so you systemically cleanse this from all the development in which you participate.

e.g. you might take a day to convert all the files on your system to the LF world, and all your editors to LF only. If you ever see a CR LF you know to kill it...like squishing a bug. So then the behavior isn't a frustration...it's a warning...one that helps you keep all your files in shape and know when a rogue tool is CR-LF-ing you.

The trouble with Windows CRLF

This is how I experience the feature. It's not an inconvenience...it's an alarm going off. It tells me "corruption has crept in, kill it now, before it does any more damage." I then think "whew, good thing I found that before it wound up spreading to another machine..."

I'm certain I want the hard sell of the philosophy to be something that comes in by default. Because I've looked at the angles and I think it is the correct way of working. A good number of people agree with "LF only, even on Windows" now...and I think that will just keep increasing every year.

I don't want to make a global way to turn off the feature. Because if people can fire and forget about this...in some .ini file...you wind up with hidden state that makes a script on one system run differently than a script on another system.

But I don't want to lose users and contributors. I just want to keep a "clear and present warning" that they are disbelievers, so people know to be wary of their scripts.

Perhaps a file naming pattern, like script.lax.r ?

In the absence of being able to convince a casual user like yourself to "take the plunge and cleanse thyself" , I would feel better about that as a compromise than other ideas.

As for the reading foreign data sources issue, I feel like the codec-level questions are different and will need to be addressed with other finesses. Wish we knew more about that, because the definition of what a READ or LOAD should mean still is quite nebulous.

In any case, thank you again for speaking up and explaining your experience.