Fight for the Future: How DELINE will save us from CR LF

An early idea that seems to have been in Rebol's "easy cross-platform vision" was the desire to simplify strings in the language to have a single codepoint to represent line breaks. This had been the standard for Unix machines for some time, and when Apple went to a unix basis for OS X they adopted it too.

... but to try and be a good citizen, Rebol didn't want to do this at the cost of bucking the trend on Windows...where files on disk had two-byte "CR LF" sequences. Despite most every programmer's editor being able to handle plain LF on Windows for decades, the rigid holdout of NOTEPAD.EXE would continue to make such files render with everything on one line.

(How did NOTEPAD.EXE become so powerful? It's a good question--read some modern opinions.)

The good instinct here was that a single codepoint is a reduced complexity situation. Whether it's a PARSE rule or any other code that does string manipulation, that single codepoint for the idea of a line break is way easier to look for and manage.

But it's not 1997 anymore (and it never should have been, at least in this respect). Having carriage returns in your files is an artifact of history. The untold man-hours lost by developers trying to appease NOTEPAD.EXE were wasted--and it's one of those cases where people should have been firm and standardized on LF. So while you're getting those old files upgraded to UTF-8 and out of whatever codepage 866 or formats with byte-order-marks you have... lose the CRs too.

The fact is that when the interpreter core attempts to be magical about this it becomes a mess. Low-level C stops being able to assume it can work with strings directly--with the actual bytes that are in them--to having to make copies or move memory around to remove the things you don't want to be there. You wind up entangled in questions of what to do if you see [CR CR LF] or [LF CR]. And when you start mutating the user's input behind the scenes without explicit say-so, the "magic" often leads to mysterious side effects and information loss.

New Answer: Strict Core, but Enhance DELINE

Historically DELINE took strings. But now I'm going to make it take BINARY!, FILE!, and URL!.

If you suspect that a file has CR LF sequences in it and you wish to be tolerant of this, then:

do deline %some-wonky-local-file.reb

Otherwise, the default behavior is that DO will error on CR. TO TEXT! of a BINARY! will consider CR to be codepoint-non-grata...you will need to use DELINE and it will now accept BINARY!.

TO TEXT! is going to be prescriptive in other ways. It's not going to allow embedded 0 bytes, because that creates risky interactions with old-style zero-terminated C strings. It means you cannot trust a simple extractor of a char* as giving you all the relevant data--you always need to worry about a length output parameter. I think that's an undesirable property for the string extractor for libRebol, and you should use a BINARY! in such cases where you are forced to always get back a size in the API.

Q: How "Platform-Sensitive" Should It Be? (A: None?)

It seems like DELINE pretty much by default needs to accept files that either have CR LF in them or that do not. Because if you say:

do deline https://example.com/some-wonky-internet-file.reb

You are dealing with a file that's not on your computer, and we don't want to create a disincentive to the person hosting it cleaning it up. They may not be able to edit your script.

But then we have to ask about things like whether ENLINE should default to being a no-op on Linux platforms, and only adding the CRs on Windows. Historical Rebol added it either way.

My hope is that people will really avoid using ENLINE and DELINE at all possible, and get their files in order. But as tools they will be there for people who find themselves stuck and can't do that. I feel this is definitely a step in the right direction, and overall code cleanliness and performance will benefit from it.

I haven't gotten a lot of feedback on my concepts here, but I want to stress it is important. I've tried to be clear: I'm uncomfortable with changing data out from under people. It's the opposite of fighting complexity, it's making things unpredictable.

Here's Rebol2 on Windows behavior (which matches Red 0.64 on Windows):

rebol2>> write %foo.txt "a^M^/b"

rebol2>> read/binary %foo.txt
== #{610D0D0A62}  ; actually wrote ["a" CR CR LF "b"]

rebol2>> read %foo.txt
== "a^/^/b"

Here's R3-Alpha on Linux behavior:

r3-alpha>> write %foo.txt "a^M^/b" 

>> read %foo.txt
== #{610D0A62}  ; wrote ["a" CR LF "b"]

r3-alpha>> read/string %foo.txt    
== "a^/b"

The CR LF situation started out bad for people who are on their own, in languages that chose not to get involved. But this state of affairs in the Rebolverse adds up to being worse. The people whose files these are won't appreciate the laissez-faire way in which bytes are doubled or thrown out, etc. It only gets worse the more weird your usage is because the code doing this processing is haphazard. Predicting its behavior on CR CR LF or LF CR CR or other edge cases is going to be nigh impossible.

People need to be source-level participants in us throwing data out (or in). This is what makes me favor words like DELINE and ENLINE that clearly indicate mutation, and encourage that people have to use them in order to get the mutations.

So expressions that read like TO TEXT! BIN have two options:

  1. Leave the CR LF situation as-is
  2. Raise an error and guide you to a preserving or mutating primitive.

That's not what happened in R3-Alpha, where TO STRING! acts like this

r3-alpha>> to string! to binary! "a^M^/b"
== "a^/b"

In my thinking, to get that behavior you use DELINE on the BINARY!. Then, TO TEXT! would error saying "hey, you need to either use DELINE -or- TO-TEXT/RELAX depending on your intent".

Note the delining behavior of TO STRING! wasn't in Rebol2 (nor now in Red 0.64):

rebol2>> to string! to binary! "a^M^/b"
== "a^M^/b"

But it leads to the named inconsistency...where a casual programmer who writes to string! read/binary %foo.txt gets a different answer from someone who just says read %foo.txt and gets a string back.

I feel we're headed in a better direction with explicit DELINE

Let's try and lean really strongly to everyone weaning themselves off of CR LF. Those who can't are going to need to get involved at the source level.

2 Likes

Here are some new tests of AS TEXT! and TO TEXT! demonstrating the subtleties:

CR codepoints (^M) are illegal in TO-string conversion unless /RELAX is used. They are legal in AS-conversions unless /STRICT mode is used

    str: "a^M^/b"
    a-bin: as binary! str  comment {remembers it was utf-8, optimizes!}
    t-bin: to binary! str  comment {makes dissociated/unconstrained copy}

    ('illegal-cr = pick trap [to text! t-bin] 'id)
    ('illegal-cr = pick trap [to-text t-bin] 'id)
    (str = to-text/relax t-bin)

    ('illegal-cr = pick trap [to text! a-bin] 'id)
    ('illegal-cr = pick trap [to-text a-bin] 'id)
    (str = to-text/relax a-bin)

    (str = as text! t-bin)
    (str = as-text t-bin)
    ('illegal-cr = pick trap [as-text/strict t-bin] 'id)

    (str = as text! a-bin)
    (str = as-text a-bin)
    ('illegal-cr = pick trap [as-text/strict a-bin] 'id)

#{00} bytes are illegal in strings regardless of /RELAX or /STRICT

    ('illegal-zero-byte = pick trap [to text! #{00}] 'id)
    ('illegal-zero-byte = pick trap [to-text #{00}] 'id)
    ('illegal-zero-byte = pick trap [to-text/relax #{00}] 'id)

    ('illegal-zero-byte = pick trap [as text! #{00}] 'id)
    ('illegal-zero-byte = pick trap [as-text #{00}] 'id)
    ('illegal-zero-byte = pick trap [as-text/strict #{00}] 'id)

Sounds good to me.

How do I check wether a string, e.g. from a file/network uses lf or crlf? I know some really ancient programs are still running, and may need crlf.

Checking would be the same as you would think:

bin: read %wherever.txt
if find bin cr [print "Has a CR"]

You can deline it after that. But note that DELINE is now strict; it's not the same thing as replace/all bin cr null...as it will error if the input file isn't strictly CR LF sequences. The goal is that ENLINE and DELINE work with valid formats and so if you have a Frankenstein file you'll find out.

(I should take this time to point out how cool it is that we actually have the null state to mean "replace with nothing")