An early idea that seems to have been in Rebol's "easy cross-platform vision" was the desire to simplify strings in the language to have a single codepoint to represent line breaks. This had been the standard for Unix machines for some time, and when Apple went to a unix basis for OS X they adopted it too.
... but to try and be a good citizen, Rebol didn't want to do this at the cost of bucking the trend on Windows...where files on disk had two-byte "CR LF" sequences. Despite most every programmer's editor being able to handle plain LF on Windows for decades, the rigid holdout of NOTEPAD.EXE would continue to make such files render with everything on one line.
(How did NOTEPAD.EXE become so powerful? It's a good question--read some modern opinions.)
The good instinct here was that a single codepoint is a reduced complexity situation. Whether it's a PARSE rule or any other code that does string manipulation, that single codepoint for the idea of a line break is way easier to look for and manage.
But it's not 1997 anymore (and it never should have been, at least in this respect). Having carriage returns in your files is an artifact of history. The untold man-hours lost by developers trying to appease NOTEPAD.EXE were wasted--and it's one of those cases where people should have been firm and standardized on LF. So while you're getting those old files upgraded to UTF-8 and out of whatever codepage 866 or formats with byte-order-marks you have... lose the CRs too.
The fact is that when the interpreter core attempts to be magical about this it becomes a mess. Low-level C stops being able to assume it can work with strings directly--with the actual bytes that are in them--to having to make copies or move memory around to remove the things you don't want to be there. You wind up entangled in questions of what to do if you see [CR CR LF] or [LF CR]. And when you start mutating the user's input behind the scenes without explicit say-so, the "magic" often leads to mysterious side effects and information loss.
New Answer: Strict Core, but Enhance DELINE
Historically DELINE took strings. But now I'm going to make it take BINARY!, FILE!, and URL!.
If you suspect that a file has CR LF sequences in it and you wish to be tolerant of this, then:
do deline %some-wonky-local-file.reb
Otherwise, the default behavior is that DO will error on CR. TO TEXT! of a BINARY! will consider CR to be codepoint-non-grata...you will need to use DELINE and it will now accept BINARY!.
TO TEXT! is going to be prescriptive in other ways. It's not going to allow embedded
0 bytes, because that creates risky interactions with old-style zero-terminated C strings. It means you cannot trust a simple extractor of a
char* as giving you all the relevant data--you always need to worry about a length output parameter. I think that's an undesirable property for the string extractor for libRebol, and you should use a BINARY! in such cases where you are forced to always get back a size in the API.
Q: How "Platform-Sensitive" Should It Be? (A: None?)
It seems like DELINE pretty much by default needs to accept files that either have CR LF in them or that do not. Because if you say:
do deline https://example.com/some-wonky-internet-file.reb
You are dealing with a file that's not on your computer, and we don't want to create a disincentive to the person hosting it cleaning it up. They may not be able to edit your script.
But then we have to ask about things like whether ENLINE should default to being a no-op on Linux platforms, and only adding the CRs on Windows. Historical Rebol added it either way.
My hope is that people will really avoid using ENLINE and DELINE at all possible, and get their files in order. But as tools they will be there for people who find themselves stuck and can't do that. I feel this is definitely a step in the right direction, and overall code cleanliness and performance will benefit from it.