How Would Stream PARSE Handle Positions?

hostilefork · October 6, 2021, 4:40pm

Generically speaking: when dealing with a streaming data source, you often don't know how many bytes or characters you want to read in advance. You're looking for some pattern in the input to delimit it.

(The simplest-yet-very-common example would be reading until a newline.)

Go has dedicated operations for reading up-until one certain byte (ReadBytes(byte)) or one certain UTF-8 character byte sequence (ReadString(codepoint)). These are methods of the buffered IO abstraction, because if you didn't go through a buffering middleman you'd have to call a lower-level Read() just one byte at a time. Otherwise the lack of a buffer would mean you wouldn't have anywhere to hold the queued-up extra data after a newline...if a Read() came back with a bigger chunk.

Rigging up anything more detailed in Go is harder. So this is where I thought a streaming PARSE would offer an interesting answer for a lot of scenarios. Getting PARSE worked out correctly would save people the trouble of having to drive the progressive Read process themselves, just to get a more nuanced condition than "until a certain byte is seen".

But Streams Aren't Series...So How Would You Call PARSE?

When you parse a series, you don't "consume" it:

>> data: "aaa"

>> uparse data [some "a" (<Yay, some A!>)]
== <Yay, some A!>

>> data
== "aaa"  ; hasn't changed

And you can do partial processing and get a position via <here>:

>> data: "aaabbb"

>> uparse data [some "a" <here>]
== "bbb"  ; this is a "position" that points into `data`

>> data
== "aaabbb"  ; again, the unchanged input

However, streams don't have any position but "here". So how would <here> be any different from <input>?

Some Streams May Internally Know A Position, But Not All

In Go we saw an example of how streaming is an interface that something can offer, while having other methods depending on the data source. Those other methods can offer features like timeouts. Or something like a file could offer the ability to re-seek so the next call reading from the stream would get from a random access position.

But that's all outside of the streaming interface. The stream itself is a black box. And the position is "inside"...all references to the same stream interface will be updated if you read from any reference.

With Rebol series, the position is "outside"...each instance has its own index. So when you NEXT a series, you have to save the result, or you will get the same thing again:

>> series: [a b c]

>> next series
== [b c]

>> next series
== [b c]

If streams worked this way, you'd have to constantly be saving the new stream value every time you read from it, as another return value of the READ process.

[data stream]: read/part stream 10

But if you did have to code like that, how would it react to a situation like this?

>> [data newstream]: read/part oldstream 10

>> read/part oldstream 10

The (presumably) buffered stream no longer has the data on hand. So it either preserves the data indefinitely or some of these calls would fail.

So Parsing Consumes Streams, But Not Series?

Right now, there's no way to leave a stream alone, because reading it consumes it.

The only way you'd be able to "consume" a series value--e.g. advance the index of the input--would be to pass into parse a variable holding the input. Because the index of the series is an immediate in the value itself.

It seems unfortunate that something like a FILE! can know how to do random seeks, and not be able to save and restore positions in PARSE. But if it did, what would the type of <here> be? It would have to create a new stream instance into the same file...this would be like being able to say:

>> s2: clone stream  ; maybe file reads support, but tcp reads don't?

>> read stream
== #{ABCD0102}

>> read s2
== #{ABCD0102}

It's probably bad for PARSE to be going this direction.

A Better Idea (?) some Streams Offer `<index>`, Some Don't

It's already the case that SEEK will accept either an index number or a series position. So when you ask for <index> it could tell you the position in the file.

(Although I should mention that file seeking has historically always been offset-based, starting with zero. This was true in R3-Alpha and is also true in 1-based languages like Julia.)

This just rules out the idea of having <here> on a stream series altogether; to basically stamp out the concept that there is such a thing as a "stream-at-position". You only deal with positions separate from streams...and only on the streams that happen to offer them.

How Would Stream PARSE Handle Positions?

But Streams Aren't Series...So How Would You Call PARSE?

Some Streams May Internally Know A Position, But Not All

So Parsing Consumes Streams, But Not Series?

A Better Idea (?) some Streams Offer <index>, Some Don't

A Better Idea (?) some Streams Offer `<index>`, Some Don't