"Streaming" Survey from Other Languages

hostilefork · October 5, 2021, 8:25am

While PORT!s were ambiguous beasts, one objective we can see was that it should be possible to pipe them together.

This can be seen in the HTTP port. It was written to (supposedly) not care what kind of port it was reading from, so long as it could supply a stream of bytes.

If you feed HTTP from a TCP port, you get plain old http
If you feed HTTP from a TLS port, you get https
- The TLS port is, in turn, fed from TCP

I've written about how this was pretty convoluted...but is approaching a less tangled state. So now we can look at it in a comparative light and redesign it.

We'd really hope that something about Redbol nature can make this more interesting than what more performance-oriented languages offer.

But if nothing unique can be offered, it hopefully isn't any worse...then it's both slower and worse (which is what it has historically been).

Comparisons follow...

hostilefork · October 5, 2021, 10:03am

Go

Go's streaming centers around io.Reader and io.Writer. (Here's a Tutorial, if you're interested _{which you are not})

There's more than one thing called "Reader" but io.Reader is an "interface" that consists simply of one method:

type Reader interface {
    Read(buf []byte) (n int, err error)
}

If you don't read Go syntax, that says Read has one input value (a buffer) and two outputs (a number of bytes read, and an error).

There is an implicit maximum length of the request, because arrays carry along with them a length. (Making new arrays initializes them up to their length with 0.) So when you call the Read function, it can ask for len(buf) to know the maximum amount of data you want back. But n returns how much data was received.

Here's the detailed contract:

Read reads up to len(p) bytes into p. It returns the number of bytes read (0 <= n <= len(p)) and any error encountered. Even if Read returns n < len(p), it may use all of p as scratch space during the call. If some data is available but not len(p) bytes, Read conventionally returns what is available instead of waiting for more.

When Read encounters an error or end-of-file condition after successfully reading n > 0 bytes, it returns the number of bytes read. It may return the (non-nil) error from the same call or return the error (and n == 0) from a subsequent call. An instance of this general case is that a Reader returning a non-zero number of bytes at the end of the input stream may return either err == EOF or err == nil. The next Read should return 0, EOF.

Callers should always process the n > 0 bytes returned before considering the error err. Doing so correctly handles I/O errors that happen after reading some bytes and also both of the allowed EOF behaviors.

Implementations of Read are discouraged from returning a zero byte count with a nil error, except when len(p) == 0. Callers should treat a return of 0 and nil as indicating that nothing happened; in particular it does not indicate EOF.

Implementations must not retain p.

Slightly Higher Level Calls Are Implemented with io Helper Functions

The low-level Read interface means that new data sources only have to implement that one function.

There's no parameterization of Read that says "I want you to read exactly n bytes in a blocking fashion, keep reading until you have it. Instead, you call standard utility functions. These functions take the thing implementing the Reader interface as a parameter, so they can call Read() however many times are necessary.

Right now I'll just point out these 3:

ReadFull - You still pass in a buffer to this function, and it keeps calling Read until the full buffer length is filled. It still gives a number of bytes it read back, but if that number of bytes isn't the size of the buffer the error return will be set. You don't get an EOF error signal if the returned n is the length of the buffer...only if it's cut off.
ReadAtLeast - Takes a minimum number of bytes to read as a parameter in addition to the buffer.
ReadAll - Doesn't take a buffer to fill as a parameter at all. It returns a buffer that is the size of the data that is read, or an error code. EOF doesn't come back as an error...since the contract was to read until the end of file.

Much Higher Level Calls Are Implemented with bufio Helper Objects

The read functions talked about so far all speak in terms of the number of bytes requested to read. But quite often when dealing with streaming, you don't know how many you want in advance. You're reading until the next newline (for instance). When you look at real-world examples like HTTP chunking you see it reading the hex size of the chunk as ASCII digits, then a CR LF, and then a big blob of data that's the size of the chunk read.

So here we enter the concept of buffered io, or Go's bufio. It has objects called bufio.Reader and bufio.Writer... not to be confused with the interfaces io.Reader and io.Writer.

(Naming-wise, they might have called the interfaces Readable and Writable or something of the sort to help with the confusion...and then you could attach Reader objects to Readable things, and Writer objects to writable things.)

In R3-Alpha, a port that you would perform READs on would act as kind of buffered IO abstraction...each partial read adding to a BINARY! found in the port.data. This binary could grow to an arbitrary size, and it wasn't clear when that buffer should be shrunk...or exactly what the concept of ownership was.

Go's buffered IO buffer never gets resized. If you try to do something like ReadBytes('\n')...which will read out of the buffered IO until a '\n' is seen, then it will just make a copy of the buffer each time it fills up and add it to a collection that is returned...then the overall ReadBytes operation fuses the bytes together into one big array. The default buffer size is 4096 (though you can override that) and over the lifetime of the Reader or Writer the size does not change.

bufio.Reader Doesn't Have General Unget or Unread

I've mentioned how some streams are able to "unread" or "unget" to push back data into the buffer...and that this is connected to backtracking in stream parsing.

Because Go's bufio buffer is finite size, you definitely can't just keep pushing an arbitrary amount of data back into it.

They do offer the ability to unread one byte or one UTF-8 character codepoint (which they call a "rune"...this term is Go-specific and doesn't seem to have precedent elsewhere).

The ability to unread one thing would seem to be because their reading routines that get data up to a delimiter will include the delimiter. Rather than making those APIs take a flag for whether or not to include the delimiter with the data read, it always does. (So if you do ReadBytes('\n') the
binary blob you get back has the '\n' in it.)

Timeouts Are Beyond The Scope of the io.Reader interface

When your "Reader" interface contains just one Read method, that's a bit of a clue that a lot of concerns aren't going to be covered by that.

What if the data source just hangs and doesn't return any data... for like, a week?
What if the data source has some data to return real-soon-now, but the user hits Ctrl-C? How does this cause an interrupt?

You can find people asking about this, and read confusing answers.

What's needed is some way for an outside force to cause the Read() call to generate an error.

There are plenty of Rube Goldberg concepts you might imagine. For instance: if the Reader is for a file, you could set a timer on another thread of execution (another goroutine) and close the file after the timer elapses. This will unblock the Read() and give you a "file was closed" error that you could choose interpret as a timeout.

Other data sources might offer you more of a legitimate way of doing this. TCP connections give you SetReadDeadline.

But if you attach a compressor Reader on top of a make-random-numbers Reader...and you ask for a large amount of random numbers and then change your mind...there's not any cross-cutting way to cancel or timeout.

For a Go-oriented insight into the many complexities of managing timeouts, deadlines, and cancellation, here is an article:

https://ieftimov.com/posts/make-resilient-golang-net-http-servers-using-timeouts-deadlines-context-cancellation/

That's a lot to think about.

Summary And Thoughts

There's no particular magic here to speak of. Just a good design choice in making the "Read" one simple method, and pinning down its semantics. Making a new data source is easy, and this means one piece of code for parsing out lines can be used for all of them.

R3-Alpha didn't make this separation. I've pointed to the bad situation of READ/LINES and how any given port implementation could just forget to look at the /LINES refinement. To correctly do streaming with /LINES shouldn't require reading the entirety of a multi-megabyte file into memory, and then splitting that giant blob into a block of lines. And you shouldn't be able to "forget" to heed the /LINES refinement.

But you also run into problems even with READ/PART. After my synchronous change I thought it would be possible to ask for a known amount of data in the HTTP protocol when the transfer was not chunked...the header tells you the expected size, so you could READ/PART that size (similar to Go's ReadFull). Yet I realized that just because the native TCP connection implemented /PART, it didn't mean the TLS connection did...so it only worked when using HTTP and not HTTPS.

The confusingness of the timeouts/cancellation situation in Go may be an opportunity to design a better solution for casual users. (At least something like hooking Read() so that each call to it also includes a call to do a test of a timeout, so at least you can catch signals between reads?) But doing anything meaningful here would require truly understanding the needs of relevant scenarios. I don't have that understanding--I'm no expert in timeouts, and I can assure you that it's more or less voodoo to get Ctrl-C or Escape to work in almost any system I've seen.

Go doesn't really have the ability to put things back into a stream. Seeing a stream as something you can put an infinite amount of data back into is something that feels like the best abstraction to build stream parsing on top of. But if you want to take advantage of the fact that the data source may be able to give you that data, it would mean Unread() would have to be part of the Reader interface...with a default fallback of saying Unread() isn't available so the buffering layer has to remember the data. (Think about how you can "put back" information from a file you've read just by moving the pointer backwards in the file...because the disk is already where you're paying for storage of that file.)