Semantics of READ and TCP Streams: Past And Future

First things first: If you want to work on network protocols at all--in the Redbolverse or elsewhere--please heed this warning!

:biohazard: There's a really big potential misunderstanding about TCP when you offer someone a plain READ operation. :biohazard:

You may get the impression that the other side of the connection is sending "messages" with specific lengths. This is not the case!

Let's imagine you could say:

>> read tcp-port
== #{1FFEC02A}  ; 4 bytes

>> read tcp-port
== #{E0C1}  ; 2 bytes

Just because you got two chunks of information from that connection does not mean that there were 2 sends from the other side. The other side might have sent 100 bytes and this is just how you are getting the first 6. Or it could have done 6 individual 1 byte writes, along with any number of 0 byte writes (which are legal).

This means all understanding of the data you receive has to be in terms of a protocol. The only number that matters is how much data you are certain that you can expect... never in any particular length you get from a chunk of data you are provided.

Note Today's READ On TCP:// Doesn't Work Like That

Rebol hasn't historically let you READ a TCP connection in the way shown above.

( Though honestly, the code for everything would have been more understandable if you could have just done that! All the convoluted WAITs and AWAKE handlers were in service of some asynchronous nirvana that never materialized. So what ended up emerging was the most convoluted and poorly engineered way to write ultimately synchronous protocols the world has likely ever seen. :angry: )

What happens instead is that READ basically just makes a request, and returns nothing. If you want anything back, you have to poke an AWAKE handler function onto the port as a function to receive the data. If you WAIT on the port...then eventually during the course of that wait you should get an EVENT! passed to the AWAKE handler specified.

The only thing that event holds is the word "READ", so the place you need to look for the data would be in the port's DATA member. That data member would just get bigger with each READ. So it was the responsibility of the user of the port to clear that data out, or it would just accrue indefinitely.

If a READ didn't come back with the number of bytes you wanted, you'd have to call READ again...and then return FALSE from your port's AWAKE handler to say you were not done yet. Returning TRUE from the port's AWAKE handler would indicate that enough progress had been made that something which was performing a WAIT on that port should be unblocked from the WAIT.

Can We Improve This? (rhetorical question)

If we were only judging R3-Alpha PORT! vs. the low-level unix recv() function, we might say that it's an improvement. It doesn't concern you with the dynamic memory management of having to pass small fixed size buffers to the recv() and stitch them together into a large blob. The BINARY! grows on its own, and you take what you want from it.

The truth is somewhat more complex. The BINARY! lives in the port and the semantics of how you interact with it are traditionally not clear. What if you increment the index? skip 100

Should the next time the buffer is added to slide the buffer forward to use the unused space at the head? Or should those 100 bytes at the beginning be preserved indefinitely?

The way most languages would resolve such questions about "buffered IO" would be to narrow the interface through something like GoLang's Reader interface. You are given specific APIs to "peek" at the data without removing it. Or if you do remove it, then it's always slid forward to the front of the buffer.

Attacking Asynchronousness In A Modern Fashion

Empirically people must have noticed that R3-Alpha never delivered the goods on its promised asynchronousness. I've pointed out some of the reasons, and how "WAIT" has lacked a semantic definition of what it is you're "waiting for" on ports.

By building on libuv, the nuts and bolts at the systems/C level of being able to make requests and then get a callback is now available...with reasonable handling of errors. But it would be a big mistake to expose that mechanism by replacing the PORT!/AWAKE mire with some kind of READ/CALLBACK situation where you pass a function that gets called with either the data you asked for or an error. No one wants to code like that... which is why Node.JS callbacks are all being replaced by async/await patterns.

My feeling is that if you want to disentangle users of a scripting-class language with the problems that come with threads and mutexes and the like, there have emerged modern answers. And Go is one of the better examples of this:

"Do not communicate by sharing memory. Share memory by communicating."

Code would get clearer if we rolled it back to where you write things as if they are synchronous. And often that's probably going to be fine for people. But if it's not, then you use channels to split off what you are doing.

So I think the likely right answer is just to push forward on stacklessness as the basis for green threading, used to implement asynchronousness as it is needed.

This would mean that I think all the asynchronous port stuff that exists so far should just be scrapped. #andnothingofvaluewaslost

So... READ on a TCP PORT! Should Give Back BINARY!, then?

Okay we're back to this:

>> read tcp-port
== #{1FFEC02A}  ; 4 bytes

>> read tcp-port
== #{E0C1}  ; 2 bytes

As I just said, synchronous reading like this is more in line of how we express ourselves...and we get asynchronousness by virtue of some scheduler that can rearrange things as a master of stack-time-and-stack-space. (I will point out pretty well developed experiments with that scheduler have been reached in the past, and can be reached again...with more insight now.)

But I think we like the concept of READ as a default of "give me all you've got until the EOF" as a default.

Multi-returns can help us here. Remember that a function knows how many returns you requested, so it can selectively invoke a behavior when you do so.

>> [data eof]: read tcp-port  ; asking for EOF means don't force read to EOF
== #{1FFEC02A}

>> eof
== false

>> [data @]: read tcp-port  ; remember "circling" and other neat tricks..
== #[true]  ; asked for eof to be the main return result

>> data
== #{E0C1}

-- or --

>> read tcp-port  ; don't ask for EOF means read until EOF
== #{1FFEC02AE0C1}

This feels much more solid.

Weird Concept Idea: Buffer As BLOCK! ?

Cleansing ourselves of the dead-end of R3-Alpha's asynchronous plan, there are some areas we might look to to play to language strengths.

I've mentioned the importance of being able to "push things back" into the buffer after having read them...and that it's likely the best way of doing PARSE on streams

So I began to wonder if the thing that ports would accrue might be a BLOCK! instead of a BINARY!?

Imagine a TCP port would be feeding in little blobs of BINARY! at the tail. But when you got the chance to process it, you could make the decision to fold that into some kind of structure. Then you could emit this higher level processed structure to something that listens down the "pipe".

These are the kinds of novel directions I'd like to see...where we can do streamed block PARSE on a PORT! that feeds arbitrary values, that was decoded from binary, that was decompressed from a streaming codec on top of a TLS decoder...

So this might lead to some weird stuff. Like if you start asking to look into the buffer you'd see that it's a block and see the blobs it plans on giving you in the next READs:

>> peek/part tcp-port 2
== [#{1FFEC02A} #{E0C1}]

Anyway, long post...but I feel slightly optimistic that it points toward some of how to dig out of the R3-Alpha port debacle.

1 Like

In one of those bittersweet feelings of catharsis... all the AWAKE handlers are now gone.

That means no more wrote and no more on-wake-up or sync-op...or the other nightmares that were easy to get out of whack.

Because the code has historically been so brittle I was worried that I'd end up spending weeks dealing with the fallout of the change. But it took just about a day...and every change I've made has just made the HTTP and TLS code more accessible. Which was a relief.

What Were AWAKE Handlers?

AWAKE handlers were a bizarre twist on a fairly mundane idea of callbacks. Callbacks are what you get stuck with in single-threaded codebases when you want to do something asynchronous, as in a very basic JavaScript timer:

setTimeout(function () {
    console.log("10 seconds later...");
}, 10000);

Due to being single-threaded, you won't get that message unless you yield to the event returning out of all the stacks that are executing. Rebol didn't have the option of returning out of all the stacks (because that would just end the program). So instead the event pump would run when you called an explicit WAIT function.

But rather than passing in arbitrary free functions like JavaScript with each request, every asynchronous request had to be connected to a PORT!. There was one callback function that served all requests of all types on that port. It was a member of that port called AWAKE.

In explaining the EVENT! datatype, I pointed out the only information you get in the callback is implicitly the port, and a word (unless it's a GUI event, in which case you get some coordinates and which GOB! was clicked).

So this pretty much means you have to assume that whatever notification you're getting is for the last request of that type that was made on that port. Practically this means you can only have one request in flight at a time. And the event notification itself doesn't carry any data, the only place you can look for data is in the port itself.

It Was A Dead End to Asynchronous Programming

Whether you could read the R3-Alpha code well enough to critique it or not, you can certainly judge by the results:

Little was achieved in Rebol codebases on top of these foundations that wasn't essentially synchronous. It just got harder and more confusing to do synchronous things.

I can't speak too much to how it interacted with the GUI, other than to know by hearsay that it was painful.

In the core build, the only thing that wound up being asynchronous was getting timeouts. And generally the only reaction to timeouts was to error. You can work that into synchronous methodology pretty well.

Putting the Design In Context

If you're convinced that each recursion of a Rebol stack needs to line up with a recursion of the C stack, then your options for how to do asynchronous programming are limited.

They are especially limited if you want to be able to run asynchronously on systems without linking to a threading library. Threads let you do other stuff when a blocking READ operation happens. Without them, you're going to need callbacks and a message's very Windows 3.1

People coming from an embedded background can be resistant to threads, because they are thinking of what it takes to run on bare metal... no operating system, so no threads. And on operating systems with threads they have a pretty high cost; threads need their own stacks. So you might be averse to them even then.

When you look at what happened with JavaScript moving from callbacks to ASYNC/AWAIT, it's clear that the direction people are moving into is to express themselves synchronously and make the ability to interrupt the code come from the language substrate itself. For those who stress over the cost of OS threading, doing "green threads" (as in Go) offer an answer...and I've made arguments for why the "stacklessness" it would take to implement that is critical for other reasons.