Parsing Giant Streams Without Consuming Tons of Memory: How?

I've mentioned that before I go through optimizing UPARSE I wanted to make it good at one thing that's been a bit of a pet wish of mine...

...and that's to be able to PARSE a giant file or network stream without needing to read it all into memory at once.

There are two levels of goal here:

  1. Big Crazy Goal: To have something like a long network stream of data (that you can't seek backwards in) and be able to have records in it parsed discretely one at a time. Even if the stream is 100GB in size, you'd only use a fixed amount of memory to do your processing.

  2. More Modest Goal: To let the PARSE be woven in with the read process, so it can start processing and reacting without waiting for all the data to be read...even if it ultimately isn't able to avoid reading the whole stream of data.

Getting 2 to work is the easier of these. -But- let me be clear that given the lack of existence of "streams" in historical Rebol, it by no means easy!

1 is the more tricky and interesting one to my tastes, so I'll start by talking about that.

If Combinators Inform Us, Then (1) Seems Tractable :tractor:

Let's say we're trying to parse lines with just the letters A and B, and count them as we go:

p: open %giant-file.txt

count: 0

parse p [while [
   some ["A" | "B"] newline
   (count: count + 1)
then [
    print ["Your giant file had" count "lines of ABs"]
else [
    print ["Giant file wasn't just lines of ABs"]

Our intuition tells us that we can can do this one line at a time, throwing it out as you go. But how might PARSE know that?

It builds a WHILE combinator and can see that's the last thing in the rule block. Assuming the user doesn't capture any positions with <here> or do any SEEK, there is no way that the WHILE will ever backtrack from the point where it started. Each iteration can throw out the ability to backtrack.

But right now WHILE is a black box; doing whatever it wants until it finally returns a result. From PARSE's perspective there's nothing from the outside that differentiates it from something called WHILE-HALF that will repeat a rule some number of times, and then jump back in time to the halfway point of the match:

>> parse "abababab" [while-half "ab", return <here>]
== "abab"

>> parse "abababababab" [while-half "ab", return <here>]
== "ababab"

Without some additional information, the system doesn't know that WHILE won't make a decision like WHILE-HALF would. It has to let it run until it is finished.

How Can Combinators Tell PARSE About Backtrack Needs?

One way of looking at this is that the combinator itself becomes responsible for storing any memory that it requires for backtracking.

That is to say that it pulls information out of the stream...and if it wants to backtrack it pushes it back in.

>> parse "aaab" [while ["a"] "b"]    
  • WHILE combinator grabs an "a" from stream, matches the "a"
  • WHILE combinator grabs an "a" from stream, matches the "a"
  • WHILE combinator grabs an "a" from stream, matches the "a"
  • WHILE combinator grabs a "b" from stream, doesn't like it, pushes it back and ends
  • TEXT! combinator grabs a "b" from the stream, matches the "b"

If the WHILE becomes responsible for pushing back any input it doesn't like, then the stream can just discard everything as it goes (in cases where it doesn't see any potential for some rule down the line to request backtrack). This means offering some kind of "push back into stream" operator that combinators can use if they need to back out.

This concept of putting back the character is actually how many things like this work.

In C++, unget() requires you give what you read in. By doing so, then if the data is no longer in a buffer and you're reading from a doesn't need to do anything but push its file offset backwards. Haskell's unRead and C++ putback() let you push back something different than what you read...and considers that a feature (we'll assume it does a similar optimization to unget() if you were reading from a file and it noticed what you pushed back was the same as the data in the buffer?)

"Going Unit-By-Unit Sounds Laborious, and Slow...?"

It may seem laborious on the surface, but as far as I can tell this is the way streams work.

I was just working on an implementation of READ-LINE for standard input. And all the prescribed methods of reading one line at a time from a file in C would go one character at a time. That sounds like a lot of I/O requests, but the thing is that basically all I/O systems have buffering in them...if you ask to read a character from a file, it isn't going to your hard drive or making a network request for that one character. It buffers it--and if you try to buffer it yourself you're likely just going to be adding complexity/code/memory and making things worse.

Unfortunately libuv() streams don't have any putback() or ungetc() ability. There's no going back in time with them. :-/

And as it turns out Boost.ASIO doesn't have it either. (Which surprises me.)

This means if we were building combinators on top of an ungetc()-type logic...and want to go back in time to read a file and not have it fully in memory...we'd have to be using the raw file API if we wanted to keep sync'd to the data that's already on disk and be able to use it instead of keeping the full buffer contents.

That's a bit depressing. But if there's any good news, it's that Rebol datatypes are optimized specifically for "unget". If the buffers are BLOCK!s or BINARY!s or TEXT!s then when you "NEXT" them the data is still there, and you just BACK it to do an ungetc.

Plus, we'd have to have our own layer for managing this if we were going to seek back in time using HTTP Range Requests on networks.

I guess I'll just experiment and see what I can work out. :-/


An Additional Note On the Practicality Of Discarding Input

I painted a picture above of how WHILE could become more participatory in helping PARSE avoid being responsible for keeping around all the past input in a parse stream for all time.

The rule I was examining looked like this:

parse p [while [
    some ["A" | "B"] newline
    (count: count + 1)

What if your rule was instead:

parse p [
    while [
        some ["A" | "B"] newline
        (count: count + 1)

The WHILE may know it's never going back after a match it makes, for each line. And it may effectively communicate that through its input consumption (with the promise it will be responsible for giving back any input it decides it does not like).

But since there's an alternate rule coming up, the PARSE can't use that knowledge to let the information go that the WHILE claims to be done with (or responsible for giving back). Because the "some other rule" could run if the WHILE ultimately doesn't match...requiring backtrack all the way to the beginning state before the WHILE.

This points out that | in a BLOCK! rule is basically a SEEK operation. It's a "seek the position before the last alternate grouping" rule. If the block rule is at a position and wants to know if it can throw out data it's streamed before the current point, it only needs to look forward and see if there are any | in the block ahead of it, and if so... it can't throw out the data.

But coming to the rescue is ||...which I have called the inline sequencing operator. If the BLOCK! combinator looks ahead and sees one of those, then it knows any | after it aren't relevant to whether it can throw out the past. The current rule will have to complete on the input it is given or things will fail.

All this may sound too tricky to be worth it, but this is a topic that interests me...and I have to stay interested, right? :slight_smile: