Updated Parse Machine

I've an updated version of my Parse Machine (currently for R3C) that binds the whole operation to the SYSTEM/CODECS object.

Could use more thought about how it works, but I thought I'd share in its current iteration as I'm finding it increasingly useful.

Features

  • It somewhat standardizes a location for grammar rules
  • It somewhat standardizes the FSM parsing method for any codecs written therein
  • The above points together conceptually would permit mixing and matching grammars (e.g. CSS within Markup) as well as using fragments, though I'd say there's likely more to figure out before that'd all be practical.
  • It somewhat separates grammar rules from the application logic that processes them, thus the same rules can be used for consuming whole files or on a token-by-token streaming basis
  • Leans also to codecs rooted in the spirit of be liberal in what you accept.

I've used somewhat above to underscore that while this idea is opinionated about the conventions used, it does not dictate anything. There's still undiscovered patterns in there

Limitations

  • For efficiency, the machine only exists as a single module and can only institute one state value at a time, thus decoders built atop it either can't be called recursively or have to manage recursion explicitly

  • The guts of the machine is a Parse WHILE loop which has to look up the current rule each iteration

  • The state value is presumed to be a MAP! chosen for its less ornery handling of keys, not for its suitability/performance

  • These two points would indicate that this approach could be optimized if it proves as useful as I believe it might

  • The codec installer (PARSER/NEW) only has params for name, file extensions, options and grammar rules. It doesn't include encode/decode/identify functions. This could change to something like:

    parser/new 'name [
        suffixes: [...]
        identify?: func [...] [...]
        rules: [rule-1 [...] [...]]
        encode: func [...] [...]
        decode: func [...] [...]
    ]
    

    Although it's not certain that a codec may have it's own set of rules (e.g. 'markup can reuse 'html rules)

Example

This example is a little contrived, but demonstrates how to install and deploy a grammar, and how flow can both be embedded in the grammar or in functions built around the Parse Machine API:

Rebol [
    Title: "Test Parser Module"
    Needs: [
        %parser.reb
    ]
]

make object! [
    parser/new 'abba [%.abba] [] [  ; the empty block is 'options
        is-a [
            #"a" (emit 'a-in-a)
            |
            #"b" (emit 'b-in-a-to-b use is-b)
            |
            #"c" (emit 'c?! stop)
            |
            skip (report ["Misplaced " to char! mark/1])
            |
            end (emit 'end-in-a use done)
        ]

        is-b [
            mark:
            #"a" (emit 'a-in-b-to-a use is-a) :mark
            |
            #"b" (emit 'b-in-b)
            |
            #"c" (emit 'c-in-b)
            |
            skip (report ["Misplaced " to char! mark/1])
            |
            end (emit 'end-in-b-to-a use is-a)
        ]

        done [
            end (emit 'done stop)
        ]
    ]

    system/codecs/abba/decode: func [
        blob [binary!]
        <local> state
    ][
        state: parser/init 'abba blob

        state/out: make block! 10
        state/errors: make block! 0

        state/emit: func [token] [
            append state/out token
        ]

        state/report: func [token] [
            append state/errors token
        ]

        parser/start

        case [
            not state/is-done [
                fail "Blob did not meet strict ABBA requirements"
            ]

            not empty? state/errors [
                probe state/errors
                fail "Multiple ABBA errors"
            ]

            <else> [
                state/out
            ]
        ]
    ]
]

probe equal? decode 'abba to binary! "aba"
[a-in-a b-in-a-to-b a-in-b-to-a a-in-a end-in-a done]

probe equal? decode 'abba to binary! "bcab"
[b-in-a-to-b c-in-b a-in-b-to-a a-in-a b-in-a-to-b end-in-b-to-a end-in-a done]

probe error? trap [decode 'abba to binary! "abcd"]
2 Likes

I don't completely follow what it's for. But it looks like a very good non-trivial use of binding to study and try to make better.

For instance, looking at:

set in system/codecs grammar make object! compose/only [  ; why I like ((splice))
    name: quote (grammar)  ; today, this could be done as name: '(grammar)
    suffixes: (suffixes)
    identify?: _
    rules: (bind rules parser)  ; v-- see remarks below
    options: (options)
    decode: _
    encode: _
]

Here's one of those situations where (bind rules parser) not only mutably binds, but it yields something that gets further bound into the make object!. So if the rules use the word "name" it would be redirected into this object. Which is presumably not what you wanted. It's an example of the need for the ideas in "Breaking MAKE OBJECT! into Multiple Operations".

There's also an aspect here of "Separating Parse Rules Across Contexts". Because if someone uses a component subrule that isn't a direct descendant of the rules, they presumably want access to the elements of parser as well.

The idea of having bindings that only apply to the GROUP! portions of code in a PARSE is a likely desired feature as well.

Retaking the term USE in DO-oriented code is one of those uncomfortable-yet-heretically-foundational ideas...and it makes you wonder exactly what the prescribed way of subverting that is. (Do you always go through LIB/XXX? How do people know what's bound out from under them and what isn't?
I've wondered about this with things like adding keywords to PARSE...each new keyword you add runs the risk of breaking a use for another purpose...)

I've an updated version of my Parse Machine (currently for R3C)

If you'd be willing to go ahead and jump this particular codebase to mainline Ren-C, and work up some tests for it, then it could be a lens for how we look at evolving these features. I'd be able to help keep it running...so changes wouldn't be as much a hassle.

It's definitely not throwing softballs--which I like. This is supposed to be a language you can extend, and that shouldn't entail "write an entirely new evaluator and PARSE command if you want something that deviates a bit".

keep [] ; force collect to return a block

I thought the "null if empty" optimization had been removed from R3C. In any case, COLLECT with no KEEPs currently returns an empty block... while COLLECT* returns NULL.

The convention I'm shooting for here is that you use the state map to track any variables. As state is a consistent value through all grammars, every grammar shares the same view on, well, state. There's a few aspects where adhering to conventions such as this offer some insight in to what the system might be capable of.

Yes, though as parser only has one state value at a time, any rule 'outside' the binding just needs access to parser/state to access parsing details.

It's certainly something I've considered. Red doesn't have use and a common pattern is to create namespaces which are kind of what module! is for. Another unstated goal here is to make the code surrounding these grammars as basic as possible, to make them maintainable and optimize-able.

I'm a little less familiar with latest Ren-C than I was, especially the Parse changes. It's not a complex module, but it may take a bit of effort demonstrating something useful.

There's a few such things that I've come across. Working to get them into tests.

That avoids naming conflicts, though it's a bit heavy-handed. And even in this case it's not true of all variables you're using, e.g. mark has potential to collide. (BTW: Ren-C took MARK and SEEK as keywords. There are several reasons, including that I never liked SET-WORD! and GET-WORD! for the purpose.)

In any case...my point stands, that you likely only intended here to bind rules parser. Yet with the historic workings of MAKE OBJECT!, when you do this with a COMPOSE and the block becomes a child of the object body, you will get the additional bindings of name, suffixes, identify?, rules, options, decode, encode... etc. Imagine adding mark to this list in your example...the user code bound to it could wind up corrupting state it wasn't supposed to see.

When people make these little "parse machine" snippets, I'd imagine they'd like to be able to take for granted that the parse machine engine could interpret those snippets without having to be given all of them up front in a block. Especially if the snippet might be generated dynamically. It would be nice if you could be on more of an even turf with the keywords of PARSE in this respect.

To try and move in this direction, I've gone ahead and pushed through PARSE/INSIDE:

%parse-inside.test.reb

The feature itself is fairly small, but the supporting commits to get it working with some level of efficiency were exhausting to look at and think about. So there's not much in the way of tests.

But parse machine seems like it would be a good test, and could push definition of these sorts of features forward.

This is not the final form the feature would likely take. It should be able to give bindings that apply more granularly.

2 Likes

Ah, I get what you mean. Yep, that did catch me out. My inclination though is to reach for map! instead as it has no context and potentially doesn't require a preset list of fields, or another way of designating the rule (e.g. [rule: bind 'rule 'some-other-word-in-the-parent-context-that-isnt-in-the-object]—not ideal and a more nuanced constructor would be better in this instance). This goes back once again to thinking about what objects are for—1) single use, code-grouping objects such as this vs. 2) prototypes representing things in the classical object-oriented sense—which I'd still prefer to see prioritized. Again, I'm sympathetic to the need for different constructors, but still have reservations about the implications of my/method as a solution to binding when cloning. Anyways, this is more a response to that topic than this one.

Specific usage of mark here is a behavioural legacy and a mistake, the convention I'm looking to use here would be state/mark. I'm not sure I see the state basket as being heavy handed, I think it's a reasonable response to the complexity of what I'm trying to achieve. Could well also be my familiarity bias too, but for now it's workable as a means to explore the Parse Machine concept to a conclusion.

BIND does not currently work on MAP!. So prior to the existence of PARSE/INSIDE, you wouldn't be able to use it in this way.

Is there a reason you haven't reacted to PARSE/INSIDE, when it has demonstrated a novel new feature I'd think would be interesting to you--given that you try things like this (and want to be able to separate parse rules across contexts)?

Just haven't had a chance to think about it yet.

1 Like

It should indeed be a good test and could be way to solve the recursion issue. I presume if it encountered a word not in the object, it'd revert to the word's containing context?

If there's one advantage to using a MAP!*, it's that if you do use overlapping grammars**, you don't have to preset the object to use all the variables used by each grammar.

*but I suppose can use a map within an object

**latest monstrosity encountered today comes from Google: a semantic information schema encoded as JSON contained within the HTML content of a MIME-based email all designed to make your email look better in the Promotions tab in Gmail—data format mad-libs!

Another thought—and this is definitely half-baked, but in the same vein as CONSTANT/MUTABLE, I'd prefer to get an error when an implied binding operation happens on a 'locked' block. In this case, I might say bind/lock rules parser and get an error when the object tries to bind. Or it may be that blocks in source get one shot at implicit binding thus when you reference the block later on after reducing/composing, you get an error. Would need to think about the implications.

e.g.

This is fine:

make object! [
    foo: 1
    bar: [foo]
]

But this would raise a binding error:

baz: [foo]
make object! compose/only [
     foo: 2
     bar: (baz)
]
=> ** Baz already bound to context

But this would raise a binding error:

baz: [foo]
make object! compose/only [
    foo: 2
    bar: (baz)
]
=> ** Baz already bound to context

A problem with outlawing binding is that in the general case, code doesn't really know how much binding it has been through already. That can be the loading process behind a module... or an abstract dialect that has slots in it where you can put code.

Just consider things like the OPERATION abstraction just added to Whitespace. You want the code in the body to be able to see library definitions you load in to the module as a whole--that foundation is done by a bunch of binding. Then it's building a function behind the scenes...a process that means binding to locals and (synthesized) arguments.

While it's using "global" variables to access parts of the virtual machine state, this needs to change to something that's more narrow. Maybe it gets at everything through one variable like your STATE approach, so it would have to say things like state/stack. But as I mentioned that feels heavy-handed... I think it's nice to just have an implicit stack variable since messing with it is so common. So broadening this set out to N implicit variables seems better, maybe just a few shorthands (e.g. STACK allowed instead of VM/STACK, but other things might require VM/XXX)

In such a layered world, what kind of rule could MAKE OBJECT! enforce to disallow binding? What's the test it does on a block to say "hmm, this is too bound up".

Yep! :slight_smile:

But I'm very glad you are picking up involvement in this line of thinking. Because having an interest in this topic at all seems to be pathologically absent from most of the Rebolverse (whereas I believe it is kind of the essential question of what the semantic basis for the language could possibly be...)

In terms of inch-by-inch progress: the baby step I've been trying for starters is to stop mutable binds from happening to CONST data. Since functions make their bodies CONST, that means anything that would "bind implicitly" (such as MAKE OBJECT!) would have to be switched from BIND-style binding to IN-style binding.

There's actually still some violators, due to a lot of little technicalities in a very sprawling and organically evolved codebase. But good news is that the compiler can catch the violations (just by making a C++ build). There just have to be little "ignore the violation" tweaks in--for now--for some bits to run.

I hope you take a look through the steps to upgrade the Whitespace example and some of the questions it raises. Right out of the gate, it needed a variant of MAKE OBJECT! which didn't broadcast the bindings down through the body...and you can see how I did that in the CATEGORY implementation.

This is a very excruciatingly manual form of what I talk about in "breaking MAKE OBJECT! into multiple operations". But this is a very common need and people shouldn't have to write their own version of it.

I doubt the needs are all that different from your Parse Machine. But the advantage the whitespace case has is that it's a rather easy-to-understand and documented scenario that can be compared with implementations in other languages. So I hope you can chime in on helping to solve these fundamental and mission-critical problems...what are the steps, what do we call them, and how does it all plug together to make a compelling story.

If a Redbol can't solve problems like this, then the big-picture promises are basically snake oil. People are merely going down a Turing tarpit of making weirdly brittle spaghetti code in Rebol--simply because it can't stop you from solving a problem using it (not because it's providing material support). I think Ren-C stands on reasonable ground to build the Actual Value solution. But it needs your help.

1 Like