PARSE on PORT! and avoiding generic behavior w.r.t. ANY-SERIES!

If you look at Rebol2, R3-Alpha, and Red they all do the same thing with FILE! that they would with text (STRING!) when you pass as the first argument to PARSE:

rebol2>> parse %aaa.txt [some "a" ".txt"]
== true

r3-alpha>> parse %aaa.txt [some "a" ".txt"]
== true

red>> parse %aaa.txt [some "a" ".txt"]
== true

I've always had the ambition that you be able to PARSE a PORT!. If that's possible, it seems that you should be able to shortcut actually opening and closing the port yourself by saying something like:

parse %some-200-megabyte-file.txt [
     some "a" end (print "Your giant file was all the letter A")
]

parse http://example.com/some-net-data/ [
    thru <title> copy title to </title> (print ["Title was" title])
]

The vision would be that PARSE would assume when you gave it a FILE! or URL! that you meant to operate on that as a PORT!...opening it, parsing it, and closing it. If you gave it a regular PORT! it would assume you would take care of closing it yourself.

Furthermore, it would be efficient so that it didn't need to load all of it into memory at once. (There could be some heuristic on a "chunk size" it picked automatically, paging in only as much of the file as it needed at a time. But you could perhaps tweak that manually by opening the port yourself and doing some settings. This seems to be a property of the PORT! and not of PARSE, though there may be PARSE-specific settings. Perhaps those settings would be looked for on the port itself as an extensible set of headers, vs. being some strange refinement you'd pass.)

In any case, the appeal of having that work for FILE! and URL! certainly seems to suggest that it's a much better use of the type variety than as a synonym for:

>> did parse as text! %aaa.txt [some "a" ".txt" end]
== #[true]

There's clear need for PARSE to run on TEXT!, BINARY!, and BLOCK! input. I'm not sure how this applies to INTO. There also might be a parse/only (or parse/into? which would be type-preserving?)

Not just for PARSE: a General Philosophy of ANY-SERIES!

This ties into what I think should be a very restrained tendency to use ANY-STRING! types in ways that make them equivalent to the behavior on TEXT!.

I've said similar things about why type of first ['''a] should not be conflated with plain WORD!. There should be a default of discernment; leaving the room open for distinct meanings.

So be on the lookout for cases where a datatype is being underused, even if it's not able to do the ideal magic today. Seeing PARSE run on PORT! is a pretty big wishlist item for me, so maybe it's not impossible that it could happen... (!)

2 Likes

But we still need to modify URL!s and FILE!s et cetera without modifying the contents they would point to if they were opened as ports. In other words, I would be against being forced to write myfilename: as file! append as text! %abc as text! %.txt in order to create a new FILE! value by appending one old one to another. There's a slippery slope when you treat a string as if its value is the string contents returned by some routine that interprets it.

In my view the need for parsing a PORT! (which I agree is a genuine need) should be met by a routine designed for it, something like say PARSE-CONTENTS, and refinements to it are where the chunk-size parameters and such-like belong, not cluttering up the PORT! object itself. What if there are other ways of operating on PORT! contents besides parse? Do we add their parameters to the PORT! object? No.

In fact I even disapprove of DO of a TAG! acting like DO of the contents of the (file? url? which one?) whose name is the string contents of the tag (and whose directory or base url is where exactly?). It should be DO-CONTENTS, with (exclusive, and possibly in some fashion semi-permanent, needs more thought) BASE and DIR refinements. But at least in that case there is no prior functionality that is being usurped, so I have kept my big mouth shut about it until now.

If things like to file! :[base %.txt] "just worked" like you had written as file! unspaced [base %.txt], we might look at things differently. Life wouldn't have to be as miserable as the worst case you give. So we should be a bit circumspect.

It seems desirable to want append %my-file.txt {Some text} to write to the file instead of give you %"my-file.txtSome text". I've pointed out before that PORT! itself faces ambiguity in its dual-life as an ANY-CONTEXT!...and APPEND is a good example. APPEND to an OBJECT! would add fields in SET-WORD! and data pairs... while APPEND to a PORT! re-triggered as WRITE/APPEND. Which behaviors qualify for subversion of the underlying type, and which do not?

I don't know what the full answer is...I just don't want to leave the most interesting behaviors off the table to preserve a kind of trivial mechanical consistency. Though I am a huge fan of mechanical consistency--it's just a matter of where you establish that firm ground.

It does make sense to me that if I'm going to make successive calls to PARSE on a PORT! that the persistence of those settings be something on the port if they are not applicable to a parse of a string in memory.

I would offer http headers as an example of a sort of labeling protocol ecology (in fact it's an ecology that PORT! might do well to inherit from or be compatible with).

I've already suggested that if you say read myport://something that it run a generic process regardless of port type spiritually akin to:

p: open myport://something
append p/headers [num-reads-to-come: 1]
data: read p
close p
data

How many reads are to come might default to unknown in order to inform a keepalive on a TCP connection, and other ports might ignore it altogether. But it would be there to draw from.

It's not just network connections that could have this kind of meta information, and I see no reason why it couldn't be there to draw from for specific clients like PARSE.

2 Likes

This sounds pretty amazing to me, esp the part about managing files larger than available memory.

You are right, that this feature is somewhat "special", and it would be better, if it were easier to verfy the lookup table. And at the same time I actually like this shortcut, and it would be great if you could add your own personal links to it.

I agree with @Mark-hi's assessment. I'd also push back against assuming HTTP(S) scheme-based ports would behave the same way as FILE. Perhaps as it's configured now with its simple read mechanism, but it's more likely from a practical/durable standpoint that one would want the various components of an HTTP response: code/headers/content. It's more likely a scheme would determine whether derived PORT! values are parseable and how chunks—whether characters or values—are procured.

(it's perhaps conceivable that by default the HTTP scheme returns a block [status [integer!] headers [map! or block!] content [binary!]] or [error [error!]] if no connection was established)

If TCP ports were parseable, how would PARSE behave if there were any kind of lag in the response?

I've been thinking of extending the convention of FOO* being the "foundational tool" for those who want to forego convenience. While FOO undecorated is the DWIM (Do What I Mean).

I think when most people say `parse http://example.com [...] they aren't meaning to look for slashes and suss out the dots.

But perhaps if we offer both alternatives it will be better. What if PARSE* is foundational and the rest layers above it?

Couple of thoughts,

1/ I don't like THIS* convention, it looks a little slipshod. I'd rather see a core namespace apart from those types of functions: core/parse thing [the core way] — this way documentation/discovery for this class of function is consolidated.

2/ I quite often use PARSE on URL or FILE values. It's one of the perks of the language that they are first class string values.

I was assuming the common join/rejoin approach will continue to work for these operations.

I like the idea of parsing on a port, but I'm also not opposed to require a refinement, keyword or new notation-- although the latter is probably undesired.

On that last aside, for fun I'll throw out a half-baked thought:

What if there were a notation for TEXT!, URL! (and values which can be READ), something like the opposite of quoting, i.e. GET'ing:

:https://example.com
-- This acts like a WORD! value, where URL! as a unique handle (as all URL! and URI's are) by default does an implicit GET of the streamable bytes/input, e.g.,
== {...Example.com is provided by...}
-- The antonym of this (for no apparent justification) could involve generalized quoting:
'https://example.com

and similar for:
:%/c/files/foo.txt
== {It was a dark and stormy night...}
'%/c/files/foo.txt
== '%/c/files/foo.txt

A related option could be to be able to implicitly QUERY a URL!, the way one might a FILE! today (or the way you used to be able to PROBE an EMAIL! value for user and host) where you'd get an object! back, e.g.,

foo: https://example.com
:foo
== https://example.com

::foo
== make object! [
name: https://example.com
headers:..
body:..
size: 64897
]

bar: ::foo
== make object! ...

bar/body
== {Example.com is provided by...}

bar/headers
== {HTTP2.1... etc.}

foo
== https://example.com

Just some ideas drifting from the hazy smoke of my crack pipe. I had been thinking of a ::word notation as a kind of first derivative, where you might be able to GET the value of a reference from deeper than one level away, in the way that generalized QUOTING "protects" values at several levels.

:man_shrugging:

pfft

"Mmm, this is some good crack!"

The idea there being a GET-URL! that acts like read url isn't necessarily that crazy. I've looked at similar concepts with append data :[some items] acting like append data reduce [some items], e.g. another way of saying repend data [some items].

But for instance: when you look at that being a general replacement for REPEND, you run into the problem of that when the block is in a variable. :my-block gets a BLOCK! from a WORD!... not operate on the block itself. Similar problems would happen with abstraction of the URL! via variable.

There might be something that GET-GROUP! could help with there, e.g. :(my-url) could act as :http://example.com/whats/in/my-url.

But I prefer the emphasis on language mechanics instead of lexical complexity. I think I would like PARSE on a URL! to not parse the characters of the URL but interpret it as a port. Maybe that's not the out of the box behavior, but if you want it you just say:

 parse: enclose 'lib/parse func [f] [
     let port: null
     if url? f/data [
         f/data: port: open f/data
     ]
     do f
     elide if port [close port]
 ]

Which isn't something you couldn't do in historical Rebol, though the idea is that the inheritance from the original function saves you on rewriting the interface...and has the potential to be more efficient (in theory).

Anyway... I think my original guidance remains valid... to not make string-based routines to just assume that ANY-STRING! all act the same because they can't think of anything original to do for them they wouldn't do for a plain TEXT!.

It's similar to how I think historical DO was kind of a turkey for saying:

>> do {print "Hello"}
Hello

>> do <print "Hello">
== <print "Hello">

>> do 1
== 1

With a limit of short words, there's a true un-interestingness when the "I don't know" space just gives back the original thing. I can't think of many situations where such a polymorphism would be interesting ("I don't know if I have a tag or a string or a URL! or an INTEGER!... but I'm going to just DO it anyway and the result is somehow meaningful"). What use case is that, unless it's DO at an understood operation for finding the location containing the thing to DO.

Whether the default behavior is to error and let people pick what fits, or if you fill in a nice default meaning is one of those open questions.

Yes, my reply is a tangent from your original topic. I was talking more about code-golf stuff, since there isn't much you can do with a URL! or a FILE! except READ, WRITE or simply RENAME its handle (which is a superficial property/metadata of the data). The majority of the time you're going to READ the file, and often the READ just adds line noise -- the READ isn't the interesting thing you do with the FILE!, the interesting thing is whatever you do after the obligitory READ. Anyway, the GET-GROUP! is a viable pathway for this very low priority idea.

1 Like

I wanted to point out two things.

One is something Red did, which is to make the CHECKSUM function accept FILE!. Ultimately one would hope this would use a PORT! and streaming method to checksum a very large file and not read it all into memory.

(I've mentioned that by using the mbedTLS functions, we have streaming abilities in the ciphers and hashes themselves. It's just a matter of wiring that up to some kind of progressive streaming PORT! model for the ciphers and hashes that needs to be figured out.)

This is a good example of a case where it is likely more useful to do this than to get the hash of the UTF-8 filename...even though both are potential usages. But the parallels to the PARSE case come up... what if you get in the habit of passing string variables, and then one day the string you pass is in the form of a FILE!, and it is obscured behind the checksum var? Is it bad to have it non-obvious that files were opened in the process?

The second thing I wanted to mention is that I did encounter a case of wanting to PARSE a FILE! and I felt in doing so that it was useful to not have to convert it to a string first. So there's a bit of ambivalence here. I just wanted to mention these things and keep the topic open.