PARSE on PORT! and avoiding generic behavior w.r.t. ANY-SERIES!

If you look at Rebol2, R3-Alpha, and Red they all do the same thing with FILE! that they would with text (STRING!) when you pass as the first argument to PARSE:

rebol2>> parse %aaa.txt [some "a" ".txt"]
== true

r3-alpha>> parse %aaa.txt [some "a" ".txt"]
== true

red>> parse %aaa.txt [some "a" ".txt"]
== true

I've always had the ambition that you be able to PARSE a PORT!. If that's possible, it seems that you should be able to shortcut actually opening and closing the port yourself by saying something like:

parse %some-200-megabyte-file.txt [
     some "a" end (print "Your giant file was all the letter A")
]

parse http://example.com/some-net-data/ [
    thru <title> copy title to </title> (print ["Title was" title])
]

The vision would be that PARSE would assume when you gave it a FILE! or URL! that you meant to operate on that as a PORT!...opening it, parsing it, and closing it. If you gave it a regular PORT! it would assume you would take care of closing it yourself.

Furthermore, it would be efficient so that it didn't need to load all of it into memory at once. (There could be some heuristic on a "chunk size" it picked automatically, paging in only as much of the file as it needed at a time. But you could perhaps tweak that manually by opening the port yourself and doing some settings. This seems to be a property of the PORT! and not of PARSE, though there may be PARSE-specific settings. Perhaps those settings would be looked for on the port itself as an extensible set of headers, vs. being some strange refinement you'd pass.)

In any case, the appeal of having that work for FILE! and URL! certainly seems to suggest that it's a much better use of the type variety than as a synonym for:

>> did parse as text! %aaa.txt [some "a" ".txt" end]
== #[true]

There's clear need for PARSE to run on TEXT!, BINARY!, and BLOCK! input. I'm not sure how this applies to INTO. There also might be a parse/only (or parse/into? which would be type-preserving?)

Not just for PARSE: a General Philosophy of ANY-SERIES!

This ties into what I think should be a very restrained tendency to use ANY-STRING! types in ways that make them equivalent to the behavior on TEXT!.

I've said similar things about why type of first ['''a] should not be conflated with plain WORD!. There should be a default of discernment; leaving the room open for distinct meanings.

So be on the lookout for cases where a datatype is being underused, even if it's not able to do the ideal magic today. Seeing PARSE run on PORT! is a pretty big wishlist item for me, so maybe it's not impossible that it could happen... (!)

2 Likes

But we still need to modify URL!s and FILE!s et cetera without modifying the contents they would point to if they were opened as ports. In other words, I would be against being forced to write myfilename: as file! append as text! %abc as text! %.txt in order to create a new FILE! value by appending one old one to another. There's a slippery slope when you treat a string as if its value is the string contents returned by some routine that interprets it.

In my view the need for parsing a PORT! (which I agree is a genuine need) should be met by a routine designed for it, something like say PARSE-CONTENTS, and refinements to it are where the chunk-size parameters and such-like belong, not cluttering up the PORT! object itself. What if there are other ways of operating on PORT! contents besides parse? Do we add their parameters to the PORT! object? No.

In fact I even disapprove of DO of a TAG! acting like DO of the contents of the (file? url? which one?) whose name is the string contents of the tag (and whose directory or base url is where exactly?). It should be DO-CONTENTS, with (exclusive, and possibly in some fashion semi-permanent, needs more thought) BASE and DIR refinements. But at least in that case there is no prior functionality that is being usurped, so I have kept my big mouth shut about it until now.

If things like to file! :[base %.txt] "just worked" like you had written as file! unspaced [base %.txt], we might look at things differently. Life wouldn't have to be as miserable as the worst case you give. So we should be a bit circumspect.

It seems desirable to want append %my-file.txt {Some text} to write to the file instead of give you %"my-file.txtSome text". I've pointed out before that PORT! itself faces ambiguity in its dual-life as an ANY-CONTEXT!...and APPEND is a good example. APPEND to an OBJECT! would add fields in SET-WORD! and data pairs... while APPEND to a PORT! re-triggered as WRITE/APPEND. Which behaviors qualify for subversion of the underlying type, and which do not?

I don't know what the full answer is...I just don't want to leave the most interesting behaviors off the table to preserve a kind of trivial mechanical consistency. Though I am a huge fan of mechanical consistency--it's just a matter of where you establish that firm ground.

It does make sense to me that if I'm going to make successive calls to PARSE on a PORT! that the persistence of those settings be something on the port if they are not applicable to a parse of a string in memory.

I would offer http headers as an example of a sort of labeling protocol ecology (in fact it's an ecology that PORT! might do well to inherit from or be compatible with).

I've already suggested that if you say read myport://something that it run a generic process regardless of port type spiritually akin to:

p: open myport://something
append p/headers [num-reads-to-come: 1]
data: read p
close p
data

How many reads are to come might default to unknown in order to inform a keepalive on a TCP connection, and other ports might ignore it altogether. But it would be there to draw from.

It's not just network connections that could have this kind of meta information, and I see no reason why it couldn't be there to draw from for specific clients like PARSE.

2 Likes

This sounds pretty amazing to me, esp the part about managing files larger than available memory.

You are right, that this feature is somewhat "special", and it would be better, if it were easier to verfy the lookup table. And at the same time I actually like this shortcut, and it would be great if you could add your own personal links to it.

I agree with @Mark-hi's assessment. I'd also push back against assuming HTTP(S) scheme-based ports would behave the same way as FILE. Perhaps as it's configured now with its simple read mechanism, but it's more likely from a practical/durable standpoint that one would want the various components of an HTTP response: code/headers/content. It's more likely a scheme would determine whether derived PORT! values are parseable and how chunks—whether characters or values—are procured.

(it's perhaps conceivable that by default the HTTP scheme returns a block [status [integer!] headers [map! or block!] content [binary!]] or [error [error!]] if no connection was established)

If TCP ports were parseable, how would PARSE behave if there were any kind of lag in the response?

I've been thinking of extending the convention of FOO* being the "foundational tool" for those who want to forego convenience. While FOO undecorated is the DWIM (Do What I Mean).

I think when most people say parse http://example.com [...] they aren't meaning to look for slashes and suss out the dots.

But perhaps if we offer both alternatives it will be better. What if PARSE* is foundational and the rest layers above it?

Couple of thoughts,

1/ I don't like THIS* convention, it looks a little slipshod. I'd rather see a core namespace apart from those types of functions: core/parse thing [the core way] — this way documentation/discovery for this class of function is consolidated.

2/ I quite often use PARSE on URL or FILE values. It's one of the perks of the language that they are first class string values.

I was assuming the common join/rejoin approach will continue to work for these operations.

I like the idea of parsing on a port, but I'm also not opposed to require a refinement, keyword or new notation-- although the latter is probably undesired.

On that last aside, for fun I'll throw out a half-baked thought:

What if there were a notation for TEXT!, URL! (and values which can be READ), something like the opposite of quoting, i.e. GET'ing:

:https://example.com
-- This acts like a WORD! value, where URL! as a unique handle (as all URL! and URI's are) by default does an implicit GET of the streamable bytes/input, e.g.,
== {...Example.com is provided by...}
-- The antonym of this (for no apparent justification) could involve generalized quoting:
'https://example.com

and similar for:
:%/c/files/foo.txt
== {It was a dark and stormy night...}
'%/c/files/foo.txt
== '%/c/files/foo.txt

A related option could be to be able to implicitly QUERY a URL!, the way one might a FILE! today (or the way you used to be able to PROBE an EMAIL! value for user and host) where you'd get an object! back, e.g.,

foo: https://example.com
:foo
== https://example.com

::foo
== make object! [
name: https://example.com
headers:..
body:..
size: 64897
]

bar: ::foo
== make object! ...

bar/body
== {Example.com is provided by...}

bar/headers
== {HTTP2.1... etc.}

foo
== https://example.com

Just some ideas drifting from the hazy smoke of my crack pipe. I had been thinking of a ::word notation as a kind of first derivative, where you might be able to GET the value of a reference from deeper than one level away, in the way that generalized QUOTING "protects" values at several levels.

"Mmm, this is some good crack!"

The idea there being a GET-URL! that acts like read url isn't necessarily that crazy. I've looked at similar concepts with append data :[some items] acting like append data reduce [some items], e.g. another way of saying repend data [some items].

But for instance: when you look at that being a general replacement for REPEND, you run into the problem of that when the block is in a variable. :my-block gets a BLOCK! from a WORD!... not operate on the block itself. Similar problems would happen with abstraction of the URL! via variable.

There might be something that GET-GROUP! could help with there, e.g. :(my-url) could act as :http://example.com/whats/in/my-url.

But I prefer the emphasis on language mechanics instead of lexical complexity. I think I would like PARSE on a URL! to not parse the characters of the URL but interpret it as a port. Maybe that's not the out of the box behavior, but if you want it you just say:

 parse: enclose 'lib/parse func [f] [
     let port: null
     if url? f/data [
         f/data: port: open f/data
     ]
     do f
     elide if port [close port]
 ]

Which isn't something you couldn't do in historical Rebol, though the idea is that the inheritance from the original function saves you on rewriting the interface...and has the potential to be more efficient (in theory).

Anyway... I think my original guidance remains valid... to not make string-based routines to just assume that ANY-STRING! all act the same because they can't think of anything original to do for them they wouldn't do for a plain TEXT!.

It's similar to how I think historical DO was kind of a turkey for saying:

>> do {print "Hello"}
Hello

>> do <print "Hello">
== <print "Hello">

>> do 1
== 1

With a limit of short words, there's a true un-interestingness when the "I don't know" space just gives back the original thing. I can't think of many situations where such a polymorphism would be interesting ("I don't know if I have a tag or a string or a URL! or an INTEGER!... but I'm going to just DO it anyway and the result is somehow meaningful"). What use case is that, unless it's DO at an understood operation for finding the location containing the thing to DO.

Whether the default behavior is to error and let people pick what fits, or if you fill in a nice default meaning is one of those open questions.

Yes, my reply is a tangent from your original topic. I was talking more about code-golf stuff, since there isn't much you can do with a URL! or a FILE! except READ, WRITE or simply RENAME its handle (which is a superficial property/metadata of the data). The majority of the time you're going to READ the file, and often the READ just adds line noise -- the READ isn't the interesting thing you do with the FILE!, the interesting thing is whatever you do after the obligitory READ. Anyway, the GET-GROUP! is a viable pathway for this very low priority idea.

1 Like

I wanted to point out two things.

One is something Red did, which is to make the CHECKSUM function accept FILE!. Ultimately one would hope this would use a PORT! and streaming method to checksum a very large file and not read it all into memory.

(I've mentioned that by using the mbedTLS functions, we have streaming abilities in the ciphers and hashes themselves. It's just a matter of wiring that up to some kind of progressive streaming PORT! model for the ciphers and hashes that needs to be figured out.)

This is a good example of a case where it is likely more useful to do this than to get the hash of the UTF-8 filename...even though both are potential usages. But the parallels to the PARSE case come up... what if you get in the habit of passing string variables, and then one day the string you pass is in the form of a FILE!, and it is obscured behind the checksum var? Is it bad to have it non-obvious that files were opened in the process?

The second thing I wanted to mention is that I did encounter a case of wanting to PARSE a FILE! and I felt in doing so that it was useful to not have to convert it to a string first. So there's a bit of ambivalence here. I just wanted to mention these things and keep the topic open.

I'm trying to straighten out some of the semantics of the file operations while standardizing on libuv.

I found the behavior that append on a port! will act like write/append. The mechanism that R3-Alpha used to implement the retriggering of APPEND was bad and dangerous. So I'd done it a different hacky way that was less dangerous.

But in the comments I had remarked about the difference between:

write/append %foo.txt "data"  ; appends "data" to the foo.txt disk file

and

>> append %foo.txt "data"
== %foo.txtdata

One conclusion to draw from this is that trying to make the PORT! accept the APPEND verb is misguided, because it doesn't work equally well in the "implicit port" style.

But my inclination is that this supports my thesis that FILE!'s true purpose in life is as the signal for implicit ports (and whatever use you want in your dialects, though more often than not probably relating to files).

I don't know what's so special about a language that just has one more "flavor" of string, that just acts like a string. That's weak magic. But implicit ports are stronger magic.

We can actually cook that magic into the evaluator... so that if a function takes PORT! but does not take FILE!, the call itself does the coercion to do the OPEN and then the CLOSE when that particular function call is over.

This would mean taking FILE! and URL! out of the ANY-STRING! category. As @BlackATTR suggests, we'd want JOIN to still work as expected...e.g. it would not try to "join the contents of files together".

I'm kind of getting certain about this as the true way. I also think that the true way may be immutability of these strings, so that URL!s in particular can have a moment of validation after which they are not allowed to be twisted into going "bad".

(I'll also point out that this gives FILE! a bit more of a legitimacy for owning the term FILE!...as opposed to FILENAME!. Because it really does turn into a case where it is a stand-in for the file itself.)

Lots of Questions to Answer, Though...

write %file1.txt %file2.txt

What did you mean there? did you mean:

port1: open %file1.txt
port2: open %file2.txt
write port1 read port2
close port2
close port1

It might seem you can do so easily enough like this:

write %file1.txt read %file2.txt

But that misses an opportunity for a streaming mechanism inside of write. You'd have to READ the entire contents of file2 before the write even got a chance to start, and if it was 100mb file then you'd have to read all 100mb into memory.

There'd be the opportunity with the syntax of write %file.txt %file2.txt to make the decision to use a streaming implementation, since it's leaving the decisions until later.

But maybe there's some other mechanics possible, like read %something.txt doesn't synchronously read the data, but creates a one-off port that knows to close when it's used. And writing into a BINARY! of a variable counts as a "use".

:thinking:

I don't know, but I'm making my intent known that I don't think FILE! and URL! are destined to remain in the ANY-STRING! category...

2 Likes

This is the direction I've decided to take things in. However I do think (maybe) that what's happening is that some of the ideas behind PORT! and user-defined types are kind of merging. This means that if someone wanted to take this in another direction they might be able to...by saying that a FILE! is a container for a TEXT!... and instead of operating "portishly" it would respond "stringishly" to things like APPEND.

We'll see.

But since these are shorthands for ports, getting at the actual filename could come by way of something like name of:

parse path of %some-200-megabyte-file.txt [
    thru "-" 200 "-" thru ".txt"
]

Which might also have other offerings:

>> directory of %foo/bar.txt
== %foo/

>> path of %foo/bar.txt
== %foo/bar.txt

>> filename of %foo/bar.txt  ; would just "name of" be clear?
== %bar.txt

But the more fundamental offerings that don't talk to the PORT! would be available, and faster. If people find writing AS TEXT! to be laborious there will be a cleaner shorthand when the name is what you think is important:

parse spell %some-200-megabyte-file.txt [
    thru "-" 200 "-" thru ".txt"
]
1 Like

In terms of questions of "what has worked and what hasn't", I think read-only URLs (for example) have worked out okay for me.

If you want to mutate a URL!, turn it into a string and mutate that. If you can turn it back into a URL! then great. If you can't then it's probably good we don't have to come up with some weird way of representing URL!s that aren't URLs.

Or this kind of garbage:

red>> code: reduce [reverse http://hostilefork.com]
== [moc.krofelitsoh://ptth]

red>> type? first code
== #[datatype! url!]  ; doesn't look like one, and wouldn't LOAD as one

red>> clear first code

red>> code
== [
;  ^-- yup, it's just one left bracket

red>> type? first code
== #[datatype! url!]

But regarding this other behavior I talk about in the thread above:

That may not be working out. I guess I can understand saying that if you want a PARSE on a PORT! that auto opens and closes you'd use something else. Python has a "WITH" construct that takes care of the open and close, so:

with port: http://example.com/some-net-data [
    parse port [
        thru <title> copy title to </title> (print ["Title was" title])
    ]
]

; ... or maybe like LET, scoped without a code block

(with port: http://example.com/some-net-data
parse port [
    thru <title> copy title to </title> (print ["Title was" title])
])  ; close happens when scope ends?

Having taken PARSE of URL! away I'm seeing enough use cases to be empathetic that having to turn things into strings yourself is a hassle if you want to extract some data from a URL.

But I think what should happen when you PARSE/UPARSE a URL! is you get a read-only string alias. If you really feel the need to do mutations, then you need to make a string out of it (a full copy, not just a string alias of the bytes)...and then if you really want a URL back you have to prove the thing is still a URL and coerce it.

So no CHANGE etc. on a plain URL! parse. And when you copy segments out of it you're getting string segments...not little bits of text that are flavored URL!.

Fair enough?

3 Likes

If you're doing this to protect someone from doing something dumb, they still can:

>> val: bob:pip
== bob:pip

>> lav: reverse value
== pip:bob

>> conforms-to-url-syntax? lav
== true

It's unclear to me to what purpose you may reverse a URL, but I could conceive of other reasons for having a URL value be in a non-conforming intermediate state. I'm not saying this is how I'd do this specific task, however it does demonstrate that a fragment plus type has semantic value:

http: to url! "http"
...
address: combine [
    http if is-secure ["s"] "://" host path
]