Taming the Pathology of PATH!

hostilefork · January 11, 2019, 3:00am

PATH! has long been a thorn. Because it has been considered an ANY-SERIES!--with a position and an index--you can get into all kinds of trouble. Such as decaying into something indistinguishable from a WORD!, and then to nothing at all. In Rebol2/Red:

>> p: to path! [a b]
== a/b

>> type? next p
== path!
>> next p
== b  ;-- ack, looks like a WORD!

>> type? next next p
== path!
>> next next p
==    ;-- uhhhh, nothing?

That's a glaring problem, but there's many other reasons it makes a bad generic array. Try putting WORD! variations in it:

red>> to get-path! [a b]
== :a/b
red>> p: to path! [:a b]
== :a/b
red>> to get-path! p
== ::a/b

Worse still, put a PATH! in a PATH!.

>> left: clear make path! [x x]
>> append left 'a/b
>> append left 'c

>> right: clear make path! [x x]
>> append right 'a
>> append right 'b/c

>> left
== a/b/c
>> first left
== a/b

>> right
== a/b/c
>> first right
== a

New Paradigm: PATH! is NOT an ANY-SERIES!

I did something that swept away a big pile of these concerns. I took PATH! out of the ANY-SERIES! category and made them immutable. Since there's a controlled number of points that can make paths, you can set rules for them (e.g. no fewer than two elements, no paths-in-paths). And since there are no direct modifiers, they can't be changed to disobey this rule. Since there is no INDEX OF due to it not being an any series, you can never think of it as being anywhere but "at the head".

It's not as limiting as it may sound at first. You can still PICK elements out of a path by index, or use FOR-EACH on them. If you ever get to a point where you really want to rearrange and restructure a path, you can convert it to a BLOCK! or GROUP! and then back. And while making operators that remove items from paths might be a little tricky, aggregating them together is not.

Surprisingly (or perhaps not?), this didn't actually cause that much of a ripple. Basically nothing was using PATH! as a generic container anyway--because compared to GROUP! and BLOCK!, paths were really bad at being generic containers. They're never all that long, because they ignore newline handling (embedded blocks/groups can have newlines, but at the level of the slashes in the path itself, there are no newlines).

It's been great so far, and I think there's no going back.

How Many Constraints Should There Be?

I mentioned length of at least 2, and no paths-in-paths. Those are pretty obvious.

But what else? We can stop ::a/b from ever existing. But historically, the following has been idiomatic and accepted as a common and correct syntax:

 a/:b: c

I've wondered if a/(b): c is superior to the point that the path creation rules prohibit embedded get-words. If you couldn't put any GET-SET-LIT inside path elements, it could stop ambiguities.

Furthermore, some types (like FILE! or URL!) have slashes in them. Should inserting them into paths be an error, or at least use those slashes to point out where path segments are and split along them?

Why this is in "Philosophy": The Role of PATH!s in Dialects

One thing that got me to think about this is that I've got a dialect which lets you define BLOCK! rules or PATH! rules:

 e: 'j/k/l
 h: [m n/o p]
 dialect [a/b/c [d e f] g/h/i]

Pathing means "AND these things together". Blocks mean "OR these things together". And like PARSE rules, if you look up a word and get to a BLOCK! or PATH! that's just recursed on and used as if you'd written the rule right there.

Some of the elemental rules were GET-WORD!. If GET-WORD! weren't legal in paths, that would put a constraint on this dialect regarding its elements that the block wouldn't impose.

But...you can work around this with a block.

 dialect [[:a]/b/c [d :e f] g/h/[:i]]

That feels very...clean. Now you have a generic solution where you're using PATH!s as a dialect component that doesn't lose any capability BLOCK! or GROUP! had, without worrying about tapdancing around gibberish paths.

And we actually are entering an era of what are called "mirrored types", which would allow 1-element blocks and 1-element groups that are immutable to fit entirely in a cell with no dynamic allocation or pointer to elsewhere.

Mirrored types were invented so /foo could be a PATH! and cost no more than the old word-class REFINEMENT! did. But seeing them in action, it suggests applying it for GROUP!s and BLOCK!s too. Those embedded blocks could cost no more than a plain GET-WORD! today. With PATH! being immutable, making those blocks and groups immutable makes sense too. (By default on scanning I mean... if you make a path with a length-1 immutable block under the path level, it can preserve that mutability.)

When you put all these concepts together, it feels like it ties up loose ends and ambiguities. Will people miss a/:b:...or can the likes of a/(b): and a/[b]: or :(a)/b and :[a]/b cover pretty much everything?

hostilefork · January 11, 2019, 5:52am

The only thing I can think of that would make me be upset about losing a/:b and having to use a/(b), is having that get involved in COMPOSE/DEEP when I didn't mean it to.

 compose/deep [
      .../(don't want composed): [(want composed) ...]
 ]

But we have better solutions to this today.

 compose/deep <*> [
      .../(don't want composed): [(<*> want composed) ...]
 ]

...and a shallow compose won't see groups in paths. I think that is enough for me.

The other issue is that right now GET refuses to fetch paths if they contain any GROUP!s. We could update this rule to make it refuse to fetch paths if they contain anything that runs any ACTION!s, so any inert groups would be fair game.

IngoHohmann · January 18, 2019, 5:54pm

What is the problem you see with get-words in paths, and set-words at the end ?

hostilefork · January 18, 2019, 7:20pm

Ambiguity. If GET-WORD!s can be put in paths, then you can't tell if :a/b/c is an ordinary PATH! with a GET-WORD! at the beginning, or a GET-PATH! with an ordinary WORD! at the beginning.

Same for a/b/c:... is that an ordinary PATH! with a SET-WORD! at the end, or a SET-PATH! with an ordinary WORD! at the end?

Every now and again it has been wondered if this suggests that there shouldn't be a SET-PATH! and GET-PATH!, but that those should simply be ordinary PATH!s with SET-WORD!s at the tail and GET-WORD!s at the head. This breaks down when you want a/b/(c + d): because you'd need a SET-GROUP!, or a/b/1: because you'd need a SET-INTEGER!, etc. for all types. It also breaks down because it inhibits the cheap/easy transformation of these path types into each other by flipping one byte without affecting the shared path array itself.

States that don't seem ambiguous, like ::a/b/c are still quite ugly...and actually can still be ambiguous. e.g. is that a three-element GET-PATH! with a GET-WORD! :a at the head, or a two-element GET-PATH! with a GET-PATH! :a/b at the head. I also think things like a/b/:c: are awful-looking, and don't have good bones for the language.

But the good news of all of this is that I think I have an answer for all of this with immutable paths, that are checked for properties at time of creation, to address all these issues...and I may be able to do it quite efficiently.

hostilefork · March 12, 2019, 7:27pm

This is pretty parallel to a problem in URL!.

We have the problem of:

>> reverse http://hostilefork.com
== moc.krofelitsoh//:ptth

What you end up with still claims to be a URL!, but wouldn't load back as one. In fact, it would be loaded under today's conventions as a PATH!, since // denotes a BLANK! path segment. :-/

Historically the idea was that it's just a matter of noticing when something isn't rendering as a "natural" of its type, and falling back on some alternate notation. Whatever it would be, we'd hope it wouldn't be any uglier than:

 #[url! "moc.krofelitsoh//:ptth"]

But as with PATH!, we can question just how useful is it to allow freaks of nature to exist vs. making them immutable and not allowing them. If you could turn URL!s into TEXT! easily enough, and turn them back, isn't that good enough?

Pieces of URL! being URL! is not that interesting. Consider being rid of the PARSE behavior of matching datatypes for COPY:

 url: http://example.com/foo
 parse url ["http://example.com/" copy stuff: to end]

That will now again give you STUFF as the neutral string "foo" like in Rebol2. It's not a URL! of simply "foo" (and hence really should show as #[url! "foo"]). That seems more desirable.

Is pretty much any kind of surgery on URL!s necessary? How often could one's needs not just be taken care of with JOIN-ing them...as being tried for PATH!, and convert otherwise?

IngoHohmann · March 13, 2019, 6:15am

In my own url! handling I only needed joining so far.
I think I used deeper surgery for helping with url-encoding issues.
Today, how about doing url-encoding on strings, and nothing on quoted strings?

hostilefork · March 13, 2019, 10:32am

Good data point, thanks!

If you mean URL-encoding on URL!, the behavior of retaining "as-is" behavior was a request from @rgchris.
Here is the rationale I summarized in the comments:

// While Rebol2, R3-Alpha, and Red attempted to apply some amount of decoding
// (e.g. how %20 is "space" in http:// URL!s), Ren-C leaves URLs "as-is".
// This means a URL may be copied from a web browser bar and pasted back.
// It also means that the URL may be used with custom schemes (odbc://...)
// that have different ideas of the meaning of characters like `%`.
//
// !!! The current concept is that URL!s typically represent the *decoded*
// forms, and thus express unicode codepoints normally...preserving either of:
//
//     https://duckduckgo.com/?q=hergé+&+tintin
//     https://duckduckgo.com/?q=hergé+%26+tintin
//
// Then, the encoded forms with UTF-8 bytes expressed in %XX form would be
// converted as TEXT!, where their datatype suggests the encodedness:
//
//     {https://duckduckgo.com/?q=herg%C3%A9+%26+tintin}
//
// (This is similar to how local FILE!s, where e.g. slashes become backslash
// on Windows, are expressed as TEXT!.)

Offhand, I feel like it sounds more likely to me that URL! is most convenient when people can round-trip the rather lenient expressions being put in their browser, as-is. Attempts we do at LOAD-time to canonize as part of the data type itself may frustrate, and do more harm than good.

I'm not in-the-know enough to know about the legality of schemes or fragments/pieces of URL!s where % does not mean URL-encoding. Obviously a Rebol scheme could do this, but I don't know if any official legal URL ever can. If not, it's probably inadvisable to permit Rebol schemes to.

I do understand this is an important issue, but it would help to see complete scenarios that are pain points and a list of all the tradeoffs.

IngoHohmann · March 13, 2019, 2:20pm

I actually meant when a text! is joined to a url!, or otherwise converted to one, and maybe vice versa.

hostilefork · March 13, 2019, 2:43pm

I had a thought in this vein, when considering TO in terms of the "stable state round-trip" philosophy. I've thought that might be interesting if TO TEXT! was all it took for a URL to get encoded, and TO URL! took it back to readable again. If such TO conversions were being run automatically, then maybe it would be sensible for operations like JOIN to be similar when working with combinations.

What made me hesitant to pursue is that if we went this route with FILE!, it would result in inconsistent behavior on Windows vs. Linux. You'd have to make sure every TEXT! plus FILE! operation had the backslashes going the right way in the TEXT!.

linux>> join "abc" %d/e/f
== "abcd/e/f"

windows>> join "abc" %d/e/f
== "abcd\e\f"

It may be that URL encoding is a different beast. But I think there's a pretty high bar for "magic" in a system--if there's going to be any, it better be really good magic with a clearly amazing payoff. Otherwise it's just more complexity, which is a net negative--even if you tinker around and find it has a slight advantage overall.

hostilefork · June 19, 2019, 7:01pm

I think immutable paths make sense, and disallowing single-element paths or paths with anything but WORD!, TEXT!, INTEGER!, DECIMAL!, BLOCK!, GROUP!, and BLANK! may make sense. (TAG! or other types perhaps, also, but the key is just to stop ambiguous constructions at the moment of creation.)

But taking ANY-PATH! out of ANY-SERIES! may be unnecessary.

The way it could make sense could be if we follow the concept that arrays with iteration positions other than the head are rendered is to include the index. e.g.:

 >> p: 'foo/baz/bar
 == foo/baz/bar

 >> next p
 == 2|foo/baz/bar

Not saying this notation is ideal, but the concept is that any non-head series shows its index value and the full data. So no matter how big your data, offsetting it by an index doesn't hide it in molding. If you have the need to truncate a series, you either have it mutable or COPY the data out.

The console could truncate to show you the most likely relevant portion, e.g. from the index position on. This would be similar to how it truncates the output of very long molds at the tail...it's just truncating long molds at the head.

This ties in with the very important "Where the Series Ends" post. That discusses the hidden index semantics, proposing there being no difference between that and a separately tracked INTEGER!...so it is purely an efficiency trick. Looked at this way helps answer a lot of questions about why that field is there, and prevents having two separate branches of semantics...one for when you use the index internal to the series, and one for when you are operating relative to some external index.

rgchris · August 15, 2022, 1:06pm

I get that this may be mainly an R3C consideration as there's assumptions that Ren-C has moved past:

I don't think I'm too fussed about mutability or removal of ANY-SERIES designation so long as you can round-trip from blocks to construct/deconstruct
Given immutability, perhaps PATH!, GET-PATH!, and LIT-PATH! status can be gleaned from the first value in the path. In this way, e.g. a GET-PATH! is a path that starts with a GET-WORD! or GET-GROUP!, never a WORD!. In this way, there is no lexical ambiguity (no ::a/b or ':a/b or a/b::). SET-PATH! would end with a SET-WORD!/SET-GROUP! so long as the first value is a WORD! or GROUP!
```
to get-path! 'a/b
; LIT-PATH! ['a b]
; evaluated to PATH! [a b]
; transformed to GET-PATH! [:a b]
```
I think it'd be possible to come up with a reasonably finite set of rules that would keep things somewhat sane. Immutability means you couldn't change the first value of a GET-PATH! to a WORD! any way other than to construct a new path so the integrity of rules would be maintained.
Any file or URL value in a path must be the last value (assuming the conservative convention of using / as a path delimiter)

hostilefork · August 15, 2022, 1:38pm

Yes, you can round trip...also COMPOSE in particular is useful:

>> compose '(spread [a b])/([a b])/c
== a/b/[a b]/c

>> compose '(if true [<a>])/b
== <a>/b

>> compose '(if true [_])/b
== /b

>> compose '(if false [<a>])/b
== b

The mechanics of JOIN are still being frittered over...I mention some of the problems in this thread.

Read-onlyness solves one problem with putting such parts in the path, but there are others I mention:

It demands more forms that we don't necessarily want, e.g. block.1: (or block/1:) would require a SET-INTEGER!.
It inhibits easy transformation via the type byte, e.g. to simply change a GET-PATH! to a SET-PATH! and share the underlying series. So consider an underlying array like [a (b) 1] ... this can be shared between GET-PATH! SET-PATH! PATH! and quoted path instances. Otherwise you have to keep making copies and transforming them.

Cheap quoting and unquoting via the quote byte on the path as a whole is also a generalized mechanism that works quite well.

I'd imagine such exceptions would probably cause more confusion than anything (consider the logic you'd have to put in things like COMPOSE and JOIN to make sure it wasn't appending to a path that was formerly valid, that would be becoming invalid by adding to it).

rgchris · August 15, 2022, 3:17pm

Sticky wicket, for sure.

Why would the burden be on individual functions. If they created a new series (because they wouldn't be modifying the old one), wouldn't the effort be shut down consistently by the path creation process?

hostilefork · August 15, 2022, 3:24pm

I guess if you truly round trip to a block and then to a path...it would...which would apply to usermode code.

But native code that does the building could previously assume that if a well-formed path fragment exists, it could reuse it without checks...and only check itemwise on new material. It would complicate those code paths. I'd rather avoid it unless there's a clear argument for why the irregularity was needed.

rgchris · August 15, 2022, 3:36pm

Understood. To understand the cost: for those functions would be a check to see if path-op ... if new-path-good ..., for path creation it'd be if valid-series? ... else bomb. Though valid-series? itself would only be really costly if the paths became overly long themselves, right?

hostilefork · August 15, 2022, 3:57pm

Things aren't very optimized at this moment. So everything that makes a new path (as opposed to switching between GET/SET variations) is running through a unified validation...as if you made it from scratch. I think. This is just a matter of trying to get things working, and optimize later.

So I'm speaking sort of more about "the principle of the thing". Right now there are two (or, two and a half?) rules:

at least 2 items
every item is from the list of valid items
- PATH!'s list of valid items has one more thing in it than TUPLE!'s list... e.g., TUPLE! is allowed

(Oh...tuple exception rule. TUPLE!s of length 2 that are both INTEGER! are illegal. Although I've suggested this might render as a pair, which would mean PAIR! would just be a type constraint on tuples. For the moment, 1. and .1 are also disabled...but I'm suggesting those are more useful as tuples than decimals. If so, then with the 1x2 rendering of a 2-element integer tuple there would be no exceptions in the supported elements.)

If there's any weirder rules than that (e.g. a special list for head items or tail items)... that basically would cast doubt on any efforts to optimize the implementation of things like COMPOSE or JOIN to take parts it already had for granted. So there'd never be optimized processes--the assumption would be that it would have to treat the whole thing as if the path was being created from scratch.

But beyond the optimization, I favor the simplicity of the rules. It makes paths feel like a reliable/predictable part (vs. a "pathological" one).

If a truly killer feature were shown to be enabled by a more asymmetric rule, then it's by no means impossible to support. It just introduces a tax on everyone writing logic that operates on paths--to where they can't themselves just check if something is in the ANY-PATH-ITEM! typeset and know in advance if what they're making would be legal. They only find out by passing their particular configuration through the TO PATH! validation algorithm. It's a more uneasy foundation.