Taming the Pathology of PATH!

#1

PATH! has long been a thorn. Because it has been considered an ANY-SERIES!--with a position and an index--you can get into all kinds of trouble. Such as decaying into something indistinguishable from a WORD!, and then to nothing at all. In Rebol2/Red:

>> p: to path! [a b]
== a/b

>> type? next p
== path!
>> next p
== b  ;-- ack, looks like a WORD!

>> type? next next p
== path!
>> next next p
==    ;-- uhhhh, nothing?

That's a glaring problem, but there's many other reasons it makes a bad generic array. Try putting WORD! variations in it:

red>> to get-path! [a b]
== :a/b
red>> p: to path! [:a b]
== :a/b
red>> to get-path! p
== ::a/b

Worse still, put a PATH! in a PATH!.

>> left: clear make path! [x x]
>> append left 'a/b
>> append left 'c

>> right: clear make path! [x x]
>> append right 'a
>> append right 'b/c

>> left
== a/b/c
>> first left
== a/b

>> right
== a/b/c
>> first right
== a

New Paradigm: PATH! is NOT an ANY-SERIES!

I did something that swept away a big pile of these concerns. I took PATH! out of the ANY-SERIES! category and made them immutable. Since there's a controlled number of points that can make paths, you can set rules for them (e.g. no fewer than two elements, no paths-in-paths). And since there are no direct modifiers, they can't be changed to disobey this rule. Since there is no INDEX OF due to it not being an any series, you can never think of it as being anywhere but "at the head".

It's not as limiting as it may sound at first. You can still PICK elements out of a path by index, or use FOR-EACH on them. If you ever get to a point where you really want to rearrange and restructure a path, you can convert it to a BLOCK! or GROUP! and then back. And while making operators that remove items from paths might be a little tricky, aggregating them together is not.

Surprisingly (or perhaps not?), this didn't actually cause that much of a ripple. Basically nothing was using PATH! as a generic container anyway--because compared to GROUP! and BLOCK!, paths were really bad at being generic containers. They're never all that long, because they ignore newline handling (embedded blocks/groups can have newlines, but at the level of the slashes in the path itself, there are no newlines).

It's been great so far, and I think there's no going back.

How Many Constraints Should There Be?

I mentioned length of at least 2, and no paths-in-paths. Those are pretty obvious.

But what else? We can stop ::a/b from ever existing. But historically, the following has been idiomatic and accepted as a common and correct syntax:

 a/:b: c

I've wondered if a/(b): c is superior to the point that the path creation rules prohibit embedded get-words. If you couldn't put any GET-SET-LIT inside path elements, it could stop ambiguities.

Furthermore, some types (like FILE! or URL!) have slashes in them. Should inserting them into paths be an error, or at least use those slashes to point out where path segments are and split along them?

Why this is in "Philosophy": The Role of PATH!s in Dialects

One thing that got me to think about this is that I've got a dialect which lets you define BLOCK! rules or PATH! rules:

 e: 'j/k/l
 h: [m n/o p]
 dialect [a/b/c [d e f] g/h/i]

Pathing means "AND these things together". Blocks mean "OR these things together". And like PARSE rules, if you look up a word and get to a BLOCK! or PATH! that's just recursed on and used as if you'd written the rule right there.

Some of the elemental rules were GET-WORD!. If GET-WORD! weren't legal in paths, that would put a constraint on this dialect regarding its elements that the block wouldn't impose.

But...you can work around this with a block.

 dialect [[:a]/b/c [d :e f] g/h/[:i]]

That feels very...clean. Now you have a generic solution where you're using PATH!s as a dialect component that doesn't lose any capability BLOCK! or GROUP! had, without worrying about tapdancing around gibberish paths.

And we actually are entering an era of what are called "mirrored types", which would allow 1-element blocks and 1-element groups that are immutable to fit entirely in a cell with no dynamic allocation or pointer to elsewhere.

Mirrored types were invented so /foo could be a PATH! and cost no more than the old word-class REFINEMENT! did. But seeing them in action, it suggests applying it for GROUP!s and BLOCK!s too. Those embedded blocks could cost no more than a plain GET-WORD! today. With PATH! being immutable, making those blocks and groups immutable makes sense too. (By default on scanning I mean... if you make a path with a length-1 immutable block under the path level, it can preserve that mutability.)

When you put all these concepts together, it feels like it ties up loose ends and ambiguities. Will people miss a/:b:...or can the likes of a/(b): and a/[b]: or :(a)/b and :[a]/b cover pretty much everything?

Skepticism of PARSE preserving *precise* input type
You can has GET-BLOCK! (SET-BLOCK!, GET-GROUP!, SET-GROUP!)
Mirrored Type Bytes, Explained
#2

The only thing I can think of that would make me be upset about losing a/:b and having to use a/(b), is having that get involved in COMPOSE/DEEP when I didn't mean it to.

 compose/deep [
      .../(don't want composed): [(want composed) ...]
 ]

But we have better solutions to this today.

 compose/deep <*> [
      .../(don't want composed): [(<*> want composed) ...]
 ]

...and a shallow compose won't see groups in paths. I think that is enough for me.

The other issue is that right now GET refuses to fetch paths if they contain any GROUP!s. We could update this rule to make it refuse to fetch paths if they contain anything that runs any ACTION!s, so any inert groups would be fair game.

#3

What is the problem you see with get-words in paths, and set-words at the end ?

#4

Ambiguity. If GET-WORD!s can be put in paths, then you can't tell if :a/b/c is an ordinary PATH! with a GET-WORD! at the beginning, or a GET-PATH! with an ordinary WORD! at the beginning.

Same for a/b/c:... is that an ordinary PATH! with a SET-WORD! at the end, or a SET-PATH! with an ordinary WORD! at the end?

Every now and again it has been wondered if this suggests that there shouldn't be a SET-PATH! and GET-PATH!, but that those should simply be ordinary PATH!s with SET-WORD!s at the tail and GET-WORD!s at the head. This breaks down when you want a/b/(c + d): because you'd need a SET-GROUP!, or a/b/1: because you'd need a SET-INTEGER!, etc. for all types. It also breaks down because it inhibits the cheap/easy transformation of these path types into each other by flipping one byte without affecting the shared path array itself.

States that don't seem ambiguous, like ::a/b/c are still quite ugly...and actually can still be ambiguous. e.g. is that a three-element GET-PATH! with a GET-WORD! :a at the head, or a two-element GET-PATH! with a GET-PATH! :a/b at the head. I also think things like a/b/:c: are awful-looking, and don't have good bones for the language.

But the good news of all of this is that I think I have an answer for all of this with immutable paths, that are checked for properties at time of creation, to address all these issues...and I may be able to do it quite efficiently.

1 Like
#5

This is pretty parallel to a problem in URL!.

We have the problem of:

>> reverse http://hostilefork.com
== moc.krofelitsoh//:ptth

What you end up with still claims to be a URL!, but wouldn't load back as one. In fact, it would be loaded under today's conventions as a PATH!, since // denotes a BLANK! path segment. :-/

Historically the idea was that it's just a matter of noticing when something isn't rendering as a "natural" of its type, and falling back on some alternate notation. Whatever it would be, we'd hope it wouldn't be any uglier than:

 #[url! "moc.krofelitsoh//:ptth"]

But as with PATH!, we can question just how useful is it to allow freaks of nature to exist vs. making them immutable and not allowing them. If you could turn URL!s into TEXT! easily enough, and turn them back, isn't that good enough?

Pieces of URL! being URL! is not that interesting. Consider being rid of the PARSE behavior of matching datatypes for COPY:

 url: http://example.com/foo
 parse url ["http://example.com/" copy stuff: to end]

That will now again give you STUFF as the neutral string "foo" like in Rebol2. It's not a URL! of simply "foo" (and hence really should show as #[url! "foo"]). That seems more desirable.

Is pretty much any kind of surgery on URL!s necessary? How often could one's needs not just be taken care of with JOIN-ing them...as being tried for PATH!, and convert otherwise?

#6

In my own url! handling I only needed joining so far.
I think I used deeper surgery for helping with url-encoding issues.
Today, how about doing url-encoding on strings, and nothing on quoted strings?

#7

Good data point, thanks!

If you mean URL-encoding on URL!, the behavior of retaining "as-is" behavior was a request from @rgchris.
Here is the rationale I summarized in the comments:

// While Rebol2, R3-Alpha, and Red attempted to apply some amount of decoding
// (e.g. how %20 is "space" in http:// URL!s), Ren-C leaves URLs "as-is".
// This means a URL may be copied from a web browser bar and pasted back.
// It also means that the URL may be used with custom schemes (odbc://...)
// that have different ideas of the meaning of characters like `%`.
//
// !!! The current concept is that URL!s typically represent the *decoded*
// forms, and thus express unicode codepoints normally...preserving either of:
//
//     https://duckduckgo.com/?q=hergé+&+tintin
//     https://duckduckgo.com/?q=hergé+%26+tintin
//
// Then, the encoded forms with UTF-8 bytes expressed in %XX form would be
// converted as TEXT!, where their datatype suggests the encodedness:
//
//     {https://duckduckgo.com/?q=herg%C3%A9+%26+tintin}
//
// (This is similar to how local FILE!s, where e.g. slashes become backslash
// on Windows, are expressed as TEXT!.)

Offhand, I feel like it sounds more likely to me that URL! is most convenient when people can round-trip the rather lenient expressions being put in their browser, as-is. Attempts we do at LOAD-time to canonize as part of the data type itself may frustrate, and do more harm than good.

I'm not in-the-know enough to know about the legality of schemes or fragments/pieces of URL!s where % does not mean URL-encoding. Obviously a Rebol scheme could do this, but I don't know if any official legal URL ever can. If not, it's probably inadvisable to permit Rebol schemes to.

I do understand this is an important issue, but it would help to see complete scenarios that are pain points and a list of all the tradeoffs.

#8

I actually meant when a text! is joined to a url!, or otherwise converted to one, and maybe vice versa.

#9

I had a thought in this vein, when considering TO in terms of the "stable state round-trip" philosophy. I've thought that might be interesting if TO TEXT! was all it took for a URL to get encoded, and TO URL! took it back to readable again. If such TO conversions were being run automatically, then maybe it would be sensible for operations like JOIN to be similar when working with combinations.

What made me hesitant to pursue is that if we went this route with FILE!, it would result in inconsistent behavior on Windows vs. Linux. You'd have to make sure every TEXT! plus FILE! operation had the backslashes going the right way in the TEXT!.

linux>> join "abc" %d/e/f
== "abcd/e/f"

windows>> join "abc" %d/e/f
== "abcd\e\f"

It may be that URL encoding is a different beast. But I think there's a pretty high bar for "magic" in a system--if there's going to be any, it better be really good magic with a clearly amazing payoff. Otherwise it's just more complexity, which is a net negative--even if you tinker around and find it has a slight advantage overall.

1 Like
#10

I think immutable paths make sense, and disallowing single-element paths or paths with anything but WORD!, TEXT!, INTEGER!, DECIMAL!, BLOCK!, GROUP!, and BLANK! may make sense. (TAG! or other types perhaps, also, but the key is just to stop ambiguous constructions at the moment of creation.)

But taking ANY-PATH! out of ANY-SERIES! may be unnecessary.

The way it could make sense could be if we follow the concept that arrays with iteration positions other than the head are rendered is to include the index. e.g.:

 >> p: 'foo/baz/bar
 == foo/baz/bar

 >> next p
 == 2|foo/baz/bar

Not saying this notation is ideal, but the concept is that any non-head series shows its index value and the full data. So no matter how big your data, offsetting it by an index doesn't hide it in molding. If you have the need to truncate a series, you either have it mutable or COPY the data out.

The console could truncate to show you the most likely relevant portion, e.g. from the index position on. This would be similar to how it truncates the output of very long molds at the tail...it's just truncating long molds at the head.

This ties in with the very important "Where the Series Ends" post. That discusses the hidden index semantics, proposing there being no difference between that and a separately tracked INTEGER!...so it is purely an efficiency trick. Looked at this way helps answer a lot of questions about why that field is there, and prevents having two separate branches of semantics...one for when you use the index internal to the series, and one for when you are operating relative to some external index.