Working with Sub-Series vs /PART (Series "Slicing")

rgchris · November 30, 2020, 11:45pm

A common pattern in Rebol is identifying a sub-portion of a series. That a series can have an index that in many ways appears as the beginning of a block is useful:

>> find "string! text!" "ring"
== "ring! text!"

However, there are many instances where you'd want a similarly constricted 'window' at the tail as well.

The thought is that addition to an INDEX, a series reference also includes a LIMIT as well. This value could be a positive number (relative to INDEX) or negative number (relative to TAIL).

An example might be:

>> blob: "Here is some text"
== "Here is some text"

>> part: find blob "is some"
== "is some"

>> uppercase part
== "IS SOME"

>> blob
== "Here IS SOME text"

There are a couple of motivations here:

To not have to COPY a substring to manipulate it (hence optimization tag)
to avoid the whole uppercase copy/part find blob thing find/tail blob thing dance

There are a whole bunch of unanswered questions:

What would a ZERO be relative to? The INDEX or the TAIL?
it's easy to reset the INDEX, how would you reset the LIMIT?

>> part/limit: tail part
== "IS SOME text"

^^^ obviously don't like using a word here

Anyways, just throwing this out here as a half-baked thought. I feel if there was a natural solution to this, it could remove a swathe of common problems

hostilefork · December 1, 2020, 1:30am

The term used for this (in other languages) is "array slicing". (Rust and Python have slices.)

Having thought about this in the past, I'm worried about questions of semantics of slices on mutable contents. It feels sketchy and confusing.

>> data: "abcd"

>> stuff: slice data 2 2  ; position length
== "bc"

>> clear data

>> stuff
== ""

>> append stuff "x"
== ""

>> append stuff "y"
== "y"

We don't have Rust's luxury here of having a borrow checker to make temporary slices that hold locks on data, and then release it.

In Rust's world, you can ensure there aren't mutating references to an array extant...make a slice and work with it...then have that slice released so you can pass a mutating reference around. That's cool.

To do that we'd need some kind of make-slice that took a hold, and destroy-slice that released it. How to do that cleanly falls under the general question of how to bracket the lifetime of an "iterating" construct. (Please share your thoughts on the critical post: "Modifying While Iterating: Crash, Nonsense, Predictable, or Illegal")

(It could hold a lock on length changes, as opposed to element mutations.)

I'm not sure what Python does, I guess it wouldn't hurt to check.

python: import numpy

python: a = numpy.array([10,20,30,40,50])
   out: array([10,20,30,40,50])

python: s = a[2:4]
   out: array([30,40])

python: numpy.resize(a, 2)
   out: array([10,20])

python: s
   out: array([30,40])

python: s[1] = 77
python: s
   out: array([77,40])

python: numpy.resize(a, 5)
   out: array([10,20,77,40,50])

Er, okay, well...whatever. We'll assume it's a data blob they're refcounting. Typical Python as "doesn't crash, someone might think that's useful, won't win any awards for design innovation or rigor."

I think resizing is an anomaly in their arrays, whereas it's foundational to series.

I'm Skeptical, Especially if it's Not Rendered Concretely

I already think the hidden index in series is something that needs to be handled more carefully and in people's awareness. I've written about the fact that sticking it in the series is a weird trick that doesn't really offer anything at a foundational level a separate index does not.

This would exacerbate that problem, to hold something in your hand that looks like a simple string but has all these trapdoors.

Were we to try this, one mechanical thing that won't work is adding a second number in series cells where the INDEX is. There's simply not enough room...all the slots are used up.

You cause a lot of problems if you pretend you have room, by making a memory allocation with two integers in it, and put a pointer to it where the index was before. Then you're GC'ing these pairs implicitly without user involvement as they increment and decrement, and chewing up all available memory during a FOR-NEXT.

(Note: This is the same reason that series indices shouldn't be bignums, and should remain constrained as platform integers.)

I'm not opposed to trying something for the sake of being able to avoid having /PART refinements on everything. But I'd like to see hard cases fleshed out first from a user perspective.

rgchris · December 1, 2020, 3:07am

If it's prohibitive from a structural point of view in existing types, could you have a pseudo 'window!' type that provides this frame?

hostilefork · December 1, 2020, 2:38pm

A single type wouldn't convey what type of series it was slicing, and most benefits are lost if you make a new type. The key purported advantage is being able to drop /PART refinements from functions and write them otherwise unchanged.

The likely better way of looking at the problem is to think of using generic iteration. This line of thinking would go further and fold in other things (like /SKIP) as well.

So rather than trying to imagine all the parameters someone might want to pass to UPPERCASE, you'd make the foundational operation apply to characters...which makes sense. Then you combine that with the iterator...running operations at each position. This means every time you come up with a transformation of char => char you don't have to fret about coming up with a kitchen-sink of refinements to that function; you just mix it with your MAP(-EACH) etc.

But as my iteration-questions link above shows, you'll wind up with some version of the mutation-while-iterating issues...no matter how you SLICE it.

BlackATTR · December 1, 2020, 3:22pm

It's an interesting idea. I'm sure there's more to it, but I think of this as a kind of QUERY or SEARCH operation, rather than one where you'd gather up indexes to living substrings and then be able to apply functions to those fragments.

I think this is closer to PARSE than FIND. For the sake of discussion here, I'll call it SEARCH. Let's say SEARCH takes the following arguments:

target TEXT! source
TEXT! fragment to locate
a word-boundary definition (basically a parse rule or charset to act as the WINDOW or LIMIT).

A refinement of SEARCH might be SEARCH/ALL or SEARCH/MAX, so that you can scan the entire TEXT! for all occurrences or a max of X number, etc. But you get the idea.

I know this doesn't deliver fully what you described, which is a live handle to specific substrings. In my less enlightened version, you first QUERY to match substrings, and then you need to have a second go at it, i.e., the explicit steps to apply changes to the text matches, e.g., the commands for CRUD operations, i.e., CREATE, UPDATE, DELETE.