QUERY Dialect Brainstorming

So @BlackATTR has brought up the idea that there is a likely very large opportunity in what might be thought of as a sibling dialect to PARSE, namely QUERY.

(For anyone who didn't see his talk on the subject, it's about 10 minutes long (edited), and good to re-watch for context.)

Rebol2 had a native QUERY target /clear which "Returns information about a file or URL." The /CLEAR refinement would "Clear modified state of target object".

R3-Alpha instead had QUERY target /mode field, remarking that /MODE would "Get mode information", and passing NONE for the mode would return a list of available modes to pass.

Being able to at least narrow down what you are asking for is important. Because in the OS, a lot of times each property might require a different method call. This is costly for filesystems, and even moreso over a network.

But the /MODE refinement seems ignored:

r3-alpha>> query/mode %README.md none
== make object! [
    name: %/home/hostilefork/Projects/ren-c/prebuilt/README.md
    size: 1425
    date: 28-Oct-2020/3:50:27-5:00
    type: 'file
]

We Must Replace This Weak Form of QUERY

The proposal is to model after SQL, which is widely known and understood.

So we'd be looking at something a bit more like:

>> query [select date from %README.md]
== 28-Oct-2020/3:50:27-5:00

The idea of this grammar being rearrangeable is something being suggested as well:

>> query [from %README.md, select date]
== 28-Oct-2020/3:50:27-5:00

Usual questions abound: should you say select 'date and use a LIT-WORD!, or select #date, or select <date> or select /date etc. etc.

The opinion of @rgchris has historically been that undecorated words are the thing that give leverage. SQL is successful and doesn't make you decorate the names, so you only need to escape when you trip across a keyword. The use of quoted words may be the most natural way to slip into this distinction when you have to make it.

But then, if you have field: 'date (or whatever) you can't expect indirection implicitly from select field...you have to do something like select :field or select @field, or do you always use GROUP!s for composition with select (field) ?

My point is that across dialects, it's not all that comforting to say "each one has whatever its own rules are". It would be nice if there were some systemic guidelines.

Some Opening Thoughts

The move to a dialect where "everything is in a block" is a pretty radical one. PARSE could have done that, enabling weird features like adding more input for the same operation.

>> parse [input "aaabb" rules [some "a"] input "bbcc" [some "b" some "c"]]
== "aaabbbbcc"

(I'll point out that Lisp's LOOP macro has some of this same feel, which has made me wonder if the LOOP word should kick into a similar dialect.)

Thinking of query as not trying to pull any aspects out into separated parameters sends it into its own universe. It can't be a beneficiary of things like specialization. You'd probably see a lot of cases of:

my-query-date: func [file] [
     query compose [select date from (file)]
]

In situations like this, it's just a fact that the system is going to have to break down the dialect into parts in order to process it. Those parts will be parameterized with formal arguments, digested and collated into a machine-processable form. There would likely be a lot of applications for these routines, where monolithic QUERY wouldn't be suitable.

To model what the internal parts would look like, I've suggested looking at the plugin architecture of systems like osquery. They've abstracted querying across a wide range of non-database things:

https://osquery.io/schema/4.5.1/

This might let us get a glimpse of what the middle forms would be. I did a bit of investigation into building osquery, which I should get back to looking at...

I lamented while trying to make BITWISE that dialecting is very poorly defined. We don't know what role binding plays in dialect evaluation or what good and bad are. So really, any thoughts here on the tradeoffs or even examples of what not to do would be good...

3 Likes

I have taken a step back to look at making the SQL dialect generalizable. In other words, looking at it as a structure where meaningful fragments (clauses) of an expression are validated and parsed in succession, like a finite state machine.

This means that in addition to having a useful dialect with well-known syntax (SQL), it allows modification of its syntax-- or even just taking the clause-based system to create a dialect with new/different syntax.

So... a dialect model with SQL syntax (for Text! based values) as its first working example. This model of course has some limitations in that it would support a limited family of "keyword dialects" rather than, say, a BITWISE dialect, or dialects for chemical reactions or whatever, but this still has some value.

**A keyword dialect (in my definition) would be a command-language or messaging syntax which is primarily word! based. SQL fits this model. A home automation dialect, or the user navigation commands for a text-based game, a static site publishing dialect etc.

As for the thornier questions raised about decorated words:

I think if the user is evaluating a QUERY block of Rebol values, then all must conform to scanner rules, and therefore a minor subset of SQL syntax will not be supportable. I worked around this (in interest of non Ren-C users), so users could provide an ordinary input (Text!) at a command-line REPL which massages the Text! input into Rebol values, arranges into optimal clause order, produces errors/remediation, and formats output as necessary.

I'll set aside the REPL piece, as the scope here is how to perform a QUERY natively within Ren-C rather than as a external DSL tailored for the average SQL user.

One other point I'll make is that SQL clause: FROM <text!, file!, dir!> is obviously a great starting point, but additional thought should be given to Ren-C's other rich built-in data-sources:

FROM <url!, ftp!>
FROM <object!, port!>
FROM <query!>
FROM <rebol source code!>
FROM <your documentation/markdown comments in rebol source!>

Another thing I'll mention about the current QUERY. You can only get a few datapoints about a file, e.g.,

  • name
  • size (in bytes)
  • date (modified)
  • type

To be useful you'd probably also want some of the following as standard fields:

  • base name
  • extension
  • path
  • full path
  • accessed date
  • access permissions

And "field" keywords to refer to the text/data within the file. So you can use filter clauses, e.g.,

select name, date, full-path
from %my-dir/
where extension = %.r and date = now/date and filetext like "black?hole?"

My personal preference is to use a sigil to indicate any structured data fields of the target FROM data source. These can and will vary depending on the data source (e.g., a file's properties will be different from a URL!). True that it adds line noise:

select #name, #date, #full-path
from %my-dir/
where #extension = %.r and #date = now/date and #filetext like "black?hole?"

(And no, I'm not attached to the #, it could also be a dot or whatever)
But some form of visual cue allows the user to quickly scan the expression for relevant NOUNS, while delineating from dialect keywords as well as embedded rebol expressions. SQL syntax is pretty minimal, and users are usually focused on the nouns (data fields) of the expression rather than the SQL syntax which tends to fade into the background.

select #name #date #full-path
from %my-dir/
where #extension like %.r??
and #date between (ask ["Between date1:" Date!]) and (ask ["And date2:" Date!])

I wonder if quoted words splits the difference:

select 'name, 'date, 'full-path
from %my-dir/
where 'extension = %.r and 'date = now/date and 'filetext like "black?hole?"

Even so: I'm sympathetic to the idea that anytime you mandate extra decoration which context alone can provide, you are kind of slipping away from what makes written languages like English work. Written languages drift to omit and gloss out superflous syllables or decorations. Not embracing that feels like falling short...to where SQL will always seem "better" and you'll have to transform from some text format. I call that failure.

With what I laid out in "speaking with tics"...maybe we could collapse this all into a kind of "strict mode" setting that you could apply as a whole...or call out constructs which it applied to. Maybe they'd use an operation like word-from-literal which a configuration setting could determine the tolerances of.

As a baseline minimum idea, you could say query: :query/strict at the top of your file.


For variable substitution, colon-escaping lines up with what we are thinking about quoted arguments.

field1: 'name
field2: 'date
field3: 'full-path
rule2: ['date = now/date]

select :field1, :field2, :field3
from %my-dir/
where 'extension = %.r and :(rule2) and 'filetext like "black?hole?"

The idea that maybe if you're splicing more than one element's worth you use a doubled group is interesting, to help convey "this is more than a single term"

where 'extension = %.r and :((rule2)) and 'filetext like "black?hole?"

The other possibility would be the @ words:

select @field1, @field2, @field3
from %my-dir/
where 'extension = %.r and @(rule2) and 'filetext like "black?hole?"

But what I've suggested this meaning is not so much "pretend I had written this here" (source level substitution) but rather "this is really the actual thing I'm looking for". I don't know what that would be in a querying dialect... but in spirit imagine:

; foo.txt contains "abcdef"

>> query [select third character from %foo.txt]
== #c  ; assumes you meant open that file and look at the contents

>> filename: %foo.txt
>> query [select third character from :filename]
== #c  ; substitution principle, so same as if you'd written it

>> query [select third character from @filename]
== #o  ; literal interpretation...third letter of FILE! string `foo.txt`

The example is easier to understand in PARSE, where block or :block would mean "use as a rule" and @rule means "look for the literal BLOCK! value in the input".

(Note: This means I'm basically committed to calling @foo a LIT-WORD!, now)

My point with all of this is that I'd like to see some consistency with this across dialects and in the evaluator. If they deviate, they should deviate for a reason.

While we don't have a lot of clarity, I feel like there's measurably more than there was.

I like the "spirit" example.

One small point: Keep in mind that SELECT in SQL and SELECT in rebol don't mean the same thing. In rebol SELECT is a PICK. In SQL SELECT 'thing means to "return all occurrences of thing from the dataset". To constrain the result set, you can say:

SELECT TOP 10 'thing from data-source

or

SELECT 'thing from data-source LIMIT 10

If you want the third character from a text! or file! in SQL, you'd probably say:

SELECT substr(:filename, 3, 3)

:roll_eyes:
Yes, SQL relies quite heavily on column functions (e.g., CONCAT, LENGTH, UPPERCASE, TRIM) -- replete with parentheses- to modify/change/format the output fields for each row of the SELECT clause. Note that this leads down the path I mentioned some time ago about parsing nested group!s, since these SQL column functions can be nested to an arbitrary depth. :grinning:

Looking at these examples of semantic/syntax differences-- between shared fns/keywords such as SELECT, or allowing GROUP! to follow the SQL format of fn(value), one needs to decide whether to support a sawed-off/bastardized version of SQL to fits within the Ren-C milieu, or instead to hew towards greater compatibility with SQL. The former is cleaner for Ren-C semantics, the latter is better if you view QUERY (as I do) as an opportunity to create a zero entry pool gently descending into the Ren-C lake.

A further thought on the messiness of column functions (scalar, aggregate and string). Although I think the following should be supported, we might come up with a cleaner rebol format to sidestep this:

SELECT concat("Sale price:", round('price), "$USD") ...

In other words, permit a format which involves embedding rebol code in-situ similar to the way PARSE dialect embeds rebol:

SELECT :[spaced ["Sale price:", round 'price, "$USD"]] ...

The SELECT clause behaves like the BODY of a FOR-EACH-- it gets evaluated for each match (aka "row") added to the result set. But we would need to create a way for the price field in the above clause to be identified and substituted with the new value on each iteration of the loop-- it probably needs to be something other than a lit-word! tick to differentiate as a field in the SQL dialect rather than a regular rebol code value.

I was just making up some pseudocode to try and motivate what "@" might signify, given how I envision it being used elsewhere.

Perhaps we should start calling these two contrasting approaches QUERY [...] and SQL [...].

While both are interesting, I think the more pressing need is QUERY: to lay foundations for new dialect authors to know best practices. So weigh both perspectives in the design. But what we'd be using in-the-box for getting file sizes would not carry forward things like CONCAT. The term wouldn't appear in its implementation, and only in a not-in-the-box script for specialized use.

I will point out that unfortunately...when you are running database operations that execute "code" inside the database itself, you don't have the choice of using a language the database doesn't know. These situations pretty much have to offer the option of going to the raw string level...you'll probably never quite get the syntax it supports.


Back to QUERY--where it's kept it in the best practices of dialects in the language...the way you'd ask new dialect new designers to do it.

But dialect best practices are still a big unknown so I'd like us to chart this out, and look for consistency. It may feed back into changes to PARSE and inform things like BITWISE.

We can think about select [a b] from vs. select a, b from and ask ourselves if there's something intrinsically better about the BLOCK! and the COMMA! should be saved for "purely optional" delimiting markers. (Should working code always run the same with or without a comma, and commas only trigger errors if the code wasn't broken up in the way the author thought?)

Maybe that's an example of a point where SQL and QUERY diverge. Maybe SQL thinks of commas differently and is more limited in their usage to avoid the "better" use of blocks or groups.

Yet I mention that even QUERY is going to need to be broken into layers in order to implement it. Maybe the "commas-are-commantary" rule doesn't apply to it, and it uses them instead of blocks? We just don't know yet, these things are still new.

It feels to me like select top 10 definitely leans into the "oops, hope I didn't name my column 'top'" kind of situation where I'd think that's getting too dicey.

Anyway, the main thing is just to not be shy about it and keep pushing. All these things take time. We've got more tools than ever with generic TUPLE!/PATH!, and COMMA!, and "new LIT-WORD!/PATH!/GROUP!/TUPLE!/BLOCK!"