How Much Determinism Should Be Pursued?

While writing tests for the QUERY dialect against some sample files in a directory, it ran into an issue of the order those files were given back. Operating system APIs generally do not return the list of files in a determined order, and the ordering across filesystems also varies.

This means that even with the same files, you could have the lists come back differently. One OS could say:

Domain-SQL>> select 'name 'date from %tests/file-tests/
    where 'size > 100 and 'date = 26-Jul-2021

1 [%tests/file-tests/Disk50.txt 26-Jul-2021]
2 [%tests/file-tests/11barz99.txt 26-Jul-2021]
3 [%tests/file-tests/Apple3.txt 26-Jul-2021]
4 [%tests/file-tests/Banana1.txt 26-Jul-2021]
5 [%tests/file-tests/BANANA22.txt 26-Jul-2021]
...

While another would say:

Domain-SQL>> select 'name 'date from %tests/file-tests/
    where 'size > 100 and 'date = 26-Jul-2021

1 [%tests/file-tests/Apple3.txt 26-Jul-2021]
2 [%tests/file-tests/Banana1.txt 26-Jul-2021]
3 [%tests/file-tests/BANANA22.txt 26-Jul-2021]
4 [%tests/file-tests/Disk50.txt 26-Jul-2021]
5 [%tests/file-tests/11barz99.txt 26-Jul-2021]
...

This made getting reproducible outputs to verify was hard.

I Made QUERY use SORT/CASE on the READ DIR Result

Getting determinism in the output meant using a function that guarantees an ordering for filenames:

Domain-SQL>> select 'name 'date from %tests/file-tests/
    where 'size > 100 and 'date = 26-Jul-2021

1 [%tests/file-tests/11barz99.txt 26-Jul-2021]
2 [%tests/file-tests/Apple3.txt 26-Jul-2021]
3 [%tests/file-tests/BANANA22.txt 26-Jul-2021]
4 [%tests/file-tests/Banana1.txt 26-Jul-2021]
5 [%tests/file-tests/Disk50.txt 26-Jul-2021]
...

Having to pay for the sort adds a little bit of overhead, but it's not that significant.

Should READ DIR be Sorted By Default?

WASI in WebAssembly is looking to chase down sources of non-determinism and see what it can do to stop it. They mention directory listing order as one potential for problems:

Roadmap to determinism in WASI · Issue #190 · WebAssembly/WASI · GitHub

They seem to believe that on the same OS the directory ordering would be deterministic for the same files, but I don't know of any guarantee of that.

All This Points to Bigger Issues About Reproducibility

We can pick many examples... like whether a MAP! will always enumerate in the same order on different platforms, or with the same contents. Using a deterministically sorted implementation of map would seem to have a number of advantages.

Especially since there's a growing push in software for giving deterministic outputs by default. If you want some reasoning, see this article:

Determinism in software engineering • Buttondown

The more testing one does, the more important it seems.

If I'm not mistaken in SQL, there is no expectation of the result set coming back in a consistent order unless you specify so in the SQL. You would use an ORDER BY clause to do this:

Domain-SQL>> select 'name 'date from %tests/file-tests/ where 'size > 100 and 'date = 26-Jul-2021 order by 'name asc

The options in this clause are ASC (ascending) or DESC (descending) based on the chosen column.

2 Likes

I happened across an article discussing what someone felt was lacking in Go:

An unordered list of things I miss in Go — kokada

The title is a joke about how the only option for enumerating maps was to use one with a randomized hash. They felt there should be an option without a third party library where you can explicitly ask for an ordered map.

It also links through to mention that as of Python 3.7, the standard "Dict" dictionary preserves the insertion order:

[Python-Dev] Guarantee ordered dict literals in v3.7?

My historical biases coming from C++ would likely to have been for Go's choice: to actually force randomization in order to prevent people from depending on something they don't care about. This helps fuzz test and make sure the default doesn't overspecify to prevent optimizations of the structure.

But for a higher-level interpreted language in the space of something like Python (or Rebol), that tradeoff may not be the right one for modern concerns.