How Much Determinism Should Be Pursued?

While writing tests for the QUERY dialect against some sample files in a directory, it ran into an issue of the order those files were given back. Operating system APIs generally do not return the list of files in a determined order, and the ordering across filesystems also varies.

This means that even with the same files, you could have the lists come back differently. One OS could say:

Domain-SQL>> select 'name 'date from %tests/file-tests/
    where 'size > 100 and 'date = 26-Jul-2021

1 [%tests/file-tests/Disk50.txt 26-Jul-2021]
2 [%tests/file-tests/11barz99.txt 26-Jul-2021]
3 [%tests/file-tests/Apple3.txt 26-Jul-2021]
4 [%tests/file-tests/Banana1.txt 26-Jul-2021]
5 [%tests/file-tests/BANANA22.txt 26-Jul-2021]

While another would say:

Domain-SQL>> select 'name 'date from %tests/file-tests/
    where 'size > 100 and 'date = 26-Jul-2021

1 [%tests/file-tests/Apple3.txt 26-Jul-2021]
2 [%tests/file-tests/Banana1.txt 26-Jul-2021]
3 [%tests/file-tests/BANANA22.txt 26-Jul-2021]
4 [%tests/file-tests/Disk50.txt 26-Jul-2021]
5 [%tests/file-tests/11barz99.txt 26-Jul-2021]

This made getting reproducible outputs to verify was hard.

I Made QUERY use SORT/CASE on the READ DIR Result

Getting determinism in the output meant using a function that guarantees an ordering for filenames:

Domain-SQL>> select 'name 'date from %tests/file-tests/
    where 'size > 100 and 'date = 26-Jul-2021

1 [%tests/file-tests/11barz99.txt 26-Jul-2021]
2 [%tests/file-tests/Apple3.txt 26-Jul-2021]
3 [%tests/file-tests/BANANA22.txt 26-Jul-2021]
4 [%tests/file-tests/Banana1.txt 26-Jul-2021]
5 [%tests/file-tests/Disk50.txt 26-Jul-2021]

Having to pay for the sort adds a little bit of overhead, but it's not that significant.

Should READ DIR be Sorted By Default?

WASI in WebAssembly is looking to chase down sources of non-determinism and see what it can do to stop it. They mention directory listing order as one potential for problems:

Roadmap to determinism in WASI · Issue #190 · WebAssembly/WASI · GitHub

They seem to believe that on the same OS the directory ordering would be deterministic for the same files, but I don't know of any guarantee of that.

All This Points to Bigger Issues About Reproducibility

We can pick many examples... like whether a MAP! will always enumerate in the same order on different platforms, or with the same contents. Using a deterministically sorted implementation of map would seem to have a number of advantages.

Especially since there's a growing push in software for giving deterministic outputs by default. If you want some reasoning, see this article:

Determinism in software engineering • Buttondown

The more testing one does, the more important it seems.

If I'm not mistaken in SQL, there is no expectation of the result set coming back in a consistent order unless you specify so in the SQL. You would use an ORDER BY clause to do this:

Domain-SQL>> select 'name 'date from %tests/file-tests/ where 'size > 100 and 'date = 26-Jul-2021 order by 'name asc

The options in this clause are ASC (ascending) or DESC (descending) based on the chosen column.