Roaring BITSET!s

hostilefork · September 14, 2020, 10:52pm

BITSET! has a very simplistic implementation. They are essentially a BINARY! that's as many bits long as it takes to represent the set bits. Because they don't have any form of compression, that means setting a single bit for a large number will extend the size of the binary... e.g. if you set the millionth bit you will have (1 million bits / 8-bits-per-byte) bytes. That one bit just cost you 122kb.

(Note that there are 1,114,112 codepoints in unicode, so if you want to talk about whitespace characters in a unicode-aware way that might create problems. This kind of thing has contributed to reluctance to build in character set classes.)

There's one twist to bitsets to cut down on storage, and that's negation. If you NOT a bitset then it will not change the bits inside of it...but just remember that the set is negated, and give you the negative answer.

>> bits-two: make bitset! [2]
== make bitset! #{20}

>> find bits-two 1
== none

>> find bits-two 2
== true

>> bits-not-two: negate bits-two
== make bitset! [not bits #{20}]  ; not a huge bitset of all negated options

>> find bits-not-two 1
== true

>> find bits-not-two 2 
== none

That seems nice... in theory. However, the bitset negation never accounted for set operations like EXCLUDE and friends. They're all buggy:

>> bits-nothing: exclude bits-not-two make bitset! [1]
== make bitset! #{20}

>> find bits-nothing 1
== none

>> find bits-nothing 2
== true

The code doesn't honor the negation bit, so all it did was clear it on the result...and clear out 1 (that wasn't there in the data being negated), leaving the 2.

Red has the same kinds of problems (in this particular case it thinks the bitset is #{60) as a result of the exclusion.)

This is a Non-Trivial Problem

Carl wrote in http://www.rebol.net/r3blogs/0114.html:

Furthermore, we will need to consider the various results of logical operations performed on bitsets, including AND, OR, XOR, but also UNION, INTERSECT, DIFFERENCE, and EXCLUDE. These become just a bit more complicated internally to get the desired results. No pun intended.

But it's not just a bit more complicated, it's a lot more complicated. When you think about it, it's really hairy to implement set operations like EXCLUDE or XOR unless you are willing to flatten out the negated set...which would mean a lot of bits. Otherwise you're maintaining a lot of little patches and there's tons of issues about compaction (plus opportunities for mistakes).

Is the Solution "Roaring Bitmaps"?

This problem is not new, so people have tackled it...and there's a popular library called "Roaring Bitmaps":

https://roaringbitmap.org/about/

They're addressing exactly this problem, and they have C code for it. They claim to need C11, but skimming it I think it's probably something that could be avoided... they seem to be using stdbool and stdint (for which we have shims) and __restrict__ which we could define away as a no-op if it's not available.

It's a somewhat heavyweight implementation compared to the "simple" concept of just using a thin wrapper around a BINARY!. But I think it's easy to say that the simple concept is not fit for purpose.

BlackATTR · September 14, 2020, 11:05pm

I don't know where this rests within the list of overall priorities, but it sounds like a solid option.

hostilefork · September 14, 2020, 11:51pm

Me neither...but the thing is, whenever trying to work on any bit of code on something foundational like BINARY! behavior you wind up having to accommodate bitsets. And it's a reminder of how much they don't work.

Today's tripping over bitset-bad-binary-mojo led me to search to say "hasn't someone already written a library for this?" and the answer is apparently "yes". Bitsets are pretty useful...and it may be that the reach of applications is greater than just providing them for users. They may be a trick for other parts of the system too.

I might hack on it some to see just how tractable using this is.

hostilefork · September 16, 2020, 5:42pm

I did an intensive dive into the code to check it out...and trying to link it into R3 and seeing what the potential problems would be. Here are the things I found out.

"Roaring Bitmaps" refers more to a methodology than to a specific implementation. Some other languages (including C++, Rust, Python, C#, Node.JS) are wrappers for the "CRoaring" implementation. Java has its own implementation (which may be the canon one for the research?), and there seems to be a manually written Go version, though.
It seems to do what it claims, and the API is not difficult to use. It covers everything one would probably like to do with a bitset or negated bitset in Redbol...and then some.
The code isn't particularly polished, and error handling is not part of the API. When out of memory errors occur, it prints out a message (using fprintf(stderr, ...)...and we don't even link to printf. They are aware this is suboptimal and the issue DB mentions it as a task someone should work on fixing.
C99 is used in basic ways, like putting variable declarations inside for loops for (int x = 0; x < 10; ++x) {...}, and they also use tagged structs. It's nothing that would be hard to hack out if building on C89 was a priority, but being able to build on C89 is not the kind of thing they would care about (the way something like zlib cares about it)...so that would have to be done as a fork.

Most significantly, the library is not set up to pick features from it a-la-carte (the way mbedTLS is set up). This means if you compile it as-is, you wind up fattening up your executable by around 60K.

I think that can be whittled down some significantly...because a lot of that cost is tied to platform-independent serialization features (so you can save bitsets to files from a C codebase and load them from Java code). There's also a bunch of variations of the functions to do things like unions in-place vs. create a new set, and other code pertaining to levels of optimization that are beyond what we would care about. So I think it could be done smaller.

But one barrier to paring it down is that they do some pretty weird stuff with linking inline functions. You can't really #ifdef things out very easily in a way they'd be happy with if you hacked it into the code directly. So it would have to be done by some automated process (like how %make-zlib.r processes the github repository into a single header and single implementation file).

How Does it Work?

The data structure is very simple. Each roaring bitset is a list of:

arrays - just some sorted list of numbers
runs - starting value and a length
bitmaps - a set of bits representing numbers starting at a certain point (like today's BINARY!-based bitset, but shifted to start at some index)

We might imagine this represented in Rebol as a list of BLOCK!s, PAIR!s, and INTEGER! followed by a BINARY!

[2 5 9] 15x5 1000 #{8F}

So let's say that's a bitset with values 2, 5, 9, 15-20, 1000 and 1004-1008. (8F is 10001111 in binary, so that would be 5 items at the corresponding positions relative to 1000). We could imagine this as a wordier dialect as well:

[2 5 9] 15 thru 5 #{8F} at 1000 
2 5 9 15 thru 5 #{8F} at 1000
2 5 9 (15 thru 5) (#{8F} at 1000)   ; ... etc.

Anyway...my point is that the representation isn't anything profound. All the interesting work is in the algorithms, for efficiently unioning these things and xor'ing or negating them. The C implementation painstakingly handles optimization combinatorics during the operations...though it seems that won't guarantee you the best results. For truly maximum compression, it seems you have to call a compaction function explicitly.

Should We Use It?

One main finding is this:

The C code has low-enough dependencies and enough people invested in it that it seems far wiser to find ways to improve and adapt it--and get those changes taken back--than to try and develop an independent codebase for these features.

So I think we should scrap any concept of implementing things like union or XOR on negated bitsets as some parallel evolution.

One thought I have is that there could be independent extensions providing bitset-simple and bitset-roaring. If you build with bitset-simple then it works as today implemented with a BINARY!, and negated bitsets give errors if you try to do anything with them besides simple searching.

It might not be hard to hack up a bitset-roaring experiment... just to get some future forward experience with the idea. I'm noticing some similar patterns to how integrating bignums will need to work...and as INTEGER! being arbitrarily sized is increasingly an obvious non-negotiable for a language of this type, then answers need to be ready for that vs having to redesign the entire system.

BlackATTR · September 16, 2020, 6:34pm

You're the best person to understand what it takes to pull this off. My only thoughts are about priorities and when it might be the best time to dig into this (as opposed to finishing stackless, etc.). But if this is an integration which offers a refreshing change of scenery, of course go for it.

hostilefork · August 15, 2021, 8:59pm

So I have this hanging around on a branch, which I've rebased to bring up to date. It's more or less working.

Earlier in the year, I contributed a significant number of changes to the library, and it's amenable to our use.

There is absolutely no reason to try and write this feature as an independent project. We'd use the same approach, and just have to reinvent the algorithms. It has been hardened by a lot of tests and deployment--so even though some of the code is a little scrappy and hard to follow, it delivers the goods.

Among my patches was building as C++, which is provably useful for statically analysis. That meant I got rid of the tagged struct dependency.

We are no longer worrying about building the core as C89, and accept C99. (The API still can be used by C89 programs.) This may mean we move to using declarations inside for loops as well. It's another one of those "not hard to preprocess out" things:

for (int x = 0; x < 10; ++x) { ... }
=>
{  ; to be generic need to put in a scope, to not conflict with other loops
    int x = 0;
    for (; x < 10; ++x) { ... }
}

They Have A Single-File Form

I'd missed that they had a single-header and single-c file form of the project. Partially because it's called the "Unity Build" and Unity means something else in my programming world.

GitHub - lemire/CRoaringUnityBuild: Dumps of CRoaring unity builds (for convenience)

That really pushes the "we wouldn't do better than this" over the top.

So... I Think We Should Push This Through

Well, what's done is done...and this is close to done.

It will make the executable larger, but, like I've mentioned... it doesn't take much to make a gigantic bitset when the implementation is a binary. The size of the code will pale in comparison to a bitset with a high bit set.

I'm going to push it through and merge the branch.