BITSET! has a very simplistic implementation. They are essentially a BINARY! that's as many bits long as it takes to represent the set bits. Because they don't have any form of compression, that means setting a single bit for a large number will extend the size of the binary... e.g. if you set the millionth bit you will have (1 million bits / 8-bits-per-byte) bytes. That one bit just cost you 122kb.
(Note that there are 1,114,112 codepoints in unicode, so if you want to talk about whitespace characters in a unicode-aware way that might create problems. This kind of thing has contributed to reluctance to build in character set classes.)
There's one twist to bitsets to cut down on storage, and that's negation. If you NOT a bitset then it will not change the bits inside of it...but just remember that the set is negated, and give you the negative answer.
>> bits-two: make bitset! [2]
== make bitset! #{20}
>> find bits-two 1
== none
>> find bits-two 2
== true
>> bits-not-two: negate bits-two
== make bitset! [not bits #{20}] ; not a huge bitset of all negated options
>> find bits-not-two 1
== true
>> find bits-not-two 2
== none
That seems nice... in theory. However, the bitset negation never accounted for set operations like EXCLUDE and friends. They're all buggy:
>> bits-nothing: exclude bits-not-two make bitset! [1]
== make bitset! #{20}
>> find bits-nothing 1
== none
>> find bits-nothing 2
== true
The code doesn't honor the negation bit, so all it did was clear it on the result...and clear out 1 (that wasn't there in the data being negated), leaving the 2.
Red has the same kinds of problems (in this particular case it thinks the bitset is #{60) as a result of the exclusion.)
This is a Non-Trivial Problem
Carl wrote in http://www.rebol.net/r3blogs/0114.html:
Furthermore, we will need to consider the various results of logical operations performed on bitsets, including AND, OR, XOR, but also UNION, INTERSECT, DIFFERENCE, and EXCLUDE. These become just a bit more complicated internally to get the desired results. No pun intended.
But it's not just a bit more complicated, it's a lot more complicated. When you think about it, it's really hairy to implement set operations like EXCLUDE or XOR unless you are willing to flatten out the negated set...which would mean a lot of bits. Otherwise you're maintaining a lot of little patches and there's tons of issues about compaction (plus opportunities for mistakes).
Is the Solution "Roaring Bitmaps"?
This problem is not new, so people have tackled it...and there's a popular library called "Roaring Bitmaps":
https://roaringbitmap.org/about/
They're addressing exactly this problem, and they have C code for it. They claim to need C11, but skimming it I think it's probably something that could be avoided... they seem to be using stdbool and stdint (for which we have shims) and __restrict__
which we could define away as a no-op if it's not available.
It's a somewhat heavyweight implementation compared to the "simple" concept of just using a thin wrapper around a BINARY!. But I think it's easy to say that the simple concept is not fit for purpose.