Checksums and Secure Hashes in Ren-C

hostilefork · May 9, 2022, 4:04am

I thought I'd write up a quick post about algorithms that take in a bunch of bytes--like a BINARY!--and produce a (usually) shorter "fingerprint" of that data.

It's not hard to make an insecure hash...like taking the first couple of letters of a string: "DICKENS" => DI. That might be useful if you're trying to divide book authors into drawers of an old-fashioned library card catalog.
But it's easy to find another string that would make the same "DI" fingerprint, like "DIABOLICAL" or "DIR^/FORMAT C:"
If you were using a fingerprint to trust something was what it was supposed to be, the easiness of generating these "collisions" is a liability!!
Secure hashes do complicated "one-way functions" in math to make it really hard on today's computers to fabricate any binary sequence--even garbage--that will give you the same fingerprint as any other input.

You've certainly seen web pages that redirect you off to download a file, but beforehand give you some bytes of what the file should securely hash to. _{(And you've probably never checked to make sure they match.)}

If the Fingerprint is Shorter than the data, COLLISIONS EXIST!

You obviously can't give every 500mb file in the universe a unique fingerprint that's 32-bits, 128-bits, 512-bits, 2048-bits etc.

So the concept behind a secure hash is just supposed to be you can't find a collision until we're all dead and no one cares. Less grimly: we might pair a hash with an expiration date to say not to trust a hash after a very conservative guess at how long it would take for computers to advance enough to break it.

When this hope falls apart (e.g. if any researcher can show two inputs that generate the same fingerprint) it's generally considered that the "secure" part of that hash is broken...for the purposes of whatever time-bubble you're living in.

We Have Four Secure Hashes in the Box Right Now...

It shouldn't be surprising that longer fingerprints correlate with being harder to find collisions:

>> checksum/method {DICKENS} 'md5
== #{3717A787E2F16310EA51DC0308E88803}  ; considered vulnerable

>> checksum/method {DICKENS} 'sha1
== #{DFE5DD61B2B19C319DBC4F44328CDF8D24366F88}  ; considered vulnerable

>> checksum/method {DICKENS} 'sha256
== #{52A095CF1F0319EF44FF9134AAD2EF5E2BFE1A48307DCB0AD0408F1CE393C950}

>> checksum/method {DICKENS} 'sha512
== #{
E8DBC26DE28FAA1BC2A6A3E1BC6DD22C1ECB3FD0D5FCBBFB69BC63C6AAC6A9CE
FB76294EFFB0522D4C90A5E5829233FC5BC5B811AE6684A6EA632ECB3FF88DA1
}

>> checksum/method {DICKENS} 'ripemd160         ; bitcoin uses this one
== #{ADC86945BE4CEF31F0CFCAF66775E1DA5160F877}  ; ...no one knows why

But longer isn't intrinsically better...there might be a weakness to exploit in the method used by a longer hash which isn't present in shorter ones. However, a bunch of mathematicians look at this stuff and we would generally hope that we'd be getting what we pay for--more bytes meaning more security (unless they're from the NSA and trying to punk us).

We Also Have Three Insecure Hashes...

These are fine when you're doing something that is not supposed to be protecting against adversarial attacks. They're smaller to store and much cheaper to calculate...but it's trivial to find other input data that would produces the same 32-bit result:

>> checksum/method {DICKENS} 'crc32
== #{FB05F0BC}  ; used to verify gzip files (among other places)

>> checksum/method {DICKENS} 'adler32
== #{0202D207}  ; used by zlib deflate and inflate

If you want a really cheapskate insecure checksum, we include the one that is in TCP packets on the internet:

>> checksum/method {DICKENS} 'tcp
== #{CA32}

Fun Corner: Let's find a collision!
>> until [
    string: copy {}
    repeat (random 10) [append string make char! 64 + random 26]
    #{CA32} = checksum/method string 'tcp
]
== #[true]

>> print string
AOGTDDN

>> checksum/method {AOGTDDN} 'tcp
== #{CA32}
That wasn't so hard, but... if you can find collisions for any of the secure hashes above (even the relatively-weak MD5 or SHA1) you will be famous!

We're Primed To Make STREAMING Secure (or Insecure) Hashes !

R3-Alpha did not have fancy modern hashes like SHA256 or SHA512. But also the code it used was copy-pasted out of a library that required you to have all the data at once.

Hence if you have a multi-gigabyte DVD .iso file that you want to checksum, you have to read that into a multi-gigabyte BINARY! to process it.

But since Ren-C is leveraging the cryptography of the pure C library known as mbedTLS, the foundations are there to stream in little blobs at a time...and it is generalized so we can just flip on or off any hashes we care about. If we only knew how to express streaming with PORT!s (or whatever).

And also, the CRC32 and ADLER32 algorithms we have "for free" by including Zlib are now set up in a way that the insecure hashes can be streamed too.

"How Does This Tie Into The Web Repl Story"?

I really want to have a laser focus on whether investing effort into something is going to be something that pays off or not, and to me a part of that payoff question is "will people using the web build care".

Right now the web build does not include the CHECKSUM function at all.

That is too bad, because I do believe that putting secure hashes into the hands of users at the Web Repl prompt (as well as other basic crypto parts) would be a great playground.

But it would be rather heavyweight to push secure hashing into the default .wasm being pulled down on every site if it didn't use it. This is why I really want to get "Wasm extensions" working, that can be dynamically loaded. There's something called "side modules" that I have meant to explore but haven't.

Anyway, I hope this summary gives a little insight into where this is at, and perhaps educational for those who don't have experience with the difference between secure/insecure hashing.

iArnold · June 9, 2022, 5:57am

And also if the contents of the file were corrupted to serve a malicious purpose, it the must be adapted to result in the same hash value while keeping the filesize the same as the original.

hostilefork · June 9, 2022, 6:04am

Requiring the length to be the same does add another layer of security. But it also requires communicating the length along with the hash as part of what you check.

You could give the first 10 bytes of the file and the last 10 bytes of the file, too. And that would be another layer of security to the check.

But if you take this to extremes, the site publishing the thing you should check about the file is publishing something as long (or longer) than the file itself.

So at least for the moment...I think the only thing that tools check for is the hash...just because it reduces the number of things to communicate that you have to check.