Originally posted on LinkedIn.
Today I’d like to explore the foundational concept of a cryptographic hash function. Let’s start by establishing a clear definition. Here’s a concise explanation, generated by ChatGPT:
A cryptographic hash function is a mathematical algorithm that takes an input (or “message”) and produces a fixed-length string of characters, which is typically a sequence of numbers and letters. This output is typically a fixed size, regardless of the size or length of the input. The key properties of cryptographic hash functions are:
- Deterministic: For the same input, a cryptographic hash function will always produce the same output.
- Fast Computation: The hash function should be computationally efficient, meaning it can quickly produce the hash value for any given input.
- Pre-image Resistance: It should be computationally infeasible to reverse the hash value to determine the original input (i.e., given a hash value, it should be nearly impossible to find the input that produced it).
- Small Changes in Input Result in Significant Changes in Output: A small change in the input should produce a significantly different hash value (avalanche effect). This means that even a one-bit change in the input should result in a substantially different hash value.
- Collision Resistance: It should be computationally infeasible to find two different inputs that produce the same hash value. In other words, two different inputs should not hash to the same value.
- Pseudorandomness: The hash values should appear random and uniformly distributed, even though the hash function is deterministic.
- Fixed Output Length: Regardless of the size of the input, the hash function produces a fixed-size hash value.
The definition may seem a bit dry, so let’s now explore how this concept comes to life in a practical context.
In Cinode, each piece of information is housed within a data blob. These data blobs are given a unique identifier, essentially serving as the data’s address. What’s intriguing is how these identifiers are generated - through a specific cryptographic hash function called SHA-256. For static blobs, the input message of SHA-256 is the content of the blob itself , while dynamic links rely on a combination of the public key and nonce to pinpoint a specific instance of the dynamic link.
What are the advantages of using cryptographic hashing functions for blob identifiers in Cinode? Let’s explore the benefits:
- Fixed Output Length: Or to be more precise, a short fixed-length output. Regardless of the input message’s size, which can even be gigabytes of data, the resulting blob name remains compact and manageable.
- Collision Resistance: When utilizing hashing functions, we can confidently assume that blob names are globally unique. Anyone worldwide can generate their own blobs without the worry of name collisions.
- Determinism: With the same input, you always get the same hash function output. In Cinode, this leads to natural deduplication. Even if two blobs with identical content are uploaded independently, they produce exactly the same hash. A blob datastore only needs to store one copy. To see how efficient this can be, check out my blog post on maps in Cinode , where the data is reduced by 90%.
- Pseudorandomness: Hashing functions produce uniformly distributed blob names, even for very similar input messages. This predictable distribution allows for streamlined data sharding and distribution, and Cinode leverages this to optimize on-disk layout within the Cinode datastore.
Blob names are just one example of the versatile applications of cryptographic hash functions within Cinode. They are also harnessed for tasks such as:
- Generating encryption keys: Here, the deterministic nature of hash functions helps mitigate the risk of weak random source attacks.
- Creating digital signatures: The streaming capability of the SHA-256 function allows for a reduction of temporary storage.