Cinode glossary - Content-Addressable Storage (CAS)

Originally posted on LinkedIn.

When we have to deal with large quantities of data, we need a good way to identify concrete pieces of information. In traditional web applications, this is usually realized through some kind of identifier in a database: each user, product, setting or any other data entity is assigned an ID such as some random value / UUID or some sequential number. Such ID is then used to uniquely identify and lookup the data.

Instead of generating identifiers randomly, we can extract the identifier from the data itself. Let’s take an example - a database of song lyrics could take the first verse and use that to identify the whole song. This approach would work as long as no two songs share the same first verse.

To avoid collisions, we can take advantage of a hashing function: a method of transforming the content into a short and unique digest for the information. And that digest could be the identifier. This property forms the basis of Content-Addressable Storage, abbreviated as CAS, which I’d like to briefly describe today.

You can find a pretty good explanation of CAS on Wikipedia, but let me bring up a few of its interesting properties.

Content immutability in CAS

Because the identifier is tightly coupled with the content, any change in the content means that we have to create a completely new object. This may sound like a huge limitation, but there are certain scenarios where this is actually a big advantage. Let me bring up one of the most widely used systems built on top of CAS principles: Git version control system.

On a lower level, everything in git is stored in blobs identified by hashes. Yes, those commit hashes displayed on GitHub are references to such blobs. Since Git stores the whole history of changes, it’s obvious that we would like to keep the old dataset, and add data for new commits on top of it (as long as we don’t rewrite the history which I intentionally skip here).

By using a solid CAS foundation, Git preserves its history because new changes will create new blob IDs. Protection against data overwrites is guarded with CAS semantics of the storage layer.
Only a storage layer

In Git, there are other layers on top of CAS too. Blobs have to be interpreted in such a way that we can experience those as files. Git also needs some data that is not stored in CAS: configuration and references (branches, tags, and so on). Such additional information is necessary to make CAS-based systems useful in practice. When we talk about Git commits, we rarely use commit hashes. It’s more about branches and tags, which are human-readable entrypoints into objects stored in CAS.

It can be seen in many other systems that rely on CAS as a storage layer. In BitTorrent, there are magnet links; in IPFS, there’s IPNS: an integrated name system; and in blockchains, we’ve got things such as Namecoin.
Great decentralization and scaling potential

CAS-based storage systems usually use a strong cryptographic hash function , which guarantees linear distribution of hash values. A hash could be seen as just a huge number.

For a better understanding, let’s imagine a simple chart with one axis. For a large number of input values, let’s draw points on that axis at positions representing their hash values. Such a chart would be uniformly distributed. There must be no area where hash values are condensed, as this is a requirement for a cryptographically secure hashing function.

But we could also take only a portion of a hash value - for example, one or two bytes. The distribution of such truncated values would also be very uniform.

A portion of hash function value can be used to distribute the data. Let’s say we have a large number of servers: 1000. A common technique is to take the hash value, get the remainder of its division by 1000, and use the result to find the server to host the data. That way all input values would evenly distribute onto all servers. If we apply this approach in a CAS system, we can easily create huge storage systems.
Great match for persistent data structures

What are Persistent Data Structures? The name may be a bit misleading. Persistent Data Structures are not about persistence. Instead, they are about the way modifications to those structures are applied. The main rule here is that the old state is not altered. Instead, when applying a modification, unchanged portions of the old state are reused, and only small modifications are added on top.

A good example is a text in an editor. When adding a new word or paragraph, the majority of the text stays unchanged, and only a small portion of the whole text needs to be updated.

Does this sound familiar? This is what Git is doing with its commit history. Yes, Git is essentially a huge persistent data structure representing a file tree over time. And if we look a bit deeper, the CAS layer in Git is a perfect tool to build Persistent Data Structures.

Each complex data structure consists of small portions of information – these can be stored in CAS, and references between data can be encoded using CAS IDs. When a new change is applied, new data blobs with new IDs are generated, and the old state is referenced through already existing IDs.

Cinode as a CAS system, sort of

Cinode operates on two major categories of data: static blobs and dynamic links. However, if we focus solely on static blobs, this represents exactly a CAS storage system . Each piece of information is stored in a blob, and that blob is named by hashing its content. By doing so, we can utilize all the advantages of a CAS storage system - we gain immutability, scalability, decentralization, history… all those benefits.

But, similar to branches in Git, pure CAS storage would be ineffective without some method to reference its data with stable anchors. In Cinode, this is achieved through dynamic links . These links can point to other data in Cinode, whether static or dynamic. Such pointers can be altered over time by an authorized publisher, and the change propagates to readers who are aware of the link.

Like static content, links are identified through an identifier. In this case, it is constructed from a hash of an unchanging link’s identification, which includes its public key and a nonce value. Both static and dynamic links can thus share the same ID space. The unified ID contains information about both the hash and the type of the blob (static or dynamic link).

By using a single space for all identifiers, the entire dataset can be easily distributed and sharded. All data can also be referenced through a single identifier, whether it is a static blob or a link pointing to it.