Blobs Anatomy - Dynamic Links - part 1
Contents
I did a lot of improvements on the hardware side recently. But finally it’s time to get back to designing and coding 😄. Today I won’t be showing something new. Instead I’d like to start a series of posts describing in great detail Cinode internals. I slowly but steadily worked on the content for a few months already - let’s finally release it.
First area covered here will be about dynamic links. This and the following posts will be a good source of information and also an interesting opportunity for me to additionally validate the whole model. I’ve already covered a lot of details about dynamic links in one of the previous posts where I was exploring the idea so there will be a lot of overlap. This time however I’ll be able to detail the design based on my experience that I gained while working on the implementation. Also I’m hoping to make the content a bit easier to understand and better structured this time since it may become the basis for future documentation.
I’ll try to be as straightforward as possible but I will also be honest that this post series may require some knowledge about cryptography. And since it will be a rather detailed description with lots of comments and reasoning about my decisions, I totally understand that for some of you such content may be boring. But fear not, I still plan to interleave the “serious” content with the “fun” stuff once in a while to bring extra positive vibes to this project 😉.
Now since you know what you’re signing up for, let’s get that deep dive into dynamic link’s anatomy.
Public and private parts
As described in my previous posts, Cinode splits processing of data into public and private layers. The public layer is meant to be processed “publicly” - at this level we only operate on a ciphertext but in such a way that invalid data is rejected. The ability to detect such bogus data and reject it before it’s decrypted is a crucial property of Cinode. Being at the core of the design it allows unconstrained data propagation. The data spreads in an encrypted form which limits the ability to apply content-based filters and yet we can still rely on the content being genuine. Such core property is of course a major factor shaping all the validation algorithms.
Built on top of the public one there’s the private layer that operates on the plaintext data. Access to this data is controlled through availability of matching encryption keys. We won’t look at the private side today though. To make sure this post is of a reasonable length I will only focus on the public layer and leave the rest for future updates.
Let’s start with a picture of all the data bits and pieces used there:
The top row of blocks in that picture represents the low-level serialization format of a dynamic link. This is the representation used by nodes to send dynamic links between themselves and would be found inside the data packets if one eavesdropped on the communication between them.
There are two main sections in dynamic link’s data. The one shown on the left side can not be changed for a given link instance and is tightly coupled with link’s unique blob name. The one on the right side carries the dynamic data that can be updated.
Unchanging part of the dynamic link
Format version prefix
Link’s data starts with a fixed 0x00
byte identifying link’s format version. It is a safety fuse reserved for future upgrades and currently it can only be 0
. This format value dictates the interpretation of the rest of the data so it must be at a very precise position - the beginning of the data is an obvious choice here (we could put it anywhere at a fixed offset but I don’t see any advantage of any offset other than zero). If there’s ever a need to change the format of the link for whatever reason (e.g. to introduce a different public/private key scheme), that byte would have to be changed.
Blob name that uniquely identifies the link relies on the format version so it won’t be possible to update link’s format without generating new blob names. That was done intentionally - a different format version would mean a different link’s structure, maybe another asymmetric encryption scheme used. When we deal with such a change, leaving any chance to produce the same blob name for different format versions would most likely mean that it would be exploitable. Adding an explicit version byte to the set of inputs for name generation completely mitigates the risk.
Summary:
Format version prefix allows updating dynamic link format in the future. Without such a prefix, changing of the link’s format would become troublesome.
Public key
Right after the reserved byte we can see the ed25519 public key. You may wonder why ed25519 since we have so many asymmetric key schemes out there. I’ve chosen a fixed scheme to simplify the initial design and implementation - having a single algorithm here means that we don’t have to deal with variations of different keys: variable key sizes, different signature formats, different sets of supported keys in different implementations etc.
And why ed25519? It is a well-known and battle-tested algorithm with many practical applications: TLS, Signal or TOR just to name a few. It is also a deterministic key scheme - it does not require a high quality random source and for the same input will always produce the same signature. Since there’s no random source needed, a whole set of problems related to weak random sources goes away: it is easier to generate signatures on low-spec devices that may not have access to a high quality random source and there’s no need to be concerned about attacks through a purposely weakened random source.
Summary:
This field contains the main public key associated with the dynamic link.
Nonce
Following the public key there’s an additional 64-bit nonce value (big-endian format). That nonce is still at the unchanging link part which means that different nonce values will produce completely different dynamic links. What would it be used for then? In my initial design I did not have this value at all. But then I thought that a private key is like an identifier of the publisher. Without such a nonce, a corresponding public key can only be used to generate a single dynamic link - equivalent to a single content. Not quite right if the publisher would like to publish a lot of different things using a single identity.
With nonce though we can produce many different dynamic links and use the same public-private key pair for them. Because the same key pair is reused, it is provable that the same identity was used to create those. It may also become handy in situations where the private key is stored inside a hardware device where such a device will only have to manage a single private key for many links.
Of course the presence of a nonce value does not rule out using different key pairs by a single person - there are security concerns to take into account here. If a given key pair is reused between different links and the private key accidentally leaks, all those links should be considered compromised. With nonce we gain flexibility and can adjust key reuse to required security level.
Summary:
This field contains an additional nonce value that can be used to distinguish between different links sharing the same public/private key pair.
Blob name and validation of unchanging link’s part
Blob name for dynamic link is created by hashing all the unchanging information with the SHA-256 function. The hash itself is not enough to identify the blob. That’s because in static blobs we also use the same hashing function to generate a blob name but using the blob content as input. One could come up with a collision by exchanging static blob’s content (the encrypted one) with the serialized content of the unchanging part of a dynamic link. That’s why the Blob name also carries information about the blob type.
Summing up the unchanging part: a dynamic link with a given blob name is tightly coupled with: 1) the format version of that link, 2) its public key and 3) the nonce value. Indirectly it is also tightly connected to the private key that matches the public one.
Part of dynamic link’s validation logic is the recalculation of the blob name from its data and comparison with the intended blob name. That way the node ensures that someone does not try to fool it by sending data of a completely unrelated link. If the unchanging part of the link is confirmed to be valid, we can for example rely on the genuineness of the public key which is critical for the validation of the dynamic part of the link.
Summary:
Deriving dynamic link’s blob name from link’s type and all the unchanging parts of the link guarantees genuineness of this static information associated with the link.
Dynamic part of the link
The dynamic part of the link is where the encrypted link data is located. It is called “dynamic” because a link with the same blob name can dynamically change this part of the link and it will still be valid. Let’s take a look at the raw dataset then.
Signature
The changing section starts with the ed25519 signature computed using the corresponding private key. What data is being signed is detailed later in this post. But for now let’s assume that it effectively proves the authenticity of the whole link dataset - both the unchanging and the changing part. It is located at the beginning of the dynamic dataset. This may look a bit weird. We used to think that the signature should be at the end of the data. But that way we can easily put the encrypted dataset at the end and process it in a streaming way.
Summary:
Signature guarantees genuineness of the whole link data, both the unchanging and changing parts leaving no room for any information within the link that does not go through blob’s validation.
Content version number
Right after the signature there’s a 64-bit unsigned integer (big-endian format) representing the version of the link. As you may remember from previous blog posts, Cinode requires definition of a forward progress rule for each blob type. That rule is used when we have two or more different datasets for the same blob name and requires a deterministic way to compute a single merged value. In case of dynamic links, that rule is very simple - we select data of one link (discarding others) that has the highest version number and if the version number is the same for different links, we select the link that has the largest value of the signature (comparing signature bytes lexicographically).
There’s no single obligatory way to produce that version number. The only requirement is that it must be greater than the previous one whenever we want to publish a new version of the link. In practice, a good solution is to use the current unix timestamp with microsecond resolution.
Summary:
Content version number is the main way to guarantee controlled conflict resolution if two versions of the link are confronted together.
IV (Initialization Vector)
Next we have an IV value - initialization vector. This IV value is used for encryption of the link’s data. To quickly recap some cryptography theory: in all practical applications, symmetric encryption requires two values apart from the ciphertext to decrypt the data - first is secret encryption key and the second is IV. Only the encryption key has to be secret, IV does not have to be so we can safely put it in the public data. Even though public, it must be different for every plaintext encrypted with the same key though and in addition to that, it must be a pseudorandom value. I’ll write a bit more about that IV in another post, but for now let’s assume it fulfills all the necessary requirements.
IV is stored with a 1-byte length prefix. That’s because at this point we don’t yet specify what symmetric encryption scheme will be used for the main blob data. In the first version of course there will be only one, but if we ever decide to extend it and allow some choices here, then we will not have to update dynamic links format to support other IV sizes.
Summary:
IV value along with the secret key (not present in public dataset) allows secure encryption of link’s data and must be different for different links to fully protect the ciphertext.
Encrypted link data
At the end of the dataset there’s an encrypted data of the link. It does not have a predefined length thus it can be of any size - we just consume all the remaining dataset of the link. In fact the way the link format is designed allows processing of the data in a streaming way by slowly digesting the input data stream and decrypting it on the fly. Currently a dynamic link should only point to some other blob so the length of encrypted data should be small. For that reason the current implementation in go
does not process the link using streaming to make the code a bit simpler. But the current construction does not rule out the possibility of storing huge datasets - e.g. a whole video of few GB size in link’s data.
Summary:
Encrypted link data contains the essential link’s information in encrypted form.
Calculation of the signature
Hash indirection
The signature must prove the authenticity of the whole link. But there’s one indirection added here - we’re not signing link’s data but we first calculate hash of that data and sign that hash instead - such signature transitively authenticates the hashed data too.
That may look like an unnecessary complexity. When you look into the implementation of the ed25519 scheme you can see that it internally uses hashing functions too. There’s a concrete reason for that - it’s the ability to process dynamic link’s data in a streaming way. The internal construction of ed25519 requires two separate passes consuming the source data. It is used to make ed25519 deterministic. But this means that we have to process the input data twice which rules out the ability to process it in a streaming way without rewinding the stream back to the beginning of the data.
Now if you take a look where Cinode will be used - it would mostly be some data transfers between nodes, could be bi-directional but could also be one-way. In such cases, streaming is necessary to avoid keeping the whole dataset in memory or storing it on disk. Hashing functions, contrary to ed25519, can be calculated over data processed in a streaming way while the data is being transferred and decrypted. We tee such a data stream to the hashing function and do further processing in parallel. Once the data stream ends, we can calculate the final hash value of the whole data stream and do the validation of that hash by checking its signature. All without the need of a large memory buffer or some temporary storage. We can thus have streaming processing and a deterministic ed25519 at the same time - a clear win-win situation.
Summary:
Instead of calculating the signature on data, we sign the hash of this data instead to enable streaming.
Signed dataset
One more thing left to decide here is what exact information we’d like to protect with the signature. It is important to remember that overall the whole link dataset must be validated. So far, through the validation of the blob name we were able to ensure that the link’s unchanging data is genuine. The dynamic part is not yet protected in any way so let’s focus on dynamic components for a moment. Of course we can not sign the signature itself so this piece of information will not be included. What’s left is the following: content version, IV and encrypted link data. But it is not enough to hash only this part of dynamic link and claim it a success. Below I’ll explain why.
Data type indicator
At the beginning of the data to be hashed we also add two more sections. First one is a single-byte prefix - a fixed value of 0x00
. Why do we put it here? In the later phases of working with the design of dynamic links I realized that we may want to create signatures using the same private key not only for the data of dynamic links but for other things as well. And here things become a bit fragile.
As soon as we start talking about different signature purposes there’s one crucial property to enforce here. Since the data that is signed is just a sequence of bytes, we must ensure that different types of data will never be represented using the same byte sequences. Otherwise there could be a situation where an attacker could somehow request a signature of data of one type and use that signature to authenticate the data from the other set. Whether this is possible of course depends on how different signatures are being used. But to avoid any possibility of exploitation we must add some protection against it. A single prefix specifying the type of data is enough in such a case.
Summary:
Each data type signed with the link’s key must have a unique prefix to avoid exploitable collisions with other data types.
Signing blob name
In the source dataset used to generate the signature we also include the blob name. Without this blob name here (or something of the same purpose) we would have a critical vulnerability in the whole signing scheme. Can you think of why it is needed?
The main reason is to ensure that the signature covers the nonce value from the unchanging link data. That nonce allowed using the same key pair for different blob names. If that nonce was not anywhere in the input, an attacker could take the dynamic link data from one blob name and push that data onto a link with another blob name if those shared the same key pair.
Instead of just the nonce, we’re using the whole blob name for the signature. That covers not only the nonce differentiation but may also protect against other vulnerabilities in the future. Because the whole blob name is signed, changing anything in the link, whether it’s just the blob type, or the asymmetric key algorithm, would affect the signature. That effectively prevents any reuse of a signature for a different blob name.
Summary:
Signature includes Blob Name to ensure that for distinct blob names we generate different signatures even for the same link content.
Where’s the private key?
The private key is needed to calculate the signature and there’s no other data flow in which it is accessed. Other than that, it could be just removed from the whole picture. As an example let’s take a look at the public layer validation, only the public key is ever needed and that key is sent within the dynamic link data.
And that’s the way private keys are supposed to be used. Those must only be accessible by the creator. And because the private key is very strictly isolated in the data processing pipeline, we may take advantage of such isolation in the implementation - e.g. the private key could be stored in some HSM / YubiKey - devices that never leak the private key (at least not on purpose), the process of generating the signature could also be done in a dedicated machine that is completely offline.
Summary:
Private key is not used outside of the link generation process and only signs the data, data read operations never touch the private key.
Validation
In order to accept a dynamic link on the public layer, the data must pass a few validation steps. Let me sum them up here.
For each step I added a pseudo-code expression that must be true for that validation step. It uses few shortcuts - e.g. concatenation of byte arrays is written with ||
operator and all fields are treated as implicitly convertible to such byte arrays. There’s also a mysterious blob_name_from_hash
function which I don’t present here. That function will be a good topic for some short post in the future, but for now you can find it in this code block.
Step 1: Check the format version byte
Currently the format version byte must be 0
, other validation steps highly rely on the format of the link thus if there’s any different format of the link, the algorithm to validate it has to be updated too. Even if other, valid format versions are added in the future, an implementation that does not recognize a given version must reject it because it won’t know how to ensure the correctness of the information. Similarly, if a critical vulnerability is found in some format version, such a version should be rejected straight away to avoid any exploitation attempts.
|
|
Step 2: Validation of blob name using unchanging link’s part
Validation always happens in the context of a particular link with a given blob name. When a link’s data is ingested, it is important that the data deals with that specific blob instance. A blob name is recalculated from the unchanging link’s data and must match the blob name from the context. That way we can be sure that the public key present in the link’s data is the correct one.
|
|
Step 3: Validation of the signature of the dynamic part
The signature present at the beginning of the dynamic part of the link protects the rest of the link. A link where the signature does not match must be rejected.
The validation can only be applied once the whole link dataset is digested. It may be a limiting factor in some cases especially when we try to process the link in a streaming fashion - e.g. it may force the node to do some unnecessary work just to figure out that the data is not trustworthy. The same issue happens in case of static blobs though and hints a bit that the network should use some kind of peer reputation system. Peers that try to poison the network with bogus data should be kicked out quickly. That topic is a completely different part of the design though.
|
|
See you next time
That would be it for today. We’ve covered pretty much all the data bits and pieces related to the public side of a dynamic link. Next topic to cover in this deep-dive series of articles would be the internal representation of dynamic link’s public part.
Links to all articles from the series:
Author BYO
LastMod 2023-08-16