Disclaimer: Tech heavy, no security guarantees

This will be a tougher post, covering many threads and details. The ideas I present may be hard to understand. Apologies if something is not easy to chew up here, I’ll try to do my best. I believe though that the topic I cover here - dynamic links - is what makes Cinode fundamentally usable and want to materialize it as fast as possible.

Also please note tht by making the design of dynamic link I’ve broken one of the most important rules of cryptography - I started building my own cryptographic blocks.

Because of that I claim now that there’s a high probability of security flaws in Cinode. I will state this as long as there is no formal verification of encryption schemes that I’ve created.

Hope to do such verification in the future, but it may not happen soon.

Now that you’ve been warned, let’s continue…

So far

It’s been a while since I worked on the design… but it’s high time to get back to work!

So far I discussed static blobs - how those can be stored in an encrypted form, how to generate keys for them, how to build more complex data structures like directories, how to maintain access control by using encryption.

This seems like a solid base for even more complex systems but it lacks an important property - dynamism.

Why we need dynamism ?

Let me recall how the current test app works (the one that is hosting a copy of this blog). I’m writing the blog using Hugo. Once a new post is added and committed, Hugo builds a static set of files that can be then opened in a web browser. Those files are then processed to build a cinode dataset consisting of static blobs reflecting the original file tree. Finally a small cinode proxy server attaches to blobs generated earlier to serve the plaintext data over a standard https protocol.

But next to a set of encrypted blobs, the tool that generates the dataset also created something that is not part of cinode data - an information about the entrypoint - the root of the file tree. It is stored in a small file containing the blob name of the root directory and it’s encryption key. It’s sufficient information to be able to access the whole directory structure.

Of course this entrypoint information will change whenever the content changes. However currently there is no built-in mechanism in Cinode where such entrypoint information cloud be stored. This entrypoint data must be given to the proxy server as an external information and each time we want to updated it the server is restarted with new entrypoint.

That’s where the lack of dynamism in Cinode shows up. We miss something that would let us work with a fixed entrypoint information, an entrypoint that can stay the same even if the underlying content is modified.

A good analogy would be a DNS system - when we access a web page, it is identified by its domain name. Such name is then translated to an IP addresses and this IP addresses is then used to establish the network connection to the server. If someone wants to migrate web page from one machine to another, it can be done by just changing the DNS record with the associated IP address. End users accessing such web page don’t have to do anything (well, maybe wait a little bit for DNS propagation).

Another example would be branches in git - a branch can point to some commit and can easily be updated. We don’t have to pass git hashes all around (although it’s still useful since a git hash points to a very precise point in a commit history).

A dynamic link in Cinode looks similar to a symbolic link in Unix/Linux filesystems. Internally such symbolic link only contains the path pointing to other location in the filesystem. But when we try to access data behind it, OS automatically follows that path just as if the link itself was the file or directory it points to.

And same happens with dynamic links in Cinode - but (as everything there), it is based on encryption mechanism - asymmetric cryptography.

Each link is based on an associated public/private key pair. This key pair does not change and is also used to uniquely identify the link itself. This is the static part of the link that is used as an entrypoint. The link also contains a dynamic part - that data contains only a reference to some other blob name in Cinode. Of course this information can not be publicly available and thus it must also be encrypted with an extra symmetric key similarly to how static blob are protecting their content.

Here is a rough picture showing that idea:

P U B L I P S r H i E v R a t e k e y : : : : : : : : : : : : : : : : : : : : : : : P R L I i K V n e A k y T E : : : : : : : : : : : : : : : : : : : : : P U B S L i I g I C n V E P n u c b S r l i y i B g p c I n t D a e k P t d e r u y o r l d e i u n c k e s

This is a very simplified view and I’ll gradually expand it with more details. But let’s first get a good overview of all peaces of this puzzle.

Data access regions

First of all, there are three different data access regions: PUBLISHER, PRIVATE and PUBLIC. Those correspond to various access levels to that link. Those levels are hierarchical - PRIVATE access contains the PUBLIC one, PUBLISHER one contains both PUBLIC and PRIVATE levels.

PRIVATE and PUBLIC regions are what we’ve seen before when discussing static blobs.

The PUBLIC region deals with data that can be shared across public parts of the system - even if this information is leaked, it still can not be used to recover the plaintext link data. However it must contain enough information to ensure a falsified data can not be injected. That way the network can ensure that only genuine information is propagated across nodes even though it still lacks the access to the plaintext link data. That’s the reason why the signature is calculated from the encrypted link and it is publicly available next to the public key.

The PRIVATE region gives access to the plaintext link data - at this level one has access to the encryption key which means one can recover plaintext link data from the encrypted form. It is also possible at this level to create encrypted link from the plaintext one since the key is know, however it is not yet possible to generate a valid signature thus one can not generate new link content that will be accepted by the network.

The third region not used so far - the PUBLISHER one - is granted for those who can publish new link data. It contains information that allows generating new valid content. Only at that access level we gain access to the private key which is necessary to generate valid and acceptable signature for the encrypted link.

Do I trust you?

Because we’re using asymmetric cryptography here, the authenticity of the link is guaranteed by the signature.

Is it unbreakable? Well… math is pretty strong here. We have been using strong asymmetric cryptography algorithms for some time already and so far they stand the test of time. Many algorithms are extensively used in TLS or cryptocurrencies and are still considered secure even though many people try to break them on a daily basis. But there’s always a human and technical factors that needs to be calculated in here.

The authenticity of the link is guaranteed as long as the private key used for signatures is kept secret. There are many ways to reveal or steal it - like just breaking into someone’s computer, or performing some side channel attacks (that is especially dangerous in shared VMs thus in most of our cloud environments). There are also problems related to key rotation etc.

I won’t try to deal all those problems now. Not at this stage of the project. For sure we need some key hierarchies or chains and some methods to deprecate older keys, some key expiration etc. But let’s start with a simple model, test it in some running code and extend it later.

There’s a small detail that you may have already noticed here. The signature is calculated from not only the link data but also the blob name. Why this is important? We could only sign the encrypted link content but then such signature could be reused for another link that was created using the same public/private key pair. Adding blob name to the signed dataset means that the signature will only be valid for a single link.

The rule of forward progress

Now that the basics of the system are established, we can get into more details.

Let’s imagine that Cinode is a highly distributed system - something like a P2P network where all the data is shared among different nodes in the system. In such setup, different peers spread parts of the overall dataset between themselves. Many networks work that way - e.g. bittorrent. In such networks, each peace of information has its unique ID assigned and we can ask other peers of the network for content behind such ID.

In all those networks, including Cinode, we have to be very careful trusting other nodes, received information must be validated and any attempt to inject falsified data must be stopped.

In case of a static blob the case is simple. The ID of a blob is the hash of its encrypted content. After retrieving the data, the receiver calculates the checksum and compares it with the ID of the blob. The blob is rejected if the ID does not match the hash.

Similar check must also be done for dynamic links. This time however it is based on the signature. The ID of the blob is calculated from the Public Key and other publicly available data. Knowing Public key, encrypted link data and the signature we can check if the signature is correct and if the blob name corresponds to given public key. Any try to falsify the data such as sending wrong public key or a signature that does not match can be easily detected and rejected.

So far so good, but there’s one problem. Dynamic links change over time. In a distributed network there will be conflicts to solve. While looking up and fetching a content of given dynamic link, we could retrieve different links from different network nodes so there must be a way to determine which content is the valid one. Being able to do such selection is what I call the rule of forward progress.

Present, Past and the Future (so just common versioning)

Let’s now assume that there’s only one publisher for some dynamic link. That publisher will occasionally update the published content. Solving a conflict here is simple - newer link should be preferred over the old one. But how to determine which link is newer? Such information is currently not present in our current model thus let’s add it:

P U B L I P S r H i E v R a t e k e y : : : : : : : : : : : : : : : : : : : : : : : : P R L I i K V n e A k y T E : : : : : : : : : : : : : : : : : : : : : : P U B S V L i e I g r I C n s V i o E P n n u c b S r l i y i B g p c I n t D a e k P t d e r u y o r l d e i u n c k e s

What has been added here is a new Version field. This field could be just an arbitrary integer value that is increased whenever a new content is published or it could be current timestamp. It is not that important how this value was generated, but it must be increased whenever a new content is created by the publisher.

Of course this value must be public. Without it, network wouldn’t be able to do a proper selection if conflicting links are available. This value must also be signed along with the encrypted link. If it wasn’t signed, an attacker could just take some old link, bump its version number and inject it into the network performing a reply attack that would result with new data being discarded.

Even more conflicts

But just adding a simple Version field is not enough.

The rule of forward progress requires that we must always be able to deterministically select the better version of the link if there’s a conflict. But what if there’s some malicious publisher that produces two or more different links with the same version number? In such case we also must be able to select one of those links - and any node in the system must make exactly the same decision.

There are few ways to deterministically select one of such blobs. For sure we have to compare some dynamic property publicly available of both links and chose one of them.

Currently we have: encrypted link, the version and the signature. Choosing the encrypted link itself would be enough for conflict resolution. For two separate links we would have different encrypted blobs, comparing them topologically would solve the conflict.

We did not consider the version here though thus let’s think about some scenario where this could be abused.

Fair draw

Let’s imagine a scenario that there’s some kind of a lottery where we have to deterministically and pseudo-randomly chose one link from a group of links in such a way that skewing the selection of the link is close to impossible.

For this selection to be fair for all participants, all parameters used for the final selection must be known upfront. Here it would be: the blob name used for the selection and its public data (public key, configuration, the version) but also the symmetric key used for link encryption. Ensuring those are known up-front means that those values can not be modified later to prioritize some links over the others.

All participants must submit their links in a given time slot. They don’t know the private key thus they can’t guess the signature that would be generated. Participants could of course submit more than one link thus there must be some way to ensure that one participant can submit only one link or each submission must be associated with some cost.

Once the time limit for submissions is reached, all valid submitted links are signed by the publisher using the previously announced version and the link that is naturally selected by the network is the winner.

For this selection to be fair, we must ensure that submitters can not skew the selection algorithm. Because of that, selection based on the encrypted link only is not good enough. Participants know the symmetric key before the draw thus they can calculate the encrypted link. A malicious participant could start preparing many links to find one that has a higher chance of winning and only submit that one lifting that way the submission limits or cost. Such link can even be reused later for different version values (e.g. multiple rounds of some lottery) if it is done using the same dynamic link.

Now let’s consider another comparison method - what if we take signatures and compare those instead? Signatures are build from the encrypted link data and the version which means that one can not reuse same prepared data for different versions. But in addition to that, submitters of the lottery can not figure out what the signature will be since they don’t know the private key. It’s just a blind guess. This sounds like a good idea but (as usual) there are few details here.

First is that we should not compare signatures directly. A byte representation of such signatures may not be fully random. This means that the publisher could have an opportunity to skew the selection of the winner by careful creation of the signatures themselves. This can be easily solved by hashing the signature first and comparing signature hashes.

Another requirement here is that the signature must be deterministic - for the same input data it must always produce the same signature. Otherwise the publisher could repeat signature generation for some links and finally chose the one that produces the greatest hash. Requirement to trust the publisher and his signature generation (and most likely the quality of publisher’s pseudorandom data) looks like a very bad assumption here.

Deterministic signatures are not that common and in fact in many situations they are considered less secure than the non-deterministic ones. But fortunately there are well know algorithms such as ed25519 which were designed with determinism in mind.

Looks like we’ve found a good solution for our selection algorithm, at least in case of our lottery example. The selection would first compare versions and if those are equal, hash signatures (which are deterministic) and compare the hashes.

But let’s also think about the different scenarios. Sometimes we could actually be interested in finding a way to increase the strength of a signature hash used for conflict resolution. In such case, non-deterministic signature would be beneficial. Repeating of the signing process would yield different signature each time, meaning that repeating the signature and finding the largest hash would make the link stronger over time. Hmm, does it sound familiar? …

Burning cycles

Such retry of link signature generation is very similar to something we know from cryptocurrencies - the proof of work. In bitcoin for example, nodes try to hash the block data with different base seed, and the one that selects the hash with given number of leading zeros wins the block.

Now let’s imagine that someone unauthorized got into a possession of the private key. Can such evil person put some falsified data behind the corresponding link? The trivial way to do so would be to just increase the version number and publish new data.

But we could have links that do not have the version number in such case, or the version number may actually be used only for signature calculation and ignored in the first step of the selection algorithm where version numbers are compared.

If the version can not be used to directly affect the selection, the attacker would have to spend enough time to generate new signature that would win over the existing one. This would make a lot of attacks impractical due to its computational cost.

We could even add something that makes links stronger. The link comes with a configuration. We could store a minimal hash value that is required for the signature for the link to be valid. Such links would then require proof of work to even be accepted by the network before any selection happens.

Now one may wonder if we need to use non-deterministic signatures here. It looks like we don’t have to. Making a non-deterministic signature from the deterministic one is in fact a trivial operation. We can simply add some nonce value to the signing process and publish it along with the public link information. This could be the version in a previously mentioned mode where the first step of selection based on versions is ignored. Or we could introduce another, publicly exposed seed value too.

Fragile encryption

The goal of a dynamic link is to be addressable using some static information. The content of the link can then dynamically change over time reflecting updates to the content it links to.

The entrypoint information for such link must for sure be a blob name. But there’s still another peace of information missing to be able to access the data behind it. It is the encryption key used to encrypt and decrypt the link data.

This key would have to be static for given blob name - to be able to read the link one would have to get the blob name and the key. Sounds like a sane idea as long as we don’t dive into some cryptographic details here.

The key is only a part of the input for encryption and decryption process. Next to it there’s an IV - initialization vector. This value is necessary to protect against serious attacks when different data sets are encrypted with the same key. Depending on the underlying algorithms, reusing IV may reveal parts or even full plaintext without knowing the key (some examples can be found in Wikipedia).

There are two ways to deal with this problem. First is to find a way to properly generate the IV, another would be to use different keys for different links.

Synthetic IV

Let’s try to deal with calculation of different IVs for different links. Those values does not have to be secret but in some encryption algorithms must be calculated in a pseudorandom way so that their values can not be guessed upfront. Pseudorandom output can easily be achieved by using cryptographic hash functions. In order to ensure the IV is not repeated for different links we would have to compute the hash over blob name, version and plaintext data (note that we can not do anything that requires the encrypted link since it does not yet exist because it requires IV to be already known).

Because pseudorandom IV is used, we must take into account a birthday paradox. In case of AES for example, the IV is 128 bits (AES block size). That gives us a pretty large number of links before the probability of IV collision is too high, we’re talking about 2^63-like values here. Similar method of IV generation can be found in AES-SIV.

Here’s how it would look like:

P U B L I P S r C H i o E v m H R a b a t i s e n h e k e + y : : : : : : : : : : : : : : : : : : : : : : : : : : P R L I i K V n e A k y T E : : : : : : : : : : : : : : : : : : : : : : : : : P U B S V L i e I g r I C n s V i o E P n n u c b S r l i y i B g p c I n t D a e k P t d e r u y o r l d e i u n c k e s

Of course because IV is necessary to decrypt the data, it must be a part of the public dataset. Decryption must then be followed by additional recalculation of the IV from the source data and comparison of that value with the one that was publicly exposed. This opens a question about propagation of invalid blobs - a bogus publisher could publish a blob that is accepted by the public network but would fail upon decryption attempt. It would effectively make the data behind the link inaccessible. Whether this opens another attack surface is not that obvious though. If an attacker can produce links that are accepted by the network, that means that the attacker gained access to the private key. This itself is enough to start sabotaging the network with links using some huge version numbers or having some random data instead of the encrypted link.

Derived key

Instead of protecting against reuse of IVs let’s discuss alternative that is generation of different key for each blob. Let’s assume that IV is well-known up-front and does not change (and that it is a secure value for given encryption mode). We must guarantee that the key is never reused - each plaintext link data must be encrypted with a unique key.

A very important fact to observe here is that this key must also guarantee uniqueness even if the publisher is an attacker. The final encryption key thus should be a function over data that’s part of the link: blob name, version, plaintext data. This is exactly the same dataset as in the case of initialization of the IV. A simple solution would be to hash those values and use that hash as the encryption key. Because we’re calculating the hash over blob name and version - this means that a precalculated values can not be reused for different links or different versions of the same link.

Now since the key generated that way must somehow be accessible when decrypting the link, it must be recoverable from the public data. It can be done by using key links mentioned before when we were discussing encryption keys for static blobs. What it means in practice is that the encryption key would be encrypted itself with a base key that is static for the link. Such encrypted (locked) form would then be attached to the public data. Without knowledge of the base key there’s no way to recalculate the key that was used to encrypt the link data.

P U B L I P S r C H i o E v m H R a b a t i s e n h e k E e + n y c r y p t / : : : : : : : : : : : : : : : : : : : : : : : : : : D : : : e c P r R L y I i K p B V n e t a A k y s T e E k e y : : : : : : : : : : : : : : : : : : : : : : : : : : : : : P U B S V L i e I g r I C n s V i E o E n P n n c u c r b S r y l i y p i B g p t c I n t e D a e d k P t d e r u k y o r l e d e i y u n c k e s

Encrypted key must be authenticated by including it in the signature. But there’s one thing missing here. Encryption of the key itself does not yet include IV - if we included it here we would end up in the same situation as with previous approach - how to securely generate that IV so that its uniqueness is always guaranteed. For that purpose, an already mentioned encryption mode that generates IV such as AES-SIV or AES-GCM-SIV would be needed.

We could also try to be super-brave (or stupid) and just avoid IV using an ECB chaining mode here. At the first look this could work. We are only encrypting keys that should already look like pseudorandom data unique for each link version. It is an output of a hash function. But it is a risky solution and I wouldn’t like to go that way.

What is better?

Based on the comparison of both IV generation solution it looks like generation of IV is much better and simpler than generation of the key.

Key generation has one advantage though. IVs are usually smaller than Keys - e.g. with AES-256 the 256-bit key is twice as long as 128-bit IV. That means that when using random keys the risk of a collision between different link versions is greatly reduced (like we should not care about the birthday paradox at all).

But considering that the collision is a local problem for a single link, it looks like generating IVs is still a very viable solution to our problems.

Getting the base key

So far we assumed that the key used for encryption and decryption is already known. This is exactly how we deal with static blobs. To decrypt one, the key must be given to authorized user along with the blob name. In case of dynamic links, it is the same. The key will be provided upfront and it will be static over time even when the link data changes (it could actually be occasionally rotated to ensure the access to data behind such link is not granted forever, but this is a topic for another post).

The method chosen for such key generation must not open some possibilities to exploit the encrypted data itself. For example it should not be a known weak or already leaked key. And of course it must not be easy to guess.

We can also bring back the discussion of key generation schemes when we talked about static blobs. It brought up three key generation methods:

  • Purely random keys
  • Keys generated from the encrypted blob content
  • Keys generated from the signature of the encrypted blob content

Each of those methods has its pros and cons. In case of dynamic links we could come up with a very similar approach.

The random key case is obvious - when generating the link, one could also generate a random key associated to it. It would work as long as we trust the randomness of such key.

Generation of the key from content is a bit more problematic here. Contrary to static blobs, when generating the link, we don’t yet have the content to encrypt. Thus the key must be built from something else, static data associated with the link that additionally must not be a publicly available information (at least not all of it).

When looking at the available static data, the only non-public information that is not connected to the link itself is the private key.

We could derive the encryption key from the private key (e.g. taking its fingerprint). But this means that it is impossible to validate if such encryption key is genuine without revealing the private key itself. From the perspective of a user with public access only, it would be indistinguishable from a randomly generated one.

Alternatively we could use the private key to generate a signature and that signature would then be be used to build the base key. The signature should be calculated from all the public information of the blob - similarly to how the blob name is generated. That means that the signature of the blob name itself would be a good data source. Of course such signature should not be the key itself but it should be fed to some key derivation function.

But this makes the verification of the encryption key a bit more complicated. To verify it we would have to get the original source signature from link. Let’s call this information a key validation block. That data could contain the blob name signature encrypted with the key itself so that the user with access to the private data can verify that signature, hash it and compare with the encryption key.

Also let’s emphasize here that such blob name signature would have to be protected at all costs and the publisher must not be tricked to reveal it. We should for example analyze if it is possible to trick the publisher to sign some specific link data that would result in a calculation equivalent to signing blob name. Such signature would then be attached as a public information to the link revealing the key and as a result totally destroying secrecy of the linked data.

(Here you can see that there is a lot of small details that could simply be made incorrectly when building such crypto schemes, thus further proves that I went to a very dangerous area of cryptography).

There’s also another mode of encryption key generation that I didn’t mention yet - the one that is based on some additional private key. Such key generation method is indistinguishable from the random key as long as the publisher does not reveal the public key. But since that looks like a separate topic, let’s leave it for some other post.

Impl

Ok, I covered a huge amount of information here already. It turned out to be a pretty large post. But writing it down in one place will help with the implementation now. Enough talking… let’s make some code out of it.