Password please

Good Keys, bad keys

When using encryption, good encryption keys are essential. They must be generated randomly and must contain enough entropy. Otherwise we’ll open wide range of attacks on encrypted data. In addition to the key, we also need Initialization Vector (IV) which doesn’t necessarily have to be secret, but still it should be either pseudorandom or (in case of some encryption primitives) just unique when used together with the same key (IV is then also called a nonce).

We would like to apply encryption to our data blobs. Just to briefly recall, our blobs are static, can not change. Every blob has it’s own unique name. This name is made of cryptographically secure hash of blob’s data. This hash also guarantees integrity of the information.

Randomize everything

So, what would be the most trivial way of generating keys and IV.

If keys should be random, and random IVs are also a good choice, why not just randomly generate key for every blob? For AES-256 we would have to generate more than 2^128 blobs (due to birthday paradox) to reuse key with probability of ~0.5, so even if we use always the same IV and nonce-based encryption primitives, we’re still pretty safe.

What would be properties of such solution:

Every blob has unique key
IV can be hardcoded for nonce-based encryption modes since we’re changing key for every blob
One blob encrypted twice will generate different key and different encrypted blob

All those properties may be both advantages and disadvantages. Uniqueness of the key does guarantee high security but would require management of such keys and won’t let us initialize the cipher once and reuse for multiple blobs.

Let’s take a look at the fact that re-encryption will use different key - one benefit of this approach is that if two people encrypt the same data, keys will be different and the encrypted blob data will be different too. The only thing equal would be the size of blob (which can be handled by some clever splitting and merging of data). Overall this could look like a good solution. But it has one drawback - it does take away deduplication which is natural in case of CAS systems. We could argue that it’s not that needed and I agree. But in case you would be uploading your own files, finding duplicates could would be useful. Also in case of public data or data shared among many people it’s better to have same keys for same blobs. This would reduce unnecessary duplicates of data and help reducing maintenance costs - less data to manage is always easier to manage and scale.

Derived Deterministic keys

Ok, let’s try the opposite side - let’s try to assign a key to a blob in such way that each key can be unambiguously assigned to it’s blob.

Key must be random. One way to build pseudorandom number generator is to use cryptographic hash, because cryptographic hash function should behave similarly to random function. So why not just use the hash of blobs contents and use this hash as a key? IV could be handled as before. For every blob we still use different key so similarly to just random keys we can use predefined one for all our cases (as long as we’re using nonce-based encryption primitives of course).

What properties would we get here:

Every blob has unique key
IV can be hardcoded for nonce-based encryption modes since we’re changing key for every blob
Every blob of data will have exactly one key assigned and same representation of encrypted data

First two points are equal to the previous case. The third one is the opposite so let’s take a closer look.

Now we are able to introduce global deduplication mechanism. If two people upload same data to one CAS, those uploads would be identical. Such blobs would share keys, IVs and encrypted representation. This could mean significant reduction of space usage and may significantly reduce network costs. But it does reveal an extra information to external observer. If the observer finds out that Alice and Bob uploaded same blob, he would be able to assume that they have access to same peace of unencrypted information. In case Alice does protect her environment well and is able to fight off all attacks, attacker may also try to attack Bob. If Bob is not that good protecting himself, the attacker may gain plain data that was shared between Alice nad Bob.

Pretty nasty, right? We could try to reduce the proof of possession here by enforcing Alice’s environment to also store some blobs she don’t have keys to. But if we’d like to ensure that possession of encrypted data does not mean possession of unencrypted one, the percentage of data Alice stores and is able to decrypt would have to be really low.

When such storage overkill could be practical? One case that comes to my mind is when Alice would provide storage services to Bob and many other people and keep backups of their encrypted blobs. Sounds like a good idea to explore in the future, I still see many potential risks here though.

Derived Nondeterministic Keys

There’s a simple method though to keep some benefits of deduplication and still provide security against blob corelation between users.

What if we would be able to securely generate keys from blob’s contents but those keys would be deterministic for one user only? To do that we need one more thing - something that would represent user’s identity. One of obvious answers here is to use asymmetric keys.

Private key in asymmetric crypto can be used to sign the message. Signing different messages will produce different result and common low-level crypto primitives yield same signature for the same data (they are pure functions). If the signature has enough entropy for being good key source (and it has to, otherwise we would be able to fake signs), we can apply key derivation function on it to build the encryption key.

This structure has few really nice properties:

Every blob has unique key
IV can be hardcoded for nonce-based encryption modes since we’re changing key for every blob
Two user private keys will generate different encryption key for same plaintext data
Same user will generate same key and same encrypted data for same plaintext data
Encryption key is indistinguishable from random key for someone who doesn’t have access to private user key

Ok, so we see that keys will be unique but only when using same private key. But I’d like to take a look at the last property - indistinguishability from random. This could be really useful in some cases. Let’s say someone publishes some important information using this key generation method. Published blob would first look like the one using random key. If someone would later require a proof of authenticity of this data, the publisher can publish the signature of this blob. Such signature would a) produce encryption key for the blob when put through KDF and b) the signature would prove that the blob was generated with person in possession of the private key.

Now there’s a property of deniability here. If the signature is not published, there’s no way to connect it with the publisher. Whether it would prove itself useful, I don’t know yet but it’s good to have such ideas for later.

Other options

We could also find few other methods to generate keys here. We could for example generate one global encryption key for the user and for every blob use IV being a hash of blob’s unencrypted contents. We could also build keys by hashing blob’s content with some prefix and suffix. But in general we can divide our methods into three groups:

Method	Deduplication	Good usage
Random	None	Highly confidential data
Derived Deterministic	Global	Published data
Derived Nondeterministic	Local	Backup with deduplication

I don’t see that any of different key generation schemes should be preferred. Instead I think key generation should be adjusted to blob usage. If it would become public at some point, key derived from contents is the best way to go. Mixing it with private key signing does allow extra identity proof. Using purely random keys could also be useful if you’d like to hide the fact of possession of shared data. A robust system should enable them all.

Transparent transport

You could already have noted that no matter what key encryption do we use, it is not relevant from the point of view of CAS itself. As long as there are no plaintext keys sent to storage system, it doesn’t have access to plaintext data.

Such storage is able to do some basic verification of the blob though. If user has to upload blob along with it’s name, such blob can be rejected if the name does not match the contents. It’s pretty nice property since even without the knowledge of plaintext data, the storage system can do a basic integrity verification preventing transfer of malformed data.

That’s it for this day. Hope to write something more soon.

Contents