The Search for Searchable Encryption

Time for a technical blog involving… encryption!

That’s right: no Flutter or Xamarin, but a topic that can be just as interesting. Before I started working at Baseflow, I graduated in Cyber Security. My research focused on the topic of searchable encryption. Allow me to take you on a journey through the wonderful world of privacy, information leakage and hashing algorithms.

The problem

Nowadays, we store loads of data; meeting notes, pet pictures, and all seasons of Mr. Robot. We typically store this data in the cloud, so it is always accessible to us. Also, keeping the data online means we do not have to buy the newest hard drive to personally store all of our data. It does not sit well for most people to trust big companies like Microsoft, Google, and Apple with their personal data. And while I do not believe big tech companies to be inherently evil, I also do not think they view their user’s privacy as their primary concern.

What can we do to prevent these companies from accessing our data while still using their services? Or to turn the question around, what can a service provider do to respect and guarantee a customer’s privacy? Encryption! Users encrypt the data that they share with the service provider. Then, when they retrieve the data, they can decrypt it themselves, and everything is fine. Right? Well, not really. See, the problem is that while this approach works from a data security perspective, by encrypting all data, there is no way to find out which data you want to retrieve from the cloud in the first place. For example, if I want to find an important email that Harry sent me, I go to my Gmail and type in ‘Harry’. As Google can read my emails, it can show me all relevant mail. However, if I were to encrypt all my emails, Google would have no idea what to show me! Google could send all the emails to me, and I could decrypt them all to find out which emails are relevant, but that kind of defeats the purpose of storing data in the cloud.

The solution

So then, what is the solution? Searchable encryption! Searchable encryption schemes are designed to provide both data privacy and allow for search queries. These schemes exist in all sorts of forms and are typically achieved by using clever mathematical properties.

Some schemes are fast and provide a basic search operation, where one can search for a single word. Other methods allow more detailed searches, such as ‘wildcard queries’. The wildcard query ‘l*g’ would match documents containing the words ‘lag’, ‘leg’, and ‘log’ for example. Another aspect that differs per scheme is its security. Searchable encryption schemes almost always leak some amount of data. This can be the user's search query or the number of words present in a document. Restricting the leakage usually comes with efficiency penalties. Typically, designers of searchable encryption schemes need to carefully balance the three properties, taking into account the environment the scheme will be used in.

Constructing a simple scheme

If you are still reading at this point, kudos to you for sticking with me! Let’s dive a bit deeper and look at how a basic scheme operates. We will construct a simple scheme that leaks tons of information to an outside observer. Nonetheless, it will show the basic structure of most modern schemes.

Data format

First of all, most schemes consider data in a document id-keyword pair format. This means that any document, email or other data types we want to store are represented as multiple pairs consisting of its id and one keyword each.

Say we have a grocery list that we want to store. We give the list a number and consider every item on the list as a keyword. We then send the pairs 1-Hagelslag, 1-Chickpeas etc. to the server individually.

The server stores these pairs in what we call an index. An index can take many different shapes depending on the scheme, but in its simplest form, it is just a list of all id-keyword pairs.

Algorithms

Most modern schemes consist of three pairs of algorithms.

AddToken is used by the client to generate a token from an id-keyword pair and send it to the server. Add is then used by the server to use this token to update its index.

DeleteToken and Delete work in the same way to remove pairs from the index.

SearchToken is used by the client to instantiate a token based on a search query. The Search algorithm is used by the server to check which ids match based on the search token. These ids are then used to determine which files must be returned to the client. The implementations of these algorithms differ per scheme and often get quite complicated. To give you a basic idea, let us create a simple scheme based on a hashing function.

The scheme

The AddToken algorithm will simply take the id-keyword pair and hash the keyword before sending the pair to the server. The Add algorithm will add the pair directly to the index, which is a list in this scheme.

The SearchToken algorithm hashes a single-keyword-query before sending it to the server.

The Search algorithm simply checks the index for matching hashes. Every matching hash corresponds to an id of a matching document.

Let’s take a look at what this scheme achieves, and what it doesn’t.

First of all, the scheme works. Yay! We can update and search data (I’ll leave deletion as an exercise for you!). Unfortunately, as you may have noticed already, we are only able to search for a single keyword at a time. With this scheme, we cannot search for documents containing multiple keywords or do fancy queries such as a wildcard search.

Let’s now direct our attention to security, since that is what started this whole quest. The good news: assuming we use a secure hashing function, both the content of the data and the client’s search query contents are hidden. The server never gets to see the actual keywords. However, we do reveal when we add or search the same keyword multiple times, as the hashes would be identical. An observer could see when we upload documents with the same keywords and deduce private information using clever analytical techniques. Additionally, the scheme leaks additional information such as the number of keywords in a document and metadata such as when we remove a specific document.

Closing words

Together, we dived into the world of searchable encryption and designed a simple scheme. In my research, I designed a system to create wildcard-supporting schemes with additional security guarantees. Unfortunately, there is no blog post long enough for me to explain all intrinsics. I hope to have piqued your interest in the fascinating world that is searchable encryption. Please feel free to contact me if you want to know more about this subject!