Anonymizing Relationships in Databases
This is a technical article for software developers concerned with privacy of users. However, it’s written in a general way, so that a layman can read it.
While many databases store users’ data encrypted, additional metadata is stored in databases in order to look up records. This metadata is not encrypted and can reveal relationships between people.
Let’s say we have two people A and B who wish to store in the database a reference to a datum R. If a hacker steals the database, would he be able to identify that both A and B share a common reference R?
Let’s take a few concrete examples. Suppose President X sends a file to President Y, using an online file sharing service. While the file itself is encrypted, the presidents do not want it to be known that the transmission occurred.
Or, suppose that certain Social Network users wish to keep their friends lists private. However, if someone steals the database of the network, will he be able to identify who befriended whom?
The way databases were designed, before anonymity was a concern, was to minimize data duplication and to reuse data as much as possible. In relational databases such references are called foreign keys. For the social network example, there would be a table of users with fields (user_id, name, email) per row. And there would be a table with fields (user_id1, user_id2) describing who is the friend of who. This second table is called the “join” table and it allows to retrieve in a single JOIN SQL query emails of the friends of a certain user.
Similarly, in the case of a file sharing example, there would be a table describing files with fields (file_id, file_size, encryption_key, …) and a join table associating users with files, with fields (user_id, file_id).
Clearly, such database can reveal who knows whom, or who has communicated with whom. That is the case even if the values of names and email addresses are stored encrypted. You may not know who is A and who is B, but you would know that A is related to B.
“So what?”, you may say. A knows B, B knows C, C knows D, but we don’t know who they all are. It is still anonymous. This is exactly the situation with Bitcoin addresses, which offer pseudo anonymity. But, the actual identity of each person can be further deduced by knowing additional information. Various social engineering techniques can be used.
Also, one of the key identifying data sources are access logs which list IP addresses and exposes approximate geographic locations. The logs often contain REST API accesses which specify database identifiers on the URLs. Combined with IP address information, it is possible to deduce where geographically people A and B were.
For example, an access log entry to get the friends of user with ID 1234 may be:
2019–01–04 10:03:45 https://example.com/api/user/1234/friends
The fact that access logs also have precise timestamps for each request, allows to even more confidently associate people. If people A and B both referenced the same datum within as small window of time, then the two people are even more closely related.
This timing information also allows a hacker to mount active attacks, that are triggered based on certain API requests. For example, if the hacker detects that President X has sent a file, he can wait for the API request arriving to retrieve this file. He can then mount a timely spearfishing attack on the person retrieving the file.
Is it possible to completely avoid metadata? The most secure database would be a map of identifiers to encrypted blobs of data. This means, however, that a user would have to load his full encrypted data blob, in order to find a datum in it. For example, if a user has 1000 friends on the social network, and he wishes to page through his friends 10 at a time, then he would have to load all 1000 friends upfront. This is not a problem if the overall size of the encrypted blob is small. However, in many cases this is not the case.
However, even such secure database can’t protect from leaking of relationships via access logs. When user A wishes to see details about friend B, he will make API access to record for B, and therefore the access logs will have this relationship.
One can solve this by not storing access logs at all. This, however, makes it difficult to resolve technical issues. It also makes it difficult for security teams to identify hacking threats and to block “bots”.
Alternatively, identifying information in URLs can be moved into the body of requests, which is not stored in logs. But this makes the logs of little use. Also, this means that all REST requests must not use the “GET” method, which is not possible when supporting regular web links.
Note that even without logs, an active attacker, can inspect internet packets as they travel, and to identify IP information. Lack of anonymity is inherent in the very nature of the internet. (An attempt to create a secure anonymous network is FreeNet, but the product lacks features and is too slow.)
A better solution is to delete old access logs. Then if a hacker steals the database and the access logs, then he will be able to reverse engineer only recently created relationships. However, some companies may be under government regulations not to delete access logs.
Users too can hide their IP addresses by using VPNs, double VPNs, and triple VPNs. However, VPN services may be required to keep all IP information, which too can be stolen. A simpler and a more reliable solution for users to hide own IP addresses, is to go to coffee shops, malls, public libraries and to use their internet connections.
Thus, let’s put aside the leakage of information from IP addresses, and let’s design the database with the assumption that we don’t have to solve this problem.
We do want to avoid having access logs containing references to data that can identify relationships. For example, we do not want to have access URLs like this:
2019–01–04 10:03:45 https://example.com/api/user/1234/friend/6789
Such a URL shows that person with ID 12345 is a friend of person with ID 56789.
Although we don’t want the database to identify relationships, we need a way to fragment the database, so that not all of his data must be loaded by each user upfront. One way to fragment the data is to store all user’s friends in one blob, all his settings in another blob, all notes in another blob, and so on.
The best case is to store each datum separately and look it up by an identifier that is unique per user. This can be accomplished by generating random identifiers, or by “salting” identifiers with something that is unique per user, and is known only by the user.
For example, suppose person A with ID 1234 wants to store in the database the date at which he friended his friend B, identified by 5678. Instead of storing the information in a blob describing all his friendships, he wants to store it as a separate record in the database. To do this, he computes the value H(key|“5678") where the key parameter is known only to him. Suppose that the result of the computation is 9385. He then creates a record in the database table friendships (id, friend_id, friendship_date, status) = (1234, 9385, 2020–01–05 12:04:34, “active”).
Now, person A can find, with a single query executed by the database engine, which friends he friended last year, all without ever downloading all the friendship records. The record can be further updated by updating the status field to, say, “inactive.”
The function H can be any keyed hash function. The simplest way is to merely concatenate secret key material with the ID string, and to hash it. Alternatively, it can be the HMAC-SHA256 algorithm (Hash-based Message Authentication Code). The encryption key is the key used by the user to encrypt his personal data stored in the database.
Using completely random identifiers does not give any advantage, and has the disadvantage that the user must somewhere store the map from real IDs to random IDs.
In summary, we have seen that it’s not so trivial to store data in a way that doesn’t allow to reverse engineer relationships between users. We may conjecture, therefore, that many, if not most, databases out there do not go to such extremes and user’s privacy is at risk.
Boris Reitman runs a meetup group focusing on Security and Privacy in Vancouver, Canada.