Risks of anonymization

Posted: 4.8.2014 in Testing
Tags: , ,

The anonymization is the process where all sensitive data is removed from the production data before it is used at testing. That is supposed to produce test data which is anonymous, and can be used without security and privacy risks. It has to be done really carefully, and the structure of production data must be understood very well before this can success. At this blog entry I show how difficult it actually is and how it can fail.

Software under testing is Twitter-like website. There is possibility to send public and private messages between users. Our simplified forum includes following three database tables:

User profile – it has numerical identifier for user (UID), username, password hash and e-mail address
Message – it has ID for message, UID of submitter, content as text, timestamp
Private message – ID for private message, UID of submitter, UID of receiver, content, timestamp

The site is already in production and open, so anyone can go to check what others has written in the public messages. Also the profiles are public. The UID is used to identify them.

Let’s start to anonymize this data. If we start from the username and message contents, are those really enough to make data anonymous? Definitely not If the original forum is public, anyone can still check private information like who has messaged and to who. Numerical UID is still the same, so we have change that also. If we want to keep statistics correct, we can’t just assign random number to messages and user profiles. E.g. if account “Teme” had UID 1, then to maintain proper statistics we have to convert all UID 1 to e.g 234.

It is still very easy to find that UID 1 is changed to 234. The features which reveal the information are timestamps and amount of messages. So we have to change also all timestamps. The new timestamps must change the order of message to keep things anonymous. We have change the number of messages of each profile also.

Even after this change we can still in some cases find the real profile from test data. Small external information like “I know that this person has messaged to that person” can help the evil person to find the real profile.

Instead of anonymizing the production data the test data should be based to model of production data. For example at production data we should have same amount of users and messages as the production data. But then on the other hand it usually doesn’t matter if the users have proper distribution of messages.

Instead of using production data like that, generate your own data and inspect, what kind of properties are the most important for your testing. Good test data increases the test coverage and possibility to find the bugs.

See also Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization by Paul Ohm

  1. Tim Hall says:

    For the example above, what you’ve outlined is the optimum approach.

    What about far more complex systems with large numbers of data dependencies, or financial data or other figures that are all supposed to add up?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s