Posts Tagged ‘test data’

The anonymization is the process where all sensitive data is removed from the production data before it is used at testing. That is supposed to produce test data which is anonymous, and can be used without security and privacy risks. It has to be done really carefully, and the structure of production data must be understood very well before this can success. At this blog entry I show how difficult it actually is and how it can fail.

Software under testing is Twitter-like website. There is possibility to send public and private messages between users. Our simplified forum includes following three database tables:

User profile – it has numerical identifier for user (UID), username, password hash and e-mail address
Message – it has ID for message, UID of submitter, content as text, timestamp
Private message – ID for private message, UID of submitter, UID of receiver, content, timestamp

The site is already in production and open, so anyone can go to check what others has written in the public messages. Also the profiles are public. The UID is used to identify them.

Let’s start to anonymize this data. If we start from the username and message contents, are those really enough to make data anonymous? Definitely not If the original forum is public, anyone can still check private information like who has messaged and to who. Numerical UID is still the same, so we have change that also. If we want to keep statistics correct, we can’t just assign random number to messages and user profiles. E.g. if account “Teme” had UID 1, then to maintain proper statistics we have to convert all UID 1 to e.g 234.

It is still very easy to find that UID 1 is changed to 234. The features which reveal the information are timestamps and amount of messages. So we have to change also all timestamps. The new timestamps must change the order of message to keep things anonymous. We have change the number of messages of each profile also.

Even after this change we can still in some cases find the real profile from test data. Small external information like “I know that this person has messaged to that person” can help the evil person to find the real profile.

Instead of anonymizing the production data the test data should be based to model of production data. For example at production data we should have same amount of users and messages as the production data. But then on the other hand it usually doesn’t matter if the users have proper distribution of messages.

Instead of using production data like that, generate your own data and inspect, what kind of properties are the most important for your testing. Good test data increases the test coverage and possibility to find the bugs.

See also Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization by Paul Ohm


Production data at testing

Posted: 30.7.2014 in Testing

Several times I’ve seen the organization where the test data is taken from the production. I have to say that in that case the normal cases are covered quite well, but I hate that kind of approach. Why?

There are couple different risks. First and probably the biggest one is the security and privacy risk. Production data usually have some security or privacy sensitive data. That can be e.g. e-mail addresses, usernames, password hashes, birthdays and so on. Usually the test environment is not as well secured as the production environment. E.g. for debugging reasons testers and developers can access to the database, the testing can be outsourced to somewhere outside the organization. During testing we can accidentally or for the purpose see sensitive data which we definitely not see.

Another reason is that often there is some data which can cause extra traffic for the user. The traffic can be related to e-mail or even snail mail. Imagine the situation where the tested application contains healthcare patient records. Tester creates the death certificate, and presses the button:”Send to the relatives.” Usually the test environment should stop there. But misconfigured environment can be connected to printing and mailing services. This can cause heart attacks to the receiver.

Production data can have problems also. There can be “one in a million case”, which hasn’t happened at the production yet. As the testers our data should consider that also. So we have to even with production data to check if the test data covers all required cases. With half million user record it can be enormous task.

Instead of using the production data investigate what kind of data it contains. Make the artificial data based to that information. During this you should consider what aspects are important for the testing. E.g. if the age distribution doesn’t have any matter for the testing, why to spend time for that? Or what if it has? Then investigate what kind the distribution is and base your data to that information.

Often at performance testing the test environment (or even production environment) is loaded with the production data. Even in those cases I try to make my performance testing scripts so that they do not touch to any sensitive data (e.g. user data). I always try to create artificial data which can be easily cleaned up from the database. E.g. if there is first and second names I usually create them as: -CUSTOMER (Teemu-CUSTOMER) and -TESTING (Vesala-TESTING). This kind of naming distributes the data correctly to the database and later it is easy to find such users and remove them. If you use Fake Name Generator ( to create your data, it is very easy to modify the name with short perl script.

Last time I wrote something about test data at manual testing. This blog entry has several words about test data at test automation.

Now the computer is taking care of submitting and checking the data, so now we can use different kind of data. We don’t have to keep the data simple, and we can use more data than during manual testing. E.g. you don’t want to type 100 megabytes of data, but test automation can do that. Test automation can also detect small changes which are difficult for human to notice. For example making difference between I (capital i), I (lower L) and 1 (number one) can be difficult for human. But test automation usually detects changes at binary level, and all of those have different Ascii code. So it can detect those.

At discussion forum checking the paging of 100 messages thread correctly would be nearly impossible for manual tester. Or at least to result would be poor because he would most likely just put a few characters to each message. But with test automation we can create longer and shorter messages with realistic looking data. What is realistic in this case? Or how do we know it? Internet is full of forums, so we can select almost any of them for analyzing. We could get e.g. in average how many words each message has, and what kind of standard variation they have. Then test data can be constructed based to that information.

Test automation can use random data. The data is created at run time by computer. It has some issues which we have to remember: Test must be reproducible. Random generator should not be cryptographically strong. It should be such that when we set its seed to specific value; it always produces the same result. Initial seed can be related to time, but then next values must be always based to previous number. Also the logging should make sure that all steps and initial stages are logged.

One good way to cheat at randomness is to create predefined list of data. You can reproduce it all the time, and it is “random enough” if you trust that different order doesn’t change the result. At one project I wanted to test how application reacted to broken input files. Manually it would be impossible to test, but with computer it was quite simple. I just had to create the inputs and I used Ramasa for that. ( Seed for that were couple working inputs. During testing I found one major crash which could have been also the security issue. Without test automation and possibility to test huge mass of data, we could have left major security issue to the application.

It seems that I should have structured my series a bit differently. There are still plenty of important things which I should describe. And they touch test automation as well as manual testing also, but also security and performance testing. And then there is plenty of miscellaneous issues like using production data.

This blog entry starts my test data related blog series. I will submit new one every now and then. If you have questions or comments, please, let me know. I like to learn and I like to give new ideas. So disagree freely with me.

Over the series I’m using single example application: Discussion forum software (Like e.g. SMF, phpBB). Key features are registration, logging in and out, different kind of user rights, posting public and private messages, reading them and searching.

One key feature of manual testing is human aspect. Human submits the data, human checks the results. All test data should consider that. We have limited possibility to notice small differences.

Normal bias which I’ve seen at manual testing is using repetitive patterns. If the date format is ddmmyy, even tester quite often uses date like 010101 or 111111. They are simple and quick to type and always valid. But as the test data they miss many possible error cases. What if age calculation swaps day and month? The data like that won’t notice it. Much better is something which makes missing the mistakes impossible. It could be for example 230142. Same number is not repeated at all, so if something fails, it is immediately noticed.

The forum also requires text data. If testing the messages, the test needs plenty of text. Usually the testers and developers are using lorem ipsum as test data. But that practice should be avoided. There are multiple failure points. You can’t read the text, so you most likely won’t notice if characters are swapped to some other. Warping of the lines should also be done correctly which is difficult to see from lorem ipsum. Also all other content related errors are masked behind nonsense. If I need plenty of text, I usually take it from Project Gutenberg.

Many organizations are still doing scripted manual testing. There you have to decide if the test data is part of test case or not. When it is not, it gives tester possibility to use different kind of inputs, but part of them might be weak and reproducing the situation suffers. If it is, then same data is used over and over again, and at the end we can say that it works at least with given data, but not sure about other data.

In my opinion you should consider what important inputs are, and specify those. If there is possibility to say “this kind of data” instead of “exactly this data”, use rather “this kind”. It still gives larger variation for inputs than exact input, but is still able to test what is wanted. “Not so important configuration data” should be specified so that it is easy to take to use. I’ve been at project where configuration of the test environment took almost whole day. In that case all configurations should be specified so that I could found them right away and in best case also be able to take them to use right away.

Thanks @HelenaJ_M about question about reusing of data versus creating all the time from scratch.