Production data at testing

Posted: 30.7.2014 in Testing
Tags:

Several times I’ve seen the organization where the test data is taken from the production. I have to say that in that case the normal cases are covered quite well, but I hate that kind of approach. Why?

There are couple different risks. First and probably the biggest one is the security and privacy risk. Production data usually have some security or privacy sensitive data. That can be e.g. e-mail addresses, usernames, password hashes, birthdays and so on. Usually the test environment is not as well secured as the production environment. E.g. for debugging reasons testers and developers can access to the database, the testing can be outsourced to somewhere outside the organization. During testing we can accidentally or for the purpose see sensitive data which we definitely not see.

Another reason is that often there is some data which can cause extra traffic for the user. The traffic can be related to e-mail or even snail mail. Imagine the situation where the tested application contains healthcare patient records. Tester creates the death certificate, and presses the button:”Send to the relatives.” Usually the test environment should stop there. But misconfigured environment can be connected to printing and mailing services. This can cause heart attacks to the receiver.

Production data can have problems also. There can be “one in a million case”, which hasn’t happened at the production yet. As the testers our data should consider that also. So we have to even with production data to check if the test data covers all required cases. With half million user record it can be enormous task.

Instead of using the production data investigate what kind of data it contains. Make the artificial data based to that information. During this you should consider what aspects are important for the testing. E.g. if the age distribution doesn’t have any matter for the testing, why to spend time for that? Or what if it has? Then investigate what kind the distribution is and base your data to that information.

Often at performance testing the test environment (or even production environment) is loaded with the production data. Even in those cases I try to make my performance testing scripts so that they do not touch to any sensitive data (e.g. user data). I always try to create artificial data which can be easily cleaned up from the database. E.g. if there is first and second names I usually create them as: -CUSTOMER (Teemu-CUSTOMER) and -TESTING (Vesala-TESTING). This kind of naming distributes the data correctly to the database and later it is easy to find such users and remove them. If you use Fake Name Generator (http://www.fakenamegenerator.com/) to create your data, it is very easy to modify the name with short perl script.

Advertisements
Comments
  1. Tim Hall says:

    We’ve frequently used production data for testing, although we’ve taken steps to anonymise it first using scripts that randomise all personal identification.

    • Teemu Vesala says:

      That approach has own risks also. It is very difficult to create truly anonymous data if any real information is left untouched. Unintentionally there can be traces to real persons which you don’t think. E.g. in patient record system if there is check box for “Dangerous blood contact” the data should remove all location and age related information. (E.g. which hospital, town, doctor info, nurse info etc.) Otherwise the person can be connected quite easily to the issue.

      It seems that I’ll have to finish my text about anonymization related problems quite soon.

      • Tim Hall says:

        We had similar issues with confidential information; the system was a housing management system that contained a lot of confidential financial information, for example rent arrears.

        Also details of repairs and maintenance, which could be traced via contractors, which also needed to be anonymised.

        Worst bit was using customer data for a sales demo integrating with Google Maps. In the end we selected an range of properties and gave them new locations in what was actually an industrial estate in Scunthorpe.

  2. Matt Spencer says:

    I also usually looking for creating testing data service for our biz. application with random credit card number and UPS tracking number WITH VALID FORMAT (Of course, all of them are not real).

    fakenamegenerator[dot]com is my choice. And recently I also use this namegenerators[dot]org and feel so happy.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s