The anonymization is the process where all sensitive data is removed from the production data before it is used at testing. That is supposed to produce test data which is anonymous, and can be used without security and privacy risks. It has to be done really carefully, and the structure of production data must be understood very well before this can success. At this blog entry I show how difficult it actually is and how it can fail.

Software under testing is Twitter-like website. There is possibility to send public and private messages between users. Our simplified forum includes following three database tables:

User profile – it has numerical identifier for user (UID), username, password hash and e-mail address
Message – it has ID for message, UID of submitter, content as text, timestamp
Private message – ID for private message, UID of submitter, UID of receiver, content, timestamp

The site is already in production and open, so anyone can go to check what others has written in the public messages. Also the profiles are public. The UID is used to identify them.

Let’s start to anonymize this data. If we start from the username and message contents, are those really enough to make data anonymous? Definitely not If the original forum is public, anyone can still check private information like who has messaged and to who. Numerical UID is still the same, so we have change that also. If we want to keep statistics correct, we can’t just assign random number to messages and user profiles. E.g. if account “Teme” had UID 1, then to maintain proper statistics we have to convert all UID 1 to e.g 234.

It is still very easy to find that UID 1 is changed to 234. The features which reveal the information are timestamps and amount of messages. So we have to change also all timestamps. The new timestamps must change the order of message to keep things anonymous. We have change the number of messages of each profile also.

Even after this change we can still in some cases find the real profile from test data. Small external information like “I know that this person has messaged to that person” can help the evil person to find the real profile.

Instead of anonymizing the production data the test data should be based to model of production data. For example at production data we should have same amount of users and messages as the production data. But then on the other hand it usually doesn’t matter if the users have proper distribution of messages.

Instead of using production data like that, generate your own data and inspect, what kind of properties are the most important for your testing. Good test data increases the test coverage and possibility to find the bugs.

See also Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization by Paul Ohm

Production data at testing

Posted: 30.7.2014 in Testing
Tags:

Several times I’ve seen the organization where the test data is taken from the production. I have to say that in that case the normal cases are covered quite well, but I hate that kind of approach. Why?

There are couple different risks. First and probably the biggest one is the security and privacy risk. Production data usually have some security or privacy sensitive data. That can be e.g. e-mail addresses, usernames, password hashes, birthdays and so on. Usually the test environment is not as well secured as the production environment. E.g. for debugging reasons testers and developers can access to the database, the testing can be outsourced to somewhere outside the organization. During testing we can accidentally or for the purpose see sensitive data which we definitely not see.

Another reason is that often there is some data which can cause extra traffic for the user. The traffic can be related to e-mail or even snail mail. Imagine the situation where the tested application contains healthcare patient records. Tester creates the death certificate, and presses the button:”Send to the relatives.” Usually the test environment should stop there. But misconfigured environment can be connected to printing and mailing services. This can cause heart attacks to the receiver.

Production data can have problems also. There can be “one in a million case”, which hasn’t happened at the production yet. As the testers our data should consider that also. So we have to even with production data to check if the test data covers all required cases. With half million user record it can be enormous task.

Instead of using the production data investigate what kind of data it contains. Make the artificial data based to that information. During this you should consider what aspects are important for the testing. E.g. if the age distribution doesn’t have any matter for the testing, why to spend time for that? Or what if it has? Then investigate what kind the distribution is and base your data to that information.

Often at performance testing the test environment (or even production environment) is loaded with the production data. Even in those cases I try to make my performance testing scripts so that they do not touch to any sensitive data (e.g. user data). I always try to create artificial data which can be easily cleaned up from the database. E.g. if there is first and second names I usually create them as: -CUSTOMER (Teemu-CUSTOMER) and -TESTING (Vesala-TESTING). This kind of naming distributes the data correctly to the database and later it is easy to find such users and remove them. If you use Fake Name Generator (http://www.fakenamegenerator.com/) to create your data, it is very easy to modify the name with short perl script.


Last time I wrote something about test data at manual testing. This blog entry has several words about test data at test automation.

Now the computer is taking care of submitting and checking the data, so now we can use different kind of data. We don’t have to keep the data simple, and we can use more data than during manual testing. E.g. you don’t want to type 100 megabytes of data, but test automation can do that. Test automation can also detect small changes which are difficult for human to notice. For example making difference between I (capital i), I (lower L) and 1 (number one) can be difficult for human. But test automation usually detects changes at binary level, and all of those have different Ascii code. So it can detect those.

At discussion forum checking the paging of 100 messages thread correctly would be nearly impossible for manual tester. Or at least to result would be poor because he would most likely just put a few characters to each message. But with test automation we can create longer and shorter messages with realistic looking data. What is realistic in this case? Or how do we know it? Internet is full of forums, so we can select almost any of them for analyzing. We could get e.g. in average how many words each message has, and what kind of standard variation they have. Then test data can be constructed based to that information.

Test automation can use random data. The data is created at run time by computer. It has some issues which we have to remember: Test must be reproducible. Random generator should not be cryptographically strong. It should be such that when we set its seed to specific value; it always produces the same result. Initial seed can be related to time, but then next values must be always based to previous number. Also the logging should make sure that all steps and initial stages are logged.

One good way to cheat at randomness is to create predefined list of data. You can reproduce it all the time, and it is “random enough” if you trust that different order doesn’t change the result. At one project I wanted to test how application reacted to broken input files. Manually it would be impossible to test, but with computer it was quite simple. I just had to create the inputs and I used Ramasa for that. (http://code.google.com/p/ouspg/wiki/Radamsa) Seed for that were couple working inputs. During testing I found one major crash which could have been also the security issue. Without test automation and possibility to test huge mass of data, we could have left major security issue to the application.

It seems that I should have structured my series a bit differently. There are still plenty of important things which I should describe. And they touch test automation as well as manual testing also, but also security and performance testing. And then there is plenty of miscellaneous issues like using production data.


Security is important part of modern systems. But as the testers we usually think only about application security. Unfortunately reality is that easiest way to intrude to the system is to cheat people. This is story about me and Santa Claus.

This year I won’t be without Christmas presents! Actually I managed to find out, how to change the present allocation to 10% of Santa’s budget. It started with the e-mail I got. I investigated its headers and there was domain santa-claus-intelligence-agency.com. (I’ll call the agency as SCIA.) It got me really excited! So I started to dig around.

First thing I found was their beta-site. Its address was public-beta.santa-claus-intelligence-agency.com. There was still some debugging options on, so the web application revealed the structure of file system. The most interesting path was ../snippets/handlefiles.php. I started to investigate what that contained. It seemed that developers of SCIA had poor knowledge about version control. There was backup file handlefiles.php~. I read it and there was mention about browser accessible directory called ../uploads/. That directory was misconfigured and I got the directory listing. I decided to check every file from that. The most interesting file was birthdays.xls.

I opened the birthdays.xls. It was real treasure! It listed all SCIA elves, their birthdays, and e-mail addresses. I was much wiser about their internal structure after that. I tried to google more information about them but I found only couple blogs – one was about fishing, and another one about snow sports – and couple LinkedIn-accounts. The birthday of “fisher elf” was at 8th of June. I found that file at May.

When 8th of June approached, I bought domain fishing-stuff.com and set up the shopping site. Then at 7th of June I sent mail to the fisher elf. I wrote to it:

Congratulations! Your birthday is near, and we at Fishing-Stuff.com want give you the present. You’ll get -40% discount from all items which you order during this week.

Hook was there! At very same day it happened: Someone from secret-nat.santa-claus-intelligence-agency.com logged in. Now I had some additional information about their network.

First I decided to check names of the IPs they had. I found webmail and proxy. So I directed to my browser to the webmail first and I realised that their mail application was vulnerable to open redirect! It means that if I manage to get user to click specific link, he can end up to ANY website I want him to end up. I continued the investigating. I read carefully the LinkedIn profiles. From them I managed to find code names of their project.

The Lady of Chaos has given me the imagination. So I started to think how to get someone to click my malicious link and submit his e-mail credentials to me. First step was to create “login page” which looked exactly like the one they used at their web mail. Only difference was, that it stored the username and password to the file which I was able to read and then went back to their original e-mail application.

Mail was:

Hi, higher official elf Jack Elvish. Our new project XyzEdfg requires your input. Could you please check our site h‌ttps://webmail.santa-claus-intelligence-agency.com/index.php?redirect=http://10.128.34.53/index.php.

I used e-mail address of Jone Lesselvish which was Jone.Lesselvish@santa-claus-intelligence-agency.com. His LinkedIn profile mentioned XyzEdfg. So I thought that it would be safe to use. E-mail protocol is insecure protocol, where you can use any name you want to. It does not ask passwords. It just trusts what it gets.

I had to wait only few hours to get username and password of Jack Elvish. Next step was easy: I just logged into his email, read what kind of systems they had. The most interesting was the “accounting” system. It was part of internal network. To intrude to there I had to find some way to access to their internal networks. Next step was to find error at firewalls or another insecure application. I decided to take the risk and investigate computer named proxy. Port 8090 had open HTTP-proxy. It was also able to access to the internal network.

At this point I decided that it is too risky to attack to it right away. I waited until end of September. If they had noticed me before, they couldn’t connect my new attempts to my older attacks. At the end of September I finally logged in to e-mails of Jack Elvish. The password was still working! Then I tried to log in to the accounting system. It asked e-mail address, so I used Jack’s address and tried the same password as at e-mail. Oh well. It didn’t work. I used “I forgot the password”-functionality. Just password, and answer for question “What’s the month of your birthday?” And Jack had new password which I also knew right away.

I had full access the accounting system. There was all kind of secrets! SCIA had done wonderful job at tracking people. Even NSA couldn’t do the same. Now I know all YOUR secrets! I was able to tune everything. After my modifications I had cured cancer and AIDS and managed to get the peace between Palestine and Israel. At the end my present allocation was 10%!

Insecurity is not just the application security. It is more or less human aspect. Misconfigured servers, small security issues, fooling the people with different techniques and insecure protocols – they all managed to make this task so simple. No matter what kind of testing and where we are working, we should always ask:”Can someone misuse this? Can he break the confidentiality, integrity or accessibility of the system?” If answer is “Yes” at any part, then there is the problem and risks should be estimated. If the service has public interface, overestimate the risks rather than underestimate.

This was originally written to Qentinel‘s internal newsletter.


This blog entry starts my test data related blog series. I will submit new one every now and then. If you have questions or comments, please, let me know. I like to learn and I like to give new ideas. So disagree freely with me.

Over the series I’m using single example application: Discussion forum software (Like e.g. SMF, phpBB). Key features are registration, logging in and out, different kind of user rights, posting public and private messages, reading them and searching.

One key feature of manual testing is human aspect. Human submits the data, human checks the results. All test data should consider that. We have limited possibility to notice small differences.

Normal bias which I’ve seen at manual testing is using repetitive patterns. If the date format is ddmmyy, even tester quite often uses date like 010101 or 111111. They are simple and quick to type and always valid. But as the test data they miss many possible error cases. What if age calculation swaps day and month? The data like that won’t notice it. Much better is something which makes missing the mistakes impossible. It could be for example 230142. Same number is not repeated at all, so if something fails, it is immediately noticed.

The forum also requires text data. If testing the messages, the test needs plenty of text. Usually the testers and developers are using lorem ipsum as test data. But that practice should be avoided. There are multiple failure points. You can’t read the text, so you most likely won’t notice if characters are swapped to some other. Warping of the lines should also be done correctly which is difficult to see from lorem ipsum. Also all other content related errors are masked behind nonsense. If I need plenty of text, I usually take it from Project Gutenberg.

Many organizations are still doing scripted manual testing. There you have to decide if the test data is part of test case or not. When it is not, it gives tester possibility to use different kind of inputs, but part of them might be weak and reproducing the situation suffers. If it is, then same data is used over and over again, and at the end we can say that it works at least with given data, but not sure about other data.

In my opinion you should consider what important inputs are, and specify those. If there is possibility to say “this kind of data” instead of “exactly this data”, use rather “this kind”. It still gives larger variation for inputs than exact input, but is still able to test what is wanted. “Not so important configuration data” should be specified so that it is easy to take to use. I’ve been at project where configuration of the test environment took almost whole day. In that case all configurations should be specified so that I could found them right away and in best case also be able to take them to use right away.

Thanks @HelenaJ_M about question about reusing of data versus creating all the time from scratch.


Test automation and testability should be in same sentence. If the application is not testable, it will be hard to automate. This short blog entry lists few things which are benefiting the testability of web applications.

What kind of application is testable? The primary issue is that it is easy to test or it doesn’t need plenty of guessing. Test automation should be able to find easily the components which it clicks or edits. There should also be some easy way to check if the results are expected. Testable application aids to create test automation which tolerates the changes of the application.

Most web applications have forms. They are the most difficult part of test automation. Many modern web application frameworks are creating own strange names and ids for fields. When application is modified, the field names are change also or they can be more or less random. For example SharePoint based web application has field names like:’ ctl00_m_g_18447b1a_f553_3331_d034_7f143559a4fe_ctl01_stateSelection_input’. In worst case the content of field name is tied to session, or dynamically changed when page is revisited at later point of the test. Testable application should have simple and descriptive id or name parameter for input-fields and links. In this case it could b “stateSelection”. In most test automation frameworks it is much easier to refer to name-parameter than to some over cryptic xpath. It’s also more debuggable than e.g. //div[@id=’mainpart’]/table[0]/tr[1]/input[contains(@id,’stateSelection’].

Test should be able to assert if the response was correct or not. This is usually done by checking if some specific text or component is at the web page. To simplify the testing the html-code should be as simple as possible to avoid misinterpretations. It should contain e.g. html-comments to describe what the web application itself thought about the result. But no matter what the application does, it should never spit out the stack trace or some other error message which reveals the internals of application. That information belongs only to the logs. In error case the content should have something which can be easily linked to error log. It doesn’t help only the test automation, but it will also help to debug production time problems.

Proper server side logging and simplifying accessing to them increases also the testability. In case of failure the logs should have stack traces, exception information and other debugging information. It is very useful if test automation can retrieve these in case of error and add that to its logs.

Ajax is real pain for test automation, performance testing, manual testing and security testing. But developers and users love it! One major problem at Ajax is that web page loading is marked done before all content has been loaded. So it is very difficult for the automation to understand if the loading has succeeded or not. So there are couple things which should be done: First the agreement between testers and other stakeholders what is the maximum loading time the content can take. Second is some clear mark at dynamic content when it is loaded and rendered to the screen. This can be e.g. small comment at html-page. That way the test automation can have good timeouts, and it can detect the finished page.

This short blog entry describes just small part of web application testability. But my experience is, that these are the most common problems.


When I see the number input, I have several patterns which I like to test. Here are few of them:

  • 08, 0100 – reason behind this is, that text to integer might interpret that to octal number. In that case 08 is illegal value, which can result strange things.
  • 8-1 – reason behind that is, that sometimes SQL query calculates that
  • 0xa – in that case the text to integer might translate the number to hex number.
  • 1e3 and 1e-3 – those might be interpreted 1*10^3 and 1*10^-3 (=1000 and 0.001)
  • 2147483646, 2147483647, 2147483648 – these are maximum ints in many cases
  • -2147483647, -2147483648, -2147483649 – these are minimum ints in many cases
  • 4294967294, 4294967295, 4294967296 – this is maximum on unsigned integer
  • Some huge number which is far beyond previous numbers

And let’s go a bit more detailed and real life situations to some of these.

At one C++ project the logic were following:

input number X
if (x+2 < fixed number)
loop from 0 to x

So if we input anything below 2147483646 we get correct functionality. But if we insert 2147483646, the result is suddenly -2147483648 and we enter to the loop. This is far from the expected result and in worst case it even opens the security issue. That system didn’t crash. It just stalled for 15 minutes which blocked some batch processing.

Then another issue is 8-1. I usually test that at web applications where I expect the number to be index. If the result is same with numbers 7 and 8-1, there is most likely SQL injection security issue. At the code is query: SELECT * FROM table WHERE id=$intput$. If it calculates 8-1, then it can also parse any other query. That can be e.g. 8+or+1=1 which might cause some really exciting result. Or it can be even such query, which dumps out the user database.

08 is interesting. I’ve seen it only at build number and as compile time error. But give it a shot. You never know what kind of number parser is at the engine room. It can lead to strange errors, or some other fancy effect which the user might dislike. And in that case try also 0100, because if it is parsed to octal, the result is 64. And it is clearly wrong. 0xa is same kind of thing. If the parser parses it to 10, then you will be in trouble if users don’t know that 0x is prefix for hex number. 0x100 is not same as 100, it is 256.

1e3 is exciting thing. I’ve met that kind of input parsed wrongly once. The system were going thru the document and catching all URLs. For some reason if URL contained that kind of string, it was parsed to number and to normal format. E.g. that would have been 1000.

Of course I try also normal border cases, classes, some random text etc. But these are the cases outside them. Do you have some specific patters which you try? And why are you trying them? Leave me the comment.