Bias in Testing

I’ve been trying a number of strategies to improve the performance of a very complex form (Oracle Forms 6i) currently in development. We’ve already done a fair amount of work making the code as efficient as possible, while still being reasonably maintainable, so there don’t seem to be any more low-hanging fruit we can pick off easily.

One aspect of the performance is the form startup time, which is on the order of 9-17 seconds. The kinds of changes I’m testing don’t make a very visible change to this time, so to work out whether the changes are worthwhile I have to be a bit more scientific about performance measurement. That means I have to use statistics (Disclaimer: I am not a statistician, so take what I say with a grain of salt and do your own research!). I test the form with the original code and with the modified code to see whether a particular code change makes any difference to the performance.

I start the form up, wait for it to load, then run our debug tool which tells me the times (to the nearest second) when the form started and when control was passed to the user. I then record these times in a spreadsheet, then close the form and try again two or three times, recording each result. Then I take the average and write that down. I’ll then change back to the original (unimproved) code, run the whole test again, then compare the average results. Sometimes the average is better, worse or the same as the original.

This is what I was doing today – I’d made a change which I thought might knock off a second or two off the startup time, and so I ran my 2-phase timing test:

Improved code timings: 19s, 8s, 7s, 7s
Original code timings: 9s, 8s, 8s

What can I conclude from these data? What I did conclude was that the first timing (19s) was an outlier – it was the first time I’d run the form today so it probably had to load various caches and possibly even perform hard-parses for some of the queries. The original code had an average startup time of 8.3 seconds, and the improved code loaded in 7.3 seconds on average (discounting the outlier). My improvement saved 1 second!

Or did it? These tests were performed between 8:16 and 8:27 on a Tuesday morning. I later thought of a slightly different way of doing the code improvement, implemented it and ran the whole test again:

Improved(#2) code timings: 11s, 9s, 12s, 9s
Original code timings: 11s, 12s, 11s, 7s, 8s

The averages have risen to 10.3 and 9.8 seconds, and the “improved” code actually looks worse here! This second battery of tests were performed between 9:13 and 9:22 on the same morning. You’ll also notice that I ran the test more often this time. Why? Well, I couldn’t believe that the improved code slowed the form to an average 10.6 seconds, so ran it a fourth time; and due to my increasing suspicion that these timings were more random than I’d realised, I ran the final test five times. In the end I decided the slower times must be because more and more developers have arrived by 9am and so the app server is busier, throwing my test results into confusion.

One thing I’ll confess is that I’ve had to resist a temptation all along to only include timings that seem “reasonable” – e.g. if the form took 19 seconds to load I’d immediately suspect some other process on the server had slowed my session down, and so I’d be tempted to not include it in the results spreadsheet. I knew enough about statistics to know that doing this would run the risk of introducing confirmation bias into the results, so I diligently recorded every result I observed.

After considering the apparent failure of my attempt to prove my code change made a discernable difference to the performance of the code, I started thinking a bit more about what was going on as I ran the tests and recorded the results. I realised that a more subtle form of confirmation bias has crept into my results because I didn’t decide firsthand how many tests I would run. I simply ran the test three or our times, depending on whether I was in a hurry, and if it didn’t seem to be coming up with the numbers I was expecting, I’d keep running it until it did!

Writing it out like that, it seems blatently obvious, but when you’re in the middle of running these tests and recording results it’s very easy to slip up.

What did I learn from all this? Before running any tests, write down exactly how many tests will be run, and at what times of the day. In other words, try to eliminate irrelevant variables such as concurrent activity on the server by spreading the tests randomly throughout the day, and try to avoid confirmation bias by predicting ahead of time how many tests will be needed to reduce the impact of outliers on the average result.