5 years ago I wrote a blog on “You Can’t Improve what you Can’t Measure”. While the use of “can’t” may have been a little strong, making changes to your website can deliver results you did not expect. If you want to make sure the changes make your site better, you need to have some form of measurement.
Luckily, today many content platforms have metric collection built in. Page views, bounce rates, referer metrics – these are increasingly common metrics to have easy access to. And if not built into the platform you are using, you can use tools like Google Analytics instead. So some form of metrics is almost always available.
So why bother with A/B testing?
What is A/B Testing
A/B testing is where you split traffic to your site into multiple (two or more) experiences. One group is normally a control group with no changes to the current site, and the others get some change introduced. The results of each change can then be compared to your control group to see if a proposed change improves or degrades the site experience.
You can also do what is called a reverse A/B test where you make the new experience the default, and make a test group the old site. This can sometimes be useful for code development reasons – for example, to implement a new feature may have required significant code rework, so the reverse test is not exactly the old site but close enough to ensure that the new changes are an improvement. It’s a form of back testing after a change to make sure your final implementation got the test results you were expecting.
Note that you don’t have to send the same volume of traffic to the different experiences. A test needs a reasonable amount of traffic for any results to be trustworthy, but lower traffic is needed if less accuracy is required. For example, if backtesting, you may only allocate the back test a small percentage as accuracy is less important – the back test is just a precaution.
A/B Testing Tools and Implications
There are several tools around to help with A/B testing. Google also has a free tool under the Google Analytics umbrella, Google Optimize, to help with collecting data on your tests and analyze the results. It’s worth checking out the different tools to see which works best with your platform and budget. Remember your time is money too, so make sure the tool is not too hard to use in your environment.
Test Results Confidence
Given all this complexity, why use A/B testing instead of changing the site say once a day or once a week and watch the results? If you don’t want the complexity of full A/B testing, then this is an approach, but it can be risky. There is frequently a seasonality to sales, so doing week over week comparisons can be hard to avoid such bias. Seasonality can be due to holiday seasons, sports events, the weather – there are all sorts influences. You also need to make sure you get enough traffic to make sure the test results are statistically valid. If the test is not on a well trafficked page, the length of time the test has to run may not be clear up front. Most (I would hope all!) A/B test platforms include confidence measurements, the mathematical confidence in the results based on data collected so far. Google Optimize, with its ability to use bayesian analysis, can provide actual conversion probabilities instead of p values so you can use it on a site with low traffic and make good decisions earlier.
Confidence is a measure of how trustworthy the test results are. It’s all a bit of a numbers game, but put simply if your test shows a 5% improvement with a 10% margin of error, then the change could reasonably result in anywhere from a 15% improvement to 5% degradation in the site. That would be a low confidence change. You want to see improvements larger than the error margin. The longer you run a test and the more test data you collect, the greater the confidence you will have in the test results.
This also means that A/B testing for low traffic sites may be less valuable. The length of time the tests need to run for may be too high.
Back when I was at eBay there were multiple A/B tests running all the time. It was almost a battle to be allocated a fraction of the available test bandwidth (even with eBay traffic volumes) as you had to be careful tests from different teams did not collide. There were also issues such as how much ramp up time would you need. People dislike changes, so to correctly measure a change sometimes it was necessary to let it “bake in” first, then start collecting results. Otherwise you only measured the impact of change and not the impact of the change once users got used to it.
A Quick Survey
I held a quick poll on twitter regarding A/B testing. Sorry! No confidence score from me! But with only 33 votes so I would not rate it very high. The results were 21% of respondents don’t bother with A/B testing, 6% thought it was too hard to try, 15% used it and found simple changes (e.g. CSS font, color, and positioning changes) sufficient, and 58% needed more.
I actually found this a positive result. It indicates to me that people wanting to do A/B testing take it seriously. They want to do more than just change the font size or position of content on the page. Many of the A/B test blog posts I find out on the web seem to stop at the simple things. They talk about font changes, margin changes, or moving call to actions above or below the fold. While important (and easily done by marketing teams without developer support), such changes feel incomplete. What about changes to search term synonym lists? Different personalization segmentation? Changing the number of pages in the checkout flow? Such changes cannot be easily achieved with simple CSS tricks. (Some actually can, by hiding or revealing different URLs for links or buttons.) Ideally you’re A/B test platform can support such tests as well from the one platform, so all tests can be compared and contrasted more easily.
So where does this leave us today?
- A/B testing is more work, it does take effort to get the benefits
- Parallel A/B testing is more reliable than running tests sequentially due to external influences like seasonality, but needs better tooling
- The volume of traffic affects the reliability of the results
On a positive note, it has amazed me at times how much lift some sites have seen making even simple changes. Some changes were as simple as changing the label on a call to action button, or increasing the contrast of a button so people noticed it more.
If you do nothing else after reading this blog, I would recommend becoming more familiar with tools to measure your current site performance. Try making a change and then see how much difference the change makes. Measure the difference and see what you can learn, or discover that the tools you are using today are not good enough.
Wrapping up, I still stand by my old blog post. I still do believe that making changes to your site without measuring the impact is dangerous. You could be reducing your conversion rates without knowing it, or failing to realize how much better your site could be with a few simple changes.