The Value of A/B Testing

5 years ago I wrote a blog on “You Can’t Improve what you Can’t Measure”. While the use of “can’t” may have been a little strong, making changes to your website can deliver results you did not expect. If you want to make sure the changes make your site better, you need to have some form of measurement.

Luckily, today many content platforms have metric collection built in. Page views, bounce rates, referer metrics – these are increasingly common metrics to have easy access to. And if not built into the platform you are using, you can use tools like Google Analytics instead. So some form of metrics is almost always available.

So why bother with A/B testing?

What is A/B Testing

A/B testing is where you split traffic to your site into multiple (two or more) experiences. One group is normally a control group with no changes to the current site, and the others get some change introduced. The results of each change can then be compared to your control group to see if a proposed change improves or degrades the site experience.

You can also do what is called a reverse A/B test where you make the new experience the default, and make a test group the old site. This can sometimes be useful for code development reasons – for example, to implement a new feature may have required significant code rework, so the reverse test is not exactly the old site but close enough to ensure that the new changes are an improvement. It’s a form of back testing after a change to make sure your final implementation got the test results you were expecting.

Note that you don’t have to send the same volume of traffic to the different experiences. A test needs a reasonable amount of traffic for any results to be trustworthy, but lower traffic is needed if less accuracy is required. For example, if backtesting, you may only allocate the back test a small percentage as accuracy is less important – the back test is just a precaution.

A/B Testing Tools and Implications

There are several tools around to help with A/B testing. Google also has a free tool under the Google Analytics umbrella, Google Optimize, to help with collecting data on your tests and analyze the results. It’s worth checking out the different tools to see which works best with your platform and budget. Remember your time is money too, so make sure the tool is not too hard to use in your environment.

It is important to understand when exploring A/B testing that it does require work and can have side effects. For example, many A/B test platforms use JavaScript to modify the presentation of a page in different ways for different test populations. This has the negative consequence of potentially either causing the page to flicker when changed or introducing a delay to revealing the page contents so it does not flicker. In the latter case, the JavaScript code has to decide which population the user is in (control group or one of the test variants), then modify the DOM (or not) accordingly before showing any page contents to the user. That means that JavaScript is render blocking, which is something to avoid for fast sites, and faster sites convert better. So make sure you understand the performance impact of the A/B testing practice you are following.

Further, using JavaScript to modify the DOM in the browser can be more complex to merge into sites with advanced JavaScript (such as a PWA). You need to make sure the A/B test JavaScript does not conflict with your site’s JavaScript. For example, if using client side rendering (React, Vue, Angular, etc) then after DOM manipulations are made by one of the frameworks, will the A/B test changes be lost? That would result in misleading results being collected.

So why use JavaScript in the browser then? Because many of the A/B testing tools do not require marketers to go back to the developers to make site changes for a test. The marketing team get simple tools they can use themselves, can work out a change that measurably helps the site, then get the developers to make that change permanent.

Test Results Confidence

Given all this complexity, why use A/B testing instead of changing the site say once a day or once a week and watch the results? If you don’t want the complexity of full A/B testing, then this is an approach, but it can be risky. There is frequently a seasonality to sales, so doing week over week comparisons can be hard to avoid such bias. Seasonality can be due to holiday seasons, sports events, the weather – there are all sorts influences. You also need to make sure you get enough traffic to make sure the test results are statistically valid. If the test is not on a well trafficked page, the length of time the test has to run may not be clear up front. Most (I would hope all!) A/B test platforms include confidence measurements, the mathematical confidence in the results based on data collected so far. Google Optimize, with its ability to use bayesian analysis, can provide actual conversion probabilities instead of p values so you can use it on a site with low traffic and make good decisions earlier.

Confidence is a measure of how trustworthy the test results are. It’s all a bit of a numbers game, but put simply if your test shows a 5% improvement with a 10% margin of error, then the change could reasonably result in anywhere from a 15% improvement to 5% degradation in the site. That would be a low confidence change. You want to see improvements larger than the error margin. The longer you run a test and the more test data you collect, the greater the confidence you will have in the test results.

This also means that A/B testing for low traffic sites may be less valuable. The length of time the tests need to run for may be too high.

Back when I was at eBay there were multiple A/B tests running all the time. It was almost a battle to be allocated a fraction of the available test bandwidth (even with eBay traffic volumes) as you had to be careful tests from different teams did not collide. There were also issues such as how much ramp up time would you need. People dislike changes, so to correctly measure a change sometimes it was necessary to let it “bake in” first, then start collecting results. Otherwise you only measured the impact of change and not the impact of the change once users got used to it.

A Quick Survey

I held a quick poll on twitter regarding A/B testing. Sorry! No confidence score from me! But with only 33 votes so I would not rate it very high. The results were 21% of respondents don’t bother with A/B testing, 6% thought it was too hard to try, 15% used it and found simple changes (e.g. CSS font, color, and positioning changes) sufficient, and 58% needed more.

I actually found this a positive result. It indicates to me that people wanting to do A/B testing take it seriously. They want to do more than just change the font size or position of content on the page. Many of the A/B test blog posts I find out on the web seem to stop at the simple things. They talk about font changes, margin changes, or moving call to actions above or below the fold. While important (and easily done by marketing teams without developer support), such changes feel incomplete. What about changes to search term synonym lists? Different personalization segmentation? Changing the number of pages in the checkout flow? Such changes cannot be easily achieved with simple CSS tricks. (Some actually can, by hiding or revealing different URLs for links or buttons.) Ideally you’re A/B test platform can support such tests as well from the one platform, so all tests can be compared and contrasted more easily.

One area I am personally still thinking about is whether good A/B testing needs to be baked into the core platforms. You use the core platform technology to create pages and behavior, so why use a different technology to then create different experiences? Why not build all test experiences in the same technology as the base site, making it easier to turn experiments into production code later? And can more be done on the server side (without breaking caching infrastructures) to avoid JavaScript page rendering delays. But this only works if you have sufficient developer resources to help the marketing team perform tests. Look forward to another post hopefully later in the year!

Conclusions

So where does this leave us today?

  • A/B testing is more work, it does take effort to get the benefits
  • Parallel A/B testing is more reliable than running tests sequentially due to external influences like seasonality, but needs better tooling
  • The volume of traffic affects the reliability of the results
  • Using JavaScript to flip between A/B tests is used by many platforms, but has the potential to slow down your site speed and hence conversion rates – understand by how much as reduced site speed reduces conversion
  • Understand if you have developer capacity to help the marketing team with tests if not, the JavaScript approaches may be your best option

On a positive note, it has amazed me at times how much lift some sites have seen making even simple changes. Some changes were as simple as changing the label on a call to action button, or increasing the contrast of a button so people noticed it more.

If you do nothing else after reading this blog, I would recommend becoming more familiar with tools to measure your current site performance. Try making a change and then see how much difference the change makes. Measure the difference and see what you can learn, or discover that the tools you are using today are not good enough.

Wrapping up, I still stand by my old blog post. I still do believe that making changes to your site without measuring the impact is dangerous. You could be reducing your conversion rates without knowing it, or failing to realize how much better your site could be with a few simple changes.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: