Performance Benchmarking meets Core Web Vitals

Google has announced that Core Web Vitals (CWV), a set of 3 core metrics to measure page experience for users, is a ranking signal. (There are *many* ranking signals.) Why three? Well, you don’t want hundreds of metrics, but the experience of users on a site (as distinct from the quality of the content itself) is an important aspect, especially when multiple sites have similar content. So three metrics were selected:

  • Largest Contentful Paint (LCP) – how long it takes for the “main” content of the page to be displayed
  • Content Layout Shift (CLS) – how much the content shifts around on the page during loading
  • First Input Delay (FID) – how long the page takes to respond to user inputs

Could these change in the future? Yes! As more is learnt about the best way to measure user experience, or problems with measuring metrics is uncovered, Google will review and adjust them.

One example of a past minor adjustment relates to CLS. A change was made to the CLS window time. Based on experience and feedback, this was a better way to measure CLS as there were cases that were getting penalized inappropriately.

Disclaimer: I work for Google, but I am writing this on my personal blog without verifying all the details (call me lazy!) so I can get it out faster. I wanted to answer some questions that came up on twitter after my Magento Association talk on measuring CWV on your own site.

So what numbers for CWV does Google actually use, what can developers measure, and why do all the numbers come out differently for different tools? Here are example influences based on my previous experience in performance analysis of complex systems. TL;DR: performance analysis is HARD!

Lab vs Field Data

Google Search uses field data, the experience of real users using Chrome, to capture CWV metrics. These metrics are grouped into clusters of “similar” pages. (I am not going to explore grouping in this post.) You can access the data that Google uses for ranking purposes via Google Search Console. So if you are tackling CWV to increase your ranking metrics (a fine goal), what Google Search Console reports is what you care about the most.

There are other sources of field data however, such as the Chrome User Experience Report and using a JavaScript to collect analytics data on your own site. The latter takes more effort, but reduces the length of time to get feedback on live site changes.

Field data is based on what users do on your site, but is generally less detailed making it less useful for analyzing problems. This is where lab tests are useful. You can change one variable on your site at a time and measure the impact; or change nothing and measure the deviation on your site (maybe your site varies a lot between users and you did not realize it!). The tools can also spend more time collecting data on what is happening to process a request, without concern for degrading the site experience of real users (collecting more data generally slows down the user experience).

So lab data generally makes it easier to analyze a site, but it is a proxy for the real goal of improving real user experiences on a site.

Variance in Results

Computers are deterministic, right? So why do you get variance in test results? Well, the answer is yes… and no. They are deterministic if their inputs are deterministic. When you introduce networks and shared cloud computing resources, they are shared with other applications – things are no longer deterministic as network delays, CPU availability, etc can vary over time.

A part of the reason for this is queuing theory. When you have a series of events that are triggered in response to inputs, and events can block or delay other events, then doing things in a different order can have surprising ripple effects on overall performance. My first experience of this is with locks inside a database engine – claiming a global lock can really degrade system throughput, even if you only need the lock for a short time. It is also true when fetching resources over HTTP from a web server. A page may load hundreds of separate resources (HTML, CSS, JavaScript, fonts, images, videos, etc) where there are dependencies between resources. JavaScript cannot run until downloaded, JavaScript once run might update the HTML on the page causing new images to be requested, etc. The order in which resources are downloaded over a restricted bandwidth network connection can significantly impact performance. (Not always, but it definitely can.)

It reminds me of Tetris. The order things arrive effects how to best pack everything together. There is randomness involved. If things arrive in a good order, life is much easier to pack it in. Use a bad packing approach and its GAME OVER!

Further, when talking about performance measurement, the machine you are on is one of your inputs. In the old days before cloud, it was much easier to lock down a machine to guarantee no other activity on the machine. With the advent of cloud computing, frequently VMs are on machines shared with other users. While the hosting provider tries to provide fair access to resources, even if there are separate network ports, the reality is there is shared infrastructure like memory busses and so on. Noisy neighbors are an issue.

And of course the location of a machine makes a difference in terms of network behavior between the machine hosting a testing tool (like Pagespeed insights) and the machine being tested. 

For real users, there are also issues such as the performance of the device they are using, how much memory/disk is available for caches, the quality of the network they are on, what sites they have previously visited to prime DNS / font / JavaScript library caches, and so on. The cache hit-rate of such services is impacted by their popularity across sites the user visits. (For example, a Google web font downloaded from Google, or a popular JavaScript library downloaded from a public site may require an additional DNS lookup, but may be cached more frequently for users moving across multiple sites.) Lab testing cannot 100% accurately emulate all users – they make decisions such as disabling caches, assuming a 3G network, or similar to try and simulate a user arriving on a site for the first time (worst case behavior).

This is why Google uses real user data rather than lab data it collects from sites. What matters is the real experience of users. Lab testing is never going to give exactly the same results.

So, when using cloud hosted tools to measure various sites, variance in results should be expected. Network load varies during the day, the measuring site and site being measured can have different loads at various times, and so on. And queueing models can amplify minor changes due to resource dependencies.

But how much variance?

While some variance is expected, how much should be expected?

Ugg! Don’t ask such difficult questions!!!

It is not an easy question to answer. Because measurements like CLS use a timing window, something that takes 1ms longer can theoretically have a big impact because some other resource or action now falls just inside or outside the time window. These sorts of edge conditions are really hard to track down.

That does not mean you should not ask – it just means answering sometimes is difficult to impossible, which is frustrating for everyone involved. This is just a part of the pain of performance analysis. At scale, it is even more painful as more people hit the edge cases.

Not a very satisfying answer perhaps, but it is the unfortunate reality.

Personally I would offer rough guidance of 10% variance between runs is just par for the course. (This is gut reaction, not measured data.) 20% variance between lab and field data would not surprise me at all. In the Magento Association talk I demonstrated a case where the results were a factor of two different. I consider lab data useful to identify and retest a particular aspect that needs to be fixed. Did the numbers get better after a change? Great! It’s not a candidate to get pushed to the live site for real user measurement.

Wrapping up…

If you do spot a variation, feel free to leave a comment on this blog post. I do not promise to have answers, but I do promise to read. I think I can guarantee there will be outliers – it is the nature of the problem. But if many people have the same problem, such data is useful to the teams here inside Google to investigate to see if there is a problem that can be solved.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: