Google has announced that Core Web Vitals (CWV), a set of 3 core metrics to measure page experience for users, is a ranking signal. (There are *many* ranking signals.) Why three? Well, you don’t want hundreds of metrics, but the experience of users on a site (as distinct from the quality of the content itself) is an important aspect, especially when multiple sites have similar content. So three metrics were selected:
- Largest Contentful Paint (LCP) – how long it takes for the “main” content of the page to be displayed
- Content Layout Shift (CLS) – how much the content shifts around on the page during loading
- First Input Delay (FID) – how long the page takes to respond to user inputs
Could these change in the future? Yes! As more is learnt about the best way to measure user experience, or problems with measuring metrics is uncovered, Google will review and adjust them.
One example of a past minor adjustment relates to CLS. A change was made to the CLS window time. Based on experience and feedback, this was a better way to measure CLS as there were cases that were getting penalized inappropriately.
Disclaimer: I work for Google, but I am writing this on my personal blog without verifying all the details (call me lazy!) so I can get it out faster. I wanted to answer some questions that came up on twitter after my Magento Association talk on measuring CWV on your own site.
So what numbers for CWV does Google actually use, what can developers measure, and why do all the numbers come out differently for different tools? Here are example influences based on my previous experience in performance analysis of complex systems. TL;DR: performance analysis is HARD!
Lab vs Field Data
Google Search uses field data, the experience of real users using Chrome, to capture CWV metrics. These metrics are grouped into clusters of “similar” pages. (I am not going to explore grouping in this post.) You can access the data that Google uses for ranking purposes via Google Search Console. So if you are tackling CWV to increase your ranking metrics (a fine goal), what Google Search Console reports is what you care about the most.
Field data is based on what users do on your site, but is generally less detailed making it less useful for analyzing problems. This is where lab tests are useful. You can change one variable on your site at a time and measure the impact; or change nothing and measure the deviation on your site (maybe your site varies a lot between users and you did not realize it!). The tools can also spend more time collecting data on what is happening to process a request, without concern for degrading the site experience of real users (collecting more data generally slows down the user experience).
So lab data generally makes it easier to analyze a site, but it is a proxy for the real goal of improving real user experiences on a site.
Variance in Results
Computers are deterministic, right? So why do you get variance in test results? Well, the answer is yes… and no. They are deterministic if their inputs are deterministic. When you introduce networks and shared cloud computing resources, they are shared with other applications – things are no longer deterministic as network delays, CPU availability, etc can vary over time.
It reminds me of Tetris. The order things arrive effects how to best pack everything together. There is randomness involved. If things arrive in a good order, life is much easier to pack it in. Use a bad packing approach and its GAME OVER!
Further, when talking about performance measurement, the machine you are on is one of your inputs. In the old days before cloud, it was much easier to lock down a machine to guarantee no other activity on the machine. With the advent of cloud computing, frequently VMs are on machines shared with other users. While the hosting provider tries to provide fair access to resources, even if there are separate network ports, the reality is there is shared infrastructure like memory busses and so on. Noisy neighbors are an issue.
And of course the location of a machine makes a difference in terms of network behavior between the machine hosting a testing tool (like Pagespeed insights) and the machine being tested.
This is why Google uses real user data rather than lab data it collects from sites. What matters is the real experience of users. Lab testing is never going to give exactly the same results.
So, when using cloud hosted tools to measure various sites, variance in results should be expected. Network load varies during the day, the measuring site and site being measured can have different loads at various times, and so on. And queueing models can amplify minor changes due to resource dependencies.
But how much variance?
While some variance is expected, how much should be expected?
Ugg! Don’t ask such difficult questions!!!
It is not an easy question to answer. Because measurements like CLS use a timing window, something that takes 1ms longer can theoretically have a big impact because some other resource or action now falls just inside or outside the time window. These sorts of edge conditions are really hard to track down.
That does not mean you should not ask – it just means answering sometimes is difficult to impossible, which is frustrating for everyone involved. This is just a part of the pain of performance analysis. At scale, it is even more painful as more people hit the edge cases.
Not a very satisfying answer perhaps, but it is the unfortunate reality.
Personally I would offer rough guidance of 10% variance between runs is just par for the course. (This is gut reaction, not measured data.) 20% variance between lab and field data would not surprise me at all. In the Magento Association talk I demonstrated a case where the results were a factor of two different. I consider lab data useful to identify and retest a particular aspect that needs to be fixed. Did the numbers get better after a change? Great! It’s not a candidate to get pushed to the live site for real user measurement.
If you do spot a variation, feel free to leave a comment on this blog post. I do not promise to have answers, but I do promise to read. I think I can guarantee there will be outliers – it is the nature of the problem. But if many people have the same problem, such data is useful to the teams here inside Google to investigate to see if there is a problem that can be solved.