This discussion is particularly relevant to e-commerce websites that have a Progressive Web App (PWA) built with a technology such as React/Vue/Angular that want all the product and category pages indexed.
Crawling, Rendering, and Indexing
How does Google collect web content to index? This consists of three steps that work together: crawling, rendering, and indexing.
Crawling is the process of retrieving pages from the web. The Google crawler follows <a href=”…”> links on pages to discover other pages on a site. Sites can have a robots.txt file to block particular pages from being indexed and a sitemaps.xml file to explicitly list URLs that a site would like to be indexed.
To play well with crawlers, sites should have a canonical URL for each page so that a crawler can determine if two URLs lead to the same content. (A site might have multiple URLs that return the same page. One of the URLs should be nominated as the “canonical” URL.)
After a page is retrieved by a crawler, indexing extracts all the content to send off to the search indexes. This is also when <a href=”…”> links to other pages are identified and sent back to the crawler to add to its queue of pages to retrieve.
Another tip – you can also use <noscript><a href=”…”>…</a></noscript> to embed other links you want crawled, but don’t want displayed.
So how best to build a PWA that can also be indexed?
Server Side, Client Side, Dynamic, and Hybrid / Universal Rendering
Server side rendering, as mentioned before, is where the web server returns all the HTML ready for display. This provides a fast first page load experience for users, is very friendly to indexers, but by definition is not a PWA.
Dynamic rendering is introduced in the presentation where a web server looks at the User-Agent header and then returns a server side rendered page when the Google crawler fetches a page and a client-side rendered version for normal users. (The server side rendered page can probably be relatively plain looking, but should contain the same content as the client side rendered page.) You just look for “Googlebot” (or equivalent for other crawlers) in the User-Agent header to work out if the request is coming from a crawler. (For extra safety you can also perform a reverse DNS lookup on the inbound IP address to make sure it is coming from the Googlebot crawler.)
There are other tools that can be worth checking out.
- Rendertron is an open source middleware project which can act as a proxy in front of your web site, doing client side rendering and returning the resultant page.
- The Google Search Console allows you to explore how Google indexes your site. It has a number of new tools such as “show me my page as Googlebot sees it” which is useful for debugging. It also contains a tool to see how mobile-friendly a website is (you should try out multiple pages on your website to try out). This can also be useful to see if robots.txt is blocking files you thought were not necessary, but negatively affect the rendering of a page by Googlebot. (There is a desktop tool as well.)
Some other common issues that arise when pages are crawled include:
- Make sure your pages are performant. Google will timeout and skip pages that are too slow to return.
- Make sure your pages don’t assume the user first visited the home page (to set up “browser data” or similar). Googlebot performs stateless requests – no state from previous requests is retained, to mimic what a user landing on the site will see.