Web Custom Elements for Content Markup

TL;DR: Custom elements (a web standard for extending the set of HTML tag names) provide an interesting way to support rich content markup while decoupling presentation from content, but still have challenges in search engine support.

Custom Elements (a key technology underpinning web components) allow developers to extend HTML with their own set of tags that a browser can directly render. That is, you can create markup such as <blog-link slug=”…” /> and the browser will parse and interpret the markup according to the definition of “blog-link” provided for the page.

Custom elements are easy to spot in a web page. To avoid collisions with standard HTML tags, custom elements must always contain a hyphen in their name. (Future HTML elements will never contain a hyphen.)

This blog explores the idea of using custom elements too allow richer content formats than simple HTML markup, for example in a Content Management System (CMS) where structured content is stored in a database. It also explores some of the practical challenges when search engines attempt to index such pages. (Spoiler – it sounds like a cool idea, but does not work that well in practice.)

Structured Content Markup

What do I mean by structured content? I use the term here to refer to marked up content as distinct from plain text, HTML in particular for the purpose of this post. Structured content includes simple markup of bold and italic text, links to other pages that need special processing to resolve cross page links, and more complicated layout markup like responsive grids.

For example, consider a website that has a desktop and mobile version. The sites may use different URL structures. To allow structured content to contain a link to another page without hard coding the URL to that page, an identifier of some kind is commonly used (such as a page id or slug). This allows the final URL to be different for the mobile and desktop web sites as it is synthesized by the custom element definition loaded by the page.

Traditional Markup Rendering

To support structured content today without custom elements, after content is retrieved from a database it is typically processed on the server to replace reference ids with the correct form of URL for the platform it is being displayed on. XML may be used to represent the database contents markup (because it’s more reliable to parse consistently). The returned content is then plain HTML.

One approach is to perform such transformation is using regular expression searches across the content to replace markup. This treats the content as a string template file without trying to parse the HTML structure from the content. This has the advantage of being simple and relatively fast, but can be less flexible.

Another approach is to parse the content into a tree-like structure which is then re-rendered as HTML. This approach is more flexible and extensible, but is typically slower as the parsing is more involved and typically has to perform more memory allocations as it builds up the tree-structure representing the content. This approach is used by technologies like XSLT where the original content is parsed as an XML document before being transformed into another representation.

Using Custom Elements

Custom elements introduce an interesting rendering option to avoid server side processing of content for display. Custom element markup can be stored in a database that is sent directly to the browser for display, reducing server side processing. The browser then parses and renders the markup. That is, the content is only parsed in the browser, not on the server.

To render the custom elements differently (say on a desktop vs mobile site), pages load different definitions of the custom elements. For example, a database may store richly marked up content as <blog-link slug=”…”>. Different web applications including that content would load up their own definition of the “blog-link” custom element, which would form the URL in the correct representation for that platform.

Custom elements can also be used for richer layout markup, such as <content-header>, <side-bar-note>, <featured-image>, and so forth, better capturing the semantics of the content.

In fact, a platform could go as far as to use custom elements for implementing themes. A theme definition becomes a list of custom element declarations which includes the styling to use for the markup. This gives themes a lot more power than just CSS as the theme developer has full access to JavaScript to perform more sophisticated manipulation of content.

It should be pointed out that not everything has to be a custom element. Responsive grid layouts can frequently be achieved using CSS classes instead (<div class=”content-header”>). In that case it is the class name that matters. The benefit of custom elements however is that they, being backed by JavaScript, can perform more sophisticated content manipulation beyond what CSS alone is capable of.

Search Engine Indexing

An important consideration for content on a public web page is search engine indexing. Most search engines today see custom elements effectively as <span> elements and ignore them. This is fine for most elements (e.g. layout elements), but is a problem for page links (<a href=”…”>) and image references (<img src=”…”>). Search engines like knowing about the outbound links of a page.

If the search engine does not execute the JavaScript on the page during indexing, it may not see the URLs to real pages and images.

Even if the JavaScript is executed to form the final correct URL, there is typically an extra processing delay while waiting for the JavaScript to run for re-indexing purposes. What this implies is the content must store a real URL for the search engine to use, even if the custom element modifies the URL later.

Two approaches to make it easier for search engines include:

  • Use markup around pure HTML <a href=”…”> and <img src=”…”> elements. For example,
<blog-link slug="web-components">
<a href="/blog/web-components">Blog Title</a>
</block-link>. 

The “blog-link” custom element can, if it wishes, change the URL used by the child <a href=”…”> element. The search engine however would use the URL as specified in the content.

  • Use the “is” attribute (not available in all browsers yet) can be used to annotate existing elements rather than introduce new custom element names. This allows markup of the form
<a href="/blog/web-components" is="blog-link">Blog Title</a>

which is easily understood by search engines and yet can be associated with a custom element definition to change its presentation.

Once content has to store real URLs for the purpose of search indexing, it becomes less desirable to store the page id twice (in the URL and as an attribute). Duplication can lead to inconsistency in markup. Thus in the above example of <blog-link slug=”web-components”> it may be better to remove the ‘slug’ attribute and rely on a standardized URL structure which can be parsed to extract the slug instead.

<blog-link><a href="/blog/web-components">Blog Title</a></blog-link>

Of course it is an option to process custom element markup on the server to automatically inject the child <a href=”…”> markup based on the slug attribute of the parent, but that goes against the original objective of not having to parse the content before returning to the browser.

Conclusions

Custom elements are a potentially useful way to capture content with more semantic markup that can be rendered on different platforms in different ways.

However due to the importance of injecting real URLs into the returned content for search engines, it may still be better to use a format such as XML and transform that XML into the correct HTML for the display surface when the content is displayed.

There are also security considerations that should be considered when taking database contents and injecting into a web page without any checks. There is the risk of an intruder injecting malicious JavaScript into the content which then opens the browser up to various attacks. Again, XML may be a better choice as an XML Schema can be used to ensure the marked up text conforms to specific rules rather than be completely open ended (e.g. <script> tags could be banned from content).

But it is an interesting approach to capture structured markup that can be returned to a browser for display, and yet change the presentation format of that content without changing the content itself.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: