Next Generation Digital Commerce Architecture Daydreaming

Every so often I day dream of what a next generation ecommerce platform would look like. (Yes, this is one of those pie in the sky blogs.)

First, I think SaaS platforms like Shopify, BigCommerce, Volusion, and so on do a great job for many ecommerce merchants. What I wanted to focus on in this blog was businesses who get sufficient benefit from undertaking custom IT development to solve specific needs.

Oh no, not yet another microservice architecture blog post! Well, kinda. But also kinda not. I do believe in microservices (and MACH), but I am thinking more about breaking large problems into smaller problems that can be tackled separately. In practice I normally hear of microservices in the context of a service responding to API calls. These days they are frequently running in a Kubernetes cluster, communicating with something like an Istio service mesh. That was not my intended focus for this post. If you want to tie this post to a concept, Composable Commerce from Gartner is closer.

That is, in this post I wanted to focus on the pattern of independent communicating systems, potentially built by different vendors. A single system may be implemented using one or more in-house microservices, but in the real world a business typically has services provided by different SaaS vendors and ERP (Enterprise Resource Planning) systems, along with customized systems for specific purposes.

Put another way, I think of large systems like store front, order management, loyalty programs, and email campaign management. I tend to think of microservices in the context of building one of those systems. In this post I focus more on connecting the systems together, but without the pesky details that such systems may expose hundreds of touch points (e.g. API calls) for connecting them together.

Another decision in this post is to make the most of cloud hosting provider technologies to see how much they could speed up development. An organization may prefer to remain as cloud agnostic as possible, which is also an argument towards technologies such as Kubernetes. In this post I decided to rely on hosting provider technologies (Google Cloud in my case – hint, I work at Google!).

But let’s start with the basics.

Web API calls and Web Hooks

Web APIs have of course been around for many years. JSON over REST APIs are a common way for systems to communicate. GraphQL is used, but more commonly used between web applications and backends. (See the Backend for Frontend, BFF, pattern.) Basically, use a GraphQL service to farm out requests to various internal microservices, aggregate the results, and provide it nicely and cleanly to the frontend application implemented in something like React.

A common modern pattern for intersystem communication, building upon APIs, is web hooks. Tell a system to make an API call to you when a specified event occurs. GitHub for example can trigger a CI/CD build when a pull request is accepted on a branch.

When using API calls, some form of authentication also needs to be built in. This is frequently via tokens such as JSON Web Tokens (JWT)  in HTTP headers. Using HTTP headers is better than including in the request payload as it simplifies implementing authentication independent of the application logic, allowing authentication to be changed without the code being changed.

Message queues

The pattern I personally prefer for wiring large systems together however is message queues, such as Google Pub/Sub. Benefits include

  • The receiving system can go down (including for maintenance and upgrades) without losing requests. (In terms of building resilient systems, this is a significant benefit.)
  • You can easily have multiple listeners on the same queue. Want to save a copy of all placed orders in your data warehouse? Keep a copy for a disaster recovery solution? Just add a new subscriber.
  • It can be easier to introduce and test a new version of a backend system in parallel to the existing deployment – just send the same message to multiple applications.
  • Monitoring of request flow rates can also be easier (it’s provided by the queue, no need to build something yourself, and if everything uses queues you can use the one monitoring platform across all systems).
  • It is easy to cope with traffic spikes. This includes batch jobs that may flood a system for short periods of time.

You can do all this with web API calls, but it can be harder without a middleware layer mediating all API calls. For example, if making direct HTTP outbound calls to a remote system over HTTPS, you cannot intercept such calls as easily without a proxy or specific application code. Instead, consider having your custom code send a message to a queue with a Google cloud function listening to the queue that does the actual call to the remote system. Then you can see the queue stats via the same queue dashboard. It can also make auditing or data sampling easier. (Judgement is needed, of course, whether such an approach is overkill.)

Message queues also have their problems.

  • You don’t get a synchronous response. This is a big one! Want to fetch the product details for a specific product? Web APIs are a better fit.
  • Synchronization can also be important when communicating with multiple external systems, requiring different orchestration approaches when using multiple queues.
  • If you are using a cloud provider message queue, make sure you understand the costs, especially if your usage is very chatty.

Asynchronous and dirty data

I don’t want to gloss over the pain that lack of synchronization brings. How to show an order number to a shopper who just placed an order if you don’t get it back immediately? There are different solutions to this:

  • Build an asynchronous web storefront. For example, use a web socket to send a message back to the web browser once the order number is known. There are libraries like socket.io that have fallback support for older browsers.
  • Send the order number later in a confirmation email, not immediately on the screen.
  • Create a user-facing order number in the storefront, then use a separate identifier later when the order is created in the backend system. (This may sound ugly, but connecting systems from different vendors may force this upon you anyway.)

You may have heard of the term eventual consistency. The idea is with asynchronous systems, not everything will be in sync while the requests are still flowing through the system. Once everything finishes, then things will be consistent. For example, multiple orders may arrive for the same product and so it looks in stock, but by the time order processing completes you discover someone else has already bought it. Obviously this is undesirable, but let’s look a bit closer. Even if data is synchronized, you can still get problems.

  • Are you selling on your own site and marketplaces? You are not going to be able to be perfectly synchronized.
  • Your stock levels may be incorrect due to damaged goods, or you were shipped the wrong goods by your suppliers and did not discover until too late.
  • Are you using software to track your stock levels? It may shock you, but there are these things called “bugs”! Or systems can crash half way through a request. Things go wrong in the real world.
  • Are you shipping from stores? Your stock stock levels may only be updated back to your main systems at the end of each trading day.

I personally think a resilient solution has to cope with noisy data, not pretend it does not exist. Obviously it is good to reduce errors and the chance of making mistakes, but don’t pretend your data is perfect and will never have errors. This is why many systems have safety buffer settings. For example, set a safety stock level of 3 so the occasional error will lower the chance of you actually being out of stock (at the cost of not selling all your inventory). And your last resort is to send a message to the customer apologizing the item is no longer in stock, which you obviously want to avoid, but design in the fact that this can occur rather than pretending it will never happen.

This is where machine learning (ML) can start sneaking in as a useful technology. Look at past purchase patterns to predict the right stock levels for reordering, the best safety stock level per product, work out if certain items are more likely to have wrong stock levels than others, how long stock takes to arrive, etc.

Recovering from dirty data

Now I am not suggesting you should not try to clean up dirty data. I actually think reconciliation is a useful process. Just like physical stores often do stock takes, it can be useful to reconcile different internal systems. For example, your storefront may keep track of stock levels so it can delist products that are out of stock. These reconciliation jobs can be run out of peak out periods. 

Running reconciliation jobs when the systems are idle can result in cleaner results, another benefit of queues. (Shut down the service pulling requests from the queue, do the reconciliation, then restart the service.) But if you remember we don’t need perfection, then reconciliation can tell other services current stock levels even if there is still a slight error due to orders being in flight.

Flexibility

To be honest, I am still scratching my head whether there is a need for a new ecommerce platform based on the above. I think there is value in trying to standardize concepts like orders and product descriptions and system boundaries, but they will never be perfectly standardized. Maybe it is more useful to have a platform that can connect to the rest of the world where it is at.

A part of this is driven by where I feel the industry is moving. I think shopper contact surfaces are changing at an increasing rate. Do you want to use QR codes in stores to bring up product reviews or call for an instore expert on that product? Do you want to hook up purchases with your latest social media campaign? Is there a new payment service you want to support? Do you want to start using ML to spot patterns?

I think there is an increasing need to be more agile in backend systems. I think there is value in being able to rewire or introduce new systems with minimal danger to existing systems. This is where I see the greater value of leveraging cloud hosting provider technology offerings. You have to pay for them, you may be able to build your own more cost effective custom solution, but how quickly can you adjust a custom system to meet continually changing business needs? Cloud solutions with technologies like queues and cloud functions can make it easier to add functions on demand.

(This also scares me a little at the same time. Discipline is required to not end up in a horrible tangled mess.)

Conclusions

Patterns that “feel right” to me:

  • Force yourself to think a asynchronous patterns (don’t be lazy and fall back to synchronous API calls)
  • Use message queues for events with side effects (such as placing an order) using Google Pub/Sub
  • Use APIs for read-only requests (such as fetching product data)
  • Leverage serverless functions for odd jobs wiring things together on the fly
  • For larger custom processes, consider Google Cloud Run so you can use your technology of choice for more complex processing
  • Listen to cloud provider events to tap into data flows (e.g. save events to BigQuery for Google Data Studio visualization or feeding into ML)
  • Use products like Google Dataflow to massage data between formats to avoid custom code development
  • Build automation with tools like Google Cloud Build
  • Don’t share databases between services to keep systems independent – even if this involves data duplication
  • Plan for reconciliation between systems when there are copies, and disaster recovery
  • Visualize your data using Google Data Studio to get benefit from your data

Kubernetes? Isn’t Google big into Kubernetes? Yes. But a recent article I saw the resonated with me described Kubernetes as a replacement for an application server. Sure, use microservices with direct and high speed internal API calls within a large service. But I do think event models between systems using asynchronous queues is a better way to go.

Disclaimer: These thoughts are my own, not that of my employer, and I have not built a production grade ecommerce system based on the above, so buyer beware!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: