Amazon AWS, Google Cloud Platform, Azure… its never been easier to spin up a new server. Then comes the raft of managed services: databases, message queues, scheduling, and more. And this is before you get to Kubernetes orchestration and automatic restarts or serverless programming models.
But using these tools is it easier to build an e-commerce platform?
My answer is no. Platforms where there is a single web server talking to a database are much easier to develop and operate. It is clear when the service is down. To restart, you start up the server again. Code deployments are simpler – all code is always in sync.
But I believe the question as phrased is missing the point. The new expectation is not about overall solution simplicity. It is about simplifying building solutions that meet modern day expectations of availability, scalability, and reliability. These new services do simplify the construction of more advanced solutions. But it would be wrong to say they are now “easy”.
For example, how do you improve availability? If you are after 100% uptime, there are techniques that can be used. There are lots of fancy jargon and theorems that can be quoted in this area. For myself, I just stick to basics and relatively simple concepts. There are lots of great books and articles on CAP theorem, eventual consistency, event sourcing, the latest design patterns, and more. These concepts are important, but I prefer talking about real world concepts than fancy jargon.
The first step to make a system always available is it must survive a server crash or failure (including a zone failure). That means you must have multiple servers running (or use a platform that does this for you) – if you lose one, there are others on hand to take the load.
As soon as you have multiple instances running, to deploy a new version of the code you need to either perform a rolling restart (restart individual services one by one, meaning you can have half running the old code and half running the new code for a period of time), or spin up a completely new set of servers and flip over atomically from the old services to the new.
However, complexity sneaks in when you come to storage. If the new code requires a format change to storage on disk (e.g. a database table change), how can you have Old code still running that does not understand the new schema yet? A common approach is to roll out code that understands the old and new table structure, then once all the code is deployed then perform the schema change. This sequencing adds complexity to deployment processes, particularly code rollbacks.
As can be seen, if you want to keep the service up during a schema changes, life is much harder. Sometimes it is just easier to take the site down, make the schema change, then restart everything. Avoid unnecessary schema changes for most code deployments.
Serverless platforms such as Amazon Lambda functions or Google Cloud Functions are useful, but you do need to be careful. These approaches have real benefits:
- They scale up on demand
- Failover is automatic – you don’t have to code this yourself
- They can be event driven – CPU is only used when required to respond to some event
But there are also negatives
- If you have sustained load, they typically cost more than dedicated services (when I looked a year or more back it was 5 times the cost)
- Monitoring can be a bit more tricky as there is no server you are running
- They do not magically make service failover and startup times disappear – if you care about system behavior during such incidents serverless programming may not be suitable
- You still need to understand how to do code deployments that cope with schema changes – but with the added complexity that you do not have full control over the code deployment pipelines
For me, serverless programming models are useful when triggered by some managed service event, such as a message arriving on a message queue. That is, more like an event hook. I would not build a complete architecture based on them.
This blog post is first in a series I am thinking of writing, exploring various cloud technologies to consider how they could support a new e-commercial platform.
The primary message of this post is this series is not about the easiest possible solution – it is about simplifying the hard problems of designing modern resilient systems.
It is also not about saying existing approaches are wrong. I believe for the majority of sites making the sites easier to develop and deploy outweighs the benefits of availability and resiliency. For all the distain “monoliths” get, if you want an easy to run, easy to develop, easy to deploy solution then monoliths are what you are after.
I should also be upfront and say I have no plans to write a single line of code for a new platform. Such an effort is a large undertaking. And platform success is more about ecosystem that clever design.
But I find it interesting to understand the available technologies and how to use them in building modern solutions that address considerations such as availability and resiliency, solutions that meet modern expectations of systems that form the core of modern businesses.
And excuse me if in this series I focus on Google technologies. This is not a work related series of posts, but I still work there, so like to understand what the cloud team (which I am not a part of) releases. Other platforms will have similar technologies (such as message queues) so the discussion will generally be equally relevant to other platforms.