This post is a thought experiment on how to super-scale Magento. That is, how do you design a system to be able to scale in multiple order of magnitude increments. Because this is a thought experiment, I am not going to worry about minor details like how much effort to implement the solution or how much all the hardware would cost.
Before going too far, it is important to remember that there are many different dimensions that can be scaled: number of products, number of variations, number of concurrent users, number of price rules, number of concurrent administrators, and more. For Magento, it is also important to remember that no two sites are the same. Different extensions may be installed, the HTML and CSS is likely to be adjusted to give the site its own look and feel, etc. So there is not a single rule of “do this and scaling is solved”. Instead there are different techniques that help with different problems. So for any particular site, different combinations of approaches may be appropriate to solve the scaling problem for the site. This post explores common and effective techniques to scale.
Disclaimer: This post is my personal opinion and does not necessarily reflect that of my employer. The details in this post are not a commitment to implement any of the features in Magento.
First, let me start with a few well known generic concepts for scaling systems. Later I move on to some more Magento specific topics.
HTTP Caching / Web Acceleration
One way to handle more load is to cache pages that don’t change server state. For example, viewing a product does not change server state; adding an item to your shopping cart does. (Yes, I know viewing a product might change state if you were tracking the number of product views.)
Magento allows a developer to customize the pages created per site via extensions, changing PHTML files, CSS, and so on. Once a page has been built however, if it is viewed repeatedly and there is no change to the data used when building the page, then why not save the page away so you can just retrieve it without running on the PHP code again. When the next request arrives, you just return what you worked out last time.
So what if the data used to form a cache is updated? “There are only two hard things in Computer Science: cache invalidation and naming things. — Phil Karlton” One approach for cache invalidation is to expire cache entries fairly quickly. That way out of date data won’t be shown on the site for long. The other way is to be smart about what entries can be knocked out of the cache. For example, Magento 2 includes Varnish cache where entries are tagged with cache labels (for example the product ID). An update to a product generates a Varnish PURGE request to remove all cache entries tagged with the specified product ID.
Caching can give one to two orders of magnitude improvement for customers hitting a web site when the cache hit rate is high, hence caching is a very useful and effective way to cope with increases in site traffic. Just remember that some pages on a site will cache badly. For example at checkout you will have different items in your cart than other users so caching will be less effective in helping the performance of the checkout flow (other than to reduce the total load on the site).
As a site scales, the likelihood of something failing goes up. If a host crashes say once a year on average, then with 365 hosts you can expect a crash every day. Failures will be the norm, not an exception. To cope, resilience needs to be designed in.
A common technique here is to have multiple hosts able to answer any request. Never have only one host able to satisfy the request. It is common for a web server cluster to have all web servers able to satisfy the same request. That way if a web server goes down, it does not matter – the other web servers will be given a bit more work until that web server is brought back online.
Data is however harder to deal with. If I add an item to my cart, then no matter what web server my request goes to, I want to see my cart contents afterwards. If my cart is stored on a single specific host and that host crashes, my cart is no longer available. This is where replication becomes useful, to store copies of data on multiple hosts so if any one host goes down, the data is still accessible on another host.
Replication and resiliency are hard to implement correctly and efficiently. This is why technologies like Redis are such good news. Someone has solved the hard problems of concurrency, replication, cluster management, and consistency – we can just leverage them and never have to know the complexities of what goes on under the covers.
I am not going to go into it much here, but as the size of a solution increases, the importance of automation increases with it. If you have hundreds or thousands of hosts, you want automatic detection and recovery from failures. Each software deployment also needs to be managed carefully.
Another useful generic technique to handle scale is to use queues whenever possible. Instead of always promising to do work immediately, when possible create a work item and put it on a queue. What this helps do is spread the workload over time. If you get a sudden spike of traffic, you can off load some work to be done after the spike ends resulting in your site being more likely to cope.
Queues are not a perfect solution however. For example, if I put a request such as “add item to my cart” in a queue, if it is not processed by the time I fetch the next page, I won’t see my new item in my cart! It would magically appear in my cart later. This is where you need to make a tradeoff between consistency and scale. If the user experience mandates adding an item to a cart always be seen, then a queue is not idea. If an administrator however changes the price of an item, it may be acceptable for that price change to wait until the server is less busy.
Remembering the discussion on resilience above, a distributed database solution may put multiple copies of the data on different hosts. Magento supports read-only database replication for example, where the whole database contents is copied and stored on different hosts. This can help support increased database read traffic, but will not cope with increased volumes of database contents. Eventually the database will be too big (or too slow) to fit on one host – something better than “replicate everything to every host” is required.
This is where the more sophisticated distributed database technologies kick in. They deal with the complexity of getting multiple machines to hold different subsets of data, with replication for redundancy, and make sure the query results are consistent and correct. This is really hard to do well.
There are other techniques to help scale a database, but most lose support for database transactions. Having a single database engine (on a single host or distributed across multiple hosts) also makes querying easier. Join queries across any combination of tables is relatively straightforward. So moving beyond a distributed database technology is one of the more expensive steps to make.
Personally, I am quite interested to learn more about how far technologies like MySQL Cluster, Clustrix, Tesora can scale in practice with Magento. How much further can you scale a solution using these technologies before you hit the next bottleneck?
In Memory Databases
Traditional databases write content to disk, so if the power goes off data is not lost. It is always a good idea to write contents to disk periodically for recovery purposes, but there is another approach that can be used to scale performance which is to use an in-memory database. The idea is since you have replication of server contents onto other hosts, why not rely on this and keep all the data in memory. Queries run much faster this way. As long as there is always a replica of data (or two for additional safety), then it is fine to lose a server. It can be recovered from data in the other server(s).
That is, using an in-memory database can avoid queries or other database operations from using disk I/O. While SSD is faster than magnetic disks, using RAM directly is faster again.
There are a number of other techniques commonly used outside of Magento to scale a database. One approach is sharding. Instead of storing all data on a single database host, you put a fraction of the data on a set of machines. For example, all users starting with A-K might go on the first database, and users starting with L-Z might go on the second.
Sharding implemented in application logic can really help scalability, but the approach is not perfect. Consider the above example of sharding user shopping carts based on the user’s name. If my cart contained just product IDs, then I need to do a database JOIN query to get product information for each item in my cart. If I have also shareded products across servers, it is guaranteed I will have cart shards and product shards are not on the same database server. This means query processing logic moves from the database engine to the application, increasing the application complexity. Care must be taken also around transaction functionality – ideally all data to be inserted/updated within a single transaction can be isolated to a single database instance.
So to summarize, leverage distributed database technologies as far as possible. Going beyond is possible, but the challenges (such as query evaluation and transaction support) increase substantially.
WHAT ABOUT MAGENTO?
The topics above are pretty generic – valid for a range of web based applications. What about for Magento?
Magento is designed around the model of a single HTTP request coming in and being processed completely by a single web server. The presentation logic, the business logic, etc is all run within a single PHP server request. This can help reduce overheads – there is no inter-process communication required, just function calls.
Yes, there is of course the performance of PHP itself to consider compared to other languages, but I will leave that to people like Daniel Sloof. Regardless, making PHP faster will not solve the super-scale problem – it will just get you a bit further before you hit the limits again. This is an important point, so I am going to repeat it for clarity: Doubling (or even tripling) the speed of processing a page is great, but does not solve the super-scaling problem. Doubling performance is not an order of magnitude increase.
The Magento data model is quite rich and complex. One of the techniques Magento uses to improve performance is to keep master copies of data in a rich, sophisticated form, and then to make a copy of the data with everything replicated and set up for high performance querying. For example, a product listing across multiple sites may have a record created per site, allowing queries to always specify the site in queries.
Of course, while searches are faster with flat indexes, there are additional overheads to create them and keep them in sync with the master data. This has been the source of a number of performance improvements in the 1.X product line – more efficiently updating flat tables when different parts of master data are updated.
So flat indexes help scale some operations, and slow down others. In general however there is a positive net benefit.
HOW TO MAKE MAGENTO SUPER SCALABLE?
So how to make Magento be super-scalable? Hopefully from reading the above you will have realized there is not a single simple solution to all scaling problems. There are some tried and true techniques that can help substantially. But since this post is on super-scaling, once you have put lots of caches in place, horizontally scaled your web servers, and used a distributed database system, what do you do next? And how can the Magento architecture make it easier to even further? What is the difference between the architecture of a Magento site and say the ebay.com site?
My best answer for making Magento super-scalable is not about some wonderful new technology or super clever algorithm. It actually involves going back to fundamental software engineering principles. If you assume any part of Magento in some scenario is going to become a bottleneck, then make sure through use of good API design and layering you have logical slice points where you can replace the default implementation with super-charged implementations.
The Magento 2 Service Layer (may be renamed to “service contracts”) was designed:
- For modules to be able to define services that the module provides to other modules, with the understanding that other well behaved modules restrict themselves to using the formally defined API.
- For presentation code to be separated from business logic, allowing the same presentation code to be used with different implementations.
By providing a contract of behavior through an API, replacing a module becomes much simpler. A different implementation (say using database sharding or a NoSQL database) can be swapped in with minimal complications. Thus the solution is not to solve the scaling problem, but rather make it possible to target issues specific to a site and plug that solution in without having to rewrite the whole site from scratch.
A module may also define several internal interface layers making it easier to replace parts of a module without losing all of the useful functionality provided by that module. For example, a well designed module may define a set of PHP interfaces to hide the database access logic from the rest of the module. Swapping out the default MySQL database access with a NoSQL database could then be done without losing all of the rest of the business logic in the module.
This boils down to good modular software design. Design good APIs use them. The same goes for any other code you call. As soon as you go around their defined API, it makes replacing the logic that much harder to achieve. So don’t do it.
Extensions – the Power of Magento
It is worth pointing out that one of the benefits of the Magento platform is the availability of third party extensions. Extensions may replace existing functionality or introduce extended functionality. So how to super-scale Magento and still use the rich set of extensions available?
The good news is the above still holds. If you are writing an extension, use the well-defined API of other modules. Try not to assume the database structure or technology another module is using.
Another technique is to avoid cross module join queries whenever possible. If you can perform operations using APIs instead of assuming the database structure used by another module, that will make your extension more able to work with a range of different implementations.
Example – Cart
For example, let us imagine a site where the shopping cart throughput is the major bottleneck for a Magento installation. The site has horizontally scaled the web servers and tried a distributed database engine instead of standard MySQL, but it is still not fast enough. One solution may be to use Redis cluster as an in-memory database. Using a Redis cluster provides redundancy and horizontal scaling without complicating the application logic. If the whole cluster does go down, the carts are lost but that is considered acceptable. Being in-memory, performance is significantly improved.
The objective here is to be able to replace the implementation of the cart with the Redis cluster without other modules knowing the change has been made. As long as the cart code provides an API with methods such as addToCart(), the rest of the code does not need to know how the cart is implemented. Having such well defined “service contracts” is critical to the success of super-scaling Magento. This fails as soon as other modules assume how other modules are implemented.
Of course the super performant cart may be implemented as an extension. A series of common bottlenecks may be identified with implementations being developed for each case.
To repeat, the critical aspect to make super-scaling possible is not a specific technology – it is defining well defined APIs that isolate implementation logic from the rest of the system, allowing highly optimized implementations. To achieve this, all modules need to be reworked to minimize the number of assumptions they make of other modules. This is the process of modularization and decoupling that has been worked on in the Magento 2 code base. This process will be continued over subsequent releases.
Will just having better APIs solve all scalability problems? No, it is not having APIs alone that will solve the scalability problem. The APIs is just to allow implementations to be replaced. The replacement implementation obviously still needs to be developed. The point is just that for a specific site, when a bottleneck is identified, having well defined APIs (service contracts) for cross module calls to use makes it possible to develop high performance replacements for targeted areas of the code base without having to rewrite the whole code base.
Another Example – Flat Indexes
To me the introduction of flat indexes presents an interesting use case and one I want to explore more in the future. Ideally the introduction of flat indexes should be hidden from other modules. Flat indexes are after all an optimization and so ideally should not affect other modules as long as they use the standard API. But this is where one of the areas of coupling sneaks in.
Modules that go directly to the database tables created by other modules can be faster than if they restricted themselves to use the published API. However, while faster, they may actually be less scalable. Why? They won’t work with a different implementation of the back end logic. For example, if the catalog was moved out of MySQL to an in-memory database, modules that restricted themselves to using the official API would continue to work. Modules that depend directly on the MySQL tables would not work, and if critical may block the more scalable solution from being introduced.
This can be an interesting tradeoff. An implementation that is faster may be less scalable. If you really need to scale, that may mean you have to accept a slower specific implementation – but the ability to distribute that slower solution across more machines will achieve the scalability that another implementation cannot.
Presentation and Business Logic Decoupling
One of the challenges with Magento 2 is to provide a platform able to scale from a single host servicing the full web server application and MySQL database to a large site. The goal is not to stop supporting this lower end of the market – the goal is to see how far the same Magento framework can be scaled to support larger sites.
Why? Why is it useful to super-scale Magento? Why not just accept that beyond a certain point you need a completely different system? One of the reasons is about developer skill sets. Magento is more than just a piece of software. There is a whole ecosystem of trained developers and extension providers out there that Merchants can draw upon. So there is value in, for example, using the Magento presentation tier to developer user experiences even if the Magento backend functionality is completely replaced. There is an ecosystem of trained ecommerce developers who know the Magento framework that Merchants can tap into. This is one of the reasons Magento 2 has invested heavily in decoupling business logic from the presentation tier within the code base and will be encouraging extension developers to do the same. The user interface of an extension may be valuable even if the default MySQL based storage layer provided with the extension is not.
So a goal of Magento 2 is to encourage extension developers to design modular code with good APIs even inside their own extensions. That will enable their extension to be picked up by even larger Merchants even if the default storage implementation provided is not sufficiently scalable. The extension developer needs to provide a working extension – but they don’t have to solve all the scaling problems themselves – they can make it possible for someone else to tackle later.
There are a number of well-known strategies to improve scalability of sites like Magento. There are a range of well-known technologies such as horizontal scaling and database replication. These techniques do help and will get you a long way down the path.
But if you want to go further, if you want to push a Magento site to the extreme, it is not feasible to predict all the combinations different sites may need. This is where using good software engineering practices around good API designs kicks in. Rather than solving the scalability problem out of the box, the introduction of “service contracts” in Magento 2 will make it easier to replace default implementations that are not sufficiently performant with faster implementation, when required.
Magento 2 has started down this path, introducing concepts like service contracts, and pulling business logic out to be separated from presentation code. The revised platform has been designed to decouple modules and other parts of the code base, making it more feasible to replace major subsets of functionality. The Magento 2 release started the work of uncoupling the code base, but more work will be required. The decoupling and improved modularization of the current code base is not particularly sexy by itself. This has been a source of internal team soul searching for some time. For all the investment, is there the benefit?
My personal feeling is absolutely “YES”! It is hard to quantify – more decoupling is still going to be required after the 2.0.0 release. But the more different subsystems can be decoupled so they are not dependent on specific implementations of each other, the more potential for Magento to be scaled. The decoupling won’t make Magento 2 scalable by itself – optimized implementations of carts, search, catalogs etc will still need to be implemented. The decoupling will make the targeted re-implementation of such functionality feasible. The ability to rework only the parts of an installation that are bottlenecks while retaining the rest of the installation is where the super-scaling potential lies.
I titled this post a “thought experiment”. Hopefully the above shows where I believe the potential is for scaling Magento. What do you think? Can you “see” how the modularization is going to help? What parts do you find not compelling? Please leave a comment! I would love to hear more about the real world experiences of the community.