Building Systems for the Modern World
Here I want to focus on a relatively new notion of building antifragile systems. We will focus mainly on principles and approaches describing services and products, not organizations or teams. The antifragile approach works for teams and companies too, but I believe a more targeted approach to what is the result of a startups activity is closer to a founder’s needs.
We are past the time when we could build a simple site or a service and expect to provide meaningful services to a customer. In any stretch, any business activity eventually produces a complex system.
Current exposure to millions of users (which is the goal of a startup) naturally leads to building complex systems. To be modern and able to respond to elevated requirements these systems need to be adaptive.
A complex adaptive system can be denoted as a system (man-made or natural) comprised of multiple entities that interact in involved ways.
The need for these components to interact with each other is dictated by the idea and components need to be able to react and adapt to potential external and internal large negative impacts in order for the system to survive.
Even at this high level of defining what we want to create, we can start recognizing the pattern and notion of microservices. The microservices architecture approach naturally fits the structure of a complex adaptive system.
We need to recognize the importance to view such a system as a holistic operation. The individual performance of a microservice needs to be reflected on the whole system and its own performance as important it may be, is not what defines the adaptiveness and antifragility of the whole system. If it does, the system needs to be reviewed and redesigned.
Defining the Scale for Systems Vulnerability
Let’s review and define better the types of systems we recognize in reflect on the ability to react to external compromising events.
We all are very familiar with these kinds of systems, complex or not. A fragile system Is easy to be impacted by incidents and recovering from such cases is usually expensive and slow.
A startup building a new service or product normally starts with building a fragile system. It usually executes the so-called “happy path” of the process and has a very low tolerance to provocations.
Another characteristic of a fragile system is that it usually holds many and very deep dependencies on other systems. An example would be reliance on having an Internet connection to proceed transactions, retrieve information or save gathered data. A simple disconnect from the network can bring the system down, lead to bad user experience and what is even more important – loss of data, open to security attacks, etc.
In some cases building such a system to impress an investor or simply to shrink the time to market, can lead to such catastrophic situations where the startup cannot sustain the load, survive security attacks or survive following lawsuits as a result. In fact, any MVP represents such a system, especially when built without understanding what an MVP is.
Robust systems are often seen as a logical solution to Fragile systems. Fragile systems are vulnerable to the impact of compromising events, while robust systems withstand or/and absorb them.
Robust systems are what we all are trying to create, once we survive the initial growth and spearhead trough the problems related to defining our service or product, and delivering it in the hands of our first customers.
Once the service faces our customers, or the product is released, we see how customers really use it, and in many cases, it differs from our initial design, implementation or even idea. Eventually, after a few iterations, we adjust it and have something valuable to offer. Now it is time to stabilize it.
Robustness is seen as the ideal solution and target to aim for. We eliminate one by one existing and newly discovered issues and make the system run 100% of the time. Or so we want to think.
Let’s talk about a system that we want to think is robust and can sustain any negative impact.
The problem is that for any such system that has addressed N potential or existing negative events, we can always come up with at least one more case where the system will be compromised.
Based on funding, we can continue fixing existing and anticipated problems, discovered by working with customers or hypothesized by the development team or the management.
Funds are not an unlimited resource, we always face the case of “triaging”, “mitigating”, etc. In other words, balancing the fragility of the robust system with throwing more resources at it or making “executive decisions” what not to fix.
Unline robust systems, antifragile systems learn from incidents on how to function increasingly well in a changing environment. In fact, for a system to be antifragile, it needs to be attacked and compromised, so it can learn and adapt.
Without the ability to learns from incidents or without going trough such incidents, antifragile systems become fragile systems and are not adequate to the needs of a changing environment.
Antifragility is reviewed for a particular type of impact and is best viewed as degrees of antifragility tp a specific set of problems.
Characteristics of Antifragile System
How do we recognize what an antifragile system is? How do we develop one and what are the elements we need to build to achieve one?
Here are currently defined parameters that define an antifragile system:
A complex system needs to be modular. A local failure due to an internal error or abnormal interactions between system units or environment, should not take down the whole system.
To achieve this, a system needs to be modularized at both – hardware and software levels, and the system needs to be described in terms of strong and weak connections to describe dependencies. Week connections guarantee that changes in one of the components don’t require or imply changes in another.
Weak links are like circuit breakers and prevent for results of a compromising event or events to be propagated across the whole system. Weak links enhance the robustness to propagating failures by restricting damage to a single module.
As part of the design effort, it is required to define the links between the modules and recognize any strongly connected modules. Then work on these links and reduce the dependencies to make the modules more independent of each other.
According to Nassim Nicholas Taleb, redundancy is an inherent property of antifragile systems. Efficiency is not the primary goal of such systems.
Antifragile systems are designed and built to survive and even thrive in an environment of randomness and uncertainty and as such “inefficiency” is expected mostly through layered redundancies.
In other words, expect to run multiple instances of a component, on multiple hardware platforms. This notion is expanded with the requirement for diversity.
Simply put – do not rely on a single technological solution and platform – software or hardware.
To ensure a system will sustain interactions with the outside world, it needs to provide almost full duplication of functionality utilizing different approaches, architecture, and platforms. The antifragile system needs to evolve to be both – redundant and diverse. Diversity makes it less likely that many modules will fail at the same time as a result of a single compromising event. Only a diverse (and redundant) system is highly robust to propagating failures; single modules or multiple modules on same platform remain fragile.
To build adaptive systems that are antifragile to classes of negative ents, it needs to learn from previous failures and recoveries, so it can adapt.
The worst-case when such a system is running would be for it to be designed to recognize the need to recover only when some of the managed resources or dependencies are exhausted. This would be too late in the management of the system and it may lead to bringing down the whole system, simply because it was designed to bee too tolerant of consuming resources.
An example would be allowing a software component to slowly creep up on memory usage, exhaust a connection pool, or overloading the CPU for an extended period of time.
If the system tolerates such internal events, it may become unable to recover simply because all resources are consumed and there are no resources available for the recovering operations.
Failing fast ensures there will be resources needed for the recovering processes, persisting states and increasing the learning mechanisms of the adaptive system.
Antifragile systems are an interesting design approach and methodology to research and apply. It is easier to do it when creating a new system, which is the case of startup development.
It is not an easy or inexpensive proposition, but it is the obvious approach for creating adaptive systems.
With the microservices design methodology in our toolboxes, building a system that is close to being antifragile is definitely possible.
It would require recognizing and mitigating dependencies, running multiple instances with the same functionality, introduce similar or same functionality on different platforms (Azure AND AWS, not OR), and of course, make such systems report early and fail fast.
All this sounds like a good architectural strategy and something that is achievable and doable with today’s microservices approach. To take it to the next step and make it a true adaptive (and antifragile) system, it would make sense to utilize machine learning to manage your system.