The Software Crisis

The term Software Crisis was coined by some attendees at the first NATO Software Engineering Conference in 1968 at Garmisch, Germany, for the difficulty of writing useful and efficient computer programs in the required time. The software crisis was due to the rapid increases in computer power and the complexity of the problems that could not be tackled. With the increase in the complexity of the software, many software problems arose because existing methods were insufficient.

2018 was the 50th anniversary of that original software crisis and in the intervening time the power of computers exploded, along with the capability and complexity of the software that runs on them. But in spite of this the same problems still strike projects, which fail to be delivered on time, do not meet the requirements of their users or collapse in unrecoverable heaps. The cost of maintaining a large system often increases in an exponential manner until it greatly exceeds that of building it in the first place, yet this is rarely acknowledged at the outset.

Many explanations are put forward to explain this phenomenon, but most claim it to be caused by a failure to follow strict rules during software development. The solution proposed each time is to add another layer of formality to the process in an attempt to squash bugs before they get embedded in the system. The result is increased complexity and a reduced pool of people who actually understand what is going on. That’s just about OK if you can find these people and get them to work for you, but it’s not how things happen in the real world, where people stay in a job for a couple of years then move on.

Given enough perspective it’s possible to see a pattern here, where we are doomed to repeat the mistakes of the past in an every tightening spiral. But one example has never succumbed to this fatal phenomenon and has become the largest software system in the entire world. In spite of failures in its component parts the system as a whole has never gone down. I refer of course to the World Wide Web.

What makes the Web different to everything else? Some claim it to be the fact that it uses Representational State Transfer (REST) as the sole means of communication between its components. But that can’t really explain it; other REST systems haven’t been so lucky. Something else must be happening.

I think the answer is simple. The Web is a huge system, composed of millions of component parts, but it’s extremely tolerant to component failure. If a website goes down the rest of the system keeps working and a work-around is soon found for all but the most serious failures. This is a self-healing mechanism at work. And it works because each component is owned by someone; someone with a personal stake in its success. It’s almost as if each component were an individual sentient being, capable of promoting its own survival.

The Web has no central controller. Everything works by agreements between component parts. No overall framework governs this, and here we have a major contrast with most other software systems, which although determinedly adopting modular practices have a fatal reliance on central direction that cannot know everything. Like a planned economy, things start out well but after a while the centre loses contact with the component parts. In a software system it’s because these parts were built by individual programmers, each with a mind of his/her own and a less than perfect understanding of the overall system. How could it be otherwise? In order to fully understand the system you had to have been there while it was being built. Coming in later means you’ll never really get that understanding.

So we end up with systems whose component parts are each “owned” by someone with only a partial understanding of the rest of the system but governed by a framework that requires total adherence to the rules in order to operate properly. When bugs are discovered they are worked on by people who also have a less than perfect understanding, so their imperfect fixes are installed back into the system, which degrades slowly but inexorably as time goes by.

So what’s the solution? My answer is to do away with frameworks, which represent the dead hand of bureaucracy. It may be early days, but I have seen some signs that by removing most of the stifling rules and allowing components to think for themselves we can increase the overall robustness of a software system, making it far leaner and more agile in the process. Sure, we increase duplication, but DRY (Don’t Repeat Yourself) was never a panacea for all evils, just one of many competing aims. Responsibility should be delegated from the centre to individual self-governing microservices, which interact only with their immediate neighbours and a few independent, global core services. Such systems, mimicking the structure of the Web, can exhibit the self-healing characteristics I describe above.

A very important characteristic of such a system is that individual components are owned and managed by their own teams, who only know enough about the rest of the system to enable them to exchange messages and data. The role of the system administrators is to identify what components are needed, but never to be involved in how they are constructed, which allows different parts of the system to be built with whatever technologies best suit their needs or the skills available. Indeed, the rapid rise of Node.js and its use to power microservices may signal a significant rebirth of the software industry. It will be interesting to watch how well microservice-based systems stand up to the test of time.