Reliable, Scalable, and Maintainable Applications

1. Data Systems

Modern applications are often data-intensive rather than compute-intensive. That is to say, the bottleneck is not the CPU speed, but some aspect of data, whether that be storage, throughput or some other factor.

These applications are often built by assembling building blocks of well-established data systems. For example, database, caches, search indexes, stream processing, batch processing. The application is essentially a new, special purpose data system built from generic components.

These are often so well abstracted that we do not think too deeply about the internals of their operation. But this is key to how our overall application will perform.

This chapter serves as a gentle introduction to establish some of the key terminology in this field. Later chapters dive deeper into individual data systems that form the layers of an application.

There are three main focuses when designing software systems:

Reliability - the system should continue to work correctly in the face of adversity
Scalability - the system can deal with growth
Maintainability - the system can be operated and extended

2. Reliability

The application “working correctly” broadly means that it meets its functional and non-functional requirements.

Reliability roughly means “the application should work correctly, and continue to work correctly when things go wrong”

The “things that go wrong” are faults; a system that can cope with faults is fault-tolerant or resilient. In practice, we can’t account for every possible fault (what happens if an asteroid hits the Earth) so we specify what types of faults we are tolerant of.

A fault is different from a failure. A fault is where one component of the system deviates from the spec. A failure is when the system as a whole stops provided the required service to the user. There are diminishing returns which means it is generally not possible to reduce the proability of fault to zero, so instead we focus our attention on ensuring faults do not cause failures.

To ensure the fault tolerance of a system is well-exercised, we may (counterintuitively) want to induce faults intentionally to periodically assess the impact on the overall system. This is the idea behind Netflix’s Chaos Monkey approach, where production services are disabled at random to ensure the system as a whole remains functional in such an event.

Categories of faults:

Hardware faults
Software faults
Human errors

Human errors can be minimised with well-designed abstractions that “make it easy to do the right think and make it hard to do the wrong thing”. This can be a tricky balance to get right, as interfaces that are too restrictive will force people to work around them. sandbox environments and making rollbacks of config changes fast can also help mitigate risk.

3. Scalability

Scalability describes the system’s ability to cope with an increased load.

It is not a one-dimensional label that we can attach to a system: it is meaningless to say “X is scalable” or “Y doesn’t scale.” Rather, discussing scalability means considering questions like “If the system grows in a particular way, what are our options for coping with the growth?” and “How can we add computing resources to handle the additional load?”

3.1. Describe Load

We first need to describe the load. This means summarising the system’s load with load parameters which depend on the system itself. This may be requests per second for a web server, read/write ratio for a database, number of concurrent users, cache hit rate etc.

Twitter Example

Scaling can be made difficult by fan out. For exmaple, Twitter’s scaling challenge is not primarily due to tweet volume, but due to fan-out, whereby each user follows many people, and each user is followed by many people. When a tweet is published it needs to be updated on the tweeter’s timeline and update on each of their follower’s home screen.

Approach (1) would be to store all data in a relational database and join the follows, tweets and users table on each tweet and timeline update.

Approach (2) is to keep a cache for each user’s timeline, and insert new tweets into the relevant users’ caches when someone they follow tweets. This generally works well because there are far more reads than writes - more lurkers than tweeters.

Twitter started with approach (1) then moved to (2). But for celebrities who have millions of followers, the fan out of approach (2) is prohibitively slow, so they use a hybrid of both where users with lots of followers use approach 1.

3.2. Describe Performance

4. Maintainability

References

Chapter 1 of Designing Data Intensive Applications by Martin Kleppmann