Design Principles to Remember

Part of Hints for Operating High Quality Services by Mark Verber
Very Very Early Draft 0.1 -- January 30, 2004

Design for the future, but only implement what you need now.

Function Rules... but transparency of design is important

While we want design and architecture to be clean, the most important evaluation criterea is to make things work and keep them running. Architect for any non-trivial system requires a number of trade-off which simplifies the a system to a manageable number of moving parts. Often times there are a number of constraints which seem to be mutually exclusive. The ultimate test is a service which runs and is manageable. The second test is how quickly someone who has never seen the system before can understand the basics of how it works.

Pay Attention to Structure

Partition the system into scalable modules. You want to have explicit interfaces between each of the modules. Interfaces might be programmatic APIs, network oriented ACLs, routing domains, physical cabinets, etc.

subsystems so that they are independent as possible
Keep complexity inside a module
Keep high rates of information exchange inside a module
State exists inside modules -- explicitly track it
Fault isolate - protect subsystem, isolate unsafe subsystem
Enable side by side versioning if not simultaneous running of services, e.g. don't use globals
Invest in APIs. Great APIs enable make it significantly easier to evolve the system over time and all modules to be used to some problems you might not have thought of yet. Good APIs let people stand on your shoulders without having to worry about your hard work.

Expect Everything to Fail

Murphy was an optimist. Most of the time you can't use the law of big numbers, you have to use the law of medium numbers: e.g. you will see all exceptions. It’s not a question of if any component will fail, it is a question of when. In fact, you should expect multiple low probability faults happening at the same time. History is filled with perfect storm and Titanic. Most accidents in well-planned systems involve two or more events of low probability occurring in the worst possible combination.

Make sure your testing is valid. Test with real data (replay logs, etc). Don't put it into place without testing it in the environment you are going to run it in. - Beware of code which have special cased for testing. <Hint by mail>

Common Heuristics

Don't trust the data you are handed (validate all input)
Test all failure conditions
Manage external dependencies - Give the best possible degraded service
Make things atomic
Be self healing... but only if the recovery process is deterministic
Route around failure

Optimize for Fast Recovery

Since everything will fail, you should design for fast recovery. There are many example of systems which have been well designed for fast recovery. Log based file systems and databases were a huge improvement over their predecessors.

Even in human scale systems (minutes or hours rather than msec) optimizing for "fast recovery" should be considered. For example, all machines should be quickly recoverable if their hardware dies. This should not require expert human attention to replace a broken machine. Basic software and config need to live outside the host on some centralized resource. We should separate the system and config from the data.

Trivial Fail fast appropriate when

when losing state doesn't matter
when plenty of other resources
when recovery quick
active / active or master/slave with regular switch

Common Heuristics

All persistent data should be updated atomically...

Keep in mind...

Assume that human response time is slow
Humans make mistakes
Map unknown failure / state into known failure/state

Think about Rendezvous / Routing In the Beginning

Everything will fail (see above). That means you will need to take components in and out of service. You want to be able to test components before you put them back into service and to remove workload from a component so you can take it out of service for preventative maintenance and upgrades. Need to be able to quiesue services.

State is Hard

Whenever possible, make things stateless.

If something can't be stateless, have someone else hold state for you.

Someone has to hold state. This is hard to do well. Solve this problem once with a production quality State Stores (or maybe a few depending on what guarantees you need from your state store) and have everything else use it.

Minimize Human Interaction

Avoid humans directly touching systems. All system modifications should be through automation. Manage the service using well built and tested tools. Often the models infused in service components are different from what operational folks are doing (declarative –vs- procedure). Ideally give ops folks a “waldo” interface so it feels like working on a system, or give them “wizards” to walk through tasks. Ops folks should not be directly updating files, changing configurations, installing new software, etc by hand. These tasks should be scripted and fully tested before they are applied to a production service.

Expect Change (Incremental Release)

Expect the service to change over time. It should be easy to make common changes, and possible to make unexpected changes. Build things to minimize the possibility of unintended consequences (e.g. systems where changing one component indirectly affects others) by creating stable interfaces with well design semantics.

Within components, separate fast changing items from those which change extremely slowly. Examples of this are separating configuration parameters from software, software packages from the base operating system, etc. It should be possible to replace a component without needing to replace all the components.

Fred Brooks in The Mythical Man Month has suggested that you should build a prototype system, and then build the real system. While in theory this is a great idea, time-to-market pressures will almost always make this impossible. If you believe that your system really needs to be prototyped, and then built from scrap, build the system on a platform, in a language or environment which can't be used for the final product. Otherwise, there will be too great a temptation to reuse the prototype code rather than learning the lessons. In most cases, you won't have the luxury of building a prototype system, and then the real system. Therefore, it is critical to design for evolution. Your first system will almost always be wrong, hopefully you can fix it.

If you have a rapid release cycle, don't try to get it right the first time. Implement your best guess and then learn from the experience. The next release you can replace what was a bad idea, and improve what was good. Also go after "low hanging" fruit.

Often a stage might involve a number of “hacks”. Need to have the maturity to go back and fix bad hacks.

Expect networks to renumber and be prepared for company merger.

Common Heuristics

Do not put knowledge into the program. Write little languages, or use extensive configuration. Both can kill you if they are too complex or inconsistent.
Avoid bulk updates
Provide a way to migrate data between servers and services

Security Must Be Addressed

Address security at the beginning - it's very expensive after the fact. You must address:

trusted source
trusted server
trustworthy communications path

Always check and validate all inputs.

Auditable & Verifiable

It should be possible to use an automated tool to verify specified configurations are set as expected, software is correctly loaded, expected services are running, etc. These tools need to be created.

Verify that whatever changes that are made are effective and durable. For example, if we rev the OS, then we need to make sure that the services on that machine continue to function effectively. If we change some system parameters, we need to be confident that when that box reboots that it will come up correctly.

Use Authoritative Sources, Minimize Information Copying

For any given piece of information there should be a single, authoritative source of that information. Ideally most information would be coming from the same source. You should never have to enter the same data twice. You should be able to easily discover what the source is. In general you should need to update the source of truth for things to work which forces it to be correct.

Enable Tools – Create a management API

We want to minimize number of management sources. Every component should have instrumentation for performance and operational data record for monitoring and analysis. Things which write logs should reopen them. Should call out what sorts of things we should report on like resource utilization, transaction rates, etc

Everything is done through a well designed API.
Have your GUI use the API
Make a command line UI as well which is scripting friendly
Servers should have control / status port if long running service

Think end-to-end

Design
Monitoring
Performance
Technology (make use of modern network fabric...)

Simple is Best

Small is beautiful, KISS. Ockham's razor. Unfortunately this is not truly possible… but virtual simplicity is using a modular / abstraction / information hiding design.

Build tools, not monolithic systems... it is easier to update small tools, It is also possible to tie tools together.

No Time Bombs

Often times systems are built which separate when a change is accepted, and when that change is applied. With these sorts of systems it is possible for someone to make a change and not to know their change just caused a problem. Furthermore, when someone notices a problem, it is not clear what is causes it. There are a variety of techniques which can minimize these issues. One is to have a process which continually tests changes which have yet to be applied to a running system. For example, continuously building a source tree (often called a tinderbox) and tracking what checkins happen between a successful and unsuccessful build can be quite helpful.

Pay Attention to Physical Realm

Don't forget about physical realm. Power and A/C tend to get you.

Out of Band Management

Have some way to manage services that doesn't rely on that service functioning correctly

Consistent - what variation tolerance

Needs more thought

Ideally you want everything to look the same

Performance Matters

trade off or development time -vs- run time efficiency.

Don't Assume

L. Peter Deutsch wrote up the "Eight Fallacies of Networking" as an internal memo Sun Labs in 1991. People often make the following assumptions which proved to be false resulting in a lot of pain and brokenness. Don't assume the following:

The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology doesn't change
There is one administrator
Transport cost is zero
The network is homogeneous

Know Underlying Numbers

http://highscalability.com/numbers-everyone-should-know

References

Hints for Computer System Design, Butler Lampson, ACM Operating Systems Rev. 15, 5 (Oct. 1983), pp 33-48. Reprinted in IEEE Software 1, 1 (Jan. 1984), pp 11-28.

Recovery Oriented Computing, UC Berkeley Computer Science Department Research Project

Lessons Learned from Giant Scale Internet Services, Eric Brewer, IEEE Computer, July/Aug 2001, pp 45-55

Challenges to Building Scalable Services, originally an internal memo from 1999 by Galen Hunt and Steven Levi

On Designing and Deploying Internet-Scale Services by James Hamilton

End-to-End Arguments in System Design, Jerome H. Saltzer, David P. Reed, David D. Clark, ACM Transactions on Computer Systems 2, 4, November 1984, pp 277-288

Rules of Thumb in Data Engineering, Jim Gray & Prashant Shenoy, Microsoft Corporation Technical Report, December 1999, MS-TR-99-100

Eight Fallacies of Distributed Computing Explained, which is an expansion of a 1991 Sun Microsystems Labs Internal Memo by Peter Deutsch

Ten Fallacies of Software Analysis and Design, Carlos E. Perez, blog posting 2004

RSPA's System Engineering Library

Worse Is Better, Disturbing. Not sure if I completely agree, but a perspective that should be considered

Perlis Epigrams

Theory of Constraints

Network API Scalability / Benchmarking

RESTful wiki article

The Data Center as a Computer