Part of Hints for Operating High Quality Services by
Mark Verber
Very Very Early Draft 0.1 --
January 30, 2004
Design for the future, but only implement what you need now.
While we want design and architecture to be clean, the most important evaluation criterea is to make things work and keep them running. Architect for any non-trivial system requires a number of trade-off which simplifies the a system to a manageable number of moving parts. Often times there are a number of constraints which seem to be mutually exclusive. The ultimate test is a service which runs and is manageable. The second test is how quickly someone who has never seen the system before can understand the basics of how it works.
Partition the system into scalable modules. You want to have explicit interfaces between each of the modules. Interfaces might be programmatic APIs, network oriented ACLs, routing domains, physical cabinets, etc.
<AOL Networking List of Don'ts>
<MegaPod Mess>
<Switch -vs- Route>
<IBstore>
Murphy was an optimist. Most of the time you can't use the law of big numbers, you have to use the law of medium numbers: e.g. you will see all exceptions. It’s not a question of if any component will fail, it is a question of when. In fact, you should expect multiple low probability faults happening at the same time. History is filled with perfect storm and Titanic. Most accidents in well-planned systems involve two or more events of low probability occurring in the worst possible combination.
Make sure your testing is valid. Test with real data (replay logs, etc). Don't put it into place without testing it in the environment you are going to run it in. - Beware of code which have special cased for testing. <Hint by mail>
Common Heuristics
Since everything will fail, you should design for fast recovery. There are many example of systems which have been well designed for fast recovery. Log based file systems and databases were a huge improvement over their predecessors.
Even in human scale systems (minutes or hours rather than msec) optimizing for "fast recovery" should be considered. For example, all machines should be quickly recoverable if their hardware dies. This should not require expert human attention to replace a broken machine. Basic software and config need to live outside the host on some centralized resource. We should separate the system and config from the data.
Trivial Fail fast appropriate when
Common Heuristics
Keep in mind...
Everything will fail (see above). That means you will need to take components in and out of service. You want to be able to test components before you put them back into service and to remove workload from a component so you can take it out of service for preventative maintenance and upgrades. Need to be able to quiesue services.
Whenever possible, make things stateless.
If something can't be stateless, have someone else hold state for you.
Someone has to hold state. This is hard to do well. Solve this problem once with a production quality State Stores (or maybe a few depending on what guarantees you need from your state store) and have everything else use it.
Avoid humans directly touching systems. All system modifications should be through automation. Manage the service using well built and tested tools. Often the models infused in service components are different from what operational folks are doing (declarative –vs- procedure). Ideally give ops folks a “waldo” interface so it feels like working on a system, or give them “wizards” to walk through tasks. Ops folks should not be directly updating files, changing configurations, installing new software, etc by hand. These tasks should be scripted and fully tested before they are applied to a production service.
Expect the service to change over time. It should be easy to make common changes, and possible to make unexpected changes. Build things to minimize the possibility of unintended consequences (e.g. systems where changing one component indirectly affects others) by creating stable interfaces with well design semantics.
Within components, separate fast changing items from those which change extremely slowly. Examples of this are separating configuration parameters from software, software packages from the base operating system, etc. It should be possible to replace a component without needing to replace all the components.
Fred Brooks in The Mythical Man Month has suggested that you should build a prototype system, and then build the real system. While in theory this is a great idea, time-to-market pressures will almost always make this impossible. If you believe that your system really needs to be prototyped, and then built from scrap, build the system on a platform, in a language or environment which can't be used for the final product. Otherwise, there will be too great a temptation to reuse the prototype code rather than learning the lessons. In most cases, you won't have the luxury of building a prototype system, and then the real system. Therefore, it is critical to design for evolution. Your first system will almost always be wrong, hopefully you can fix it.
If you have a rapid release cycle, don't try to get it right the first time. Implement your best guess and then learn from the experience. The next release you can replace what was a bad idea, and improve what was good. Also go after "low hanging" fruit.
Often a stage might involve a number of “hacks”. Need to have the maturity to go back and fix bad hacks.
Expect networks to renumber and be prepared for company merger.
Common Heuristics
Do not put knowledge into the program. Write little languages, or use extensive configuration. Both can kill you if they are too complex or inconsistent.
Avoid bulk updates
Provide a way to migrate data between servers and services
Address security at the beginning - it's very expensive after the fact. You must address:
Always check and validate all inputs.
It should be possible to use an automated tool to verify specified configurations are set as expected, software is correctly loaded, expected services are running, etc. These tools need to be created.
Verify that whatever changes that are made are effective and durable. For example, if we rev the OS, then we need to make sure that the services on that machine continue to function effectively. If we change some system parameters, we need to be confident that when that box reboots that it will come up correctly.
For any given piece of information there should be a single, authoritative source of that information. Ideally most information would be coming from the same source. You should never have to enter the same data twice. You should be able to easily discover what the source is. In general you should need to update the source of truth for things to work which forces it to be correct.
We want to minimize number of management sources. Every component should have instrumentation for performance and operational data record for monitoring and analysis. Things which write logs should reopen them. Should call out what sorts of things we should report on like resource utilization, transaction rates, etc
Small is beautiful, KISS. Ockham's razor. Unfortunately this is not truly possible… but virtual simplicity is using a modular / abstraction / information hiding design.
Build tools, not monolithic systems... it is easier to update small tools, It is also possible to tie tools together.
Often times systems are built which separate when a change is accepted, and when that change is applied. With these sorts of systems it is possible for someone to make a change and not to know their change just caused a problem. Furthermore, when someone notices a problem, it is not clear what is causes it. There are a variety of techniques which can minimize these issues. One is to have a process which continually tests changes which have yet to be applied to a running system. For example, continuously building a source tree (often called a tinderbox) and tracking what checkins happen between a successful and unsuccessful build can be quite helpful.
Don't forget about physical realm. Power and A/C tend to get you.
Have some way to manage services that doesn't rely on that service functioning correctly
Needs more thought
Ideally you want everything to look the same
trade off or development time -vs- run time efficiency.
L. Peter Deutsch wrote up the "Eight Fallacies of Networking" as an internal memo Sun Labs in 1991. People often make the following assumptions which proved to be false resulting in a lot of pain and brokenness. Don't assume the following:
Know Underlying Numbers
http://highscalability.com/numbers-everyone-should-know
Hints for Computer System Design, Butler Lampson, ACM Operating Systems Rev. 15, 5 (Oct. 1983), pp 33-48. Reprinted in IEEE Software 1, 1 (Jan. 1984), pp 11-28.
Recovery Oriented Computing, UC Berkeley Computer Science Department Research Project
Lessons Learned from Giant Scale Internet Services, Eric Brewer, IEEE Computer, July/Aug 2001, pp 45-55
Challenges to Building Scalable Services, originally an internal memo from 1999 by Galen Hunt and Steven Levi
On Designing and Deploying Internet-Scale Services by James Hamilton
End-to-End Arguments in System Design, Jerome H. Saltzer, David P. Reed, David D. Clark, ACM Transactions on Computer Systems 2, 4, November 1984, pp 277-288
Rules of Thumb in Data Engineering, Jim Gray & Prashant Shenoy, Microsoft Corporation Technical Report, December 1999, MS-TR-99-100
Eight Fallacies of Distributed Computing Explained, which is an expansion of a 1991 Sun Microsystems Labs Internal Memo by Peter Deutsch
Ten Fallacies of Software Analysis and Design, Carlos E. Perez, blog posting 2004
RSPA's System Engineering Library
Worse Is Better, Disturbing. Not sure if I completely agree, but a perspective that should be considered
Network API Scalability / Benchmarking
RESTful wiki article