Since August of 1999, Tellme has been running a 24x7 production network answering voice calls. Since July of 2000, that production service has been required to sustain greater than 99.95 availability for the general public.
The network has grown from two ports to over three thousand. This document encapsulates the lessons we've learned in building this network.
Know the upper bound of capacity that a particular unit is required to support
Minimize the knowledge any unit needs to have in other units in the system
Make everything look the same
Clearly identify where all state information resides
Fail fast
Design for the future, implement to the minimum requirements
Fault isolation
The system should continue to operate under a certain set of failure conditions until a human can resolve the problem
Translate unknown failures into known failures
Test at maximum capacity, but not beyond
Untested configurations generally fail
Test end-to-end
Understand all problems
Expect frequent failures
Assume that human response time is slow
Expect operator error
Use indicator monitors to compensate for inability to monitor each individual metric
Invest in your tools
If the system is not running in a "known good" configuration, it should not be live
Use interchangeable parts
Lesson 5: Design for rapid creation and interchange of parts
Make frequent, but controlled upgrades
Always have a roll-back process
Live upgrades requires a quiescing roll-over
Always push during periods where support is available
Separate release dependencies
Decentralize release process as much as possible
Releases should be transparent to other parts of the system
Assume every change will be immediately live on production
Clearly identify changes, document, execute, and communicate
Control the rate of change