From: Brad Porter [brad@tellme.com]
Sent: Friday, November 09, 2001 8:47 AM
To: Mark Verber
Subject: Re: principles

Observations on Running a Large Production Network

Author: Brad Porter, Platform Architect

Introduction

Since August of 1999, Tellme has been running a 24x7 production network answering voice calls. Since July of 2000, that production service has been required to sustain greater than 99.95 availability for the general public.

The network has grown from two ports to over three thousand. This document encapsulates the lessons we've learned in building this network.

Design

Clearly define the role of a particular unit of infrastructure

Know the upper bound of capacity that a particular unit is required to support

Minimize the knowledge any unit needs to have in other units in the system

Make everything look the same

Clearly identify where all state information resides

Fail fast

Design for the future, implement to the minimum requirements

Fault isolation

The system should continue to operate under a certain set of failure conditions until a human can resolve the problem

Translate unknown failures into known failures

Quality

Test with real-world production scenarios

Test at maximum capacity, but not beyond

Untested configurations generally fail

Test end-to-end

Understand all problems

Operations

Run at the upper bound, but never beyond

Expect frequent failures

Assume that human response time is slow

Expect operator error

Use indicator monitors to compensate for inability to monitor each individual metric

Invest in your tools

If the system is not running in a "known good" configuration, it should not be live

Capacity growth

Know exactly what is required to add new capacity

Use interchangeable parts

Lesson 5: Design for rapid creation and interchange of parts

Change Control

Clearly define your change process for each unit

Make frequent, but controlled upgrades

Always have a roll-back process

Live upgrades requires a quiescing roll-over

Always push during periods where support is available

Separate release dependencies

Decentralize release process as much as possible

Releases should be transparent to other parts of the system

Assume every change will be immediately live on production

Clearly identify changes, document, execute, and communicate

Control the rate of change