While we want everything to be
architecturally clean, the most important thing is to make things
work, and keep them running. If something is working then you
should be reluctant to touch it. If we need to get something
running, try to violate as small a number of this list as is
possible given constraints. We should alway deliver on
promises.. nothing matters but results. Need to have the maturity to
go back and fix bad hacks.
In general we should be managing the service using well built and
tested tools. We don't want people to touch things by hand. Updating
files, changing configurations installing new software, etc should be
scripted and fully tested before we do them in production.
We expect things to change over time. It should be easy to make
common changes, and possible to make unexpected changes. We should
build things to minimize the possibility of unintended consequences
(e.g. systems where changing one component indirectly affects others).
We should be separating systems into componenents. Within
components, separate fast changing items from those which change
extremely slowly. Examples of this are separating configuration
parameters from software, software packages from the base operating
system, etc. It should be possible to replace a component without
needing to replace all the components.
e.g. always check and validate all inputs.
For example, all machines should be quickly recoverable if their
hardware dies. This should not require expert human attention
except in special situations. So basic software and config need to
live outside the host on some centralized resource. We should
separate the system and config from the data.
It should be possible to use an automated tool to verify
specified configurations are set as expected, software is
correctly loaded, expected services are running, etc.
These tools need to be created.
Needs more thought
We need to verify that whatever changes we make are effective and
durable. For example, if we rev the OS, then we need to make sure
that the services on that machine continue to function effectively.
If we change some system parameters, we need to be confident that
when that box reboots that it will come up correctly.
For any given piece of information there should be a single,
authoritative source of that information. Ideally most information
would be coming from the same source. You should never have to enter
the same data twice. You should be able to easily discover
what the source is.
Everything should have a document describing the guiding principles
and requirements, an architectural overview, design document for
people who would work on / extended the system, APIs to components, a
users guide including troubleshooting, and if appropriate, a
developers guide. For example, for our package management system on
the UNIX hosts there should be: a principles and requirements doc,
architecture doc, API interface the system, a guide for people who
use the deployment system and a troubleshooting guide, a document
describing how to define new machine types, and a document describing
how to build new packages. Documentation is secondary to transparent
design.
We want to minimize number of
management sources. Every component should have instrumentation for
performance and operational data record for monitoring and
analysis. Things which write logs should reopen them. Should
call out what sorts of things we should report on like resource
utilization, transaction rates, etc
Don't forget about physical realm.