Operational Principles

needs to work
While we want everything to be architecturally clean, the most important thing is to make things work, and keep them running. If something is working then you should be reluctant to touch it. If we need to get something running, try to violate as small a number of this list as is possible given constraints. We should alway deliver on promises.. nothing matters but results. Need to have the maturity to go back and fix bad hacks.
no direct touch -- everything through automation
In general we should be managing the service using well built and tested tools. We don't want people to touch things by hand. Updating files, changing configurations installing new software, etc should be scripted and fully tested before we do them in production.
easy to rev
We expect things to change over time. It should be easy to make common changes, and possible to make unexpected changes. We should build things to minimize the possibility of unintended consequences (e.g. systems where changing one component indirectly affects others).
incremental
We should be separating systems into componenents. Within components, separate fast changing items from those which change extremely slowly. Examples of this are separating configuration parameters from software, software packages from the base operating system, etc. It should be possible to replace a component without needing to replace all the components.
secure
e.g. always check and validate all inputs.
fast recovery
For example, all machines should be quickly recoverable if their hardware dies. This should not require expert human attention except in special situations. So basic software and config need to live outside the host on some centralized resource. We should separate the system and config from the data.
Auditable
It should be possible to use an automated tool to verify specified configurations are set as expected, software is correctly loaded, expected services are running, etc. These tools need to be created.
Consistent - what variation tolerance
Needs more thought
Verifiable
We need to verify that whatever changes we make are effective and durable. For example, if we rev the OS, then we need to make sure that the services on that machine continue to function effectively. If we change some system parameters, we need to be confident that when that box reboots that it will come up correctly.
Authoritative source of software and configuration
For any given piece of information there should be a single, authoritative source of that information. Ideally most information would be coming from the same source. You should never have to enter the same data twice. You should be able to easily discover what the source is.
Documentation
Everything should have a document describing the guiding principles and requirements, an architectural overview, design document for people who would work on / extended the system, APIs to components, a users guide including troubleshooting, and if appropriate, a developers guide. For example, for our package management system on the UNIX hosts there should be: a principles and requirements doc, architecture doc, API interface the system, a guide for people who use the deployment system and a troubleshooting guide, a document describing how to define new machine types, and a document describing how to build new packages. Documentation is secondary to transparent design.
Has Management API
We want to minimize number of management sources. Every component should have instrumentation for performance and operational data record for monitoring and analysis. Things which write logs should reopen them. Should call out what sorts of things we should report on like resource utilization, transaction rates, etc
supports multiple versions / releases
physical realm
Don't forget about physical realm.