Generic Operational Requirements

Part of Hints for Operating High Quality Services by Mark Verber
Draft, Version 0.3 -- April 15, 2005

For the purposes of these requirements it is assumed that system components provide a single service. Service components might be general computing devices running specialized software or embedded hardware devices such as a network switch.

Documentation

Documentation is important, but is secondary to transparent design. All systems in the same operational administration should use the same term for a given thing. For instance, "fault" describes a condition of the system that is not as it should be; it should not be called "error" or "exception" or another synonym, as these may have other defined meanings.

Overview of the Service

Service component needs to have an architectural overview. The overview must document what function the service is providing, list the external APIs / protocols, a list of services which are expected to contact this service, a list of services which which this component is dependent on. If this service needs to talk to something outside an operation groups control, this should be explicitly called out. Core assumptions should be documented, especially those which might result in a redesign of the component. A description of the flows should be specified in some machine readable markup (XML).

Installation and Configuration Guide

Every service component should have release notes and an installation guide. If previous versions of this service component have been released, changes (especially configuration changes) must be documented. Physically requirements in the case of an integrated hardware solution would include physical space / orientation, power, cooling requirements, wiring, etc. For software based products this should include platform (OS, memory, CPU, etc). The installation guide must indicate how this component scales against offered load and what performance thresholds should be alarmed at. A full list of configuration values should be documents with a descript of what are valid inputs and how those values effective the components behavior.

Operational Documentation & Training

An operational guide should include a flowchart describing how to troubleshoot the service, a description of known limitations and work-a-rounds for those limitations. If this component has a specific SLA, or effects SLA of other services, these should be explicitly documented. Whatever clients are using this service should also be documented. An escalation guide should detail how to get help if the response team is unable to resolve a problem themselves.

Instrumentation

Everything we build should have objective service metrics. Many service will have a client facing SLA. SLA must be specified in a way that can be automatically calculator from data. This means that enough functional loggings has to be performed so an automated tool can determine if a service component is functioning within acceptable parameters.

Unit Tests

All service components mush must have clearly defined interfaces. Each interface should have a set of unit tests which permit functional testing of the the component. These unit tests should ideally be reusable by engineering, quality assurance, and the operations team. It must be possible for these tests to be run automatically at least once a minute with the results reporting success or failure. The result of this test must generate events which are fed into an alert management infrastructure. Ultimately there should be a testing framework that permits operations staff to initiate executing a full testing suite as well an automated monitoring system which would be scheduled to run regularly.

Ops staff must be able, by a written procedure, to trigger a test of the service, and see the result. Interpretation of the output should be as obvious as possible, as far as go/no-go; further interpretation can be directed by written procedures. There should be a "local", internal unit test, which has no external dependencies. It must be possible to run a test which exercises the service component and anything which is “downstream”, e.g. that this service component depends on to function correctly.

Performance Metrics

Every component must have instrumentation for performance and operational data record for monitoring and analysis. Counters must be kept for all critical resources which are related to delivering a service. In almost all cases this will include memory, cpu, disk, and network bandwidth. Some services might also have internal pools which need to be instrumented. All of these counters should collected and trended.

Service components also need to instrument transaction / workload. There should be a counter which indicates the work that the service component is currently performing. This could be characterized by transactions / sec, number of connections being services, etc

A model must be created which predicts what expected resource consumption would be (such as Microsoft's Revisited TCA model) so the resource footprint of the component can be predicted, provisioned, and alarmed when trends are out of bounds. If possible, a tools should be created which can capture real traffic to a service, and another tool which can replay logs in a test environment so synthetic work load will look as much like real world traffic as possible.

Service component must be load tested to document what the maximum number of work it can be expected to perform. There should be a single graph which displays resource consumed with the component workload overlaid. Ideally all the resource (and workload) should be scaled so 100% is the top of the graph.

Logging

Error Messages Logged

Faults must produce meaningful and helpful messages, written to a standardized log which is transported off the component. Things which write logs should reopen them. Logs entries must indicate at least;

the time, in GMT if possible
the identity of the the subsystem making the report. If this is a process it should report PID and argv[0] command name when appropriate. Other possibilities would be UUID, mod_perl modules, the physical card / circuit effected, etc
the general nature of the operation that was happening when the error occurred, e.g. "opening connection to subsystem Z"
the system error code as returned. For example on a UNIX system, report perror();
the specific arguments to a system call or other interface, e.g. full pathnames of files; and most important
a recommendation or clue about what to do about the condition.

Messages can occupy more than one line or log file entry, but should be concise yet complete. They may refer to external documentation for more detailed recommendations about what action to take, but must capture all necessary details. In short, error messages are supposed to hand you solutions, not problems.

Technology commonly used today:

syslog: UDP based, delivery not guaranteed
syslog-ng: can be TCP based
http post to a collection server
snmp traps
special logging systems (recommended)

There should be a process that reviews the filtering in the alert management system on a regular basis. There should be filters for log items which are "good" / "informational" which are filter from the operational view. The number of these should be trended so it is possible to notice if there is a significant change. Log items which are known to be bad should be filtered to show up as alerts. Anything else should be collected as an "unknown" and should be classified and filtered appropriately after the review.

Variable Log Levels Viewing

It must be possible to vary the operations staff visibility to logs. Normally operations staff must not be bothered with log entries except errors, but it must be possible to see logging at a more detailed level. This can be done by doing full logging and have operational staff using a logging viewer, or this could be to have multiple logging locations. Ops staff should be trained to capture the relevant output, and take the first steps to interpret it, before calling the engineers. The principle is that ops staff have brains, and can use them beyond the edge of the written procedures, given some information.

Trace Support

Ops staff must be able to enable full tracing. Data should be dumped in XXX format (readable by ethereal?). Ideally turning tracing one and off could be driven by a reset trigger.

Other Issues

All SLAs must be backed by logging data. If an SLA is provided for a given service, data must be logged which is used to calculate the SLA performance automatically.

Service components must support ntp to keep log timestamps correct

Log Aggregation and Correlation

heartbeat doing real work

Installation Requirements

Component installation should require minimal configuration by hand. Device should be able to do a network based boot.

Configuration Management

All components except devices providing network routing infrastructure must use DHCP to acquire IP addresses, netmasks, default route, and DNS resolver information.

Device should be able to do a network based boot for software installation.

Service components must not have any configuration values which only live on the device. All host based appliances must store configuration values in whatever configuration management system is in use. No components should require configurations performed by hand. The one exception to the no hand configuration would be limited to things like typing IP address into a network routing element, or configuring BIOS of a computing device to support network booting.

Software Management

All host based appliances must use a centralize software configuration management system. Software release and installation should be automated.

Versioning

Should be possible to roll forward, or roll back in a matter of seconds. This means that everything needs to already be laid down on the disk. Switching could be a restart or some sort of traffic redirection. Extra credit if you can running multiple services on the same machine. Pushing service shouldn't end service.

No Downtime Maintaince

well def change boundary
no downtime updates for service (components can go down)
Versioning needs to address partitioning, independence, rendevue.

Runtime Management

service state management

Operations View

flows (backbox - high level)
when misbehaving (see component broken)

Stable Running

Don't need human garbage collection

Manage Resources

If need restart, automate it

If problem which requires log-term fix, automate the detection which provides plenty of time for humans to fix the problem, or better yet automatically clean up the problem.

Service Routing

Must be addresses.

Maintenance of Service State

All services should support at least three states: InService, OutOfService, and Test. It should be possible to take a service “out of service” without breaking existing users. It should be possible to test a service before you put it back into service.

Service Launcher

All long running processes should sit under a service laucher, sometimes called a Service Mommy. This is a process which makes sure all the services which should be running are. Note: this assumes that in the event of inconsistent state, servers will fast fail rather than live in a half alive state. When a child going away the service mommy should:

Log failure
Save core file for later evaluation
Rate limited restart (to prevent spinning)
Extra credit, file bug against service with pointer to core file.

Ops staff must be able to cause the system or subsystem to write a crash dump for later analysis by the engineers. If there is a service mommy, this should be a command to the mommy. Written procedures must say how to cause it, where the dump is written, how to preserve it for analysis, and how to take the first steps to analyze it (see Trace/Debug).

If UNIX service daemons are using under initd, there should be a start/stop interface. Ops must be able to stop, start, and restart the product exclusively by executing scripts in /etc/init.d with only the "stop", "start", and "restart" arguments, per System V standards. Every piece of the product which can foreseeably need to be restarted separately should have a separate script. For processes that fork permanent children, and try (and sometimes fail) to reap them when the parent is stopped, the stop script must find and kill them. It is also necessary, but not sufficient, to document the identifying names of all relevant children, in case they must be identified and killed manually.

Programmatic API w/ Command Line Tool

All services management functions should be accessed through a common, well designed API. The API should be documented so it is possible to easy create scripts to automate operational tasks. A command line interface (and optionally a GUI) should be provided to make it easy to perform any anticipated operational activities. The command line tool should provide a limited set of commands which used defensive programming to reject illegal parameters. Any system-specific interface, if used for engineering debugging as well as operations, must have a "safety catch" of some kind, so that ops staff do not unintentionally enter commands that only engineers should. For instance, Cisco routers have "enable". It may or may not be protected by a password or other authentication, provided the system as a whole is adequately protected from unauthorized use.

Everything is done through a well designed API.
Have your command line tool and GUI use the same management API
Make a command line UI as well which is scripting friendly

Control Port

Servers should have control / status port if long running service

Out of Band Management

Need a plan for out of band management of service components. We can’t assume that a device will always be reachable via our primary IP transport network. Plans could include a simple alternative IP network used only for administration, a serial console system, or humans at the remote site that could type on the console.

Process and Policy

triage (does it need tools) - debugging and escalation
clear owners, roles, escalation
test
take out of service
put into service
capture data / traces for engineer
product lifecycle - specific address sunsetting / deprecation
change process

Security

no shared accounts
audit log of admin functions
Authorization/authentication/accounting requirements
Level of sensitivity of data involved, e.g.:
- Social Sec
- Credit
- marketing details
- other financials
- passwords of any kind

Appendix A: Issues a Design Needs to Address

Behavior When There are Failures

It’s not a question of if any component will fail, it is a question of when. In fact, you should expect multiple low probability faults happening at the same time. History is filled with perfect storm, Titanic disaster, etc. Most accidents in well-planned systems involve two or more events of low probability occurring in the worst possible combination. Every service component needs to address how it will tolerate failures and be recoverable.

Maintenance of Component

Every service component will fail; need to be updated, etc. There must be a way to perform needed maintenance on a service component without impacting the over all service delivery. Either you need some external traffic management, or your service component must he a collection of components which provide a reliable service without impacting external components (which is very challenging).

State Management

Whenever possible you should make servers and services stateless. Push the state to the edges, or hand your state to dedicated state stores. For example, all machines should be quickly recoverable if their hardware dies. This should not require expert human attention to replace a broken machine. Basic software and config need to live outside the host on some centralized resource (see avoid state above). Separate the system and config from the data. Installations should partition where they write there data. There should be separation between: App Binaries, Conf Files, Persistent Data, Transient Data, Communication (lock files, etc), and Log Files. Whenever possible no globals should be used to enable side by side versioning,

Optimize for Fast Recovery

Since everything will fail, you should design for fast recovery.

Use Authoritative Sources, Eliminate Human Based Information Copying

For any given piece of information there should be a single, authoritative source of that information. Ideally most information would be coming from the same source. You should never have to enter the same data twice. You should be able to easily discover what the source is. For example, in large scale deployment you don’t want a user database on each network device which has to be managed independently.

Address end-to-end

Design
Monitoring
Performance
Technology (make use of modern network fabric...)

Appendix B: Useful References

Good Software Guidelines for Developers, Geoff Halprin, Technical Report 1998, SysAdmin Group

Event Helix Real Time Design Patterns