Logging Systems

Instrumentation is absolutely critical to running a large, complex, distributed system. Smaller systems tend to be up or down. There are a small number of machines and it it practical to log into individual machines and look at the logs. In a complex system it's not practical to log into all the machines. To further complication the situation, problems often manifest themselves in performance issues or inconsistent behavior (but not outright failure). It is key to be able to collect and analysis large amounts of data efficiently.

A good logging system is the backbone of an effective monitoring system. Logs can provide a channel which connects system state to external agents. These agents might be software or human beings. As events happen, it must be possible to take an automatic action (invoke a script), set an alarm, forward the event to another system, and/or save the data for future trend analysis.

In distributed systems it's important to be able to tie logs together. Whenever possible the "start" of events should generate a GUID and the GUID should be propigated and logged to make it possible to do path based analysis, event coorilation, etc.

Key to this is a logging system with the following characteristics:

Low overhead - if using the logging system required too many resources it will be avoided. You want developers and operational staff to feel comfortable putting things in the logging system. If people feel they have to justify the cost of logging events, odds are you won't have enough data going into the logs.
high performance, streaming oriented - It should be possible to use logging system for near real-time monitoring. This speaks to both throughput and latency of the overall system. The system should be sufficently responsive that in the midst of a problem people are confident they will get current information when they look at data in the logs.
structured - playloads should be tagged and typed with a rich data representation. This avoids requiring ad hoc data extraction and also makes it easy to evolve a system, Older versions (even if up stream) and just ignore data/tags that it doesn't know.
subscriber based - simplies configuration management and allows information to be used in a variety of ways. More scalable because counters are maintained by the subscriber rather than publisher. Subscribers should be able to request what information it wants with a filter.
reliable - You don't want to lose information due to common problems: system crashes, network congestion, etc. Best effect such as syslog over UDP is not adequate. The logging system should be built on top of a reliable message queue which will be able to detect when data has been lost.
Secure - limit who can connect. Limit what events they can tap. Have a mask which will remove sensitive information from the events returned.
Open- It should be strait-forward to connect all sorts of systems and devices into the logging system.
client and server - You can subscribe to yourself with a filter restricting what data you want but change your retention policy so important information can be stored for logger periods of time.

The event logging system should run on a streaming model. Don't try to make a batch oriented harvesting work. Stream the data up to an aggregating host. Make it easy for the aggregating host(s) to forward the data on as well.

The logging system should supports a rich variety of information sources and values. Each event should have an id, and a collection of tags and corresponding values.

Take action based on the tags and values of an event

rewrite events
discard events. A counter should be kept for each matching rules. It should be possible to set threshold alarms if the counter increases too quickly.
generate events if counter exceeds a threshold, if heart beats don't appear from a system, or if an events content matches a rule set.
commit an event to a standard database
execute a procedure if an event matches a rules. It should be possible to use the contents of the log event as parameters to the script.

A common mistake is to try to feed the logging information into a singla, traditional database. This is always a scaling problem. A global view is needed, but that doesn't required the data to be in a single system. distributing information in a number of systems and then usng a distributed distributed query system such as Google's mapreduce is sufficient.

Examples of people working on this problem include

splunk - A commercial product which offers the more extensive feature set of any logging system with decent support for visualization, search tools, and machine learning. Downside is that it's complex and expensive.
logstash - Accepts input from variety of sources than the streams consalidated data into a number of systems. Part of the elasticsearch ecosystem. This of this as the poor man's Splunk.
NetLogger - A simple but clean system from LBL which is easy to integrate but maybe not as performant as some of the other systems.
Graylog2 - Store and processes logs and stores metadata into a MongoDB instacts and elasticsearch to enable searching of logs. Provides a nice UI and tools to transform free form logs into structured record.
fluentd - Log collection / transformation system built primarily in Ruby which transforms all inputs into Json which can the be persistented / analysed using a varity of tools / systems.
scribe from facebook
Flume - Apache project associated with Hadoop to get logs into HDFS.
Chukwa - Apache project that collects logs onto HDFS
kafka - Kafka was developed at LinkedIn for their activity stream processing and is now an Apache incubator project. Although Kafka could be used for log collection this is not it’s primary use case. Setup requires Zookeeper to manage the cluster state.
ArcSight - Security / audit oriented logging system now owned by HP
sumologic - complaince oriented log auditing system.

Additional References

log pipeline you can't have describes system that uses /dev/kmsg.
Advances and Challenges in Log Analysis in ACM Queue is a good introduction to how people use logs.
The Log: What every software engineer should know about real-time data's unifying abstration by Jay Kreps of LinkedIn. This paper touches on just about everything I think people should know except more advanced techniques such as path based analysis of data in the logs.
Centralized Logging and Centralized Logging Architecture by Jason Wilder which is a brief, DevOps look at logs.
Path-Based Failure and Evolution Management is a good introduction to how logging systems can be used to diagnosis complex, multi-component systems.
liblog is a good example of a system which collects sufficent data to enable log replay in a distributed system which is very useful because it allows testing using "almost real" traffic.