Logging Systems

Instrumentation is absolutely critical to running a large, complex, distributed system. Smaller systems tend to be up or down. There are a small number of machines and it it practical to log into individual machines and look at the logs. In a complex system it's not practical to log into all the machines. To further complication the situation, problems often manifest themselves in performance issues or inconsistent behavior (but not outright failure). It is key to be able to collect and analysis large amounts of data efficiently.

A good logging system is the backbone of an effective monitoring system. Logs can provide a channel which connects system state to external agents. These agents might be software or human beings. As events happen, it must be possible to take an automatic action (invoke a script), set an alarm, forward the event to another system, and/or save the data for future trend analysis.

In distributed systems it's important to be able to tie logs together. Whenever possible the "start" of events should generate a GUID and the GUID should be propigated and logged to make it possible to do path based analysis, event coorilation, etc.

Key to this is a logging system with the following characteristics:

The event logging system should run on a streaming model. Don't try to make a batch oriented harvesting work. Stream the data up to an aggregating host. Make it easy for the aggregating host(s) to forward the data on as well.

The logging system should supports a rich variety of information sources and values. Each event should have an id, and a collection of tags and corresponding values.

Take action based on the tags and values of an event

A common mistake is to try to feed the logging information into a singla, traditional database. This is always a scaling problem. A global view is needed, but that doesn't required the data to be in a single system. distributing information in a number of systems and then usng a distributed distributed query system such as Google's mapreduce is sufficient.

Examples of people working on this problem include

Additional References