Logging Systems
Instrumentation is absolutely critical to running a large, complex,
distributed system. Smaller systems tend to be up or down. There are
a small number of machines and it it practical to log into individual
machines and look at the logs. In a complex system it's not practical
to log into all the machines. To further complication the situation,
problems often manifest themselves in performance issues or
inconsistent behavior (but not outright failure). It is key to be able
to collect and analysis large amounts of data efficiently.
A good logging system is the backbone of an effective monitoring
system. Logs can provide a channel which connects system state to
external agents. These agents might be software or human beings. As
events happen, it must be possible to take an automatic action (invoke
a script), set an alarm, forward the event to another system, and/or
save the data for future trend analysis.
In distributed systems it's important to be able to tie logs together.
Whenever possible the "start" of events should generate a GUID and the
GUID should be propigated and logged to make it possible to do path
based analysis, event coorilation, etc.
Key to this is a logging system with the
following characteristics:
- Low overhead -
if using the logging system required too many resources
it will be avoided. You want developers and operational staff to feel
comfortable putting things in the logging system. If people feel they have
to justify the cost of logging events, odds are you won't have enough
data going into the logs.
- high performance, streaming oriented -
It should be possible to use
logging system for near real-time monitoring. This speaks to both
throughput and latency of the overall system. The system should be
sufficently responsive that in the midst of a problem people are
confident they will get current information when they look at data in
the logs.
- structured -
playloads should be tagged and typed with a rich data
representation. This avoids requiring ad hoc data extraction and also
makes it easy to evolve a system, Older versions (even if up stream)
and just ignore data/tags that it doesn't know.
- subscriber based -
simplies configuration management and allows
information to be used in a variety of ways. More scalable because
counters are maintained by the subscriber rather than publisher.
Subscribers should be able to request what information it wants with
a filter.
- reliable -
You don't want to lose information due to
common problems: system crashes, network congestion, etc. Best
effect such as syslog over UDP is not adequate. The logging system
should be built on top of a reliable message queue which will be
able to detect when data has been lost.
- Secure -
limit who can connect. Limit what events they can
tap. Have a mask which will remove sensitive information from the
events returned.
- Open-
It should be strait-forward to connect all sorts of systems and
devices into the logging system.
- client and server -
You can subscribe to yourself with a
filter restricting what data you want but change your retention
policy so important information can be stored for logger periods
of time.
The event logging system should run on a streaming model. Don't
try to make a batch oriented harvesting work. Stream the data up to an
aggregating host. Make it easy for the aggregating host(s) to forward the
data on as well.
The logging system should supports a rich variety of
information sources and values. Each event should have an id, and a
collection of tags and corresponding values.
Take action based on the tags and values of an event
- rewrite events
- discard events. A counter should be kept for each
matching rules. It should be possible to set threshold alarms if the counter
increases too quickly.
- generate events if counter exceeds a threshold,
if heart beats don't appear from a system, or if an events content
matches a rule set.
- commit an event to a standard database
- execute a procedure if an event matches a rules.
It should be possible to use the contents of the log event as parameters to
the script.
A common mistake is to try to feed the logging information into a
singla, traditional database. This is always a scaling problem. A global
view is needed, but that doesn't required the data to be in a single system.
distributing information in a number of systems and then usng a distributed
distributed query system such as
Google's mapreduce
is sufficient.
Examples of people working on this problem include
- splunk - A commercial product which offers the more extensive
feature set of any logging system with decent support for visualization,
search tools, and machine learning. Downside is that it's complex and expensive.
- logstash - Accepts input from variety of sources than the streams consalidated data into a number of systems. Part of the
elasticsearch ecosystem. This of this as the poor man's Splunk.
- NetLogger - A simple but clean system from LBL which is easy to integrate but maybe not as performant as some of the other systems.
- Graylog2 - Store and processes logs and stores metadata into a MongoDB instacts and elasticsearch to enable searching of logs. Provides a nice UI and tools to transform free form logs into structured record.
- fluentd - Log collection / transformation system built primarily in Ruby which transforms all inputs into Json which can the be persistented / analysed using a varity of tools / systems.
- scribe from facebook
- Flume - Apache project associated with Hadoop to get logs into HDFS.
- Chukwa - Apache project that collects logs onto HDFS
- kafka - Kafka was developed at LinkedIn for their activity stream processing and is now an Apache incubator project. Although Kafka could be used for log collection this is not it’s primary use case. Setup requires Zookeeper to manage the cluster state.
- ArcSight - Security / audit oriented logging system now owned by HP
- sumologic - complaince oriented log auditing system.
Additional References