State Stores

Part of Hints for Operating High Quality Services by Mark Verber
Very Very Early Draft 0.13 -- July 10, 2006

The management of persistent state is arguably the most thorny issue when building and operating a high quality service. Whenever possible design your system to be stateless so that each transaction / message / whatever fully contains all the information required to the requested action to be performed, removing the need for the service to associate your request with some pile of data.

Make State Someone Else's Problem

When you can't avoid state, try to make it someone else's problem:

push state into your client... make client persist and present session state. If you don't trust the client (you shouldn't), encrypt and sign the data you hand to the client.
punt state into a storage layer which manages state. [Do the hard work once, and then reuse it.]
direct the request to someone else who has state (contact handle or proxy approach)

Selecting the Right State Store

There are a variety of factors which will effect what type of state store(s) are appropriate for a given system. Factors which need to be consider include:

complexity of data
consistency guarantees
transaction performance
maximum size of data stored
reliability in the face of failure

If you can't punt state storage... some rules of thumb:

commit state only when on at least two machines. You can't trust less than this

"Traditional" Databases

Most people are familiar with complex, relational data which require ACID guarantees. If you are dealing with this sort of data, then you should go with a real database system. There are a variety of commercial database systems such as Microsoft SQLserver and Oracle. There are also a number of open source databases such as Postgres, SQLite, and MySQL which provide many features found in traditional databases. There has been discussions about problems-with-acid-and-how-to-fix-them

In most large, real-world production systems, some of the ACID guarantees have been weakened to improve system performance. There is typically some sort of audit log which enable detecting and repairing data corruption.

A very common approach to scale a large systems is to split data up into slow changing (small size) data which must always be available, and larger/more rapidly changing data which can tolerate occasional outages. This is often implement by having a read-mostly table which is replicated on all databases for data which must always be available (and also contains the location of the rapidly changing data), and then tables which are stored on individual (or a pair of machines) which holds the more rapidly changing / larger data. It is expected that some of these clusters will be unavailable time to time, but in a properly designed system those outages will only effect a small percentage of the overall user population.

Examples of this sort of approach include:

Scaling LiveJournal: Talk by Brad Fitzpatrick about how they got mysql to scale

Put it in a File System

Most of us manage "state" in our day to day lives though simple interactions with a file system. Many services which require state to be stored often start out as files in a basic file system. So long as a file system on a single machine is sufficient, things are easy and strait forward. On the same machine there are a number of options to enable sharing, synchronization, and locking between processes or threads. As soon as single machine isn't sufficient, things get a lot more complex. The number of system calls which work across machines is often limited. Furthermore, the is the question of how to make the file system reliable in the face of machine failures.

NFS

When a service outstrips a single machine CPU resources, but is fairly light in terms of file system I/O (common in web services) people will often move the file based state store to a dedicated file server, which has multiple machines mounting it and performing their services against the common file system. A two edge-sword with this approach is that you can use the same interconnect (IP over your existing Ethernet connection) to offer your services and access the state store. The possible downside is bandwidth contention. [This shouldn't be a problem in most cases.] There are a three possible issues that you can run into when doing this:

Coordination: In theory, fcntl and lockf will work over NFS, but this is often broken, and most lockfd don't scale well. Much more common is to use file system operations which are atomic across all machines (like link) and use opportunistic locking with some way to recover in the face of a lock grabbed from a machine which later died before the lock was released.

Reliability: Using a commodity server as the NFS server creates a huge single point of failure. One of the most common approaches is to use a harden file server (such as those made by Network Appliance) which has a high MTBF and fast recovery. People wanting a more reliable solution will go with an expensive clustered solutions which typically is a pair of machines in a high available configuration.

Scalability: NFS client libraries send their RPC requests to a specific address for a given file system. This means that an NFS server has to either be a single machine, or a set of machines which are sharing (and coordinating the service) on a shared IP address. In most causes this is not a problem, but in exceptional, extremely heavy transaction situations, this could be a problem.

For more information:

Netapp product information & whitepapers
Pillar Data Systems production information & whitepapers
NFS Considered Harmful

SAN

SAN = storage area networks. SAN are very profitable for the companies that have succeeded in this marketplace such as EMC. I am sort of down on SAN. You have to deploy a separate infrastructure for interconnections (I don't like betting against Ethernet as an interconnect), they tend to be operational complex, and rather expensive. SAN is typically made up of inter-connected fibers which connect dedicated storage bricks (embedded controllers with disks), to hosts running special drives. Half way between NFS based NAS and SAN is iSCSI.

For more information

EMC
HP
Hitatchi
Red Hat GFS

Cluster File Systems:

There are a number of file systems which have been designed to run on a cluster of machines. These have attempted to provide effective coordination and address reliability through automatic replication. They typically address scalability horizontally (many boxes rather than one big/fast box). Sometimes clustered file systems are not fully integrated into the operating system. For example, the googlefs and MogileFS require applications to be linked against the "file-system" library. Brent Welch wrote up a brief taxonomy of clustered file systems entitled What is a Cluster File System.

For more information:

Lustre : open source (and commercially supported file system)
panasas: commercial hardware / software solution similar to lustre (more mature)
isilon: commercial hardware / software solution
Farsite: An interesting file system from MSR which takes Castro's work on Byzantine fault tolerance to the next level
OceanStore: Persistant, operator independent data storage. I believe the project is over-reaching, but exploring lots of interesting issues
Google File System
hadoop dfs
MogileFS: memcached for persistent data... I believe inactive since Brad left SixApart.
csync2: file system replication for clusters
drbd: block level replication system
kfs
memcachedb
ceph

Data Structure / Hash Stores

Often times, it turns out that state management can be reduced to simple atomic actions on a simple key/value pair, hashes, or other basic data structures like Btrees. Historically many services built simple stores on top of file systems because that was the quickest / easiest way to built a blob store. Richard Jones of last.fm wrote up a decent survey of several of the more interesting open source distributed key value stores. Some projects that have caught my interesting over the years:

FoundationDB
redis in-memory
Swift persistance
Boxwood: distributed Btree+ and other goodies. I want it.
SSM, DStore: Quorum based data stores produced at Stanford
memcached: simple distributed memory caching system ideal for soft state.
BerkeleyDB
Amazon's Dynamo
mongo db (10gen)
couchdb
C-Store Tuple Store
appjet.com persistent store
hadoopdb
terrastore

Other Important or Interesting Stuff

Epidemic algorithms for replicated database maintenance: Landmark paper from PARC exploring the need for anti-entropy in loosely coupled data stores.
Session Guarantees for Weakly Consistent Replicated Data: Descendent of the epidemic algorithms world... this paper provides a good framework to think about guarantees and what tradeoffs you might want to make.
Practical Byzantine Fault Tolerance: Great work by Miguel Castro and Barbara Liskov demonstrating that it is possible to build performant file systems which are Byzantine fault tolerant.
Where Have All the Good Databases Gone: Adam Bosworth Blob wishing for dynamic schemas, dynamic partitioning, and modern indexing
NING Content Store
Amazon S3 Storage Service
4store.org