“How many IT staff does an organization need?” is a commonly asked and difficult to answer question. There is no magic ratio. There is no ideal staffing level. The appropriate number of a staff depends on what the IT organization is responsible for and the level of service expected in each area of responsibility.
This post is primarily about classic enterprise IT staffing… what happens inside a company or a university where IS/IT solutions are delivered. While related, running a production service which is delivering software as service is quite a bit different from enterprise computing. There was a brief article in CIO which demonstrates what happens when you benchmark enterprises against service providers. I discuss some of these issues in my Hints for Operating High Quality Services. James Hamilton has noted that in mega scale operations, human staff accounts for less than 10% of overall costs.
There was a nice graph found in the slide deck Impliance: an Information Management Appliance by folks from IBM Research which captured how staffing costs have gone up in comparison to the cost of hardware for enterprise computing. In a production environment, there is typically significantly more investment make into infrastructure and tools, work is often shared between an engineering group and an operations group, and there are often economies of scale that I only hint at in this paper.
What IT Staff Do
There are number of different kinds of work that IT teams are often responsible to deliver.
User Services (e.g. Helpdesk)
How much hand-holding is expected? Some sites have users who are pretty self-sufficient; other sites have users who need assistance for everything they do. Can your users take care of themselves or do they need and want the administrator to perform even the simplest tasks for them? For example, I have a friend whose users demand that he perform the most basic tasks for them (such as moving their files from one directory to another). Anything that isn’t simply invoking the text editor or reading mail is “UNIX” and hence a job for the administrator. This sort of support requires a ratio something like one administrator for every four users.
Does the site want you to conduct workshops or prepare extensive local documentation? To what extent are you expected to consult on technical issues? Do you concern yourself with just UNIX or other realms? For example, let’s say your site has heavy users of TeX, Mathematica, Common LISP, C++, Python, X11, PostScript, and MySQL. Are you supposed to be able to answer detailed questions on all those topics? Few people are experts at all these things. Something that many people don’t appreciate is that development of expertise in any given topic area requires time to play, experiment, and mature in that area.
How much public domain software or freeware do people want installed? What level of support are they expecting? Just compiling and installing software doesn’t take much time. Often though, software doesn’t just compile and install properly. There are often assumptions in the software which need to be changed before the software can be used at a given site. In addition, administrators are often expected (and rightly so) to continue maintenance of the software (bug fixes and what not) and to become an expert in the use of the software. Compiling and installing (coupled with frequent patches) or many hardware/software platforms can make this incredibly time consuming for even just a few software packages. The time this takes varies with the quality and complexity of the software. Keeping a current version of kermit or perl isn’t hard (I wish everyone did as nice a job as Larry Wall has with perl); keeping up with g++ is much more time-consuming.
Cloud Services Support
Increasingly today, orginizations use cloud services to solve their business problems. Signing up for a cloud service can often be done my anyone with a credit card. Managing account take a bit more effort, but if often fairly strait forward for organization that using Google Apps, an enterprise single sign-on system like Okta, or expose an AD service SaML endpoint. Beyond basic provisioning, the support costs for a cloud services vary widely. At the simple end are in the cloud editors or file storage systems. On the complex end are workflow system which need to be configured for an organization’s unique processes.
Most places not only expect their IT staff to manage software provided by vendors but also to create — on demand — tools for the user community. This is understandable, especially in small sites where the IT staff might be the only professional programmers. If there is this expectation, time must be allocated for this development process.
Site Planning/Administration Overhead
How much site planning is the administrator expected to handle? Must the administrator know that the average person generate 115 watts, and how to factor that and heat loads from machines to scale appropriate AC/heating loads and power? How much paperwork is there?
Who crawls through the ceiling to pull wires? Who finds the flaky transceiver when the Ethernet starts to go crazy? When a workstation dies, does a secretary just call your vendor and wait, or are more creative solutions required? Does your site buy all its peripherals ready-to-install or do you save money by purchasing components and do the integration yourself? Having IT staff do any of these things takes time.
Is the IT staff supposed to anticipate new technology and advise the company about new approaches? Most places I have worked expect the IT staff to have a good feel for the state of the art and new technology that looks promising (not just products, but research, too). Anticipation is often necessary given many sites have a two-to-five year planning or depreciation schedule. Keeping up with our field isn’t easy. There are a variety sources one much draw upon to stay current. I have found a variety of good sources for current information. Trade rags can give you a picture of what is being sold, blog (and other electronic media) is great for questions regarding current issues and problems. Professional journals from ACM, IEEE, etc are useful to see what is happening on the almost done research front. There is no substitute however, for a good network of professional contacts. This network can be maintained with phone calls, electronic mail, and attending conferences.
The best way to estimate the number of administrators needed is to figure out what level of service is required and how various factors (for instance networking infrastructure and heterogeneity of the machines being supported) will affect the the fulfillment of those responsibilities. Rarely are system administrators doing only “administrator” tasks. The first part of this article will detail the tasks that I find myself performing in addition to the normal “administrator” tasks, such as backups, installing new users, operating-system maintenance, and so forth. Additional tasks are presented (for the most part) in the form of questions. The second part details some of the various factors that will affect staff levels. The third part details some simple perspectives that system administrators can adopt to make their environment more easily administrable. Finally, I will end by quickly examining some ratios which might help you to approximate your staffing needs.
The following is a very rough set of rules I use to estimate staffing requirements. Your mileage will vary. I should note that these numbers assume maintaining a reasonably stable environment. Rapid turnover of user base, machines, abnormally frequent software changes, growth of the environment, etc results in more work and effect the ratios.
|Type of Work||Units of labor to deliver best practice performance and scaling factors|
|End User Service||1 unit for every 10 computer-phobia users who need to do “complex things” (hand-holding factor), 1 unit to 30 users who get good service. 1 unit for every 120 who get basic service, and (e.g. students in an educational factory who mostly self-serve 🙂 assuming 8×5 support. Ratio has to go up if you want help desk to run extended hours.|
|24 x 7 Support (Partners, clients, etc)||Doing a 24×7 NOC which requires proactive notification and rapid problem resolution scales against the complexity of the service that is being managed and the number of high touch clients. Places that really care about this have a step in cost of 14 people… a manager, an assistant manager, three shifts, with each shift having two people, one shift running sunday-wednesday, and the other running wednesday-saturday so there is overlap between teams, clean handoffs, and times to do group training. Less that this can easily result in shifts not being covered. For example, having a single person / shift can fail if the night shift person falls asleep, or if someone working one of the weekend shifts gets sick. This doesn’t count folks to escalate to. The number of people needed per shift is related to how much normal work there is, and how many simultaneous disasters the team is expected to be able to handle.|
|Operating System Management||2 units for each make of OS requiring basic support. If you are pushing the OS beyond mainstream / tested scale add an addition 4 units. Doing very complex things requiring hacked kernels, non standard device drivers, etc then add 4 units. If you really care about security add an additional 4 units. Need functionality which isn’t in the kernel at this time and/or something more than basic jumpstart or kickstart for installation and management? Manage this like a software development project and get good engineers working on it.|
|Hardware Management / Host Imaging (OS Deployment)||1 unit for every 20 boxes if you can’t protect the OS and system configurations from the users (Windows in many environments). 1 unit for every 40 boxes if you can protect the OS from the users without hindering the user, but can’t be automatically build / rebuild / update OS and software without sysadmin oversight. 1 unit for every 120 boxes which have network based software installs (compute clusters or fully automated user workstations with configuration management). Extremely large scale operations (1000s of machines running completely cookie cutter) scale more like 500 boxes / unit and might scale as high as 2500 boxes/unit at a google scale where you don’t have to worry about the health of individual machines.|
|Platform Interoperability||2 units * # of OS if tight coupling. (shared filing, etc)|
|Simple Network Services||1 unit for every two basic services that are set up network wide instead of machine wide. e.g. newsspool, httpd, DNS, mail, printing, SAMBA. Add 2 units if you want to make them highly available (better than 99.8%). Add 2 units if you care about security. Add 2 units if you are scaling larger than the average. Add 4 if you are scaling to mega size and are beyond what the software was designed for. If you are completely beyond scale, treat a development project and staff accordingly with real engineers.|
|Complex Network Services||Highly variable. For example, multi-terabyte database used for data mining could easy consume multiple DBAs + multiple senior system administrators who specialize in performance tuning and large scale storage system.|
|Network Connectivity||Scales against number of network devices, number of networks, security issues, complexity of routing, HA requirements. Don’t have good numbers at this time.|
|Coordination and Management||The larger and more complex an organization, the more there is a need for coordination roles. People who focus on human management, systems architecture, program management, project management. This is quite complex. It would be presumptuous to suggest a ratio.|
A solid SAGE II system administrator can handle 4 units of work. A strong SAGE III system administrator can handle 8 units of work. A superior SAGE IV system administrator can handle 12 units of work. This counting system is loosely based on an equation proposed by Sherwood Botsford and found in the comp.unix.admin FAQ. A some point I will update the counting to use my Operations Skill Matrix (excel).
Site with one administrator are not very desirable.
They are a fact of life since many small sites can neither afford nor justify more than one system administrator. It is difficult for one person to have the breadth of knowledge and experience to run a really first-class site, no matter how few machines it has. There will always be some area that is not the strength of a sole administrator.
Another problem is that the site with a single system administrator has a single point of failure: when the administrator is on vacation (or gets run over by a bus), the site is vulnerable. Carrying a pager on vacation isn’t my idea of fun; however, no one can predict when a crisis might occur. Of course, it’s hard to interest a high-level person in a job that also involves changing the backup tapes and crawling through the ceilings.
The more homogeneous a site is, the easier it is to support.
The number of different platforms supported (different machine architectures or different operating systems) increases the complexity of the support task. Upgrading the operating system will have to be done at least once by hand for each platform. Each operating system has it own idiosyncrasies that must be learned and mastered. Most sites want all the platforms to appear identical so that their users can sit down on any of the workstations and get work done. This requires that each platform have identical tools, window systems, etc. This can greatly increase the amount of work the administrator must do. In the best of circumstances this means recompiling programs for each platform. In the worst circumstances, it involves porting software, and fighting with vendor-supplied software. My personal nightmare is trying to support all of X11R4 (from MIT), DECwindows, OSF/Motif, and Sun’s OpenWindows on three different platforms.
Larger sites can exploit economies of scale.
Large sites can expand their administration staffs less rapidly than the number of users (or workstations) grows. The reason for this is that as your staff gets larger it is possible for people to specialize. This specialization permits individual staff members to develop a depth of expertise that enables them to understand all the issues on a given topic and solve more quickly whatever problems crop up.
Secondly, larger sites can leverage off previous work. The first installation of a machine or piece of software is always the most difficult. The second is easier. By the time you have done 50 or 100 installations, you have developed automatic scripts and can do installations in your sleep. I have seen large sites at a 1:100 administrator-to-machine ratio where things ran pretty well. I must caution the reader though: this sort of ratio is only feasible with top-notch people working in a carefully engineered environment with many hundreds of users. Most sites can’t get productive work done with this sort of ratio. This sort of ratio also limits the professional growth of members of the system staff because they will spend most of their time with the day-to-day issues and fire-fighting. This is a shame since an organization’s most valuable resource is its people.
|Increased SA Efficiency||Decreased SA Efficiency|
Common SA tools
Robust IS security
Tight control over what gets loaded on HW/SW baseline
Redundancy of critical services
Separating services (single service machines)
Good training program
Detailed disaster recovery plans, by system
System which don’t require backups
Good backup/restore program, centrally managed
|Diverse hardware baseline|
Diverse software baseline
Lax IS security
Little or no training
A staff that is reactive, not proactive
ad-hoc backups or no backups
High Availability Sites Require higher staffing.
Site which need to be highly available (e.g. greater than 99.9% service delivery) will require a higher level of staffing. The reason for this is you need people who can respond almost immediately to any service issues (e.g. 24×7 coverage, ideally at least 2 people deep who can do first and second level resolution, and be able to escalate to subject area experts). You also need to have multiple people for each subject area who are able to diagnosis and resolve complex issues quickly.
What About Other Platforms?
The platform which is being supported makes a great deal of difference. My experience is that support of Macintosh and UNIX communities take approximately the same staffing levels. Historically support of PCs running any Microsoft OS seems to require at least double the staffing and delivers a lower level of service. Since Windows XP the ratio doesn’t need to be as high… but I still find administration scales better on UNIX than Windows.
A colleague suggested to me that it is critical to keep in mind what factors effect scaling of a team. He provided a nice summary table.
Other People’s Ratios
In the last few years there have been a lot of people who have talked about the ratios they think are reasonable. It is common to hear people talking about staff/user ratios of 1:60 where there is some variation in the population and a lot of custom work, and staff/user ratios of 1:150 (or higher) in locations that can use “cookie cutter” solutions, eg universities with hordes of undergraduates or enterprises where people are using computing as a tool rather than looking to innovate on the machines that are being administered. A more realistic set of ratios (based on best practices in the field rather than vendor white pages on TCO) was the Mega Group’s Improve staffing ratios article. There are a number of other studies that have found that in the real world most organizations have not been able to support ratios greater than 30:1. A Mitre study from 2000 suggested that the ratio is 47:1 +/- 17%. In a video about User to Technician Ratios by Justin Nguyen a base ratio of 60:1 was suggestion, with a number of factors which impacts this ratio.
An example of over inflated numbers can be found in Staffing for Technology Support, a white paper for education institutions. Unfortunately, these folks are trying to apply staffing ratios from MIT’s Project Athena to the rest of the world. This is flawed for three reasons. First, most sites don’t have the sophisticated tools that Athena had. Second, Athena had people who make Athena run which were not capture in their ratios: student volunteers that did a lot of work and hard core system programmer that developed tools which met MIT’s requirements. Finally, MITs user population is not an average user population.
David Cappuccio of the Gartner Group suggested in his article Know The Types: Sizing up Support Staffs that there are two ratios that you need to consider. The first ratio is staff to users, an attempt to capture the human part of the equation. This ratio is looking at how many people you need to do what is often called Tier I, help desk, or user support. The second ratio is the number of machines and subsystems per staff, that is capturing how many people are needed to take care of the technical infrastructure. While I like David’s framework, I think that his ratios are too high for user support, and that he has failed to capture the diverse set of technologies most organizations deploy: there is much more than print, file, web, and database servers. There are directory, security, messaging, and collaborative services. To complicate matters, many sites are heterogeneous requiring extra efforts to make one service work for all clients, or worse, resulting in the need services which are based on the client platform. A final complicating factor is that these services often have complex interactions and dependencies which makes them more difficult to deploy and maintain. The result is that David’s ratios will result staffing which will be able to deliver only the most basic services at an adequate level.
The itbenchmark blog has a number of postings on the topic of staff sizing.
The number of administrators required varies greatly from site to site. The one constant is that there are rarely enough system administrators for the responsibilities that they have. My personal experience is that it is possible for a single person to maintain up to 120 machines (with three different platforms) and give adequate user services to a fairly sophisticated user population. My time is divided between user services (30 percent), general system administration tasks (20 percent), installing new machines and hardware/network support (10 percent), software installation and maintenance (40 percent), custom software development and tracking of trends (25 percent), and site planning (10 percent). You will note that this adds up to 135 percent.
History of this Post
In 1991 I posted a note to Usenet responding to a question about staffing ratios. Rob Kostad asked me to expand that short note into an article which ran with the title “How Many Administrators are Enough?” in the magazine Unix Review, April 1991. The original article in troff -ms form is still around. Over the years I have made some, mostly minor updates to the original article. One of these days I will rewrite it completely. While this article was written a long time ago, I find that the ratios are still pretty accurate. If you think I am wrong, send me mail with your experiences.