Originally Published in 1991
Minor updates in 2000, and 2018
“How many IT staff does an organization need?” is a commonly asked and difficult to answer question. There is no magic ratio. There is no ideal staffing level. The appropriate number of a staff depends on what the IT organization is responsible for and the level of service expected in each area of responsibility.
This post is primarily about classic enterprise IT staffing… what happens inside a company or a university where IS/IT solutions are delivered. While related, running a production service which is delivering software as service is quite a bit different from enterprise computing. There was a brief article in CIO which demonstrates what happens when you benchmark enterprises against service providers. I discuss some of these issues in my Hints for Operating High Quality Services. James Hamilton has noted that in mega scale operations, human staff accounts for less than 10% of overall costs.
There was a nice graph found in the slide deck Impliance: an Information Management Appliance by folks from IBM Research which captured how staffing costs have gone up in comparison to the cost of hardware for enterprise computing. In a production environment, there is typically significantly more investment make into infrastructure and tools, work is often shared between an engineering group and an operations group, and there are often economies of scale that I only hint at in this paper.
What IT Staff Do
There are number of different kinds of work that IT teams are often responsible to deliver.
User Services (e.g. Helpdesk)
How much hand-holding is expected? Some sites have users who are pretty self-sufficient; other sites have users who need assistance for everything they do. Can your users take care of themselves or do they need and want the administrator to perform even the simplest tasks for them? For example, I have a friend whose users demand that he perform the most basic tasks for them (such as moving their files from one directory to another). Anything that isn’t simply invoking the text editor or reading mail is “UNIX” and hence a job for the administrator. This sort of support requires a ratio something like one administrator for every four users.
Does the site want you to conduct workshops or prepare extensive local documentation? To what extent are you expected to consult on technical issues? Do you concern yourself with just the operating system like OSX or is the Helpdesk assisting with applications, or even development tools? For example, let’s say your site has heavy users of TeX, Mathematica, C++, Python, X11, R, MySQL, and Google G-Suite. Is user services supposed to be able to answer detailed questions on all those topics? Few people are experts at all these things. Development of expertise in any given topic area requires time to play, experiment, and mature in that area, and there are limits to the number of areas someone can master.
How much public domain software or freeware do people want installed? What level of support are they expecting? Installing RPMs is typically fairly easy. Just compiling and installing software doesn’t take much time. Often though, software doesn’t just compile and install properly. Often there are conflicts between packages which can be challenging to resolve. There are often assumptions in the software which need to be changed before the software can be used at a given site. In addition, administrators are often expected (and rightly so) to continue maintenance of the software (bug fixes, security patches, and what not) and to become an expert in the use of the software. Compiling and installing (coupled with frequent patches) or many hardware/software platforms can make this incredibly time consuming for even just a few software packages. The time this takes varies with the quality and complexity of the software. Keeping a current version of kermit or perl isn’t hard (I wish everyone did as nice a job as Larry Wall has with perl); keeping up with g++ is much more time-consuming.
Cloud Services Support
Increasingly today, organizations use cloud services to solve their business problems. Staff time for Cloud services are often under estimated because they tend to be distributed. First is the cost of on boarding. Often this is as soon as someone using a company credit card to sign up for a service, but that is not sustainable. First, the billing really need to be going to the organization, not an individual. More important though is insuring the correct people have access, and than when people’s roles changes, or they leave the company, the right thing happens.
In the best case, the SaaS can leverage an existing account management systems such as Google Apps, enterprise single sign-on system like Okta, or an internal AD service if it provides an externally reachable SaML endpoint. Can has to be made to insure that disabling a single sign-on account blocks access (some systems also permit access against a service specific credential). Generally on boarding apps is around 1 man week / app which you factor in learning about the service, arranging integrations, producing appropriate documentation.
Beyond getting a SaaS plumbed in, there are a wide range of possible costs depending on the complexity of the integrations, how much customization is required, and whether there will be a need for the IT team to provide customer support.
In our increasingly complex and regulated world, organizations have to track and manage a wide range of processes. As the pace of work has increased there is the perception that data needs to be available in real-time but has lead to the adoption of Enterprise resource planning (ERP) systems. The best know systems in this space are SAP and Oracle’s suite of applications. In the mid-tier there are companies like Netsuite and Workday. For small businesses, Quickbooks is commonly used.
In smaller orginiztions ERP systems are often managed by the team that is responsible for the function. Finance takes care of the the accounting software. Human Resources often takes care of whatever tools are used to manage headcount, and uses whatever system finance has selected for doing payroll. In these cases IT staffing is fairly minimal and is typically focused providing a server and insuring that it’s data is regularly backed up.
As a company grows and size and complexity, the requirements of these systems can grow massively. It’s beyond the scope of this post to discuss estimating the staffing requirements for ERP systems, but I will make one observation. In almost all cases, it’s a bad idea to customize ERP systems to exactly match your existing business processes. Whenever possible you should use a system as close to out of the box as you can, and if the choice is change your process or customize the software, change your process.
Many organizations don’t have a dedicated software engineering department, so the IT staff is called on not just to manage software provided by vendors but also to create — on demand — tools for the user community. This is understandable, especially in small sites where the IT staff might be the only professional programmers. If there is this expectation, time must be allocated for this development process.
Site Planning/Administration Overhead
How much site planning is the administrator expected to handle? Must the administrator know that the average person generate 115 watts, and how to factor that and heat loads from machines to scale appropriate AC/heating loads and power? How much paperwork is there?
Who crawls through the ceiling to pull wires? Who finds the flaky transceiver when the Ethernet starts to go crazy? When a workstation dies, does a secretary just call your vendor and wait, or are more creative solutions required? Do you do board level repairs? Does your site buy all its peripherals ready-to-install or do you save money by purchasing components and do the integration yourself? Having IT staff do any of these things takes time.
Is the IT staff supposed to anticipate new technology and advise the company about new approaches? Most places I have worked expect the IT staff to have a good feel for the state of the art and new technology that looks promising (not just products, but research, too). Anticipation is often necessary given many sites have a three-to-six year planning or depreciation schedule. Keeping up with our field isn’t easy. There are a variety sources one much draw upon to stay current. I have found a variety of good sources for current information. Trade rags can give you a picture of what is being sold, blog (and other electronic media) is great for questions regarding current issues and problems. Professional journals from ACM, IEEE, etc are useful to see what is happening on the almost done research front. There is no substitute however, for a good network of professional contacts. This network can be maintained with phone calls, electronic mail, participating in online communities, and attending conferences.
The best way to estimate the number of administrators needed is to figure out what level of service is required and how various factors (for instance networking infrastructure and heterogeneity of the machines being supported) will affect the the fulfillment of those responsibilities. Rarely are system administrators doing only “administrator” tasks. The first part of this article will detail the tasks that I find myself performing in addition to the normal “administrator” tasks, such as backups, installing new users, operating-system maintenance, and so forth. Additional tasks are presented (for the most part) in the form of questions. The second part details some of the various factors that will affect staff levels. The third part details some simple perspectives that system administrators can adopt to make their environment more easily administrable. Finally, I will end by quickly examining some ratios which might help you to approximate your staffing needs.
The following is a very rough set of rules I use to estimate staffing requirements. Your mileage will vary. I should note that these numbers assume maintaining a reasonably stable environment. Rapid turnover of user base, machines, abnormally frequent software changes, growth of the environment, etc results in more work and effect the ratios.
|Type of Work||Units of labor to deliver best practice performance and scaling factors|
|End User Service||1 unit to 50 users who get good service. 1 unit for every 200 who get basic service, and (e.g. students in an educational factory) assuming 8×5 support. Not needed if you are running the service with a seperate customer care organization. Ratio has to go up if you want help desk to run extended hours.|
|24 x 7 Support (Partners, clients, etc)||Doing a 24×7 NOC which requires proactive notification and rapid problem resolution scales against the complexity of the service that is being managed and the number of high touch clients. Places that really care about this have a step in cost of 14 people… a manager, an assistant manager, three shifts, with each shift having two people, one shift running sunday-wednesday, and the other running wednesday-saturday so there is overlap between teams, clean handoffs, and times to do group training. Less that this can easily result in shifts not being covered. For example, having a single person / shift can fail if the night shift person falls asleep, or if someone working one of the weekend shifts gets sick. This doesn’t count folks to escalate to. The number of people needed per shift is related to how much normal work there is, and how many simultaneous disasters the team is expected to be able to handle.|
|Operating System Management||2 units for each make of OS requiring basic support. If you are pushing the OS beyond mainstream / tested scale add an addition 4 units. Doing very complex things requiring hacked kernels, non standard device drivers, etc then add 4 units. If you really care about security add an additional 12 units. Need functionality which isn’t in the kernel at this time and/or something more than basic jumpstart or kickstart for installation and management? Manage this like a software development project and get good engineers working on it.|
|Hardware Management / Host Imaging (OS Deployment)||1 unit for every 50 boxes if you can’t protect the OS and system configurations from the users (Windows in many environments) or where there is high customization which has to be done by IT staff. 1 unit for every 200 boxes if you can protect the OS from the users without hindering the user, but can’t be automatically build / rebuild / update OS and software without IT oversight. 1 unit for every 400 boxes which have network based software installs (compute clusters or fully automated user workstations with configuration management). Extremely large scale operations (1,000s of machines running completely cookie cutter) scale more like 500 boxes / unit and might scale as high as 2500 boxes/unit at a google scale where you can afford to lose full racks / service units without needing to immediately take action.|
|Appliance Support||1 unit for each simple app. 4 units for any complex app which staff are expected to be power users. Initial deployment of apps is typically time consuming. When large number of apps are deployed need to account for the time it takes staff to context switch / “swap in” information.|
|Simple Network Services||1 unit for every two basic services: httpd, DNS, mail, printing, SAMBA, etc. Add 2 units if you want them to have better than 99.8% availability. Add 2 unit if you care about security. Add 1 unit if you are scaling larger than the average. Add 4 if you are scaling to mega size and are beyond what the software was designed for. If you are completely beyond scale, treat a development project and staff accordingly with engineers.|
|Complex Network Services||Highly variable. For example, multi-terabyte database used for data mining could easy consume multiple DBAs + multiple senior system administrators who specialize in performance tuning and large scale storage system.|
|Network Connectivity||Scales against number of network devices, number of networks, security issues, complexity of routing, HA requirements.|
|Coordination and Management||The larger and more complex an organization, the more there is a need for coordination roles. People who focus on human management, systems architecture, program management, project management. This is quite complex. It would be presumptuous to suggest a ratio.|
A solid SAGE II system administrator can handle 4 units of work. A strong SAGE III system administrator can handle 8 units of work. A superior SAGE IV system administrator can handle 12 units of work. This counting system is loosely based on an equation proposed by Sherwood Botsford and found in the comp.unix.admin FAQ. A some point I will update the counting to use my SRE Skill Matrix (excel).
Site with one administrator are not very desirable.
They are a fact of life since many small sites can neither afford nor justify more than one system administrator. It is difficult for one person to have the breadth of knowledge and experience to run a really first-class site, no matter how few machines it has. There will always be some area that is not the strength of a sole administrator.
Another problem is that the site with a single system administrator has a single point of failure: when the administrator is on vacation (or gets run over by a bus), the site is vulnerable. Carrying a pager on vacation isn’t my idea of fun; however, no one can predict when a crisis might occur. Of course, it’s hard to interest a high-level person in a job that also involves changing the backup tapes and crawling through the ceilings.
The more homogeneous a site is, the easier it is to support.
The number of different platforms supported (different machine architectures or different operating systems) increases the complexity of the support task. Upgrading the operating system will have to be done at least once by hand for each platform. Each operating system has it own idiosyncrasies that must be learned and mastered. Most sites want all the platforms to appear identical so that their users can sit down on any of the workstations and get work done. This requires that each platform have identical tools, window systems, etc. This can greatly increase the amount of work the administrator must do. In the best of circumstances this means recompiling programs for each platform. In the worst circumstances, it involves porting software, and fighting with vendor-supplied software. My personal nightmare is trying to support all of X11R4 (from MIT), DECwindows, OSF/Motif, and Sun’s OpenWindows on three different platforms.
Larger sites can exploit economies of scale.
Large sites can expand their administration staffs less rapidly than the number of users (or workstations) grows. The reason for this is that as your staff gets larger it is possible for people to specialize. This specialization permits individual staff members to develop a depth of expertise that enables them to understand all the issues on a given topic and solve more quickly whatever problems crop up.
Secondly, larger sites can leverage off previous work. The first installation of a machine or piece of software is always the most difficult. The second is easier. By the time you have done 50 or 100 installations, you have developed automatic scripts and can do installations in your sleep. I have seen large sites at a 1:100 administrator-to-machine ratio where things ran pretty well. I must caution the reader though: this sort of ratio is only feasible with top-notch people working in a carefully engineered environment with many hundreds of users. Most sites can’t get productive work done with this sort of ratio. This sort of ratio also limits the professional growth of members of the system staff because they will spend most of their time with the day-to-day issues and fire-fighting. This is a shame since an organization’s most valuable resource is its people.
|Increased SA Efficiency||Decreased SA Efficiency|
Standards (policy, architecture)
Robust IS security
Tight control over what gets loaded on HW/SW baseline
Redundancy of critical services
Separating services (single service machines)
Good training program
Detailed disaster recovery plans, by system
System which don’t require backups
|Diverse hardware baseline|
Diverse software baseline
Lax IS security
Little or no training
A staff that is reactive, not proactive
ad-hoc backups or no backups
High Availability Sites Require higher staffing.
Site which need to be highly available (e.g. greater than 99.9% service delivery) will require a higher level of staffing. The reason for this is you need people who can respond almost immediately to any service issues (e.g. 24×7 coverage, ideally at least 2 people deep who can do first and second level resolution, and be able to escalate to subject area experts). You also need to have multiple people for each subject area who are able to diagnosis and resolve complex issues quickly.
What About Other Platforms?
The platform which is being supported makes a great deal of difference. My experience is that support of Macintosh and UNIX communities take approximately the same staffing levels. Historically support of PCs running any Microsoft OS seems to require at least double the staffing and delivers a lower level of service. Since Windows XP the ratio doesn’t need to be as high… but I still find administration scales better on UNIX than Windows.
Other People’s Ratios
In the last few years there have been a lot of people who have talked about the ratios they think are reasonable. It is common to hear people talking about staff/user ratios of 1:60 where there is some variation in the population and a lot of custom work, and staff/user ratios of 1:150 (or higher) in locations that can use “cookie cutter” solutions, eg universities with hordes of undergraduates or enterprises where people are using computing as a tool rather than looking to innovate on the machines that are being administered. A more realistic set of ratios (based on best practices in the field rather than vendor white pages on TCO) was the Mega Group’s Improve staffing ratios article. There are a number of other studies that have found that in the real world most organizations have not been able to support ratios greater than 30:1. A Mitre study from 2000 suggested that the ratio is 47:1 +/- 17%. In a video about User to Technician Ratios by Justin Nguyen a base ratio of 60:1 was suggestion, with a number of factors which impacts this ratio.
An example of over inflated numbers can be found in Staffing for Technology Support, a white paper for education institutions. Unfortunately, these folks are trying to apply staffing ratios from MIT’s Project Athena to the rest of the world. This is flawed for three reasons. First, most sites don’t have the sophisticated tools that Athena had. Second, Athena had people who make Athena run which were not capture in their ratios: student volunteers that did a lot of work and hard core system programmer that developed tools which met MIT’s requirements. Finally, MITs user population is not an average user population.
David Cappuccio of the Gartner Group suggested in his article Know The Types: Sizing up Support Staffs that there are two ratios that you need to consider. The first ratio is staff to users, an attempt to capture the human part of the equation. This ratio is looking at how many people you need to do what is often called Tier I, help desk, or user support. The second ratio is the number of machines and subsystems per staff, that is capturing how many people are needed to take care of the technical infrastructure. While I like David’s framework, I think that his ratios are too high for user support, and that he has failed to capture the diverse set of technologies most organizations deploy: there is much more than print, file, web, and database servers. There are directory, security, messaging, and collaborative services. To complicate matters, many sites are heterogeneous requiring extra efforts to make one service work for all clients, or worse, resulting in the need services which are based on the client platform. A final complicating factor is that these services often have complex interactions and dependencies which makes them more difficult to deploy and maintain. The result is that David’s ratios will result staffing which will be able to deliver only the most basic services at an adequate level.
The itbenchmark blog has a number of postings on the topic of staff sizing.
The number of administrators required varies greatly from site to site. The one constant is that there are rarely enough system administrators for the responsibilities that they have. At the time this was originally written, I found it was possible for a single person to maintain up to 220 machines (with three different platforms) and give adequate user services to a fairly sophisticated user population of around 80 people. My time is divided between user services (30 percent), general system administration tasks (20 percent), installing new machines and hardware/network support (10 percent), software installation and maintenance (30 percent), custom software development and tracking of trends (35 percent), and site planning (10 percent). You will note that this adds up to 135 percent.
History of this Post
In 1991 I posted a note to Usenet responding to a question about staffing ratios. Rob Kostad asked me to expand that short note into an article which ran with the title “How Many Administrators are Enough?” in the magazine Unix Review, April 1991. Over the years I have made some, mostly minor updates to the original article. One of these days I will rewrite it completely. While this article was written a long time ago, I find that the ratios are still pretty accurate. If you think I am wrong, send me mail with your experiences.