Challenges Using NT For Production Services
Mark Verber
Draft 0.01 / October 8, 1999
Few updates May 10, 2007
The following is a list on things which have made building and deploying a
production service on NT harder than it should be. This list was generate
in about 15 minutes ... in other words, it's shooting from the hip.
Development
Threads are hard, especially when debugging IIS. We get something like
1gb of icepack logs from a 1 minute run. The tools we know about make it really
tough to comb through the output and find what we are looking for. Insert
a description of some of the instrumentation and tools which existed in the
Cedar environment at PARC which made understanding thread traces and seeing
deadlocks a bit easier.
APIs, Data Structures, and Libraries
I hate most of Windows
API. I wish the interface police from DEC SRC (oops,
HPlabs) would
take a run at things and clean everything up. Oh, but then we wouldn't be
backwards compatible. Sigh.
- API Junk - A lot of the time we find ourselves having to work around some
silliness. Many interfaces aren't clean. Too often parameters are
polymorphic -- not so good when you are working in C. Too many interfaces
which aren't consistent.
- File system calls not expose by Win32 such as an atomic
append at end of file making it hard to get acceptable performance with appropriate levels of safety.
The Write-scatter-gather doesn't compare to UNIX writev.
- BSTR suck... but you have to use them for a lot of things
- using COM in process forces you to use the system heap system...
- ... and the heap allocator is single threaded. Should move system malloc to
something like Hoard.
- Many core libraries such as ATL raise dialog boxes on the machine's
display in the case of errors! We have had to
overwrite macros, change code, etc to prevent this. This is a real
pain because if you don't find these error traps, a simple problem can
result in a process locking up and not immediately obvious reason unless you
are in the same physical location (e.g. can see the screen).
This does not scale well.
Debugging Pains
- Dr. Watson isn't right. First, it should be designed so that it
can save multiple core dumps. This is important in a service when
you want a quick restart but not lose failure data.
- It would be nice if there was a good debugger framework such that you
could call an appropriate debugger for a particular issue. Right now you
can't call a debugger based on which component failed. You can only set
the global debugger
- Categorization in IIS - what inside was the guilty party
- ntsd doesn't give easy access to sources
- terminal server debugging - run in a css, no way to communicate
General Brokenness
Operations
OS install
- Normal installs require multiple reboots. In our UNIX work everything can
be done with a single reboot. We do a network boot, software
installation, and then a reboot to the newly loaded disk.
We has
hacked together a Linux based installation system which lets us install NT
with 2 reboots... we are getting closer.
- No good equiv. of Sun's jumpstart. Jumpstart is a network boot and
software installation system which can automatically customize new machines.
[I believe the internal Microsoft project call "Big"
which would do bare metal provisioning was finally released as part of
Windows Server 2003]
- It's hard to know what to install and what isn't needed.
CPU servers in data centers normally don't need to print. It
would be nice to remove the printer module to lower the chances of yet
another security issues, yet with the fun world of COM
and inter-dependencies, you can't be 100% that something doesn't depend on
those modules and you won't see it initially due to late binding.
Registry
- tools not great
- not right solution. Want to be able to grep, awk, etc
- No one place to store information since. In UNIX land it is possible
to reference other keys. For example, people take ($HOSTNAME) on unix
much more seriously than they used to. If you want to change a machines
IP address, maybe need to change tens of keys.
- Encourages people to keep lots of global state
making it hard to do side-by-side versioning
Software install
- Most things are globals - can't compartmentalize
- DLL hell
Logging
- performance
is poor
- aggregation
is primative
- everything doesn't log..
some things put in on screen
- many things don't log useful information (just an error code)
- Didn't used to insert a boot even which made some
debugging harder. Think it's there now.
Management APIs
- one API, multiple interfaces. Today, often GUI only way to set all values
- many things require a head.
Remote Management
- Serial lines - not dependent on networking code, basic device driver
- -silent not enough. Errors should go to command shell
- In UNIX everything is a command line. To do remote administration
you just need a remote shell and everything works. In the NT world you
have to make every tool network aware.
Plug and Pray
- Can you tell not do this? e.g. this nic card sucks.. ignore it