Monday, July 7, 2008

Admin Basics: One, Some, Many

As I write this post, we're on the eve of a date Windows admins are painfully familiar with: Patch Tuesday. Microsoft releases scheduled updates on the 2nd Tuesday of each month and because of this predictable schedule, Admins can take all the actions necessary to ensure that these patches are delivered in a timely manner. I'll defer talking about patch automation until a later date, but for now I'll take this opportunity to talk about my first Admin Basics topic: One, Some, Many.

When taking any action on a computer, there's always some risk that the change you make will break something or have other unintended consequences. You can try to predict what will happen, and you can have rigorous testing, but the chances are that something may fail and that something may not be what you test for. When making larger changes, the chances of something going wrong are greater than for a trivial change.

This leads me to the concept of One, Some and Many. This ties nicely into patching, but applies to all system changes.

You know you're going to make a change. You know it might have negative consequences. You test it as best you can-- how can you limit your risk beyond that?

Simple: Push the change to a single system first and test. If it works, then the chances are reasonable that the change had no negative effects. From there, pick a representative sample of other systems and push that change to them... and wait. If none of the users report problems, you can then push out to a larger group. If you're running a very large group of systems, you may have several groups of "many" for various reasons. If you have a smaller number of systems, you can probably safely patch them all in one big group of "many." If you start to get failures, you can go back to the previous stage and test more rigorously with the new failure information.

Why do this?

1. Vendors can't test every possible scenario, and often patches, updates, and configuration changes are poorly tested.

2. While you have a responsibility to test changes, you will have a hard time testing every scenario. It's efficient to have some users test as well.

3. The risk exposure is lower: If users experience problems in the "some" phase, you've limited the number of people having problems and enhanced your ability to troubleshoot quickly. If nothing else, you can back out their updates and go back to the previous testing step.

If you execute the One, Some, Many strategy you can still make changes in a reasonable period of time but lessen the risk. It's a very bad feeling when you make a sweeping change and your users start screaming. This will help you not be that guy.

No comments: