If you tell enough stories, perhaps the moral will show up.

2011-03-12

Coverage doesn't cover it; Why we need delinquency.


I don’t think “coverage” targets and metrics for update-led systems like AV and patching do the job. It’s just as important to measure the AGE of non-compliance: the Delinquency. I want the reporting packages that come with AV and patch products to offer that number, and I’m vexed - really quite upset just now - that they don’t.

The purpose and justification of compliance targets is to ensure that "enough" machines are current without wasting effort on too many fixes. In theory, “enough” means
  • In the machine population as a whole, on average an installed malware infects (much) less than one other machine, so outbreaks are stifled ("herd immunity")  and
  • The probability of any individual machine being compromised is acceptably small considering its sensitivity.
These factors could be measured and calculated, but in practice they aren’t. There are too many unknowns. In fact we adopt more or less arbitrary targets.

Almost uniformly, targets for anti-malware signature systems and patching are routinely measured and set in terms of coverage. What percentage of the population is “out of date” or not installed at all?

My view is that that suits the manufacturer’s model where installers and updating systems work reliably and easily, where the bulk of the effort lies in the initial setup and maintenance is mostly a matter of ensuring the new-builds all carry the agent. When that’s true, why not aspire to 100% and set a target of 97%? The real world is somewhat different.

Not so long ago, I faced a situation where a major vendor AV product was struggling to attain 80% coverage. The hardware was regular, the OS was XP, the builds were coming off a restricted set of images. We got plenty of high-quality support, but it was all about rectifying individual machines – try this, if that doesn’t work, try this. In that situation, the coverage metric was very unhelpful because it just doesn’t say what to do next, how to prioritise limited effort. Coverage can even be a negative guide if the support teams learn to focus on the easy wins, as the troublesome builds which may never have had a current scan will never be fixed. If you have a coverage target, it’s natural to fix the easy problems first – would you fix the AV on a dozen standard-build workstations, or that flaky build that runs that special system that nobody really understands? Are you going to reach out to the laptop users who never update? Or if you’re patching every server you have, except the domain controllers, the coverage looks fine, but the situation is dire…

The measure we want – and none of the reporting tools support it – is delinquency. Delinquency measures how long devices – servers, workstations – have been out of compliance. Admins faced with a delinquency target will be more motivated to fix the hard cases, or escalate them out of the system.

Delinquency is the percentage of machines which are out of compliance now and have not been in compliance before some cut-off time. If you scan compliance on Monday, then the machines that were first noticed the previous Monday or before are your one-week delinquents. Of those, the ones that first showed up the Monday before that are your (yes) two-week delinquents.

The timescales you use depends on how tightly you intend to ride rectification for that particular population. For example, for workstations, I would say that a target might be 10% of one-day delinquents, and zero% of one-weekers. I’m saying that we can accept quite a high percentage of non-compliant hosts, provided that we have confidence that all of them are getting fixed within the week – rebuilt or updated by hand if necessary.

Servers, naturally, are different.  For servers, the route to live is six weeks long and we get one reboot window per month. Many rectification processes involve a reboot. That’s part of the reason why coverage targets fail harder for servers. But for delinquency, we can say that our target is zero% of six-week delinquents – everything has to be fixed in the first reboot cycle after it goes bad – and all of a sudden we are getting somewhere.

I’m not against coverage reporting. It’s good, and it tells a good story at management level. And coverage targets are necessary to control some very obvious ways to game delinquency! But delinquency allows you to manage:
  • It gives you a clear “next action” – pick the oldest – to prioritise your rectification effort, and
  • It’s compatible with a zero target – you just have to set the age of non-compliance to match your environment, available effort, and risk appetite

On the downside, auditors tend to panic or look blank when you describe it to them. More seriously, it requires a history, and I guess it’s that dependency which means that it seems to be impossible to get figures out of the reporting packages.

And that’s why I’m ranting! I’ve just had to give up delinquency reporting as the hand-built tool I used became too hard to maintain with a change of platform. I’ve had to move back to checking coverage and keeping private little lists of troublemakers, and it feels like a real step backwards.