One winter night every year, New York City tries to count how many homeless people are out in its streets. (This doesn’t include people in shelters, because shelters already keep records.) It’s done in a pretty low-tech way: the Department of Homeless Services hires a bunch of volunteers, trains them, and sends them out to find and count people.
How do you account for the fact that you probably won’t find everyone? Plant decoys! The city sends out another set of volunteers to pretend to be homeless, to see if they actually get counted. (My social worker wife gets glamorous opportunities like this sent to her on a regular basis.) Once all the numbers are in, you can estimate the total number of homeless as follows:
- Actual homeless counted = Total people counted — Decoys counted.
- Percent counted estimate = Decoys counted / Total decoys
- Homeless estimate = Actual homeless counted / Percent counted estimate
For example, say you counted 4080 people total out in the streets. And say you sent out 100 decoys and 80 of them got counted. Then the number of true homeless you counted is 4000 (= 4080 — 80), your count seems to capture 80% of the people out there, so your estimate for the true number of homeless is 4000 / 80% = 5000 (in other words, 5000 is the number that 4000 is 80% of).
But it’s probably not exactly 5000, for two reasons:
- Random error. You happened to count 80% of your decoys, but on another day, you might have counted 78% of your decoys, or 82%, or some other number. In other words, there’s natural randomness in your model which leads to indeterminacy in your answer.
- Systematic error. When you count the homeless, you have some idea of where they’re likely to be. But you don’t really know. And your decoys are likely going to plant themselves in the same general places where you think the homeless are. Put another way, if there are a bunch of homeless in an old abandoned subway station that you have no idea exists, you’re not going to count them. And your decoys won’t know to plant themselves there, so you won’t have any idea that you’re not counting them.
The first kind of error is error inside your model. You can analyze it, and treat it statistically by estimating a confidence interval, e.g., I’m estimating that there are 5000 homeless out there, and there’s a 95% chance that the true number is somewhere between 4500 and 5500, say. The second kind of error is external; it comes from stuff that your model doesn’t capture. It’s more worrying because you don’t — and can’t — know how much of it you have. But at least be aware that almost any model has it, and that even confidence intervals don’t incorporate it.