This Atlantic article is a bit highfalutin’, but if you make it past all the metaphors at the beginning, you’ll get to see some good examples of a very important idea: you can’t separate a model from the data that goes into the model. In particular, constraints on the input data become constraints on the model.
A story: my first non-academic job was Modeling Guy at a start-up that was building technology to generate movie recommendations, similar to Netflix or Amazon. (Just so you know how long ago this was, we were going to have recommendation kiosks at video stores! Then the Internet crash happened.) Recommendation models all work in pretty much the same way: the model finds people whose taste (in movies, books, music, whatever) is similar to yours, and recommends movies to you that those people have liked but you might not have seen yet. The basic input data to a model like this is preference information. In less fancy language, you need to know what movies different people like.
One thing I wanted to account for in my model was that there are multiple movie genres, and people might have similar tastes in some but not others. (You and I could both like pretty much the same comedies, but maybe you like musicals and I hate them. No, really, I hate them.) To make this work, I needed enough data to be able to model preferences in each genre, not just overall. It wasn’t enough to know, for each person, 10 or 20 movies that they liked; I needed to know a few comedies each person liked, a few mysteries, a few musicals (if any), etc. Which meant I needed a larger dataset overall, because there are a lot of genres.
Now, it wasn’t so hard to collect this data. We made a long list of movies, made sure we included a decent number from every genre we wanted to cover, and had people rate the movies on our list. (You can give people a long list, because they usually still remember a movie well enough to rate it long after they saw it.) Long story short, I had enough data to do what I wanted to do — model each genre separately — and my model seemed to work pretty well. (We tested against models that lumped all the movies together, and mine did better.) What I want to highlight is that if I hadn’t been able to collect as much data, my fine-grained approach probably wouldn’t have worked at all. If I only had a small dataset, I wouldn’t have been able to say anything about what was going on inside each genre, and grouping people based on all the movies lumped together would have been a better bet. The model wouldn’t have been very precise, but it would have used the little data I did have more efficiently.
The upshot is that models depend on data, and data availability (quantity and quality) is always a real world issue, not just a math issue. A model may make perfect sense in theory, but work badly in practice if reality gets in the way of gathering the data you need to run the model.
Education data is a great example here. There’s a class of models for measuring teacher and school performance, known broadly as value-added models (VAM). The idea is to try to isolate how much “value” a teacher or school adds to students’ learning, where learning is usually measured through test scores. Regardless of what you think about standardized testing, you should know that the modeling here is extremely challenging! The problem is that it’s very hard to break out the impact of a teacher or school from all the other, “external” factors that might affect a kid’s test scores (genetics, at-home support and preparation, attendance, schools attended in the past, just to name a few). To do this, you need a model to estimate a kid’s “expected” test score based on all the external factors. (The “value added” by the school or teacher is then supposed to be captured as the difference between this model-based expected score and the actual score.)
To build such a model, you need to model the external factors, which means you need a huge amount of input data. You certainly want a history of past test scores (hard to collect if a kid has moved around between schools where different tests are given; hard to interpret even if you happen to have the data). You likely want to know something about income (typically eligibility for reduced-price school lunch programs is used as a proxy for this. At my kids’ school, this data was apparently wrong for a couple years). And you probably want to know something about support at home, out of school activities, and lots of other variables — well, good luck! The worst part is that the data gaps tend to be the biggest in the poorest schools (less resources to collect data, more kids going in and out, making the data problem harder to begin with). These are precisely the schools where it’s most important to model the challenges the kids face — and yet the data isn’t there to do it.
There’s starting to be a backlash against standardized testing, and against measuring teachers and schools by the results of those standardized tests. And there’s also a backlash to the backlash, with supporters of the VAM framework arguing that it’s the most objective measure of teacher performance and kids’ progress. But models and measures based on data that’s not there, and can’t be filled in, aren’t objective at all.