The Models Were Telling Us Trump Could Win

Nate Silver got the election right.

Modeling this election was never about win probabilities (i.e., saying that Clinton is 98% likely to win, or 71% likely to win, or whatever). It was about finding a way to convey meaningful information about uncertainty and about what could happen. And, despite the not-so-great headline, this article by Nate Silver does a pretty impressive job.

First, let’s have a look at what not to do. This article by Sam Wang (Princeton Election Consortium) explains how you end up with a win probability of 98-99% for Clinton. First, he aggregates the state polls, and figures that if they’re right on average, then Clinton wins easily (with over 300 electoral votes I believe). Then he looks for a way to model the uncertainty. He asks, reasonably: what happens if the polls are all off by a given amount? And he answers the question, again reasonably: if Trump overperforms his polls by 2.6%, the election becomes a toss-up. If he overperforms by more, he’s likely to win.

But then you have to ask: how much could the polls be off by? And this is where Wang goes horribly wrong.

The uncertainty here is virtually impossible to model statistically. US presidential elections don’t happen that often, so there’s not much direct history, plus the challenges of polling are changing dramatically as fewer and fewer people are reachable via listed phone numbers. Wang does say that in the last three elections, the polls have been off by 1.3% (Bush 2004), 1.2% (Obama 2008), and 2.3% (Obama 2012). So polls being off by 2.6% doesn’t seem crazy at all.

For some inexplicable reason, however, Wang ignores what is right in front of his nose, picks a tiny standard error parameter out of the air, plugs it into his model, and basically says: well, the polls are very unlikely to be off by very much, so Clinton is 98-99% likely to win.

Always be wary of models, especially models of human behavior, that give probabilities of 98-99%. Always ask yourself: am I anywhere near 98-99% sure that my model is complete and accurate? If not, STOP, cross out your probabilities because they are meaningless, and start again.

How do you come up with a meaningful forecast, though? Once you accept that there’s genuine uncertainty in the most important parameter in your model, and that trying to assign a probability is likely to range from meaningless to flat-out wrong, how do you proceed?

Well, let’s look at what Silver does in this article. Instead of trying to estimate the volatility as Wang does (and as Silver also does on the front page of his web site, people just can’t help themselves), he gives a careful analysis of some possible specific scenarios. What are some good scenarios to pick? Well, maybe we should look at recent cases of when nationwide polls have been off. OK, can you think of any good examples? Hmm, I don’t know, maybe…

Aiiieeee!!!!

Look at the numbers in that Sun cover. Brexit (Leave) won by 4%, while the polls before the election were essentially tied, with Remain perhaps enjoying a slight lead. That’s a polling error of at least 4%. And the US poll numbers are very clear: if Trump overperforms his polls by 4%, he wins easily.

In financial modeling, where you often don’t have enough relevant history to build a good probabilistic model, this technique — pick some scenarios that seem important, play them through your model, and look at the outcomes — is called stress testing. Silver’s article does a really, really good job of it. He doesn’t pretend to know what’s going to happen (we can’t all be Michael Moore, you know), but he plays out the possibilities, makes the risks transparent, and puts you in a position to evaluate them. That is how you’re supposed to analyze situations with inherent uncertainty. And with the inherent uncertainty in our world increasing, to say the least, it’s a way of thinking that we all better start becoming really familiar with.

The models were plain as day. What the numbers were telling us was that if the polls were right, Clinton would win easily, but if they were underestimating Trump’s support by anywhere near a Brexit-like margin, Trump would win easily. Shouldn’t that have been the headline? Wouldn’t you have liked to have known that? Isn’t it way more informative than saying that Clinton is 98% or 71% likely to win based on some parameter someone plucked out of thin air?

We should have been going into this election terrified.

Probability For Dummies (And We’re All Dummies)

Sometimes it feels like probability was made up just to trip you up. My undergraduate advisor Persi Diaconis, who started out as a magician and often works on card shuffling and other problems related to randomness, used to say that our brains weren’t wired right for doing probability. Now that I (supposedly!) know a little more about probability than I did as a student, Persi’s statement rings even truer.

I spent a little time this weekend thinking lately about why probability confuses us so easily. I don’t have all the answers, but I did end up making up a story that I found pretty illuminating. At least, I learned a few things from thinking it through. It’s based on what looks like a very simple example, first popularized by Martin Gardner, but it can still blow your mind a little bit. I actually meant to have a fancier example, but my basic one ended up being more than enough for what I wanted to get across. (Some of these ideas, and the Gardner connection, are explored in a complementary way in this paper by Tanya Khovanova.) Here goes.

Prologue. Say you go to a school reunion, and you find yourself at a dimly-lit late evening reception, talking to your old friend Robin. You haven’t seen each other for years, you’re catching up on family, and you hear that Robin has two children. Maybe the reunion has you thinking back to the math classes you took, or maybe you’ve just been drinking too much, but for some reason, you start wondering whether Robin’s children have the same gender (two boys or two girls) or different genders (one of each). Side note: if you’ve managed to stay sober, this may be the point at which you realize that you’ve not only wandered into a reunion you’re barely interested in, you’ve wandered into a math problem you’re barely… um, well, anyway, let’s keep going.

The gender question is pretty easy to answer, at least in terms of what’s more and less likely. Assuming that any one child is as likely to be a girl as a boy (not quite, but let’s ignore that), and assuming that having one kid be a girl or boy doesn’t change the likelihood of having your other kid be a girl or boy (again, probably not exactly true, but whatever), we find there are four equally likely scenarios (I’m listing the oldest kid first):

(Girl, girl)      (Girl, boy)     (Boy, girl)     (Boy, Boy)

Each of these scenarios has probability 25%. There are two scenarios with two kids of the same sex (total probability 50%), and two scenarios with two kids of opposite sexes (total probability also 50%). Easy peasy.

But things won’t stay simple for long, because you’ve not only wandered into a school reunion and a math problem, you’ve also wandered into a…

Really. So you’re at the reunion, still talking to Robin, only you might be sober, or you might be drunk. Which is it?

Sober Version: You and Robin continue your nice lucid conversation, and Robin says: “My older kid is a girl.” Does the additional information change the gender probabilities (two of the same vs. opposites) at all?

This one looks easy too, especially given that you’re sober. Now that we know the older kid is a girl, things come down to the gender of the younger kid. We know that having a girl and having a boy are equally likely, so two of the same vs. opposite genders should still be 50-50. In terms of the scenarios above, we’ve ruled out the last two scenarios and have a 50-50 choice between the first two.

But now let’s turn the page to the…

Drunk Version: You and Robin have both had more than a little wine, haven’t you? Maybe Robin’s starting to mumble a bit, or maybe you’re not catching every word Robin says any more, but in any case, in this version what you heard Robin say was, “My umuhmuuh kid is a girl.” So Robin might have said older or younger, but in the drunk version, you don’t know which. What are the probabilities now? Are they different from the sober version?

Argument for No: Robin might have said, “My older kid is a girl,”in which case you rule out the last two scenarios as above and conclude the probabilities are still 50-50. Or Robin might have said, “My younger kid is a girl,” in which case you would rule out the second and fourth scenarios but the probabilities would again be 50-50. So it’s 50-50 no matter what Robin said. It doesn’t make a difference that you didn’t actually hear what it was.

Argument for Yes: Look at the four possible scenarios above. All we know now is that one of the kids is a girl, i.e., we’ve only ruled out (Boy, Boy). The other three are still possible, and still equally likely. But now we have two scenarios where the kids have opposite genders, and only one where they have the same gender. So now it’s not 50-50 anymore; it’s 2/3-1/3 in favor of opposite genders.

Both arguments seem pretty compelling, don’t they? Maybe you’re a little confused? Head spinning a little bit? Well, I did tell you this was the drunk version!

To try to sort things out, let’s step back a little bit. Drink a little ice water and take a look around the room. Let’s say you see 400 people at the reunion that have exactly two kids. I won’t count spouses, and I’ll assume that none of your classmates got together to have kids. That keeps things simple: 400 classmates with a pair of kids means 400 pairs of kids. On average, there’ll be 100 classmates for each of the four kid gender combinations. One of these classmates is your friend Robin.

Now imagine that each of your classmates is drunkenly telling a friend about which of their kids are girls. What will they say?

• The 100 in the (Boy, Boy) square would certainly never say, “My umuhmuuh kid is a girl.” We can forget about them.
• The 100 in the (Girl, Boy) square would always say, “My older kid is a girl.”
• The 100 in the (Boy, Girl) square would always say, “My younger kid is a girl.”
• The 100 in the (Girl, Girl) square could say either. There’s no reason to prefer one or the other, especially since everyone is drunk. So on average, 50 of them will say “My older kid is a girl,” and the other 50 will say, “My younger kid is a girl.”

All together, there should be 150 classmates who say their older kid is a girl, 150 who say their younger kid is a girl, and 100 who don’t say anything because they have no girl kids.

In the drunk version, where we don’t know what Robin said, Robin could be any of the 150 classmates who would say “My older kid is a girl.” In that case, 100 times out of 150, Robin’s two kids have opposite genders. Or Robin could be any of the 150 classmates who would say, “My younger kid is a girl,” and in that case again, 100 times out of 150, Robin’s two kids have opposite genders.

This analysis is consistent with the Argument for Yes, and leads to the same conclusion: there is a 2-in-3 chance (200 times out of 300) that Robin’s kids have opposite genders. But, it seems to agree with the spirit of the Argument for No as well! It looks like knowing Robin was talking about the older kid actually didn’t add any new information: that 2-in-3 chance would already hold if Robin had soberly said “My older kid is a girl” OR if Robin had just as soberly said “My younger kid is a girl.”

But now something seems really off. Because now it’s starting to look like our analysis of the sober version, apparently the simplest thing in the world, was actually incorrect. In other words, now it seems like we’re saying that finding out Robin’s older kid was a girl actually didn’t leave the gender probabilities at 50-50 like we thought. Which is just… totally… nuts. (And not at all sober.) Isn’t it?

Not necessarily.

Here’s the rub. In the sober version, the conversation could actually have gone a couple different ways:

Sober Version 1.

ROBIN: I’ve got two. My older kid is a junior in high school, plays guitar, does math team, runs track, and swims.

YOU: That’s great. Girls’ or boys’ track? The girls’ track team at my kids’ school is really competitive.

ROBIN: Girls’ track. My older kid is a girl.

Sober Version 2.

YOU: I teach math and science, and I’m really interested in helping girls succeed.

ROBIN: That’s great! Actually, if you’re interested in girls doing math, you might be interested in something that happened to one of my kids. My older kid is a girl, and…

Comparing Versions. In both versions, it looks like you ended up with the same information (Robin’s older kid is a girl). But the conclusions you get to draw are totally different!

Let’s view things in terms of your 400 classmates in the room. In Sober Version 1, the focus is on your classmate’s older kid. The key point is that, in this version of the conversation, in the 100 scenarios in which both of your classmate’s kids are girls, you would hear “my older kid is a girl” in all of them. Of course in the 100 (Girl, Boy) scenarios, you would hear “my older kid is a girl” as well. That makes for 200 “my older kid is a girl” scenarios, 100 of which are same-gender scenarios. The likelihood that both kids are girls is 50-50.

Whereas in Sober Version 2, the focus is on girls. In the 100 scenarios in which both of your classmate’s kids are girls, you should expect to hear a story about the older daughter about half the time, and the younger daughter the other half. (Perhaps not exactly, because the older kid has had more time to have experiences that become the subject of stories, but I’m ignoring this.) Combining this with the 100 (Girl, Boy) scenarios, we get 150 total “my older kid is a girl” scenarios. Only 50 of them are same-gender scenarios, and the likelihood that both kids are girls is only 1-in-3.

Why Probability Makes Us All Dummies. Probability is about comparing what happened with what might have happened. Math people have a fancy name for what might have happened: they call it the state space. What we see in this example is that when you talk about everyday situations in everyday language, it can be very tricky to pin down the state space. It’s hard to keep ambiguities out.

Even the Sober Version, which sounds very simple at first, turns out to have an ambiguity that we didn’t consider. And when we passed from the Sober Version to the Drunk Version, we got confused because we implicitly took the Sober Version to be Version 1, with a 200-person state space, while we took the Drunk Version to be like Version 2, with a 150-person state space. In other words, in interpreting “My older kid is a girl” vs. “One of my kids is a girl,” we fell into different assumptions about the background. I think this is what it means that our brains aren’t wired right to do probability: it’s incredibly easy for them to miss what the background assumptions are. And when we change the state space without realizing it by changing those background assumptions, we get paradoxes.

Note: while I framed what I’ve been calling the Drunk Version (one of my kids is a girl) in a way that makes Version 2 the natural interpretation, it can also be reframed to sound more like Version 1. In that case, the Argument for No in the Drunk Version is fully correct, and the probabilities are 50-50. From a quick online survey, I’ve found this in a few places, including Wikipedia and the paper I linked at the start. I haven’t seen anyone else note that what I’ve been calling the Sober Version (my oldest kid is a girl) can be also framed in multiple ways. Just more proof that it’s really easy to miss background assumptions!

Another point of view on this is in terms of information. The Sober vs. Drunk versions confused us because it looked like we had equivalent information – one of the kids is a girl – but ended up with different outcomes. But in fact we didn’t have equivalent information; in fact in the Sober version, there was an essential ambiguity in what information we had! The point here is that just knowing the answer to a question (my oldest kid is a girl) usually isn’t the full story when it comes to probability problems. We need to know the question (Is your oldest kid a girl vs. Is one of your kids a girl) as well. The relevant information is a combination of a question and a statement that answers it, not a statement (or set of statements) floating on its own.

Nick Kristof is not Smarter than an 8th Grader

About a week ago, Nick Kristof published this op-ed in the New York Times. Entitled Are You Smarter than an 8th Grader, the piece discusses American kids’ underperformance in math compared with students from other countries, as measured by standardized test results. Kristof goes over several questions from the 2011 TIMSS (Trends in International Mathematics and Science Study) test administered to 8th graders, and highlights how American students did worse than students from Iran, Indonesia, Ghana, Palestine, Turkey, and Armenia, as well as traditional high performers like Singapore. “We all know Johnny can’t read,” says Kristof, in that finger-wagging way perfected by the current cohort of New York Times op-ed columnists; “it appears that Johnny is even worse at counting.”

The trouble with this narrative is that it’s utterly, demonstrably false.

My friend Jordan Ellenberg pointed me to this blog post, which highlights the problem. In spite of Kristof’s alarmism, it turns out that American eighth graders actually did quite well on the 2011 TIMSS. You can see the complete results here. Out of 42 countries tested, the US placed 9th. If you look at the scores by country, you’ll see a large gap between the top 5 (Korea, Singapore, Taiwan, Hong Kong, and Japan) and everyone else. After that gap comes Russia, in 6th place, then another gap, then a group of 9 closely bunched countries: Israel, Finland, the US, England, Hungary, Australia, Slovenia, Lithuania, and Italy. Those made up, more or less, the top third of all the countries that took the test. Our performance isn’t mind-blowing, but it’s not terrible either. So what the hell is Kristof talking about?

You’ll find the answer here, in a list of 88 publicly released questions from the test (not all questions were published, but this appears to be a representative sample). For each question, a performance breakdown by country is given. When I went through the questions, I found that the US placed in the top third (top 14 out of 42 countries) on 45 of them, the middle third on 39, and the bottom third on 4. This seems typical of the kind of variance usually seen on standardized tests. US kids did particularly well on statistics, data interpretation, and estimation, which have all gotten more emphasis in the math curriculum lately. For example, 80% of US eighth graders answered this question correctly:

Which of these is the best estimate of (7.21 × 3.86) / 10.09?

(A) (7 × 3) / 10   (B) (7 × 4) / 10   (C) (7 × 3) / 11   (D) (7 × 4) / 11

More American kids knew that the correct answer was (B) than Russians, Finns, Japanese, English, or Israelis. Nice job, kids! And let’s give your teachers some credit too!

But Kristof isn’t willing to do either. He has a narrative of American underperformance in mind, and if the overall test results don’t fit his story, he’ll just go and find some results that do! Thus, the examples in his column. Kristof literally went and picked the two questions out of 88 on which the US did the worst, and highlighted those in the column. (He gives a third example too, a question in which the US was in the middle of the pack, but the pack did poorly, so the US’s absolute score looks bad.) And, presto! — instead of a story about kids learning stuff and doing decently on a test, we have yet another hysterical screed about Americans “struggling to compete with citizens of other countries.”

Kristof gives no suggestions for what we can actually do better, by the way. But he does offer this helpful advice:

Numeracy isn’t a sign of geekiness, but a basic requirement for intelligent discussions of public policy. Without it, politicians routinely get away with using statistics, as Mark Twain supposedly observed, the way a drunk uses a lamppost: for support rather than illumination.

So do op-ed columnists, apparently.

You Can’t Separate Models From Data

This Atlantic article is a bit highfalutin’, but if you make it past all the metaphors at the beginning, you’ll get to see some good examples of a very important idea: you can’t separate a model from the data that goes into the model. In particular, constraints on the input data become constraints on the model.

A story: my first non-academic job was Modeling Guy at a start-up that was building technology to generate movie recommendations, similar to Netflix or Amazon. (Just so you know how long ago this was, we were going to have recommendation kiosks at video stores! Then the Internet crash happened.) Recommendation models all work in pretty much the same way: the model finds people whose taste (in movies, books, music, whatever) is similar to yours, and recommends movies to you that those people have liked but you might not have seen yet. The basic input data to a model like this is preference information. In less fancy language, you need to know what movies different people like.

One thing I wanted to account for in my model was that there are multiple movie genres, and people might have similar tastes in some but not others. (You and I could both like pretty much the same comedies, but maybe you like musicals and I hate them. No, really, I hate them.) To make this work, I needed enough data to be able to model preferences in each genre, not just overall. It wasn’t enough to know, for each person, 10 or 20 movies that they liked; I needed to know a few comedies each person liked, a few mysteries, a few musicals (if any), etc. Which meant I needed a larger dataset overall, because there are a lot of genres.

Now, it wasn’t so hard to collect this data. We made a long list of movies, made sure we included a decent number from every genre we wanted to cover, and had people rate the movies on our list. (You can give people a long list, because they usually still remember a movie well enough to rate it long after they saw it.) Long story short, I had enough data to do what I wanted to do — model each genre separately — and my model seemed to work pretty well. (We tested against models that lumped all the movies together, and mine did better.) What I want to highlight is that if I hadn’t been able to collect as much data, my fine-grained approach probably wouldn’t have worked at all. If I only had a small dataset, I wouldn’t have been able to say anything about what was going on inside each genre, and grouping people based on all the movies lumped together would have been a better bet. The model wouldn’t have been very precise, but it would have used the little data I did have more efficiently.

The upshot is that models depend on data, and data availability (quantity and quality) is always a real world issue, not just a math issue. A model may make perfect sense in theory, but work badly in practice if reality gets in the way of gathering the data you need to run the model.

Education data is a great example here. There’s a class of models for measuring teacher and school performance, known broadly as value-added models (VAM). The idea is to try to isolate how much “value” a teacher or school adds to students’ learning, where learning is usually measured through test scores. Regardless of what you think about standardized testing, you should know that the modeling here is extremely challenging! The problem is that it’s very hard to break out the impact of a teacher or school from all the other, “external” factors that might affect a kid’s test scores (genetics, at-home support and preparation, attendance, schools attended in the past, just to name a few). To do this, you need a model to estimate a kid’s “expected” test score based on all the external factors. (The “value added” by the school or teacher is then supposed to be captured as the difference between this model-based expected score and the actual score.)

To build such a model, you need to model the external factors, which means you need a huge amount of input data. You certainly want a history of past test scores (hard to collect if a kid has moved around between schools where different tests are given; hard to interpret even if you happen to have the data). You likely want to know something about income (typically eligibility for reduced-price school lunch programs is used as a proxy for this. At my kids’ school, this data was apparently wrong for a couple years). And you probably want to know something about support at home, out of school activities, and lots of other variables — well, good luck! The worst part is that the data gaps tend to be the biggest in the poorest schools (less resources to collect data, more kids going in and out, making the data problem harder to begin with). These are precisely the schools where it’s most important to model the challenges the kids face — and yet the data isn’t there to do it.

There’s starting to be a backlash against standardized testing, and against measuring teachers and schools by the results of those standardized tests. And there’s also a backlash to the backlash, with supporters of the VAM framework arguing that it’s the most objective measure of teacher performance and kids’ progress. But models and measures based on data that’s not there, and can’t be filled in, aren’t objective at all.