How Value Added Models are Like Turds

“Why am I surrounded by statistical illiterates?” — Roger Mexico in Gravity’s Rainbow

Oops, they did it again. This weekend, the New York Times put out this profile of William Sanders, the originator of evaluating teachers using value-added models based on student standardized test results. It is statistically illiterate, uses math to mislead and intimidate, and is utterly infuriating.

Here’s the worst part:

When he began calculating value-added scores en masse, he immediately saw that the ratings fell into a “normal” distribution, or bell curve. A small number of teachers had unusually bad results, a small number had unusually good results, and most were somewhere in the middle.

And later:

Up until his death, Mr. Sanders never tired of pointing out that none of the critiques refuted the central insight of the value-added bell curve: Some teachers are much better than others, for reasons that conventional measures can’t explain.

The implication here is that value added models have scientific credibility because they look like math — they give you a bell curve, you know. That sounds sort of impressive until you remember that the bell curve is also the world’s most common model of random noise. Which is what value added models happen to be.

Just to replace the Times’s name dropping with some actual math, bell curves are ubiquitous because of the Central Limit Theorem, which says that any variable that depends on many similar-looking but independent factors looks like a bell curve, no matter what the unrelated factors are. For example, the number of heads you get in 100 coin flips. Each single flip is binary, but when you flip a coin over and over, one flip doesn’t affect the next, and out comes a bell curve. Or how about height? It depends on lots of factors: heredity, diet, environment, and so on, and you get a bell curve again. The central limit theorem is wonderful because it helps explain the world: it tells you why you see bell curves everywhere. It also tells you that random fluctuations that don’t mean anything tend to look like bell curves too.

So, just to take another example, if I decided to rate teachers by the size of the turds that come out of their ass, I could wave around a lovely bell-shaped distribution of teacher ratings, sit back, and wait for the Times article about how statistically insightful this is. Because back in the bad old days, we didn’t know how to distinguish between good and bad teachers, but the Turd Size Model™ produces a shiny, mathy-looking distribution — so it must be correct! — and shows us that teacher quality varies for reasons that conventional measures can’t explain.

Or maybe we should just rate news articles based on turd size, so this one could get a Pulitzer.


The Models Were Telling Us Trump Could Win

Nate Silver got the election right.

Modeling this election was never about win probabilities (i.e., saying that Clinton is 98% likely to win, or 71% likely to win, or whatever). It was about finding a way to convey meaningful information about uncertainty and about what could happen. And, despite the not-so-great headline, this article by Nate Silver does a pretty impressive job.

First, let’s have a look at what not to do. This article by Sam Wang (Princeton Election Consortium) explains how you end up with a win probability of 98-99% for Clinton. First, he aggregates the state polls, and figures that if they’re right on average, then Clinton wins easily (with over 300 electoral votes I believe). Then he looks for a way to model the uncertainty. He asks, reasonably: what happens if the polls are all off by a given amount? And he answers the question, again reasonably: if Trump overperforms his polls by 2.6%, the election becomes a toss-up. If he overperforms by more, he’s likely to win.

But then you have to ask: how much could the polls be off by? And this is where Wang goes horribly wrong.

The uncertainty here is virtually impossible to model statistically. US presidential elections don’t happen that often, so there’s not much direct history, plus the challenges of polling are changing dramatically as fewer and fewer people are reachable via listed phone numbers. Wang does say that in the last three elections, the polls have been off by 1.3% (Bush 2004), 1.2% (Obama 2008), and 2.3% (Obama 2012). So polls being off by 2.6% doesn’t seem crazy at all.

For some inexplicable reason, however, Wang ignores what is right in front of his nose, picks a tiny standard error parameter out of the air, plugs it into his model, and basically says: well, the polls are very unlikely to be off by very much, so Clinton is 98-99% likely to win.

Always be wary of models, especially models of human behavior, that give probabilities of 98-99%. Always ask yourself: am I anywhere near 98-99% sure that my model is complete and accurate? If not, STOP, cross out your probabilities because they are meaningless, and start again.

How do you come up with a meaningful forecast, though? Once you accept that there’s genuine uncertainty in the most important parameter in your model, and that trying to assign a probability is likely to range from meaningless to flat-out wrong, how do you proceed?

Well, let’s look at what Silver does in this article. Instead of trying to estimate the volatility as Wang does (and as Silver also does on the front page of his web site, people just can’t help themselves), he gives a careful analysis of some possible specific scenarios. What are some good scenarios to pick? Well, maybe we should look at recent cases of when nationwide polls have been off. OK, can you think of any good examples? Hmm, I don’t know, maybe…



Look at the numbers in that Sun cover. Brexit (Leave) won by 4%, while the polls before the election were essentially tied, with Remain perhaps enjoying a slight lead. That’s a polling error of at least 4%. And the US poll numbers are very clear: if Trump overperforms his polls by 4%, he wins easily.

In financial modeling, where you often don’t have enough relevant history to build a good probabilistic model, this technique — pick some scenarios that seem important, play them through your model, and look at the outcomes — is called stress testing. Silver’s article does a really, really good job of it. He doesn’t pretend to know what’s going to happen (we can’t all be Michael Moore, you know), but he plays out the possibilities, makes the risks transparent, and puts you in a position to evaluate them. That is how you’re supposed to analyze situations with inherent uncertainty. And with the inherent uncertainty in our world increasing, to say the least, it’s a way of thinking that we all better start becoming really familiar with.

The models were plain as day. What the numbers were telling us was that if the polls were right, Clinton would win easily, but if they were underestimating Trump’s support by anywhere near a Brexit-like margin, Trump would win easily. Shouldn’t that have been the headline? Wouldn’t you have liked to have known that? Isn’t it way more informative than saying that Clinton is 98% or 71% likely to win based on some parameter someone plucked out of thin air?

We should have been going into this election terrified.

Cathy’s Book is Out!

Cathy O’Neil’s book Weapons of Math Destruction is out, and it’s already been shortlisted for a National Book Award! Here is a review of the book that I posted on

So here you are on Amazon’s web page, reading about Cathy O’Neil’s new book, Weapons of Math Destruction. Amazon hopes you buy the book (and so do I, it’s great!). But Amazon also hopes it can sell you some other books while you’re here. That’s why, in a prominent place on the page, you see a section entitled:

Customers Who Bought This Item Also Bought

This section is Amazon’s way of using what it knows — which book you’re looking at, and sales data collected across all its customers — to recommend other books that you might be interested in. It’s a very simple, and successful, example of a predictive model: data goes in, some computation happens, a prediction comes out. What makes this a good model? Here are a few things:

  1. It uses relevant input data.The goal is to get people to buy books, and the input to the model is what books people buy. You can’t expect to get much more relevant than that.
  2. It’s transparent. You know exactly why the site is showing you these particular books, and if the system recommends a book you didn’t expect, you have a pretty good idea why. That means you can make an informed decision about whether or not to trust the recommendation.
  3. There’s a clear measure of success and an embedded feedback mechanism. Amazon wants to sell books. The model succeeds if people click on the books they’re shown, and, ultimately, if they buy more books, both of which are easy to measure. If clicks on  or sales of related items go down, Amazon will know, and can investigate and adjust the model accordingly.

Weapons of Math Destruction reviews, in an accessible, non-technical way, what makes models effective — or not. The emphasis, as you might guess from the title, is on models with problems. The book highlights many important ideas; here are just a few:

  1. Models are more than just math. Take a look at Amazon’s model above: while there are calculations (simple ones) embedded, it’s people who decide what data to use, how to use it, and how to measure success. Math is not a final arbiter, but a tool to express, in a scalable (i.e., computable) way, the values that people explicitly decide to emphasize. Cathy says that “models are opinions expressed in mathematics” (or computer code). She highlights that when we evaluate teachers based on students’ test scores, or assess someone’s insurability as a driver based on their credit record, we are expressing opinions: that a successful teacher should boost test scores, or that responsible bill-payers are more likely to be responsible drivers.
  2. Replacing what you really care about with what you can easily get your hands on can get you in trouble. In Amazon’s recommendation model, we want to predict book sales, and we can use book sales as inputs; that’s a good thing. But what if you can’t directly measure what you’re interested in? In the early 1980’s, the magazine US News wanted to report on college quality. Unable to measure quality directly, the magazine built a model based on proxies, primarily outward markers of success, like selectivity and alumni giving. Predictably, college administrators, eager to boost their ratings, focused on these markers rather than on education quality itself. For example, to boost selectivity, they encouraged more students, even unqualified ones, to apply. This is an example of gaming the model.
  3. Historical data is stuck in the past. Typically, predictive models use past history to predict future behavior. This can be problematic when part of the intention of the model is to break with the past. To take a very simple example, imagine that Cathy is about to publish a sequel to Weapons of Math Destruction. If Amazon uses only  purchase data, the Customers Who Bought This Also Bought list would completely miss the connection between the original and the sequel. This means that if we don’t want the future to look just like the past, our models need to use more than just history as inputs. A chapter about predictive models in hiring is largely devoted to this idea. A company may think that its past, subjective hiring system overlooks qualified candidates, but if it replaces the HR department with a model that sifts through resumes based only on the records of past hires, it may just be codifying (pun intended) past practice. A related idea is that, in this case, rather than adding objectivity, the model becomes a shield that hides discrimination. This takes us back to Models are more than just math and also leads to the next point:
  4. Transparency matters! If a book you didn’t expect shows up on The Customers Who Bought This Also Bought list, it’s pretty easy for Amazon to check if it really belongs there. The model is pretty easy to understand and audit, which builds confidence and also decreases the likelihood that it gets used to obfuscate. An example of a very different story is the value added model for teachers, which evaluates teachers through their students’ standardized test scores. Among its other drawbacks, this model is especially opaque in practice, both because of its complexity and because many implementations are built by outsiders. Models need to be openly assessed for effectiveness, and when teachers receive bad scores without knowing why, or when a single teacher’s score fluctuates dramatically from year to year without explanation, it’s hard to have any faith in the process.
  5. Models don’t just measure reality, but sometimes amplify it, or create their own. Put another way, models of human behavior create feedback loops, often becoming self-fulfilling prophecies. There are many examples of this in the book, especially focusing on how models can amplify economic inequality. To take one example, a company in the center of town might notice that workers with longer commutes tend to turn over more frequently, and adjust its hiring model to focus on job candidates who can afford to live in town. This makes it easier for wealthier candidates to find jobs than poorer ones, and perpetuates a cycle of inequality. There are many other examples: predictive policing, prison sentences based on recidivism, e-scores for credit. Cathy talks about a trade-off between efficiency and fairness, and, as you can again guess from the title, argues for fairness as an explicit value in modeling.

Weapons of Math Destruction is not a math book, and it is not investigative journalism. It is short — you can read it in an afternoon — and it doesn’t have time or space for either detailed data analysis (there are no formulas or graphs) or complete histories of the models she considers. Instead, Cathy sketches out the models quickly, perhaps with an individual anecdote or two thrown in, so she can get to the main point — getting people, especially non-technical people, used to questioning models. As more and more aspects of our lives fall under the purview of automated data analysis, that’s a hugely important undertaking.


Probability For Dummies (And We’re All Dummies)

Sometimes it feels like probability was made up just to trip you up. My undergraduate advisor Persi Diaconis, who started out as a magician and often works on card shuffling and other problems related to randomness, used to say that our brains weren’t wired right for doing probability. Now that I (supposedly!) know a little more about probability than I did as a student, Persi’s statement rings even truer.

I spent a little time this weekend thinking lately about why probability confuses us so easily. I don’t have all the answers, but I did end up making up a story that I found pretty illuminating. At least, I learned a few things from thinking it through. It’s based on what looks like a very simple example, first popularized by Martin Gardner, but it can still blow your mind a little bit. I actually meant to have a fancier example, but my basic one ended up being more than enough for what I wanted to get across. (Some of these ideas, and the Gardner connection, are explored in a complementary way in this paper by Tanya Khovanova.) Here goes.

Prologue. Say you go to a school reunion, and you find yourself at a dimly-lit late evening reception, talking to your old friend Robin. You haven’t seen each other for years, you’re catching up on family, and you hear that Robin has two children. Maybe the reunion has you thinking back to the math classes you took, or maybe you’ve just been drinking too much, but for some reason, you start wondering whether Robin’s children have the same gender (two boys or two girls) or different genders (one of each). Side note: if you’ve managed to stay sober, this may be the point at which you realize that you’ve not only wandered into a reunion you’re barely interested in, you’ve wandered into a math problem you’re barely… um, well, anyway, let’s keep going.

The gender question is pretty easy to answer, at least in terms of what’s more and less likely. Assuming that any one child is as likely to be a girl as a boy (not quite, but let’s ignore that), and assuming that having one kid be a girl or boy doesn’t change the likelihood of having your other kid be a girl or boy (again, probably not exactly true, but whatever), we find there are four equally likely scenarios (I’m listing the oldest kid first):

(Girl, girl)      (Girl, boy)     (Boy, girl)     (Boy, Boy)

Each of these scenarios has probability 25%. There are two scenarios with two kids of the same sex (total probability 50%), and two scenarios with two kids of opposite sexes (total probability also 50%). Easy peasy.

But things won’t stay simple for long, because you’ve not only wandered into a school reunion and a math problem, you’ve also wandered into a…


Really. So you’re at the reunion, still talking to Robin, only you might be sober, or you might be drunk. Which is it?

Sober Version: You and Robin continue your nice lucid conversation, and Robin says: “My older kid is a girl.” Does the additional information change the gender probabilities (two of the same vs. opposites) at all?

This one looks easy too, especially given that you’re sober. Now that we know the older kid is a girl, things come down to the gender of the younger kid. We know that having a girl and having a boy are equally likely, so two of the same vs. opposite genders should still be 50-50. In terms of the scenarios above, we’ve ruled out the last two scenarios and have a 50-50 choice between the first two.

But now let’s turn the page to the…

Drunk Version: You and Robin have both had more than a little wine, haven’t you? Maybe Robin’s starting to mumble a bit, or maybe you’re not catching every word Robin says any more, but in any case, in this version what you heard Robin say was, “My umuhmuuh kid is a girl.” So Robin might have said older or younger, but in the drunk version, you don’t know which. What are the probabilities now? Are they different from the sober version?

Argument for No: Robin might have said, “My older kid is a girl,”in which case you rule out the last two scenarios as above and conclude the probabilities are still 50-50. Or Robin might have said, “My younger kid is a girl,” in which case you would rule out the second and fourth scenarios but the probabilities would again be 50-50. So it’s 50-50 no matter what Robin said. It doesn’t make a difference that you didn’t actually hear what it was.

Argument for Yes: Look at the four possible scenarios above. All we know now is that one of the kids is a girl, i.e., we’ve only ruled out (Boy, Boy). The other three are still possible, and still equally likely. But now we have two scenarios where the kids have opposite genders, and only one where they have the same gender. So now it’s not 50-50 anymore; it’s 2/3-1/3 in favor of opposite genders.

Both arguments seem pretty compelling, don’t they? Maybe you’re a little confused? Head spinning a little bit? Well, I did tell you this was the drunk version!

To try to sort things out, let’s step back a little bit. Drink a little ice water and take a look around the room. Let’s say you see 400 people at the reunion that have exactly two kids. I won’t count spouses, and I’ll assume that none of your classmates got together to have kids. That keeps things simple: 400 classmates with a pair of kids means 400 pairs of kids. On average, there’ll be 100 classmates for each of the four kid gender combinations. One of these classmates is your friend Robin.

Now imagine that each of your classmates is drunkenly telling a friend about which of their kids are girls. What will they say?

  • The 100 in the (Boy, Boy) square would certainly never say, “My umuhmuuh kid is a girl.” We can forget about them.
  • The 100 in the (Girl, Boy) square would always say, “My older kid is a girl.”
  • The 100 in the (Boy, Girl) square would always say, “My younger kid is a girl.”
  • The 100 in the (Girl, Girl) square could say either. There’s no reason to prefer one or the other, especially since everyone is drunk. So on average, 50 of them will say “My older kid is a girl,” and the other 50 will say, “My younger kid is a girl.”

All together, there should be 150 classmates who say their older kid is a girl, 150 who say their younger kid is a girl, and 100 who don’t say anything because they have no girl kids.

In the drunk version, where we don’t know what Robin said, Robin could be any of the 150 classmates who would say “My older kid is a girl.” In that case, 100 times out of 150, Robin’s two kids have opposite genders. Or Robin could be any of the 150 classmates who would say, “My younger kid is a girl,” and in that case again, 100 times out of 150, Robin’s two kids have opposite genders.

This analysis is consistent with the Argument for Yes, and leads to the same conclusion: there is a 2-in-3 chance (200 times out of 300) that Robin’s kids have opposite genders. But, it seems to agree with the spirit of the Argument for No as well! It looks like knowing Robin was talking about the older kid actually didn’t add any new information: that 2-in-3 chance would already hold if Robin had soberly said “My older kid is a girl” OR if Robin had just as soberly said “My younger kid is a girl.”

But now something seems really off. Because now it’s starting to look like our analysis of the sober version, apparently the simplest thing in the world, was actually incorrect. In other words, now it seems like we’re saying that finding out Robin’s older kid was a girl actually didn’t leave the gender probabilities at 50-50 like we thought. Which is just… totally… nuts. (And not at all sober.) Isn’t it?

Not necessarily.

Here’s the rub. In the sober version, the conversation could actually have gone a couple different ways:

Sober Version 1.

YOU: Tell me about your kids.

ROBIN: I’ve got two. My older kid is a junior in high school, plays guitar, does math team, runs track, and swims.

YOU: That’s great. Girls’ or boys’ track? The girls’ track team at my kids’ school is really competitive.

ROBIN: Girls’ track. My older kid is a girl.

Sober Version 2.

YOU: I teach math and science, and I’m really interested in helping girls succeed.

ROBIN: That’s great! Actually, if you’re interested in girls doing math, you might be interested in something that happened to one of my kids. My older kid is a girl, and…

Comparing Versions. In both versions, it looks like you ended up with the same information (Robin’s older kid is a girl). But the conclusions you get to draw are totally different!

Let’s view things in terms of your 400 classmates in the room. In Sober Version 1, the focus is on your classmate’s older kid. The key point is that, in this version of the conversation, in the 100 scenarios in which both of your classmate’s kids are girls, you would hear “my older kid is a girl” in all of them. Of course in the 100 (Girl, Boy) scenarios, you would hear “my older kid is a girl” as well. That makes for 200 “my older kid is a girl” scenarios, 100 of which are same-gender scenarios. The likelihood that both kids are girls is 50-50.

Whereas in Sober Version 2, the focus is on girls. In the 100 scenarios in which both of your classmate’s kids are girls, you should expect to hear a story about the older daughter about half the time, and the younger daughter the other half. (Perhaps not exactly, because the older kid has had more time to have experiences that become the subject of stories, but I’m ignoring this.) Combining this with the 100 (Girl, Boy) scenarios, we get 150 total “my older kid is a girl” scenarios. Only 50 of them are same-gender scenarios, and the likelihood that both kids are girls is only 1-in-3.

Why Probability Makes Us All Dummies. Probability is about comparing what happened with what might have happened. Math people have a fancy name for what might have happened: they call it the state space. What we see in this example is that when you talk about everyday situations in everyday language, it can be very tricky to pin down the state space. It’s hard to keep ambiguities out.

Even the Sober Version, which sounds very simple at first, turns out to have an ambiguity that we didn’t consider. And when we passed from the Sober Version to the Drunk Version, we got confused because we implicitly took the Sober Version to be Version 1, with a 200-person state space, while we took the Drunk Version to be like Version 2, with a 150-person state space. In other words, in interpreting “My older kid is a girl” vs. “One of my kids is a girl,” we fell into different assumptions about the background. I think this is what it means that our brains aren’t wired right to do probability: it’s incredibly easy for them to miss what the background assumptions are. And when we change the state space without realizing it by changing those background assumptions, we get paradoxes.

Note: while I framed what I’ve been calling the Drunk Version (one of my kids is a girl) in a way that makes Version 2 the natural interpretation, it can also be reframed to sound more like Version 1. In that case, the Argument for No in the Drunk Version is fully correct, and the probabilities are 50-50. From a quick online survey, I’ve found this in a few places, including Wikipedia and the paper I linked at the start. I haven’t seen anyone else note that what I’ve been calling the Sober Version (my oldest kid is a girl) can be also framed in multiple ways. Just more proof that it’s really easy to miss background assumptions!

Another point of view on this is in terms of information. The Sober vs. Drunk versions confused us because it looked like we had equivalent information – one of the kids is a girl – but ended up with different outcomes. But in fact we didn’t have equivalent information; in fact in the Sober version, there was an essential ambiguity in what information we had! The point here is that just knowing the answer to a question (my oldest kid is a girl) usually isn’t the full story when it comes to probability problems. We need to know the question (Is your oldest kid a girl vs. Is one of your kids a girl) as well. The relevant information is a combination of a question and a statement that answers it, not a statement (or set of statements) floating on its own.

Should You Opt Out of PARCC?

Today’s post is a discussion of education reform, Common Core, standardized testing, and PARCC with my friend Kristin Wald, who has been extremely kind to this blog. Kristin taught high school English in the NYC public schools for many years. Today her kids and mine go to school together in Montclair. She has her own blog that gets orders of magnitude more readers than I do.

We’re cross-posting this on Kristin’s blog and also on Mathbabe (thank you, Cathy O’Neil!)

ES: PARCC testing is beginning in New Jersey this month. There’s been lots of anxiety and confusion in Montclair and elsewhere as parents debate whether to have their kids take the test or opt out. How do you think about it, both as a teacher and as a parent?

KW: My simple answer is that my kids will sit for PARCC. However, and this is where is gets grainy, that doesn’t mean I consider myself a cheerleader for the exam or for the Common Core curriculum in general.

In fact, my initial reaction, a few years ago, was to distance my children from both the Common Core and PARCC. So much so that I wrote to my child’s principal and teacher requesting that no practice tests be administered to him. At that point I had only peripherally heard about the issues and was extending my distaste for No Child Left Behind and, later, Race to the Top. However, despite reading about and discussing the myriad issues, I still believe in change from within and trying the system out to see kinks and wrinkles up-close rather than condemning it full force.


ES: Why did you dislike NCLB and Race to the Top? What was your experience with them as a teacher?

KW: Back when I taught in NYC, there was wiggle room if students and schools didn’t meet standards. Part of my survival as a teacher was to shut my door and do what I wanted. By the time I left the classroom in 2007 we were being asked to post the standards codes for the New York State Regents Exams around our rooms, similar to posting Common Core standards all around. That made no sense to me. Who was this supposed to be for? Not the students – if they’re gazing around the room they’re not looking at CC RL.9-10 next to an essay hanging on a bulletin board. I also found NCLB naïve in its “every child can learn it all” attitude. I mean, yes, sure, any child can learn. But kids aren’t starting out at the same place or with the same support. And anyone who has experience with children who have not had the proper support up through 11th grade knows they’re not going to do well, or even half-way to well, just because they have a kickass teacher that year.

Regarding my initial aversion to Common Core, especially as a high school English Language Arts teacher, the minimal appearance of fiction and poetry was disheartening. We’d already seen the slant in the NYS Regents Exam since the late 90’s.

However, a couple of years ago, a friend asked me to explain the reason The Bluest Eye, with its abuse and rape scenes, was included in Common Core selections, so I took a closer look. Basically, a right-wing blogger had excerpted lines and scenes from the novel to paint it as “smut” and child pornography, thus condemning the entire Common Core curriculum. My response to my friend ended up as “In Defense of The Bluest Eye.”

That’s when I started looking more closely at the Common Core curriculum. Learning about some of the challenges facing public schools around the country, I had to admit that having a required curriculum didn’t seem like a terrible idea. In fact, in a few cases, the Common Core felt less confining than what they’d had before. And you know, even in NYC, there were English departments that rarely taught women or minority writers. Without a strong leader in a department, there’s such a thing as too much autonomy. Just like a unit in a class, a school and a department should have a focus, a balance.

But your expertise is Mathematics, Eugene. What are your thoughts on the Common Core from that perspective?

ES: They’re a mix. There are aspects of the reforms that I agree with, aspects that I strongly disagree with, and then a bunch of stuff in between.

The main thing I agree with is that learning math should be centered on learning concepts rather than procedures. You should still learn procedures, but with a conceptual underpinning, so you understand what you’re doing. That’s not a new idea: it’s been in the air, and frustrating some parents, for 50 years or more. In the 1960’s, they called it New Math.

Back then, the reforms didn’t go so well because the concepts they were trying to teach were too abstract – too much set theory, in a nutshell, at least in the younger grades. So then there was a retrenchment, back to learning procedures. But these things seem to go in cycles, and now we’re trying to teach concepts better again. This time more flexibly, less abstractly, with more examples. At least that’s the hope, and I share that hope.

I also agree with your point about needing some common standards defining what gets taught at each grade level. You don’t want to be super-prescriptive, but you need to ensure some kind of consistency between schools. Otherwise, what happens when a kid switches schools? Math, especially, is such a cumulative subject that you really need to have some big picture consistency in how you teach it.


ES: What I disagree with is the increased emphasis on standardized testing, especially the raised stakes of those tests. I want to see better, more consistent standards and curriculum, but I think that can and should happen without putting this very heavy and punitive assessment mechanism on top of it.

KW: Yes, claiming to want to assess ability (which is a good thing), but then connecting the results to a teacher’s effectiveness in that moment is insincere evaluation. And using a standardized test not created by the teacher with material not covered in class as a hard percentage of a teacher’s evaluation makes little sense. I understand that much of the exam is testing critical thinking, ability to reason and use logic, and so on. It’s not about specific content, and that’s fine. (I really do think that’s fine!) Linking teacher evaluations to it is not.

Students cannot be taught to think critically in six months. As you mentioned about the spiraling back to concepts, those skills need to be revisited again and again in different contexts. And I agree, tests needn’t be the main driver for raising standards and developing curriculum. But they can give a good read on overall strengths and weaknesses. And if PARCC is supposed to be about assessing student strengths and weaknesses, it should be informing adjustments in curriculum.

On a smaller scale, strong teachers and staffs are supposed to work as a team to influence the entire school and district with adjusted curriculum as well. With a wide reach like the Common Core, a worrying issue is that different parts of the USA will have varying needs to meet. Making adjustments for all based on such a wide collection of assessments is counterintuitive. Local districts (and the principals and teachers in them) need to have leeway with applying them to best suit their own students.

Even so, I do like some things about data driven curricula. Teachers and school administrators are some of the most empathetic and caring people there are, but they are still human, and biases exist. Teachers, guidance counselors, administrators can’t help but be affected by personal sympathies and peeves. Having a consistent assessment of skills can be very helpful for those students who sometimes fall through the cracks. Basically, standards: yes. Linking scores to teacher evaluation: no.

ES: Yes, I just don’t get the conventional wisdom that we can only tell that the reforms are working, at both the individual and group level, through standardized test results. It gives us some information, but it’s still just a proxy. A highly imperfect proxy at that, and we need to have lots of others.

I also really like your point that, as you’re rolling out national standards, you need some local assessment to help you see how those national standards are meeting local needs. It’s a safeguard against getting too cookie-cutter.

I think it’s incredibly important that, as you and I talk, we can separate changes we like from changes we don’t. One reason there’s so much noise and confusion now is that everything – standards, curriculum, testing – gets lumped together under “Common Core.” It becomes this giant kitchen sink that’s very hard to talk about in a rational way. Testing especially should be separated out because it’s fundamentally an issue of process, whereas standards and curriculum are really about content.

You take a guy like Cuomo in New York. He’s trying to increase the reliance on standardized tests in teacher evaluations, so that value added models based on test scores count for half of a teacher’s total evaluation. And he says stuff like this: “Everyone will tell you, nationwide, the key to education reform is a teacher evaluation system.” That’s from his State of the State address in January. He doesn’t care about making the content better at all. “Everyone” will tell you! I know for a fact that the people spending all their time figuring out at what grade level kids should start to learn about fractions aren’t going tell you that!

I couldn’t disagree with that guy more, but I’m not going to argue with him based on whether or not I like the problems my kids are getting in math class. I’m going to point out examples, which he should be well aware of by now, of how badly the models work. That’s a totally different discussion, about what we can model accurately and fairly and what we can’t.

So let’s have that discussion. Starting point: if you want to use test scores to evaluate teachers, you need a model because – I think everyone agrees on this – how kids do on a test depends on much more than how good their teacher was. There’s the talent of the kid, what preparation they got outside their teacher’s classroom, whether they got a good night’s sleep the night before, and a good breakfast, and lots of other things. As well as natural randomness: maybe the reading comprehension section was about DNA, and the kid just read a book about DNA last month. So you need a model to break out the impact of the teacher. And the models we have today, even the most state-of-the-art ones, can give you useful aggregate information, but they just don’t work at that level of detail. I’m saying this as a math person, and the American Statistical Association agrees. I’ve written about this here and here and here and here.

Having student test results impact teacher evaluations is my biggest objection to PARCC, by far.

KW: Yep. Can I just cut and paste what you’ve said? However, for me, another distasteful aspect is how technology is tangled up in the PARCC exam.


ES: Let me tell you the saddest thing I’ve heard all week. There’s a guy named Dan Meyer, who writes very interesting things about math education, both in his blog and on Twitter. He put out a tweet about a bunch of kids coming into a classroom and collectively groaning when they saw laptops on every desk. And the reason was that they just instinctively assumed they were either about to take a test or do test prep.

That feels like such a collective failure to me. Look, I work in technology, and I’m still optimistic that it’s going to have a positive impact on math education. You can use computers to do experiments, visualize relationships, reinforce concepts by having kids code them up, you name it. The new standards emphasize data analysis and statistics much more than any earlier standards did, and I think that’s a great thing. But using computers primarily as a testing tool is an enormous missed opportunity. It’s like, here’s the most amazing tool human beings have ever invented, and we’re going to use it primarily as a paperweight. And we’re going to waste class time teaching kids exactly how to use it as a paperweight. That’s just so dispiriting.

KW: That’s something that hardly occurred to me. My main objection to hosting the PARCC exam on computers – and giving preparation homework and assignments that MUST be done on a computer – is the unfairness inherent in accessibility. It’s one more way to widen the achievement gap that we are supposed to be minimizing. I wrote about it from one perspective here.

I’m sure there are some students who test better on a computer, but the playing field has to be evenly designed and aggressively offered. Otherwise, a major part of what the PARCC is testing is how accurately and quickly children use a keyboard. And in the aggregate, the group that will have scores negatively impacted will be children with less access to the technology used on the PARCC. That’s not an assessment we need to test to know. When I took the practice tests, I found some questions quite clear, but others were difficult not for content but in maneuvering to create a fraction or other concept. Part of that can be solved through practice and comfort with the technology, but then we return to what we’re actually testing.

ES: Those are both great points. The last thing you want to do is force kids to write math on a computer, because it’s really hard! Math has lots of specialized notation that’s much easier to write with pencil and paper, and learning how to write math and use that notation is a big part of learning the subject. It’s not easy, and you don’t want to put artificial obstacles in kids’ way. I want kids thinking about fractions and exponents and what they mean, and how to write them in a mathematical expression, but not worrying about how to put a numerator above a denominator or do a superscript or make a font smaller on a computer. Plus, why in the world would you limit what kids can express on a test to what they can input on a keyboard? A test is a proxy already, and this limits what it can capture even more.

I believe in using technology in education, but we’ve got the order totally backwards. Don’t introduce the computer as a device to administer tests, introduce it as a tool to help in the classroom. Use it for demos and experiments and illustrating concepts.

As far as access and fairness go, I think that’s another argument for using the computer as a teaching tool rather than a testing tool. If a school is using computers in class, then at least everyone has access in the classroom setting, which is a start. Now you might branch out from there to assignments that require a computer. But if that’s done right, and those assignments grow in an organic way out of what’s happening in the classroom, and they have clear learning value, then the school and the community are also morally obligated to make sure that everyone has access. If you don’t have a computer at home, and you need to do computer-based homework, then we have to get you computer access, after school hours, or at the library, or what have you. And that might actually level the playing field a bit. Whereas now, many computer exercises feel like they’re primarily there to get kids used to the testing medium. There isn’t the same moral imperative to give everybody access to that.

I really want to hear more about your experience with the PARCC practice tests, though. I’ve seen many social media threads about unclear questions, both in a testing context and more generally with the Common Core. It sounds like you didn’t think it was so bad?

KW: Well, “not so bad” in that I am a 45 year old who was really trying to take the practice exam honestly, but didn’t feel stressed about the results. However, I found the questions with fractions confusing in execution on the computer (I almost gave up), and some of the questions really had to be read more than once. Now, granted, I haven’t been exposed to the language and technique of the exam. That matters a lot. In the SAT, for example, if you don’t know the testing language and format it will adversely affect your performance. This is similar to any format of an exam or task, even putting together an IKEA nightstand.

There are mainly two approaches to preparation, and out of fear of failing, some school districts are doing hardcore test preparation – much like SAT preparation classes – to the detriment of content and skill-based learning. Others are not altering their classroom approaches radically; in fact, some teachers and parents have told me they hardly notice a difference. My unscientific observations point to a separation between the two that is lined in Socio-Economic Status. If districts feel like they are on the edge or have a lot to lose (autonomy, funding, jobs), if makes sense that they would be reactionary in dealing with the PARCC exam. Ironically, schools that treat the PARCC like a high-stakes test are the ones losing the most.

Opting Out

KW: Despite my misgivings, I’m not in favor of “opting out” of the test. I understand the frustration that has prompted the push some districts are experiencing, but there have been some compromises in New Jersey. I was glad to see that the NJ Assembly voted to put off using the PARCC results for student placement and teacher evaluations for three years. And I was relieved, though not thrilled, that the percentage of PARCC results to be used in teacher evaluations was lowered to 10% (and now put off). I still think it should not be a part of teacher evaluations, but 10% is an improvement.

Rather than refusing the exam, I’d prefer to see the PARCC in action and compare honest data to school and teacher-generated assessments in order to improve the assessment overall. I believe an objective state or national model is worth having; relying only on teacher-based assessment has consistency and subjective problems in many areas. And that goes double for areas with deeply disadvantaged students.

ES: Yes, NJ seems to be stepping back from the brink as far as model-driven teacher evaluation goes. I think I feel the same way you do, but if I lived in NY, where Cuomo is trying to bump up the weight of value added models in evaluations to 50%, I might very well be opting out.

Let me illustrate the contrast – NY vs. NJ, more test prep vs. less — with an example. My family is good friends with a family that lived in NYC for many years, and just moved to Montclair a couple months ago. Their older kid is in third grade, which is the grade level where all this testing starts. In their NYC gifted and talented public school, the test was this big, stressful thing, and it was giving the kid all kinds of test anxiety. So the mom was planning to opt out. But when they got to Montclair, the kid’s teacher was much more low key, and telling the kids not to worry. And once it became lower stakes, the kid wanted to take the test! The mom was still ambivalent, but she decided that here was an opportunity for her kid to get used to tests without anxiety, and that was the most important factor for her.

I’m trying to make two points here. One: whether or not you opt out depends on lots of factors, and people’s situations and priorities can be very different. We need to respect that, regardless of which way people end up going. Two: shame on us, as grown ups, for polluting our kids’ education with our anxieties! We need to stop that, and that extends both to the education policies we put in place and how we collectively debate those policies. I guess what I’m saying is: less noise, folks, please.

KW: Does this very long blog post count as noise, Eugene? I wonder how this will be assessed? There are so many other issues – private profits from public education, teacher autonomy in high performing schools, a lack of educational supplies and family support, and so on. But we have to start somewhere with civil and productive discourse, right? So, thank you for having the conversation.

ES: Kristin, I won’t try to predict anyone else’s assessment, but I will keep mine low stakes and say this has been a pleasure!

You Can’t Separate Models From Data

This Atlantic article is a bit highfalutin’, but if you make it past all the metaphors at the beginning, you’ll get to see some good examples of a very important idea: you can’t separate a model from the data that goes into the model. In particular, constraints on the input data become constraints on the model.

A story: my first non-academic job was Modeling Guy at a start-up that was building technology to generate movie recommendations, similar to Netflix or Amazon. (Just so you know how long ago this was, we were going to have recommendation kiosks at video stores! Then the Internet crash happened.) Recommendation models all work in pretty much the same way: the model finds people whose taste (in movies, books, music, whatever) is similar to yours, and recommends movies to you that those people have liked but you might not have seen yet. The basic input data to a model like this is preference information. In less fancy language, you need to know what movies different people like.

One thing I wanted to account for in my model was that there are multiple movie genres, and people might have similar tastes in some but not others. (You and I could both like pretty much the same comedies, but maybe you like musicals and I hate them. No, really, I hate them.) To make this work, I needed enough data to be able to model preferences in each genre, not just overall. It wasn’t enough to know, for each person, 10 or 20 movies that they liked; I needed to know a few comedies each person liked, a few mysteries, a few musicals (if any), etc. Which meant I needed a larger dataset overall, because there are a lot of genres.

Now, it wasn’t so hard to collect this data. We made a long list of movies, made sure we included a decent number from every genre we wanted to cover, and had people rate the movies on our list. (You can give people a long list, because they usually still remember a movie well enough to rate it long after they saw it.) Long story short, I had enough data to do what I wanted to do — model each genre separately — and my model seemed to work pretty well. (We tested against models that lumped all the movies together, and mine did better.) What I want to highlight is that if I hadn’t been able to collect as much data, my fine-grained approach probably wouldn’t have worked at all. If I only had a small dataset, I wouldn’t have been able to say anything about what was going on inside each genre, and grouping people based on all the movies lumped together would have been a better bet. The model wouldn’t have been very precise, but it would have used the little data I did have more efficiently.

The upshot is that models depend on data, and data availability (quantity and quality) is always a real world issue, not just a math issue. A model may make perfect sense in theory, but work badly in practice if reality gets in the way of gathering the data you need to run the model.

Education data is a great example here. There’s a class of models for measuring teacher and school performance, known broadly as value-added models (VAM). The idea is to try to isolate how much “value” a teacher or school adds to students’ learning, where learning is usually measured through test scores. Regardless of what you think about standardized testing, you should know that the modeling here is extremely challenging! The problem is that it’s very hard to break out the impact of a teacher or school from all the other, “external” factors that might affect a kid’s test scores (genetics, at-home support and preparation, attendance, schools attended in the past, just to name a few). To do this, you need a model to estimate a kid’s “expected” test score based on all the external factors. (The “value added” by the school or teacher is then supposed to be captured as the difference between this model-based expected score and the actual score.)

To build such a model, you need to model the external factors, which means you need a huge amount of input data. You certainly want a history of past test scores (hard to collect if a kid has moved around between schools where different tests are given; hard to interpret even if you happen to have the data). You likely want to know something about income (typically eligibility for reduced-price school lunch programs is used as a proxy for this. At my kids’ school, this data was apparently wrong for a couple years). And you probably want to know something about support at home, out of school activities, and lots of other variables — well, good luck! The worst part is that the data gaps tend to be the biggest in the poorest schools (less resources to collect data, more kids going in and out, making the data problem harder to begin with). These are precisely the schools where it’s most important to model the challenges the kids face — and yet the data isn’t there to do it.

There’s starting to be a backlash against standardized testing, and against measuring teachers and schools by the results of those standardized tests. And there’s also a backlash to the backlash, with supporters of the VAM framework arguing that it’s the most objective measure of teacher performance and kids’ progress. But models and measures based on data that’s not there, and can’t be filled in, aren’t objective at all.

Two Kinds of Model Error

One winter night every year, New York City tries to count how many homeless people are out in its streets. (This doesn’t include people in shelters, because shelters already keep records.) It’s done in a pretty low-tech way: the Department of Homeless Services hires a bunch of volunteers, trains them, and sends them out to find and count people.

How do you account for the fact that you probably won’t find everyone? Plant decoys! The city sends out another set of volunteers to pretend to be homeless, to see if they actually get counted. (My social worker wife gets glamorous opportunities like this sent to her on a regular basis.) Once all the numbers are in, you can estimate the total number of homeless as follows:

  1. Actual homeless counted = Total people counted — Decoys counted.
  2. Percent counted estimate = Decoys counted / Total decoys
  3. Homeless estimate = Actual homeless counted / Percent counted estimate

For example, say you counted 4080 people total out in the streets. And say you sent out 100 decoys and 80 of them got counted. Then the number of true homeless you counted is 4000 (= 4080 — 80), your count seems to capture 80% of the people out there, so your estimate for the true number of homeless is 4000 / 80% = 5000 (in other words, 5000 is the number that 4000 is 80% of).

But it’s probably not exactly 5000, for two reasons:

  1. Random error. You happened to count 80% of your decoys, but on another day, you might have counted 78% of your decoys, or 82%, or some other number. In other words, there’s natural randomness in your model which leads to indeterminacy in your answer.
  2. Systematic error. When you count the homeless, you have some idea of where they’re likely to be. But you don’t really know. And your decoys are likely going to plant themselves in the same general places where you think the homeless are. Put another way, if there are a bunch of homeless in an old abandoned subway station that you have no idea exists, you’re not going to count them. And your decoys won’t know to plant themselves there, so you won’t have any idea that you’re not counting them.

The first kind of error is error inside your model. You can analyze it, and treat it statistically by estimating a confidence interval, e.g., I’m estimating that there are 5000 homeless out there, and there’s a 95% chance that the true number is somewhere between 4500 and 5500, say. The second kind of error is external; it comes from stuff that your model doesn’t capture. It’s more worrying because you don’t — and can’t — know how much of it you have. But at least be aware that almost any model has it, and that even confidence intervals don’t incorporate it.