Book Review: “Measuring Up” by Daniel Koretz

I wrote this last year as a guest post for mathbabe. Reposting it here with a few edits.

You’ve probably heard of high-stakes testing. Over the last ten or so years, many states have increased the impact of standardized test results on the students who take the tests (you have to pass to graduate), their teachers (who are often evaluated based on standardized test results), and their school districts (state funding depends on test results). New Jersey, where I live, has been putting such a teacher evaluation system in place (for a lot more detail and criticism, see here), though it now looks like some of it is being dialed back or delayed.

Before you think about high-stakes testing, you need to understand standardized testing in general. The excellent John Ewing pointed me to a pretty comprehensive survey of standardized testing called “Measuring Up,” by Harvard Ed School prof Daniel Koretz, who teaches a course there about this stuff. If you have any interest in the subject, the book is very much worth your time. But in case you don’t get to it, or just to whet your appetite, here are my top 10 takeaways:

  1. Believe it or not, people who write standardized tests aren’t idiots. Building effective tests is a difficult measurement problem! Koretz makes an analogy to political polling, which is a good reminder that a test result is really a sample from a distribution (if you take multiple versions of a test designed to measure the same thing, you won’t do exactly the same each time), and not an absolute measure of what someone knows. It’s also a good reminder that the way questions are phrased can matter a great deal.

  2. The reliability of a test is inversely related to the standard deviation of this distribution: a test is reliable if your score on it wouldn’t vary very much from one instance to the next. That’s a function of both the test itself and the circumstances under which people take it. More reliability is better, but the big trade-off is that increasing the sophistication of the test tends to decrease reliability. For example, tests with free form answers can test for a broader range of skills than multiple choice tests, but they introduce variability across graders, and even the same person may grade the same test differently before and after lunch. More sophisticated tasks also take longer (imagine a lab experiment as part of a test), which means fewer questions on the test and a smaller cross-section of topics being sampled, again meaning more noise and less reliability.

  3. A complementary issue is bias, which is roughly about people doing better or worse on a test for systematic reasons outside the domain being tested. Again, there are trade-offs: the more sophisticated the test, the more skills beyond what’s being tested it may be bringing in. One common way to weed out biased questions is to look at how people who score the same on the overall test do on each particular question: if you get variability you didn’t expect, that may be a sign of bias. So, if you take a cohort of boys and girls who do about equally well on the overall test, and you find that boys tend to do better than girls on Question 7, you would wonder if Question 7 is biased toward boys. It’s harder to do this check for more sophisticated tests, where each question is a bigger chunk of the overall test. It’s also harder if the bias is systematic across the whole test.

  4. Beyond the (theoretical) distribution from which a single student’s score is a sample, there’s also the (likely more familiar) distribution of scores across students. This depends both on the test and on the population taking it. For example, for many years, students on the eastern side of the US were more likely to take the SAT than those in the west, because only those students in the west who were applying to very selective eastern colleges took the test. As you can imagine, the score distributions were very different in the east and the west (and average scores tended to be higher in the west), but this didn’t mean that there was bias or that students or schools in the west were better as a whole.

  5. The shape of the score distribution across students carries important information about the test. If a test is relatively easy for most students, scores will be clustered to the right of the distribution, while if it’s hard, scores will be clustered to the left. This matters when you’re interpreting results: the first test is worse at discriminating among stronger students and better at discriminating among weaker ones, while the second is the reverse.

  6. The score distribution across students helps to communicate test results (you may not know right away what a score of 600 on a particular test means, but if you hear it’s one standard deviation above a mean of 500, that’s a decent start). It’s also important for calibrating tests so the results are comparable from year to year. In general, you want a test to have similar means and variances from one year to the next, but this raises the question of how to handle year-to-year improvement. This is particularly significant when educational goals are expressed in terms of raising standardized test scores.

  7. If you think in terms of the statistics of test score distributions, you realize that many of those goals of raising scores quickly are deluded. Koretz has a good phrase for this: the myth of the vanishing variance. The key point is that test score distributions are very wide, on all tests, everywhere, including countries that we think have much better education systems than we do. The goals we set for student score improvement (typically, a high fraction of all students taking a test several years from now are supposed to score above some threshold) imply a great deal of compression at the lower end of this distribution — compression that has never been seen in any country, anywhere. It sounds good to say that every kid who takes a certain test in four years will score as proficient, but that corresponds to a score distribution with much less variance than you’ll ever see. Maybe we should stop lying to ourselves?

  8. Koretz is highly critical of the recent trend to report test results in terms of standards (e.g., how many students score as “proficient”) instead of comparisons (e.g., your score is in the top 20% of all students who took the test). Standards and standard-based reporting are popular because Americans are worried that our students’ performance as a group is inadequate. The idea is that being near the top doesn’t mean much if the comparison group is weak, so instead we should focus on making sure every student meets an absolute standard needed for success in life. There are three (at least) problems with this. First, how do you set the standard — i.e., what does proficient mean, anyway? Koretz gives enough detail here to make it clear how arbitrary the standards are. Second, you lose information: in the US, standards are typically expressed in terms of just four bins (advanced, proficient, partially proficient, basic), and variation inside the bins is ignored. Third, even standards-based reporting tends to slide back into comparisons: since we don’t know exactly what proficient means, we’re happiest when our school (or district, or state) places ahead of others in the fraction of students classified as proficient.

  9. Koretz’s other big theme is score inflation for high-stakes tests. If everyone is evaluated based on test scores, everyone has an incentive to get those scores up, whether or not the high scores come from actual learning. If you remember anything from the book or from this post, remember this phrase: sawtooth pattern. The idea is that when a new high-stakes standardized test appears, average scores start at some base level, go up quickly as people figure out how to game the test, then plateau. If the test is replaced with another, the same thing happens: base, rapid growth, plateau. Repeat ad infinitum. Koretz and his collaborators did a nice experiment in which they went to a school district several years after it had replaced one high-stakes test with another, and administered the first test. Now that teachers weren’t teaching to the first test, scores on it reverted back to the original base level. Moral: score inflation is real, pervasive, and unavoidable, unless we bite the bullet and do away with high-stakes tests.

  10. While Koretz is sympathetic toward test designers, who live the complexity of standardized testing every day, he is harsh on those who (a) interpret and report on test results and (b) set testing and education policy, without taking that complexity into account. Which, as he makes clear, is pretty much everyone who reports on results and sets policy.

Final thoughts

If you think it’s a good idea to make high-stakes decisions about schools and teachers based on standardized test results, Koretz’s book offers several clear warnings.

First, we should expect any high-stakes test to be gamed. Worse yet, the more reliable tests, being more predictable, are probably easier to game (look at the SAT prep industry).

Second, the more (statistically) reliable tests, by their controlled nature, cover only a limited sample of the domain we want students to learn. Tests trying to cover more ground in more depth (“tests worth teaching to,” in the parlance of the last decade) will necessarily have noisier results. This noise is a huge deal when you realize that high-stakes decisions about teachers are made based on just two or three years of test scores.

Third, a test that aims to distinguish “proficiency” will be worse at distinguishing students elsewhere in the skills range, and may be largely irrelevant for teachers whose students are far away from the proficiency cut-off. (For a truly distressing example of this, see here.)

With so many obstacles to rating schools and teachers reliably based on standardized test scores, is it any surprise that we see results like this?


Monty Hall and Loss Aversion

After my last post, I started thinking about other ways to extend the Monty Hall problem to more doors. If you want to take your intuition in a very different direction, try this: there’s a car behind one of 100 doors. You, the player, get to pick 98 doors. Then Monty opens one of the two remaining doors, showing you the car isn’t there. He asks you if you want to stand pat or trade one of the doors you originally picked for the remaining unopened door.

The same general logic still applies. Your original win probability is 98/100. If you switch, you lose only if the car was behind the door you gave up (probability 1/100), so your win probability after switching is 99/100. Now I don’t know about you, but in this version, I find it really hard to make my brain grasp intuitively that there’s anything to be gained by switching. Which might be a little more evidence for a psychological basis for loss aversion, or the endowment effect, or whatever you want to call that phenomenon whereby people are reluctant to give up, or risk, what they already have. Or maybe it shows our inability (or unwillingness) to distinguish between numbers once they get really big.

Think Like A Math Person I: Door 53

Which is easier to analyze: 3 things or 100 things? Hang on, don’t answer yet.

The Monty Hall problem is a probability puzzle that’s notorious for messing with people’s intuition. If you haven’t heard it before, it goes like this. On the TV game show Let’s Make a Deal, a contestant is trying to win a car, which is hidden behind one of three closed doors. The contestant picks a door. Then the host of the show, Monty Hall, gives her one chance to switch to another door. He doesn’t tell her whether or not the car is behind the door she picked, but he does narrow down her choice by revealing what’s behind one of the doors she didn’t pick — not the one with the car, of course. For example:

Contestant picks Door 2.

Monty opens Door 1, showing the car isn’t there.

Monty asks: Do you want to keep Door 2, or switch to Door 3?

Think about it for a little while if you haven’t heard it before. OK, a little longer. Assuming the game always works this way (Monty always gives you the choice to switch doors, and he always narrows your choice down to just two doors after you’ve picked an initial one), would you stay with your original door or switch?

The simplest intuition is that it doesn’t matter: two identical doors left, there’s no information to help you decide where the car is, it’s 50-50. This is wrong, but it’s very powerful. It’s hard to find the flaw in the logic, and really hard to explain the flaw in a way that convinces somebody else.

But now let’s play the game with 100 doors instead of three. Same basic rules: you pick a door, and Monty gives you an option to switch after narrowing the number of choices down to two. For example:

You pick Door 12 (let’s say), because your oldest kid just turned 12.

Monty starts opening doors, one by one. He opens Door 1, no car. Door 2, no car. He keeps opening doors, no car behind any of them. He skips Door 12, because that’s the one you picked. Door 13, no car. Door 14, no car. On and on: Door 51, no car. Door 52, no car. Door 54, no car. (Yes, he did skip Door 53, didn’t he.) 55, no car. He opens the remaining doors, all the way through Door 100, all empty.

So now we’re down to Door 12, which you picked as a 1-in-100 shot, and Door 53, which Monty just happened to skip when he was opening every other door. Still think it’s 50-50?

Meanwhile, in the next studio over, a MontyClone™ is running the show, and your friend Bob is playing the same game. Bob picks Door 77 out of 100, because his mother is 77. MontyClone starts opening doors in order: 1, 2, 3, all the way up through 67. He skips Door 68. Then he opens 69, 70, all the way through 100 (he skips 77 because that’s the door that Bob picked). Then he asks Bob to choose between the mysteriously skipped 68, and the original choice, 77. Again: do you really think it’s 50-50?

And in yet another studio… But you get the point.

What’s a little clearer in the 100-door version, I hope, is that the two doors — the one you picked and the one Monty skipped — aren’t symmetric. Because, unless your 1-in-100 shot at Door 12 was right to begin with, Monty always has to skip the door with the car behind it when he’s opening all the other doors. If the car is behind Door 53, he skips 53. If it’s behind 68, he skips 68. Put another way, you end up choosing between two doors at the end: yours and Monty’s. You picked yours at random, a 1-in-100 shot. But unless your door was right, i.e., 99 times out of 100, Monty’s door isn’t random at all. He knows where the car is, and by the rules of the game, he points you right to it. You just have to take the hint.

It’s the same with just three doors, just a little harder to see. Put aside the door you picked and focus on the single additional door that Monty skips. Unless your initial guess was right (a 1-in-3 chance), the skipped door is the one with the car. If you can keep your door and Monty’s door separate in your head, then the 50-50 intuition might go away. But things are a lot clearer with 100 doors than with just three.

The approach here — replacing a small example with a larger one, and finding it actually simplifies things — may seem like a trick, but it’s actually a very common thing to do in math and physics. Look at the behavior in the large limit.

Here’s a more down to earth example. I started to follow the Giants when I lived in the Bay Area as a student, and I’ve been rooting for them in the World Series. My seven year old son is rooting for the Royals, because he saw them in person last year and likes their uniforms. Family conflict! But I’ve found it really relaxing to watch the Series together: someone will be happy no matter who wins! At first I was skeptical that I really felt this way: I’ve suffered through a lot of painful Red Sox defeats, and I figured it would hurt if my team lost, even if my kid was happy that his team won. But then I thought about the large limit. If I had 30 kids, one rooting for each major league team, then someone would be happy after every game, and I’d always have one really happy kid at the end of the season. Which actually sounds great! Looking at the large limit was the thing that convinced me. Financial theory tells you to diversify your investments; in my family, we’re diversifying our rooting interests as well.

My First Time

Doing math in school is usually about getting an answer. Doesn’t matter if you’re multiplying 3-digit numbers, integrating by parts, figuring out what happens when a train leaves Chicago at 6 AM, or counting paths through a maze. Doesn’t matter if you’re in grade school or high school, math class or math team. Problem. Answer. Find out if your answer was right. On to the next problem.

It’s a shame, because the first time I really thought hard about math was when this paradigm blew up. It happened around seventh grade, when we were learning how to convert repeating decimals into fractions. We had done a few standard problems (converting, say, 0.333… into 1/3, or maybe 0.424242… into 42/99, simplified to 14/33), when along came 0.999… Which, if you used the standard method (multiply by a power of 10, subtract the original from the result to cancel the infinitely repeating part, solve for the original), appeared to be equal to 1. And that was an answer I wasn’t remotely ready to accept.

Point (me): That couldn’t be right, because we started with 0-point-something, which is clearly less than 1.

Counterpoint (teacher): OK, but you had no problem turning 0.333… into 1/3, and we’re just repeating the same method when we turn 0.999… into 1.

Point: But I can turn 1/3 back into 0.333… by dividing 1 by 3, and I can’t turn 1 into 0.999… by dividing 1 by 1.

Counterpoint: You can. It’s just a funky kind of long division, with a remainder of 1 each time. At the first step, you’re dividing 1 by 1, and you say the answer is 0, remainder 1.  From then on, you’re dividing 10 by 1, and you say the answer is 9, remainder 1, over and over again.

Point: That’s against the rules, the remainder has to be less than what you divide by!

Counterpoint: If you agree that 0.333… = 1/3, just multiply both sides by 3 and see what you get.

Point: But the decimal expansion of 1 is 1.000…! How can it have another one?

Back and forth we went (did I mention that I loved to argue?). It wasn’t question and answer anymore, but questions spawning questions. Was it really OK to say 10 times 0.999… was 9.999…, or were we pulling some strange extra little bit from infinity? It did look OK to multiply 0.333… by 10, but was that somehow suspect too? What was really going on out there at infinity, and what did it mean to be just a tiny smidgen less than 1? What were the real rules of long division, anyway?

For the first time, math seemed very open, up for grabs. My seventh grade mind wasn’t even sure what the answers to these questions could look like. My teachers said there were these things called limits, which helped you represent what happens out at infinity. So 0.999… wasn’t an ordinary number, it was a limit, but it was equal to 1, which was an ordinary number. My head spun. Eventually I declared that anything with infinitely repeating 9’s was undefined and called it a day. (That was right in a way: nobody defines infinite sums carefully in grade school, although I wouldn’t have said that, say, 0.333… was undefined too.) And everyone went back to the usual routine. Problem. Answer. On to the next problem.

But several aspects of the experience stayed with me to this day:

1. An expanded sense of what a math question could be, and what you could learn from it. Once in a while, you’ll hear kids complain about not understanding what a math problem means, or what it’s asking them to do. More frequently, at least these days, the parents are the ones complaining (and much more loudly, too). Often they’re right: hundreds of poorly written math problems get sent home every day. But sometimes it’s through trying to make sense of a question, whether someone else’s or your own, that you learn the most. Does 1 – 0.999… = 0? Why or why not? These were questions of a different kind, as far as you could get from the land of how many more marbles does Dorothy have than Fred. Behind the scenes, infinity was revealing itself, as the subject matter (where 0.999… finally reached 1, or didn’t), and also as the true scope of math. It was thrilling.

2. Years later, when I finally got to learn about limits, I paid a lot of attention. I’d been promised that they would resolve the mystery, and they did! Short summary in case you haven’t studied this stuff: the idea is that when you write down 0.999…, or any other decimal that doesn’t terminate, the dots mean that you’re not writing down an ordinary number in the literal sense. Instead, 0.999… is shorthand for a sequence of numbers (here 0.9, 0.99, 0.999, 0.9999, and so on), defined by some rule that pins down exactly what “and so on” means. (In this case, the rule is that you get the n-th element of the sequence by adding 9/(10n) to the n — 1-st element. For example, you start with 9/10 as the first element, add 9/100 to get the second, then add 9/1000 more to get the third.) Out at infinity, this sequence converges (gets arbitrarily close) to 1, meaning that you can’t squeeze any other number between 0.999… and 1. Once you have convergence, limit theory tells you that the arithmetic manipulations are OK: you’re allowed to write 3×0.333… = 0.999…, 10×0.999… = 9.999…, and so on. There’s a lot more to say here, and this wonderfully detailed yet accessible article by Jordan Ellenberg is very insightful on both the math and the underlying intuition.

It was amazing to me to learn all this. Back in seventh grade, I had gone from some ordinary-looking manipulations of sums and products to a set of questions that felt like philosophy, and now here was math providing real answers to those questions, on its own terms, putting me back on firm ground. In their way, the answers were as mind-expanding as the questions had been. How cool was that?

3. Speaking of those ordinary-looking manipulations, I would never naively trust them again. Maybe your eyes glazed over when you first encountered proofs in math class: why bother proving things that seem out-and-out obvious? But for me, after seventh grade, the obvious could be questionable (like those algebraic manipulations that led to such a weird outcome) or flat-out false (a single number really could have two separate decimal expansions). So when I finally got to proofs in school, I couldn’t have been happier. You mean you can actually prove it? You can carefully work through the all moving parts and identify the statements and methods that you can genuinely trust and use? Bring it on!

4. Faith in the long view. Once I heard someone say that math has two-minute problems, two-hour problems, two-day problems, two-month problems, and two-year problems. I wasn’t ready for multi-year territory as a seventh grader, already getting antsy after a couple weeks without a real answer. Being able to arrive at a satisfying solution eventually, years later, was a very big deal. For a long time afterwards, when I got stuck on a math problem, or on something else, I would recall how long it took to make sense of 0.999… And sometimes, with a little more work, a few hours or days or months later, I’d get unstuck. For what it’s worth, even this post took a few tries over a couple weeks to write.

So, kids, don’t be afraid of questions that might be a little unclear, or that don’t point you directly to an answer. And parents, don’t rush to ridicule that confusing math homework sheet on Facebook. Maybe a little confusion is part of the territory, and just means that you haven’t solved the problem yet. Not understanding can make you frustrated, but it can also mean you’re on the cusp of learning something. Or even that the learning is already underway.


Watching Game 5 of the Giants-Cardinals series. Top of the first, Cards have runners on first and second, one out. The atter hits a line drive to third base. Giants’ third baseman Pablo Sandoval leaps up, catches the ball (the batter’s out), throws quickly to second. Looks like the ball gets there just a hair before the runner on second can dive back in. Umpire calls the runner out at second — double play! Here comes the Cards’ manager to argue with the ump.

Except that this year, baseball uses instant replay. Here’s how it works: a manager has the right to challenge most plays, asking for an umpire’s call to be overturned based on a review of the replay. The caveat is that if you lose a challenge (meaning that after a review of the replay, the call on the field is upheld), you also lose the right to challenge for the rest of the game. So when you challenge, especially early in the game, you better be sure that the umpire’s wrong.

The Cards’ manager speaks briefly with the umpire. Meanwhile, someone on the Cards’ side is reviewing the replay. They must decide that the umpire’s probably right (they don’t challenge and risk losing the right to challenge in the future), because the manager returns to the dugout. It all takes less than a minute.

When replay was introduced, the worry was that managers challenging calls all the time would slow down the game. Only here it feels like it’s actually sped up the game. If you’ve watched enough baseball, you’ve seen many long arguments, frustrated managers venting endlessly to umps who never had any mechanism to change their mind. But with the new rule, the manager has (1) a lot more control, and (2) a strong incentive to be correct. The Cards’ manager gets to make a decision, he decides the call against his team was right, and we quickly move on.

Empowering people, based on the right incentives, can go a long way.

Suspense as Suspension of Time

I figured the ballgame would be over by the time I needed to head out. I was going over to Tierney’s Tavern in Montclair, NJ (I hope your town has a place like this) to see my kids’ awesome music teacher Myrna play a solo set on guitar. The Giants-Nationals playoff game I had been keeping tabs on was winding down (1-0 Nats after 8 innings), apparently in plenty of time for me to get to the show. But then the visiting Giants (my favorite National League team, going back to my Bay Area days) tied it up on a two-out double in the top of the 9th. The game was well into the 10th inning, with no sign of ending, when I drove off to the tavern.

At Tierney’s, I found a cozy stool with a good view of the stage, the bar, and — importantly — the TV over the bar. Myrna was playing some seriously hard-edged rock and roll, and the Giants looked like they might take it in the 12th when they got a man to third with one out. But the rally died (popout, groundout). Yusmeiro Petit came on to pitch for SF, Myrna wrapped up, and Thee Volatiles took the stage at Tierney’s. Petit, a second line starter who didn’t quite make it into the shorter playoff rotation, seemed shaky at the beginning, then settled in. He looked like he could pitch for a while. Thee Volatiles were tight from the first chord, filling the room with the kind of garage rock I used to hear all over Boston, and could never resist, in the 80’s. (These guys are from New Jersey, but once you know a certain kind of sound, you recognize it anywhere.) Petit pumped in strikes, the Nats’ pitchers kept pace, and Thee Volatiles banged out chords: a symphony of forward momentum.

Songs and innings raced by. Thee Volatiles finished up. Most of the crowd had come to see them, and now began to drift out of the room. I wondered how many people were left at the ballpark. The night’s last act, Karyn Kuhl (say her name out loud to realize how great it is) took the stage. She had an electric guitar and a small rhythm section: bass, drums. Within five minutes, I was transfixed: I had been expecting a local stalwart, and here instead was the second coming of PJ Harvey, just arrived in Jersey to play a private gig for 30 people. The music left the garage, headed out into vast spaces. The room appeared emptier but wasn’t, really; the sound pulsed, flowed, filled every space it could find, mocked everyone who had left early. Things didn’t feel so linear anymore: was it the 15th inning now? The 16th?

In an instant, the game comes back in sharp focus. Brandon Belt, up for the Giants, starts his swing — smooth, controlled, deliberate — and the other players on the field fade away. Somehow the ball is on a tee, and then it’s headed for the upper deck in right field, a line that turns into a parabola, gravity’s rainbow at the ballpark. 2-1 Giants. Suddenly we are back on the clock, starting to count down: midnight approaching, three outs left for the Nats. The visitors celebrate, wearily but defiantly, in their dugout. This game has made everyone old.

We head to the bottom of the inning, the Nats’ last ups. Turns out it is the 18th: we have played nine innings and then nine more, an impromptu doubleheader. Petit, who has carried a starter’s workload (six innings) after all in this unscheduled nightcap, is out of the game, rookie Hunter Strickland in to pitch for the Giants. I saw Strickland give up two towering homers to the Nats the day before, so I am more than a little terrified.

Strickland gets one out, then faces Nat leadoff hitter Denard Span, a .300 hitter up for the eighth time. Two quick strikes, then three balls. Now one foul ball after another: straight down, off to the side. The count is already full, but the at bat has infinite capacity, getting fuller still with every pitch. Karyn is singing about ghosts leaving. The music swirls, builds. Song and at bat go on, pitcher and batter take their time, another ball is fouled off, jagged notes twist in the air. The suspense, on stage and on TV, is killing me. At that moment, hearing the music, seeing the game, I realize I’m so tense not just because I don’t know what will happen, but also because I don’t know when. There’s so much suspense because time is suspended. Suspense, suspended, pend, meaning hang. Time hangs, and everything is uncertain.

Eventually the clock starts back up again, as it must. Span hits an ordinary grounder to first base, the second out of the inning. Karyn pivots into the Ramones’ “I Just Wanna Have Something to Do,” dedicates it to Thee Volatiles because it is straight ahead punk rock, forward momentum again. At some point midnight comes. One more song, one more out (a fly ball to right field), and it’s over, in Montclair, NJ and in Washington, DC. Sometimes you don’t know if you can explain what you’ve just seen and heard, but you know you have to try.