How Value Added Models are Like Turds

“Why am I surrounded by statistical illiterates?” — Roger Mexico in Gravity’s Rainbow

Oops, they did it again. This weekend, the New York Times put out this profile of William Sanders, the originator of evaluating teachers using value-added models based on student standardized test results. It is statistically illiterate, uses math to mislead and intimidate, and is utterly infuriating.

Here’s the worst part:

When he began calculating value-added scores en masse, he immediately saw that the ratings fell into a “normal” distribution, or bell curve. A small number of teachers had unusually bad results, a small number had unusually good results, and most were somewhere in the middle.

And later:

Up until his death, Mr. Sanders never tired of pointing out that none of the critiques refuted the central insight of the value-added bell curve: Some teachers are much better than others, for reasons that conventional measures can’t explain.

The implication here is that value added models have scientific credibility because they look like math — they give you a bell curve, you know. That sounds sort of impressive until you remember that the bell curve is also the world’s most common model of random noise. Which is what value added models happen to be.

Just to replace the Times’s name dropping with some actual math, bell curves are ubiquitous because of the Central Limit Theorem, which says that any variable that depends on many similar-looking but independent factors looks like a bell curve, no matter what the unrelated factors are. For example, the number of heads you get in 100 coin flips. Each single flip is binary, but when you flip a coin over and over, one flip doesn’t affect the next, and out comes a bell curve. Or how about height? It depends on lots of factors: heredity, diet, environment, and so on, and you get a bell curve again. The central limit theorem is wonderful because it helps explain the world: it tells you why you see bell curves everywhere. It also tells you that random fluctuations that don’t mean anything tend to look like bell curves too.

So, just to take another example, if I decided to rate teachers by the size of the turds that come out of their ass, I could wave around a lovely bell-shaped distribution of teacher ratings, sit back, and wait for the Times article about how statistically insightful this is. Because back in the bad old days, we didn’t know how to distinguish between good and bad teachers, but the Turd Size Model™ produces a shiny, mathy-looking distribution — so it must be correct! — and shows us that teacher quality varies for reasons that conventional measures can’t explain.

Or maybe we should just rate news articles based on turd size, so this one could get a Pulitzer.

 

Advertisements

How I Learned to Stop Worrying and Love Pythagoras

 

Maybe the best thing about the Pythagorean theorem is how it puts math and non-math people on a pretty equal footing. We all know what it says (right triangle, squares of sides, hypotenuse), we all agree it’s Important Math with a capital M… and most of us don’t have much idea, if any, why it’s true. Seriously. Ask another math person if you don’t believe me. If you’re lucky, they might point you to a picture that looks more or less like this:

Pythagoras-proof-anim.svgThis is sometimes called a proof without words, but here are a few words to guide you, just in case. We’ve got the usual notation: a and b are sides of a right triangle, c is the hypotenuse. We build a super-square with side length a+b and break it up in two ways. On the right, we divide it into one (white) sub-square with side length a (area a2), another with side b (area b2), and four (colored) copies of the triangle. On the left, we rearrange the four triangles so that their complement is a new white sub-square with side c (area c2). White area on the left = white area on the right, so c2 = a2 + b2.

Lovely as this is, it feels like a nifty conjuring trick. The Pythagorean equation is the most direct thing in the world, a2 + b2 = c2, and the best we can do is to rearrange triangles inside a big square? Surely there must be a way to cut up c2 into a2 and b2 directly.

Well, there is! I first ran across what I’m about to show you a few weeks ago, loved it, and was surprised that (1) I hadn’t seen it before and (2) it doesn’t seem to be widely known, though the idea actually goes back to Euclid. Perhaps you’ll feel the same way once you see it. Here goes:

The set-up. What’s so special about right triangles? Well, one thing is that they have an amazing self similarity property.  Draw a line segment out from the vertex with the right angle toward the hypotenuse and perpendicular to it. It divides our original right triangle up into two smaller ones:

right-triangle

Let’s stick to the same notation we had before. Let c be the hypotenuse of our original right triangle, running along the bottom. Let a and b be the sides, a the one on the left, b the one on the right. Then a is the hypotenuse of the green right triangle on the left, and b is the hypotenuse of the blue right triangle on the right. Obviously (foreshadowing), the areas of the two smaller triangles add up to the area of the original.

The self similarity property is that the two smaller right triangles are both similar to the original one! In other words, all three triangles have the same three interior angles, which means that you can rotate and scale each one into any of the others. Put simply, all three triangles have the same shape. Can you see why? Let’s compare those interior angles: the big triangle has a right angle of 90 degrees, and two others, which we will call \theta (say the one on the left, in the green triangle) and \phi (on the right, in the blue triangle). The key point is that the angles of any triangle have to add up to 180 degrees, so two angles of a triangle always determine the third. The green triangle has an angle of \theta on the left that it inherits from the original big triangle, and a 90 degree angle in the middle, so its third angle, at the top, must be \phi. (Essentially: the green triangle and the big triangle have two angles in common, so they must have all three in common.) Similarly, the blue triangle has an angle of \phi on the right that it inherits from the original triangle, and a 90 degree angle in the middle (again, two angles in common), so its third angle must be \theta. Same angles, same shape.

The pay-off. Stare for a minute at those three similar triangles, with the areas of the two smaller ones adding up to the area of the bigger one. Wouldn’t it be great if the triangle with hypotenuse a had area a2, the one with hypotenuse b had area b2, and the one with hypotenuse c had area c2? It’s not true, of course. But it’s almost true! If you read my last post, you know that in fact the area of a right triangle with hypotenuse c and interior angle \theta is

\frac14 \cdot \sin(2 \theta) \cdot c^2.

The green and blue triangle have exactly the same interior angles as the big one. So their areas are given by the same exact formula, with c replaced by a and b, respectively. The areas have to add up, so we have:

\frac14 \cdot \sin(2 \theta) \cdot a^2 + \frac14 \cdot \sin(2 \theta) \cdot b^2 = \frac14 \cdot \sin(2 \theta) \cdot c^2.

Now just divide out the common factor of \frac14 \cdot \sin(2 \theta), and you’re left with

a^2 + b^2 = c^2.

What just happened? The way Pythagoras’s equation fell out of equating areas may seem like a bit of a magic trick too, but it’s actually based on the very fundamental idea of scale invariance. To recap: we (1) wrote a completely explicit formula for the area of a right triangle, (2) equated formulas corresponding to equal areas, and (3) found that the bulk of the formulas, everything except the part corresponding to the Pythagorean theorem, went away. The key to understanding all that is this picture:

square

It shows the area of a right triangle embedded in the area of the corresponding square, with the hypotenuse matching one side of the square. The precise formula for the ratio of the areas, \frac14 \cdot \sin(2 \theta), doesn’t matter so much — what matters is that when we blow this picture up or down, the ratio of the areas doesn’t change. That’s scale invariance. If the two smaller triangles add up to the big triangle, then the squares corresponding to the smaller triangles have to add up to the square corresponding to the big triangle. And that’s exactly the Pythagorean theorem.

I like to think of a2b2, and c2 as units of area corresponding to each triangle. In a nutshell, the Pythagorean theorem decomposes a right triangle into two smaller, similar ones, and says that if the triangles add up, the units of area have to add up too. It’s deep, it’s direct, and I’ll never forget it. How about you?

How to Count

The other day I saw a math question disguised as a baseball trivia question. Here it is:

How many states don’t have major league baseball teams?

Let’s see: there’s Alaska, Arkansas,… Sure, it might be hard to list them all, but why am I calling this a math question?

Well, it doesn’t ask us to list all the states without baseball teams, it asks us to count them. Of course you can count things directly, by listing every one, but that’s not always as easy as it might seem. Maybe you can list all 50 states off the top of your head, and keep track as you go along of which ones don’t have teams, but I’m pretty sure I’ll overlook a few states. (I thought I could organize the states alphabetically, but I ended up giving up once I thought I got through the A’s, and I forgot Alabama!)

So how do you count things indirectly, without listing them? For a start, let’s reframe the question:

How many states DO have major league baseball teams?

If we can answer one question, we can answer the other: if, say, 20 states out of 50 have teams, then 30 don’t. But doesn’t the question feel a little easier when you ask it this second way? Pause here with me for just a moment: why is that?

One reason is that relatively few states have teams, and the ones that do are likely to be the better known ones, so if you were going to try to count by listing, listing the states that have teams is probably easier than listing the ones that don’t. But the real reason the alternate formulation helps is that you don’t have to count by listing — at least not by listing states. You could count by listing teams.

The Red Sox play in Massachusetts — that’s one state. The Yankees play in New York — that’s a second. The Giants play in California — a third. The A’s play in California too, but we already counted that. And so on.

We can make this process a little more organized if we use the structure of the baseball leagues. There are 30 major league baseball teams and they are currently divided evenly into two leagues: 15 in the American League, 15 in the National. Each league has 3 divisions — East, Central, and West — and each division has 5 teams. In other words: 30 teams broken up into 6 divisions of 5.

Doesn’t it feel a lot easier to go through 6 divisions of 5 than to go through 50 states? Let’s do it. I write this off the top of my head, in real time:

AL East: Boston Red Sox (MA, 1), New York Yankees (NY, 2), Baltimore Orioles (MD, 3), Toronto Blue Jays (Canada, not a state), Tampa Bay Rays (FL,4)

AL Central: Kansas City Royals (MO, 5), Detroit Tigers (MI, 6), Cleveland Indians (OH, 7), Minnesota Twins (MN, 8), Chicago White Sox (IL, 9)

AL West: Oakland A’s (CA, 10), Houston Astros (TX, 11), Texas Rangers (TX, repeat state), California Angels (CA, repeat state), Seattle Mariners (WA, 12)

NL East: Washington Nationals (DC, not a state), New York Mets (NY, repeat state), Philadelphia Phillies (PA, 13), Miami Marlins (FL, repeat state), Atlanta Braves (GA, 14)

NL Central: St. Louis Cardinals (MO, repeat state), Pittsburgh Pirates (PA, repeat state), Milwaukee Brewers (WI, 15), Cincinnati Reds (OH, repeat state), Chicago Cubs (IL, repeat state)

NL West: Arizona Diamondbacks (AZ, 16), Colorado Rockies (CO, 17), San Diego Padres (CA, repeat state), Los Angeles Dodgers (CA, repeat state), San Francisco Giants (CA, repeat state)

And there you have it: 17 distinct states with teams, so 33 states without. And while this problem isn’t winning anybody the Fields Medal, it does illustrate two very important principles of counting, and math in general:

1. Find and use correspondences. When we asked which states have teams, we set up an implicit correspondence between states and teams. A way to make that correspondence more explicit is to reframe the question yet again, this time in terms of team-state pairs:

How many pairs (S, T) are there, where S is a state, T is a team that plays in that state, and no state is repeated more than once?

This might sound needlessly complicated, but math people actually like to talk this way! (Remember the definition of relations and functions the first time you saw it? Your eyes probably glazed over; mine sure did.) We use this language because it brings to the surface the duality inherent in the set-up: states and teams are paired. When you have pairs, you get to choose how to enumerate them: over the first entry, or over the second. And in this case, the second is the way to go, because…

2. More structure is better. The set of states seems sort of amorphous. You can try to break it up into regions (New England, Mid-Atlantic, Midwest,…), but it’s not totally clear how to do it. Whereas the set of baseball teams has a very clear structure: six by five. I lied in one place when I told you I was listing baseball teams in real time. When I got to the NL Central, I put down three of the five teams, and then spaced on what the other two were. But I knew there had to be five, and I knew about where they should be geographically. I remembered the other two within a minute.

Counting has a rich and noble history. Also a fancier name: combinatorics. And while the subject, perhaps like much of math, might seem like a bag of tricks when you first encounter it, it has some clear guiding principles. Look for structures, and try to transform your problem so you can make use of those structures. These principles are at work all over, so keep an eye out for them!

Ee-ther/Ai-ther: Calling the Whole Thing Off at the Science Museum

The papers (the Boston Globe, Time, and others) were abuzz yesterday about a supposed error in a math exhibit at the Boston Museum of Science. Most of the interest in the story came from the fact that the issue — described as a minus sign instead of a plus sign in a formula for the golden ratio — was pointed out by a 15 year-old. Frustratingly, none of the articles I saw included any actual math, though if you’re familiar enough with the golden ratio, you might guess even from the very brief description above that that the fuss was probably about a difference of convention rather than any kind of serious mistake.

And, right on schedule, today the Globe reports that the exhibit is correct after all. So what’s going on?

Let’s start with what we mean by “golden ratio.” I’ve posted about it before, in the context of ratios of successive Fibonacci numbers, which have the golden ratio as their limit. Let’s start with a picture:220px-SimilarGoldenRectangles.svgIn this picture, the small rectangle (with one side length having length a and the other length b) and the big rectangle (with one side having length a+b and the other having length a) are supposed to be similar, meaning that the ratios of their sides are the same. In other words, if you write the length of the longer side on top, a/b = (a+b)/a. You could also put the length of the shorter side on top and get an equivalent equation: b/a = a/(a+b). Either way, dividing through by top and bottom, we get:

a² = b(a+b),

or

a² − ab = 0.

This equation has lots of pairs of solutions (a,b). You could find them using the quadratic formula, in one of two ways. If you treat a as the variable, you can solve for it in terms of b:

a = (b ± √b² + 4b²  ) / 2 = b·(1 ± √5 ) / 2.

But the equation is pretty symmetrical, and you can also solve for b in terms of a:

b = (−a ± √a² + 4a²  ) / 2 = a·(1 ± √5 ) / 2.

We need to pare down our solutions just a bit. Knowing that a and b are both lengths of rectangle sides, we should make sure they are both positive. 1 − √5 and 1 − √5 are not positive, so we throw them out, leaving us with

a =  b·(1 + √5 ) / 2   and   b = a·(1 + √5 ) / 2.

Once we know this, it’s easy to talk about ratios of sides. The ratio of the longer side to the shorter side is a/b. Taking the equation a =  b·(1 + √5 ) / 2 and dividing both sides by b, we see that a/b = (1 + √5 ) / 2 = 1.61803… And the ratio of the shorter side to the longer side is b/a, which by similar logic is just (1 + √5 ) / 2 = a/b 1 = 0.61803… (We can also deduce b/a = a/b 1 directly from the initial equation a/b = (a+b)/a, because  (a+b)/a  is just 1 + b/a.)

Pictorially, if the square in our initial picture is 1 × 1 (a = 1), then b = (1 + √5 ) / 2 = 0.61803… (the short side of the small rectangle), and a + b = (1 + √5 ) / 2 = 1.61803… (the long side of the big rectangle).

So what is the golden ratio? Well, which ratio do you want — long side to short side or short to long? Do you say tom-ay-to or tom-ah-to? Which of the two we call golden is unimportant; what matters is that the picture, and all the math around the ratio, are the same either way. Which should we take as the golden ratio? Ee-ther! Or maybe ai-ther!

We think of math as being about deduction and absolute right answers, but it is also full of decisions and conventions. Sometimes the decisions make a difference: we decide to make .9999… equal to 1 (by deciding on certain rules for doing math with infinite sums), and this has consequences across the subject (decimal representations are no longer unique). But sometimes the decisions are only conventions, just a way of fixing language or notation and no more, and don’t matter very much.

We do, however, need to keep track of what conventions we’re using. The 15-year old in the news stories probably learned that the golden ratio is (1 + √5 ) / 2, which is the more common formulation. Then, at the Science Museum, he saw this (photo from the latest Globe article):

image1(10)AIt looked wrong; he was sure it should say (5 + 1) / 2, not (5 − 1) / 2. But read the fine print: the short side divided by the long side. That ratio is indeed (5 − 1) / 2, as the display claims. The Science Museum just happened to frame their display in terms of the opposite ratio from the one the student learned. There’s nothing wrong with that, but we need to be aware that which version of the ratio we use is mathematical convention for us to choose, not mathematical fact set in stone.

Nick Kristof is not Smarter than an 8th Grader

About a week ago, Nick Kristof published this op-ed in the New York Times. Entitled Are You Smarter than an 8th Grader, the piece discusses American kids’ underperformance in math compared with students from other countries, as measured by standardized test results. Kristof goes over several questions from the 2011 TIMSS (Trends in International Mathematics and Science Study) test administered to 8th graders, and highlights how American students did worse than students from Iran, Indonesia, Ghana, Palestine, Turkey, and Armenia, as well as traditional high performers like Singapore. “We all know Johnny can’t read,” says Kristof, in that finger-wagging way perfected by the current cohort of New York Times op-ed columnists; “it appears that Johnny is even worse at counting.”

The trouble with this narrative is that it’s utterly, demonstrably false.

My friend Jordan Ellenberg pointed me to this blog post, which highlights the problem. In spite of Kristof’s alarmism, it turns out that American eighth graders actually did quite well on the 2011 TIMSS. You can see the complete results here. Out of 42 countries tested, the US placed 9th. If you look at the scores by country, you’ll see a large gap between the top 5 (Korea, Singapore, Taiwan, Hong Kong, and Japan) and everyone else. After that gap comes Russia, in 6th place, then another gap, then a group of 9 closely bunched countries: Israel, Finland, the US, England, Hungary, Australia, Slovenia, Lithuania, and Italy. Those made up, more or less, the top third of all the countries that took the test. Our performance isn’t mind-blowing, but it’s not terrible either. So what the hell is Kristof talking about?

You’ll find the answer here, in a list of 88 publicly released questions from the test (not all questions were published, but this appears to be a representative sample). For each question, a performance breakdown by country is given. When I went through the questions, I found that the US placed in the top third (top 14 out of 42 countries) on 45 of them, the middle third on 39, and the bottom third on 4. This seems typical of the kind of variance usually seen on standardized tests. US kids did particularly well on statistics, data interpretation, and estimation, which have all gotten more emphasis in the math curriculum lately. For example, 80% of US eighth graders answered this question correctly:

Which of these is the best estimate of (7.21 × 3.86) / 10.09?

(A) (7 × 3) / 10   (B) (7 × 4) / 10   (C) (7 × 3) / 11   (D) (7 × 4) / 11

More American kids knew that the correct answer was (B) than Russians, Finns, Japanese, English, or Israelis. Nice job, kids! And let’s give your teachers some credit too!

But Kristof isn’t willing to do either. He has a narrative of American underperformance in mind, and if the overall test results don’t fit his story, he’ll just go and find some results that do! Thus, the examples in his column. Kristof literally went and picked the two questions out of 88 on which the US did the worst, and highlighted those in the column. (He gives a third example too, a question in which the US was in the middle of the pack, but the pack did poorly, so the US’s absolute score looks bad.) And, presto! — instead of a story about kids learning stuff and doing decently on a test, we have yet another hysterical screed about Americans “struggling to compete with citizens of other countries.”

Kristof gives no suggestions for what we can actually do better, by the way. But he does offer this helpful advice:

Numeracy isn’t a sign of geekiness, but a basic requirement for intelligent discussions of public policy. Without it, politicians routinely get away with using statistics, as Mark Twain supposedly observed, the way a drunk uses a lamppost: for support rather than illumination.

So do op-ed columnists, apparently.

Should You Opt Out of PARCC?

Today’s post is a discussion of education reform, Common Core, standardized testing, and PARCC with my friend Kristin Wald, who has been extremely kind to this blog. Kristin taught high school English in the NYC public schools for many years. Today her kids and mine go to school together in Montclair. She has her own blog that gets orders of magnitude more readers than I do.

We’re cross-posting this on Kristin’s blog and also on Mathbabe (thank you, Cathy O’Neil!)

ES: PARCC testing is beginning in New Jersey this month. There’s been lots of anxiety and confusion in Montclair and elsewhere as parents debate whether to have their kids take the test or opt out. How do you think about it, both as a teacher and as a parent?

KW: My simple answer is that my kids will sit for PARCC. However, and this is where is gets grainy, that doesn’t mean I consider myself a cheerleader for the exam or for the Common Core curriculum in general.

In fact, my initial reaction, a few years ago, was to distance my children from both the Common Core and PARCC. So much so that I wrote to my child’s principal and teacher requesting that no practice tests be administered to him. At that point I had only peripherally heard about the issues and was extending my distaste for No Child Left Behind and, later, Race to the Top. However, despite reading about and discussing the myriad issues, I still believe in change from within and trying the system out to see kinks and wrinkles up-close rather than condemning it full force.

Standards

ES: Why did you dislike NCLB and Race to the Top? What was your experience with them as a teacher?

KW: Back when I taught in NYC, there was wiggle room if students and schools didn’t meet standards. Part of my survival as a teacher was to shut my door and do what I wanted. By the time I left the classroom in 2007 we were being asked to post the standards codes for the New York State Regents Exams around our rooms, similar to posting Common Core standards all around. That made no sense to me. Who was this supposed to be for? Not the students – if they’re gazing around the room they’re not looking at CC RL.9-10 next to an essay hanging on a bulletin board. I also found NCLB naïve in its “every child can learn it all” attitude. I mean, yes, sure, any child can learn. But kids aren’t starting out at the same place or with the same support. And anyone who has experience with children who have not had the proper support up through 11th grade knows they’re not going to do well, or even half-way to well, just because they have a kickass teacher that year.

Regarding my initial aversion to Common Core, especially as a high school English Language Arts teacher, the minimal appearance of fiction and poetry was disheartening. We’d already seen the slant in the NYS Regents Exam since the late 90’s.

However, a couple of years ago, a friend asked me to explain the reason The Bluest Eye, with its abuse and rape scenes, was included in Common Core selections, so I took a closer look. Basically, a right-wing blogger had excerpted lines and scenes from the novel to paint it as “smut” and child pornography, thus condemning the entire Common Core curriculum. My response to my friend ended up as “In Defense of The Bluest Eye.”

That’s when I started looking more closely at the Common Core curriculum. Learning about some of the challenges facing public schools around the country, I had to admit that having a required curriculum didn’t seem like a terrible idea. In fact, in a few cases, the Common Core felt less confining than what they’d had before. And you know, even in NYC, there were English departments that rarely taught women or minority writers. Without a strong leader in a department, there’s such a thing as too much autonomy. Just like a unit in a class, a school and a department should have a focus, a balance.

But your expertise is Mathematics, Eugene. What are your thoughts on the Common Core from that perspective?

ES: They’re a mix. There are aspects of the reforms that I agree with, aspects that I strongly disagree with, and then a bunch of stuff in between.

The main thing I agree with is that learning math should be centered on learning concepts rather than procedures. You should still learn procedures, but with a conceptual underpinning, so you understand what you’re doing. That’s not a new idea: it’s been in the air, and frustrating some parents, for 50 years or more. In the 1960’s, they called it New Math.

Back then, the reforms didn’t go so well because the concepts they were trying to teach were too abstract – too much set theory, in a nutshell, at least in the younger grades. So then there was a retrenchment, back to learning procedures. But these things seem to go in cycles, and now we’re trying to teach concepts better again. This time more flexibly, less abstractly, with more examples. At least that’s the hope, and I share that hope.

I also agree with your point about needing some common standards defining what gets taught at each grade level. You don’t want to be super-prescriptive, but you need to ensure some kind of consistency between schools. Otherwise, what happens when a kid switches schools? Math, especially, is such a cumulative subject that you really need to have some big picture consistency in how you teach it.

Assessment

ES: What I disagree with is the increased emphasis on standardized testing, especially the raised stakes of those tests. I want to see better, more consistent standards and curriculum, but I think that can and should happen without putting this very heavy and punitive assessment mechanism on top of it.

KW: Yes, claiming to want to assess ability (which is a good thing), but then connecting the results to a teacher’s effectiveness in that moment is insincere evaluation. And using a standardized test not created by the teacher with material not covered in class as a hard percentage of a teacher’s evaluation makes little sense. I understand that much of the exam is testing critical thinking, ability to reason and use logic, and so on. It’s not about specific content, and that’s fine. (I really do think that’s fine!) Linking teacher evaluations to it is not.

Students cannot be taught to think critically in six months. As you mentioned about the spiraling back to concepts, those skills need to be revisited again and again in different contexts. And I agree, tests needn’t be the main driver for raising standards and developing curriculum. But they can give a good read on overall strengths and weaknesses. And if PARCC is supposed to be about assessing student strengths and weaknesses, it should be informing adjustments in curriculum.

On a smaller scale, strong teachers and staffs are supposed to work as a team to influence the entire school and district with adjusted curriculum as well. With a wide reach like the Common Core, a worrying issue is that different parts of the USA will have varying needs to meet. Making adjustments for all based on such a wide collection of assessments is counterintuitive. Local districts (and the principals and teachers in them) need to have leeway with applying them to best suit their own students.

Even so, I do like some things about data driven curricula. Teachers and school administrators are some of the most empathetic and caring people there are, but they are still human, and biases exist. Teachers, guidance counselors, administrators can’t help but be affected by personal sympathies and peeves. Having a consistent assessment of skills can be very helpful for those students who sometimes fall through the cracks. Basically, standards: yes. Linking scores to teacher evaluation: no.

ES: Yes, I just don’t get the conventional wisdom that we can only tell that the reforms are working, at both the individual and group level, through standardized test results. It gives us some information, but it’s still just a proxy. A highly imperfect proxy at that, and we need to have lots of others.

I also really like your point that, as you’re rolling out national standards, you need some local assessment to help you see how those national standards are meeting local needs. It’s a safeguard against getting too cookie-cutter.

I think it’s incredibly important that, as you and I talk, we can separate changes we like from changes we don’t. One reason there’s so much noise and confusion now is that everything – standards, curriculum, testing – gets lumped together under “Common Core.” It becomes this giant kitchen sink that’s very hard to talk about in a rational way. Testing especially should be separated out because it’s fundamentally an issue of process, whereas standards and curriculum are really about content.

You take a guy like Cuomo in New York. He’s trying to increase the reliance on standardized tests in teacher evaluations, so that value added models based on test scores count for half of a teacher’s total evaluation. And he says stuff like this: “Everyone will tell you, nationwide, the key to education reform is a teacher evaluation system.” That’s from his State of the State address in January. He doesn’t care about making the content better at all. “Everyone” will tell you! I know for a fact that the people spending all their time figuring out at what grade level kids should start to learn about fractions aren’t going tell you that!

I couldn’t disagree with that guy more, but I’m not going to argue with him based on whether or not I like the problems my kids are getting in math class. I’m going to point out examples, which he should be well aware of by now, of how badly the models work. That’s a totally different discussion, about what we can model accurately and fairly and what we can’t.

So let’s have that discussion. Starting point: if you want to use test scores to evaluate teachers, you need a model because – I think everyone agrees on this – how kids do on a test depends on much more than how good their teacher was. There’s the talent of the kid, what preparation they got outside their teacher’s classroom, whether they got a good night’s sleep the night before, and a good breakfast, and lots of other things. As well as natural randomness: maybe the reading comprehension section was about DNA, and the kid just read a book about DNA last month. So you need a model to break out the impact of the teacher. And the models we have today, even the most state-of-the-art ones, can give you useful aggregate information, but they just don’t work at that level of detail. I’m saying this as a math person, and the American Statistical Association agrees. I’ve written about this here and here and here and here.

Having student test results impact teacher evaluations is my biggest objection to PARCC, by far.

KW: Yep. Can I just cut and paste what you’ve said? However, for me, another distasteful aspect is how technology is tangled up in the PARCC exam.

Technology

ES: Let me tell you the saddest thing I’ve heard all week. There’s a guy named Dan Meyer, who writes very interesting things about math education, both in his blog and on Twitter. He put out a tweet about a bunch of kids coming into a classroom and collectively groaning when they saw laptops on every desk. And the reason was that they just instinctively assumed they were either about to take a test or do test prep.

That feels like such a collective failure to me. Look, I work in technology, and I’m still optimistic that it’s going to have a positive impact on math education. You can use computers to do experiments, visualize relationships, reinforce concepts by having kids code them up, you name it. The new standards emphasize data analysis and statistics much more than any earlier standards did, and I think that’s a great thing. But using computers primarily as a testing tool is an enormous missed opportunity. It’s like, here’s the most amazing tool human beings have ever invented, and we’re going to use it primarily as a paperweight. And we’re going to waste class time teaching kids exactly how to use it as a paperweight. That’s just so dispiriting.

KW: That’s something that hardly occurred to me. My main objection to hosting the PARCC exam on computers – and giving preparation homework and assignments that MUST be done on a computer – is the unfairness inherent in accessibility. It’s one more way to widen the achievement gap that we are supposed to be minimizing. I wrote about it from one perspective here.

I’m sure there are some students who test better on a computer, but the playing field has to be evenly designed and aggressively offered. Otherwise, a major part of what the PARCC is testing is how accurately and quickly children use a keyboard. And in the aggregate, the group that will have scores negatively impacted will be children with less access to the technology used on the PARCC. That’s not an assessment we need to test to know. When I took the practice tests, I found some questions quite clear, but others were difficult not for content but in maneuvering to create a fraction or other concept. Part of that can be solved through practice and comfort with the technology, but then we return to what we’re actually testing.

ES: Those are both great points. The last thing you want to do is force kids to write math on a computer, because it’s really hard! Math has lots of specialized notation that’s much easier to write with pencil and paper, and learning how to write math and use that notation is a big part of learning the subject. It’s not easy, and you don’t want to put artificial obstacles in kids’ way. I want kids thinking about fractions and exponents and what they mean, and how to write them in a mathematical expression, but not worrying about how to put a numerator above a denominator or do a superscript or make a font smaller on a computer. Plus, why in the world would you limit what kids can express on a test to what they can input on a keyboard? A test is a proxy already, and this limits what it can capture even more.

I believe in using technology in education, but we’ve got the order totally backwards. Don’t introduce the computer as a device to administer tests, introduce it as a tool to help in the classroom. Use it for demos and experiments and illustrating concepts.

As far as access and fairness go, I think that’s another argument for using the computer as a teaching tool rather than a testing tool. If a school is using computers in class, then at least everyone has access in the classroom setting, which is a start. Now you might branch out from there to assignments that require a computer. But if that’s done right, and those assignments grow in an organic way out of what’s happening in the classroom, and they have clear learning value, then the school and the community are also morally obligated to make sure that everyone has access. If you don’t have a computer at home, and you need to do computer-based homework, then we have to get you computer access, after school hours, or at the library, or what have you. And that might actually level the playing field a bit. Whereas now, many computer exercises feel like they’re primarily there to get kids used to the testing medium. There isn’t the same moral imperative to give everybody access to that.

I really want to hear more about your experience with the PARCC practice tests, though. I’ve seen many social media threads about unclear questions, both in a testing context and more generally with the Common Core. It sounds like you didn’t think it was so bad?

KW: Well, “not so bad” in that I am a 45 year old who was really trying to take the practice exam honestly, but didn’t feel stressed about the results. However, I found the questions with fractions confusing in execution on the computer (I almost gave up), and some of the questions really had to be read more than once. Now, granted, I haven’t been exposed to the language and technique of the exam. That matters a lot. In the SAT, for example, if you don’t know the testing language and format it will adversely affect your performance. This is similar to any format of an exam or task, even putting together an IKEA nightstand.

There are mainly two approaches to preparation, and out of fear of failing, some school districts are doing hardcore test preparation – much like SAT preparation classes – to the detriment of content and skill-based learning. Others are not altering their classroom approaches radically; in fact, some teachers and parents have told me they hardly notice a difference. My unscientific observations point to a separation between the two that is lined in Socio-Economic Status. If districts feel like they are on the edge or have a lot to lose (autonomy, funding, jobs), if makes sense that they would be reactionary in dealing with the PARCC exam. Ironically, schools that treat the PARCC like a high-stakes test are the ones losing the most.

Opting Out

KW: Despite my misgivings, I’m not in favor of “opting out” of the test. I understand the frustration that has prompted the push some districts are experiencing, but there have been some compromises in New Jersey. I was glad to see that the NJ Assembly voted to put off using the PARCC results for student placement and teacher evaluations for three years. And I was relieved, though not thrilled, that the percentage of PARCC results to be used in teacher evaluations was lowered to 10% (and now put off). I still think it should not be a part of teacher evaluations, but 10% is an improvement.

Rather than refusing the exam, I’d prefer to see the PARCC in action and compare honest data to school and teacher-generated assessments in order to improve the assessment overall. I believe an objective state or national model is worth having; relying only on teacher-based assessment has consistency and subjective problems in many areas. And that goes double for areas with deeply disadvantaged students.

ES: Yes, NJ seems to be stepping back from the brink as far as model-driven teacher evaluation goes. I think I feel the same way you do, but if I lived in NY, where Cuomo is trying to bump up the weight of value added models in evaluations to 50%, I might very well be opting out.

Let me illustrate the contrast – NY vs. NJ, more test prep vs. less — with an example. My family is good friends with a family that lived in NYC for many years, and just moved to Montclair a couple months ago. Their older kid is in third grade, which is the grade level where all this testing starts. In their NYC gifted and talented public school, the test was this big, stressful thing, and it was giving the kid all kinds of test anxiety. So the mom was planning to opt out. But when they got to Montclair, the kid’s teacher was much more low key, and telling the kids not to worry. And once it became lower stakes, the kid wanted to take the test! The mom was still ambivalent, but she decided that here was an opportunity for her kid to get used to tests without anxiety, and that was the most important factor for her.

I’m trying to make two points here. One: whether or not you opt out depends on lots of factors, and people’s situations and priorities can be very different. We need to respect that, regardless of which way people end up going. Two: shame on us, as grown ups, for polluting our kids’ education with our anxieties! We need to stop that, and that extends both to the education policies we put in place and how we collectively debate those policies. I guess what I’m saying is: less noise, folks, please.

KW: Does this very long blog post count as noise, Eugene? I wonder how this will be assessed? There are so many other issues – private profits from public education, teacher autonomy in high performing schools, a lack of educational supplies and family support, and so on. But we have to start somewhere with civil and productive discourse, right? So, thank you for having the conversation.

ES: Kristin, I won’t try to predict anyone else’s assessment, but I will keep mine low stakes and say this has been a pleasure!

Why Your Kids Should Help in the Kitchen

We were emptying the dishwasher in the morning, and my younger son’s job was putting away the silverware. He brought the silverware basket over to the silverware drawer, and said:

You know what I’m going to do? First I’m going to collect together all the spoons, and put them in the spoon bin. Then I’ll take all the knives and put them in the knife bin. Then I’ll take the forks…

Any idea that a five year old can come up with must be really simple, right? But simple ideas can still be deep and powerful. Among other things, this one is at the heart of an important mathematical technique called Lebesgue integration, which one of my favorite math teachers once explained to me like this:

Say you’re trying to count a really big pile of money. You can stack it really high and count it a bill at a time. Or you could separate it into piles of ones, fives, tens, and twenties, count how many bills are in each pile, multiply the count in each pile by the denomination, and add the results. Lebesgue integration is when you break the bills into separate piles first.

What my son figured out was that if you focus on one kind of utensil at a time, you can work faster because everything you pick up goes in the same place, so you don’t need to think about switching from one bin to another all the time. From a computer science perspective, you’re doing fewer operations. From a math perspective, you’re representing a single function that appears complex (because it jumps around all the time, from knife to fork to spoon to knife or from $1 to $10 to $5 to $1) in terms of a few simple (constant) functions defined on different domains. I’ve written before about how math is about finding, creating, and making use of order, and this is a great example.

You can apply this idea to the problem of finding the area under a really jumpy curve. Henri Lebesgue is famous, with an integration technique named after him, because he worked out the details, about 100 years ago. But the underlying idea truly is accessible to a five year old. At least, as long as that five year old pays attention to his chores.