Can New York Clean Up the Testing Mess?
As New York State’s executive and legislative branches sank into a swamp of corruption and political paralysis this winter, something brave, honest, and totally unexpected took place in one Albany office. Pounding the table and refusing to accept any more excuses, the new chancellor of the Board of Regents, Merryl Tisch, forced the state’s notoriously dysfunctional department of education to submit to an outside audit of the reading and math tests, mandated by the federal No Child Left Behind (NCLB) act, that it has administered to all students in grades three through eight. A memorandum of understanding between the department and Harvard professor Daniel Koretz-one of the nation’s top testing experts-gives Koretz access to the data that he’ll need to determine whether New York’s test scores have been inflated. And once he does, it’s likely that the state’s claims of spectacular student progress will be revealed as an illusion.Such a development would be healthy for education reform nationally, because all states need to come clean about test-score inflation. Reliable tests of student achievement are as essential for improving education as accurate monitoring of blood-sugar levels was to advancing the treatment of diabetes. Unfortunately, when NCLB became law, it left the door wide open to massive test inflation by stipulating that all American students “will be proficient” by the year 2014-and imposing a series of increasingly onerous sanctions on districts and schools not moving toward that goal-yet allowing each state to develop its own tests and set its own standard for “proficiency.” Since men are not angels, it was inevitable that some state education authorities would lower the proficiency bar to make themselves look good to the feds.Now the Obama administration has launched the most expansive (and expensive) federal school-reform initiative in American history. Like NCLB, the initiative judges teachers, schools, and states by improvements in students’ test scores. But testing could be the Achilles’ heel of Obama’s reform agenda unless states like New York, where sudden student improvements strain all credulity, shape up.For more than two decades, mainstream American social science has recognized that accountability schemes like NCLB can lead to fraud and distortion. The principle even has a name: Campbell’s Law, after Donald Campbell, one of the greatest American social scientists of the twentieth century. In one study, Campbell observed various companies’ attempts to improve employees’ performance indicators by giving them incentives. He came up with this general formulation: “The more any quantitative social indicator is used for social decision making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.”Campbell even extended the “law” to the realm of education testing, writing that “when test scores become the goal of the teaching process, they both lose their value as indicators of educational status and distort the educational process in undesirable ways.” In plain English: when school districts, prompted by systems like NCLB, offer teachers and administrators substantial incentives to raise students’ test scores, the teachers and administrators will be tempted to find extraordinary means-but not always ethical ones-to get the test scores up.That’s exactly what seems to have happened. The best evidence of test-score inflation is the growing gap between the number of students that states deem proficient on their own tests-those they administer under the terms of NCLB-and the number deemed proficient by the National Assessment of Educational Progress (NAEP), often referred to as the “nation’s report card.” One reason the federal NAEP tests are the gold standard in student assessment is that they can’t be gamed by teachers or administrators. Every two years, NAEP math and reading tests are given to a statistically valid sample of all fourth- and eighth-grade students in each state; teachers aren’t able to teach to the test, and school districts can’t offer students practice tests because no one knows ahead of time which students will be tested. And in nearly every state in the union, the NAEP exams deem far fewer students proficient than the state’s own exams do.An even more disturbing phenomenon: states reporting huge gains on their own tests while their NAEP results don’t budge. One glaring instance happened recently in New York, where the percentage of eighth-graders reaching proficiency on the state’s math test rose from 58.8 percent in 2007 to a stunning 80.2 percent in 2009, while over the same period, the NAEP math scores for the eighth-graders remained flat. On the state’s fourth-grade reading exams, the proficiency rate went up from 68 percent in 2007 to 76.9 percent in 2009, while the NAEP test again showed no gain. On the state’s eighth-grade reading test, the proficiency rate went from 57 percent to 68.5 percent, while the NAEP tests showed a 1 percentile-point gain. It’s hard to escape the conclusion that states are gaming their tests-not only to satisfy NCLB but also to celebrate the education miracles that they’ve supposedly worked.Not every state has succumbed to the temptation, thankfully. Massachusetts made an honorable effort to maintain high academic content standards and then carefully aligned its state tests to those standards. Not surprisingly, it is one of the few states without significant gaps between student performance on state tests and on the NAEP.But Massachusetts is an exception. For a depressing instance of the rule, turn to New York, whose education department, even before the NCLB era, was repeatedly cited in official audits for failing to uphold its statutory responsibility to oversee local school districts. A 1991 report by state comptroller Ned Regan, for example, revealed that 10 percent of high school Regents exams had been improperly graded, leading to passing grades for many students who should have failed. But state education authorities did little to establish control over districts’ grading practices and didn’t even consider revisiting the astonishing policy of letting teachers grade their own students’ tests. It therefore shouldn’t have come as a surprise last year when the state’s current comptroller, Thomas DiNapoli, audited the grading of Regents exams and discovered that many students who should have failed had again passed.But the problem worsened considerably in 2002, when NCLB was enacted and New York’s testing agency was about to become the education department’s most important bureau, in charge of supervising the NCLB-mandated reading and math exams for grades three through eight. Two years later, the position of state testing director became vacant in New York, giving the Board of Regents-which governs the New York State Education Department-a golden opportunity to hire a highly qualified director committed to creating an honest and transparent testing system. Instead, the job went to David Abrams, a high school English teacher who had spent ten years as an administrator in an Albany-area school district. Abrams’s curriculum vitae-apparently written hastily after I asked the education department for it-shows that he lacks professional credentials in education testing, statistics, and psychometrics. He has bachelor’s and master’s degrees in English, plus an administration M.A., which is required of every assistant principal in the state. One member of the Regents tells me that Abrams “has no qualifications for the job, and he’s responsible for many of our blunders on the tests.”Perhaps Abrams’s most serious blunder was ignoring a warning about those tests in 2008. On September 12, he received an urgent memo from Daniel Koretz and Howard Everson, a professor of education testing at the CUNY Graduate Center and chair of the state education department’s Technical Advisory Group (TAG). The memorandum cited growing skepticism by the public and press about the reported score gains on the 2008 math and reading tests for grades three through eight. “None of the reporters found the
large increases credible,” Koretz and Everson wrote. “While we both emphatically assured them that there is no malfeasance, we had no basis for reassuring them with respect to test design and score inflation.” They requested the education department’s “support for a program of validation studies to be initiated this academic year,” proposing to study “score inflation and the undesirable instructional activities that produce it” as well as the ways that “score inflation can create an illusion the achievement gap is closing, or that it is closing much faster than it actually is.” Abrams never responded.Though Koretz and Everson graciously absolved the education department of “malfeasance” as an explanation for score inflation, there was something very close to that in the way the department continued to defend itself whenever a reporter asked questions about the 2008 tests’ integrity. Education commissioner Richard Mills reiterated that New York was protected against test corruption because of the highly regarded testing experts who served on the TAG and certified the testing process. The same defense appeared on the department’s website. All this despite the TAG’s chairman having already written the memo pointing out that, absent an independent study, there was “no basis for reassuring [the public] with respect to test design and score inflation”!Rather than finally doing something to restore the integrity of the tests, the education department released more wildly implausible results the following year. On the state’s 2009 math tests, for example, 87 percent of the seventh-graders achieved proficiency, up from 55 percent in just three years. In many districts, the number of students in all grades scoring above the proficiency bar was nearly 100 percent. In City Journal and the New York Daily News, I’ve called results like this the Lake Wobegon Effect, after Garrison Keillor’s tales about a town where “all the children are above average.” More bluntly, the New York Post’s Michael Goodwin called the test results “one of the most destructive frauds in many years.”To see how New York’s tests became corrupted, it’s necessary to understand how the scoring process works. The state hires a private education publisher, CTB/McGraw-Hill, to design its tests. A “standard-setting” committee of CTB/McGraw-Hill psychometricians then converts raw test scores-that is, the total number of questions a student gets right-to a wider scale, much as raw points on the SAT get converted to a scale of 800. From these “scale scores,” the committee then recommends the “cut points” that establish the minimum bar for students to reach one of the four designated achievement levels-below basic, basic, proficient, and advanced. If the cut points are set too high, student achievement will appear to fall; if they are set too low, achievement will appear to rise. Even in the best of circumstances, according to Everson, setting the cut scores is “more art than science.”Arguments about who set the cut scores for the troubling 2009 tests-and for what purpose-set off an unprecedented tumult at the Board of Regents last year. Shortly after her election as chancellor in April 2009, Merryl Tisch confided to several colleagues that she suspected Commissioner Mills of tampering with test scores to boost his own legacy. Another member of the Regents, former New York City schools superintendent Betty Rosa, went further, claiming that she was told by a high-ranking department official that Mills and Abrams had lowered the cut scores on the 2009 math test. Rosa even lobbied unsuccessfully to delay the release of the scores until there was an independent investigation of possible test corruption.When I wrote Abrams to ask whether he and Mills had played any role in setting the cut scores for the 2009 tests, he didn’t respond, but the education department’s press secretary, Tom Dunne, assured me that the cut scores are set by CTB/McGraw-Hill. And Everson confirmed that protocol calls for CTB/McGraw-Hill to recommend to the commissioner what the final cut scores should be. But the commissioner doesn’t have to follow the recommendation and always has the last word, Everson added. In a phone conversation, I asked Mills whether he’d had anything to do with setting the 2009 cut scores. He didn’t say either that he did or that he didn’t. “I did what was appropriate, and the record speaks for itself,” he noted in our brief exchange.Whoever set the cut scores, something untoward seems to have happened at the education department last year. Several researchers analyzed the 2009 tests and demonstrated that the standards dropped precipitously and arbitrarily. Diana Senechal, a Yale Ph.D. and former New York City teacher, established that in some grades it was possible for a student to attain the “basic” level just by guessing on every multiple-choice question, even while totally disregarding the section of the test that required longer written answers. Senechal’s study is buttressed by the fact that practically no students score below basic any more. For example, only 0.1 percent of all sixth-graders were designated at the lowest level on the 2009 reading test, down from 7.3 percent in 2006. Clearly, the downward trend was driven by the steady reduction in the percentage of raw points needed to reach “basic” status. Thus the percentage of raw points that sixth-graders needed to reach “basic” in reading dropped from 41 percent in 2006 to 17.9 percent in 2009.
The June 2 press conference announcing the 2009 math results was one of the strangest in the education department’s history. Mills sounded triumphant as he described the spectacular double-digit gains by students all over the state and in every grade. But Tisch sat on the same dais, and it was obvious that she didn’t believe a word of it. “Just because scores have gone up dramatically does not mean that our youngsters are ready to go to college,” she said. “No one should interpret this as an enormous victory.”James Williams, the African-American superintendent of Buffalo schools, promptly complained that Tisch was disparaging the performance of his district’s black children just as they were finally making academic gains. And in New York City, Mayor Michael Bloomberg and Schools Chancellor Joel Klein reportedly weren’t happy that Tisch was raining on their parade. They certainly had reason to be surprised: nothing in her background might have predicted that she would take on the state’s education bureaucracy. She’s a member of one of Gotham’s wealthiest families, part of the city’s “ruling class,” according to a New York Times profile. She travels in the mayor’s social circle and counts both Bloomberg and Klein as friends.However, Tisch had already pushed successfully for Mills’ ouster, and to succeed him, she eventually supported the appointment of David Steiner, the reform-minded dean of the Hunter College School of Education. As a Boston University political-science professor in the early 1990s, Steiner had played a role in advancing the Massachusetts education-reform law that mandated a knowledge-based core curriculum. Raised in England, Steiner had never even seen a multiple-choice test until he moved to the United States for graduate studies. He was shocked to discover that New York allowed teachers to grade their own students’ Regents exams, and he took office in October believing that the state’s testing system needed a major overhaul.Tisch’s other major move-supporting the audit by Koretz-unleashed a backlash by the education department’s old guard, who feared (reasonably) that an honest evaluation of the state’s testing system would turn out to be extremely embarrassing. One high-level staffer even warned Steiner that Koretz was known in the field as a “test basher”-a fabrication. But the reformers have stuck to their guns. Koretz’s team at Harvard is already at work, mapping student responses to individual test items over the past four years.
It isn’t only NCLB-mandated tests that suffer from score inflation, by the way. The same pattern of lowered standards has infected New York’s high school Regents exams. Former CUNY education school dean Alfred Posamentier, a member of the state’s Mathematics Standards Commission, tells me that the Regents exam in algebra-typically taken by students in the ninth grade-has been drastically dumbed down. According to Posamentier, the algebra Regents is “no longer a very challenging instrument and, amazingly, requires a student to answer only about 35 percent of the questions correctly to earn a passing grade. A passing grade could then send a student on to the next course-typically, geometry-knowing only about one-third of the material from the previous mathematics course. This student is truly doomed to failure.”
For Mayor Bloomberg, test-score inflation has been the gift from Albany that keeps on giving, an elevated platform from which to rejoice over-and take credit for-New York City’s newly successful schools. Bloomberg and Chancellor Klein have also found ways to build even higher monuments to their success on top of that platform. Starting in 2005, for instance, the city’s Department of Education (DOE) dissembled to reporters about the magnitude of score increases, appropriating to its column gains that had actually been achieved under the previous education administration. Also, many of the city’s checks against cheating have disappeared in recent years. Under the old, pre-Bloomberg Board of Education, the safeguards included sending district administrators to every school on testing days to oversee test security procedures. The board also identified schools that had unusually high score increases from the previous year and subjected them to closer scrutiny in checking test papers. These procedures are nonexistent in today’s DOE.
Moreover, beginning in 2005, the administration offered principals (and teachers, a few years later) a variety of inducements, including cash payments, for pushing test scores up, but didn’t bother to ask too many questions about how the deed was done. Principals received cash bonuses of up to $25,000 for each year that scores went up substantially, and thousands of teachers got schoolwide bonuses-a powerful enticement to inflate scores. If scores didn’t go up, teachers and principals faced sanctions, such as having their schools closed. Finally, CTB/McGraw-Hill now provides yet another gimmick to help city teachers get the scores up: the “Predictive Assessment,” essentially a test-prepping device disguised as a mini-test that students take once a year and that closely reflects the blueprint and structure of the state tests.
The paradigmatic case of test-score inflation happened in the middle of the 2005 mayoral campaign at P.S. 33, an elementary school in the South Bronx. The percentage of the school’s fourth-graders passing the reading exam-that is, scoring at or above proficiency-had more than doubled in a single year, skyrocketing from 35 percent in 2004 to 83 percent in 2005. Instead of ordering an investigation of this astonishing and literally unbelievable rise, Mayor Bloomberg held a press conference at the school to proclaim another miracle improvement on his watch. The school’s principal, Elba Lopez, then quickly retired with a $15,000 bonus for her great work, which boosted her annual pension by about $8,000 for life. The next year, the same cohort of students, now fifth-graders, fell back to a pass rate of only 47 percent, and the pass rate for the new crop of fourth-graders was just 41 percent.
After I reported this in City Journal and Andy Wolf did the same in the New York Sun, the presumption that somebody had tampered with the students’ 2005 exams became so strong that the DOE was forced to take a look. But the department waited almost two years to start an investigation, by which time the suspicious test papers had been destroyed. In a report issued almost four years after the tests were taken, the department’s investigator found no wrongdoing, yet neglected to interview Lopez. When I asked DOE counsel Michael Best how the most likely suspect could be cleared without even an interview, he responded that the department’s investigator hadn’t been able to locate her. A few months later, New York Post reporter Yoav Gonen found Lopez at her apartment in the Bronx, exactly where she had always been. Lopez assured the Post that there was no cheating; the reason that the students didn’t maintain their spectacularly high scores for more than one test cycle, she explained, was that the school had a new, inexperienced principal. By its negligent handling of this well-publicized case, the DOE sent a clear signal to all adults working in the schools: feel free to tamper with students’ tests, so long as the miracle scores continue to make city hall happy.
In undermining the integrity of the tests, the Bloomberg administration had many powerful enablers, including the United Federation of Teachers (UFT). At the celebratory press conferences each year, union president Randi Weingarten appeared beside the mayor, nodding in approval as he detailed amazing stories of student improvement. Weingarten told confidants that the test scores were too good to be true. But in public, she maintained the fiction: Didn’t the rising scores prove, after all, that teachers had earned their unprecedented raises of 43 percent since Bloomberg took over the schools?
Now that Bloomberg has informed the UFT that the days of wine and roses are over and negotiations for a new teachers’ contract are at an impasse, Weingarten’s handpicked successor, Michael Mulgrew, has suddenly reversed course on the testing issue. “Should we accept these ‘miracles’ on faith?” the new UFT president asked in a satiric editorial for the union newspaper. “Or do they warrant closer scrutiny? Reason alone tells us that when . . . 82 percent of students in grades 3 to 8 meet standards in math, the bar is too low.” Only four months earlier, when the union still expected a 4 percent yearly raise from the city, Mulgrew had happily mounted the dais with Bloomberg and Weingarten to share praise for that very 82 percent proficiency rate.
Chancellor Tisch and Commissioner Steiner have a unique opportunity to offer parents an honest assessment of how their children and their schools are doing, even before the results of the Koretz audit come in. For starters, they should finally conduct a national search for a highly qualified testing director committed to transparency and an honest assessment of student academic achievement. The $148,000 salary that the education department already allocates for the position should be enough to attract someone with excellent professional credentials and a commitment to reform.
Tisch and Steiner could take a significant step toward solving blatant cheating by recommending legislation that would make it a crime for adults to tamper with students’ test sheets. They should also end the practice of allowing teachers to grade their own students’ Regents exams. School districts could easily create procedures in which exam papers are distributed randomly for grading by teachers from other schools.
Tisch and Steiner should also press for a moratorium on all bonus schemes based on students’ test scores until the Koretz audit is completed and there is greater confidence that the test-inflation problem is being solved. And instead of echoing NCLB’s impossible goal-all students reaching proficiency by some arbitrary date-Tisch and Steiner should declare a much more realistic target: narrowing the gap between students’ performance on the state tests and on the NAEP tests. When the NAEP results are released, the education department should issue a report on how much progress (or lack thereof) has been made toward achieving that goal.
“We have to stop lying to children,” education secretary Arne Duncan said recently at a meeting of the National Governors Association (NGA). “We have to look them in the eye and tell them the truth at every stage of their educational trajectory.” Duncan has offered seed money to states to develop tests aligned to the Common Core State Standards project initiated by the NGA and endorsed by the administration. But Tisch and Steiner need not wait for the feds to fund better tests. By putting their own house in order, they can make New York a model for the kind of political courage and educational honesty that are desperately needed all over the nation.
Sol Stern is a contributing editor of City Journal and a senior fellow at the Manhattan Institute.