Assessment

The Challenge of Authentic Assessment

Problem: Old-school accountability tests are crude measurements of student learning.
Solution: Build a better test.

March 18, 2008

Credit: Gregory Cherin

Past: Traditional standardized tests deaden teaching and inaccurately measure student learning.

When I was a younger education reporter in the old mill town of Lawrence, Massachusetts, the big day came when the state released scores on its school accountability tests. The Massachusetts Comprehensive Assessment System, better known and feared as the MCAS, fulfills the requirements of the federal No Child Left Behind Act through annual tests in English and math (and now additional subjects).

I scrutinized pages of numbers and wrote a story on the success and failure of nearby schools. My editors played it big on the front page because they knew parents would look anxiously at their school's results and homeowners would mentally adjust their property values based on the scores. I prodded principals and superintendents to explain their schools' leaps or stumbles.

And unwittingly, I played right into the dominant illusion that these bloodless test scores are the most definitive measure of a school's success -- and that they measure what's most important.

Cold, hard numbers have a way of seeming authoritative, but accountability tests are not the infallible and insightful report cards we (and our state governments) imagine them to be. The educational assessment tests states use today have two fundamental flaws: They encourage the sort of mind-numbing drill-and-kill teaching educators (and students) despise, and, just as important, they don't tell us much about the quality of student learning.

"We are totally for accountability, but we've got the wrong metrics," says John Bransford, a professor of education at Seattle's University of Washington who studies learning and designs assessments. "These tests are the biggest bottleneck to education reform."

Hobbled by History

Jennifer Simone, a fifth-grade teacher at Deerfield Elementary School, in Edgewood, Maryland, is acutely aware of the limitations of standardized tests. Her curriculum must emphasize subjects for which the state accountability test measures proficiency -- math, reading, and science. Social studies? Though the subject is on her master schedule, if there is a shortened school day, it gets dropped.

Moreover, Simone says, the test scores don't truly reflect her students' abilities and are too vague to help her pinpoint individual needs. She longs for an assessment that relies on more than just written problems, that could capture the more diverse skills visible in her classroom and valued in the workplace, such as artistic talent, computer savvy, and the know-how to diagnose and fix problems with mechanical devices. Simone asks, "If we differentiate our instruction to meet the needs of all the learners, why aren't we differentiating the test?"

Credit: Gregory Cherin

Future: Desiree Jerome demonstrates a chemiluminescent reaction as part of her portfolio to move up to the next grade at F.W. Parker Essential School, in Devens, Massachusetts.

The simple, but unsatisfying, answer is history and efficiency. The tests that states use to satisfy NCLB descended from a model created in the 1920s designed to divide students into ability groups for more efficient tracking. Eighty years, a world war, and a technological revolution (or two) later, the tests remain structurally the same.

Policy makers revere the seeming objectivity of these tests, but the truth is that the exams are not adept at determining either how well teachers have taught or students have learned -- and test makers themselves will tell you so. Stephen Dunbar, an author of the influential Iowa Test of Basic Skills, explains that these tests can help illuminate statewide educational trends but are too broad a brush for the detail at the school and classroom level that NCLB demands.

Assessment tests might show the overall effectiveness of the ninth-grade curriculum, for instance, or indicate trends within large demographic groups in that grade. But Dunbar says that when you get down to measuring the ability of students at Dallas's Woodrow Wilson High School, for example, where you're comparing this year's ninth graders to last year's, accountability test scores are not very useful. "They might tell you more about idiosyncrasies in that combination of kids than the level of achievement or the quality of teaching and learning that's going on," Dunbar explains.

In other words, state governments, at the behest of the feds, are using tests to measure something they actually don't measure very well, and then penalizing schools -- and in some cases, denying students diplomas -- based on the results.

"Most of these policy makers are dirt ignorant regarding what these tests should and should not be used for," W. James Popham, professor emeritus at the University of California at Los Angeles and former president of the American Educational Research Association, told PBS's Frontline in 2001. "And the tragedy is that they set up a system in which the primary indicator of educational quality is simply wrong." (See Popham's article, "F for Assessment," April 2005.)

There are several reasons the tests are imprecise. (See "Where Standardized Tests Fail.") Some are technical: an ambiguous question, a misjudgment in setting the difficulty level, a scoring error. The National Board on Educational Testing and Public Policy, at Boston College, has documented cases when scoring errors sentenced children to summer school or caused them to miss graduation before the mistakes were discovered. Some reasons are personal: Simone, whose school narrowly dodged state intervention last year, has seen fifth graders arrive on testing day angry about personal matters; others struggled to sit still during the test or broke down in tears under the pressure.

The tests' fallibility has most to do with the very idea of measuring a year's worth of learning in a single exam. Inevitably, cramming that much coverage into a short test leads states to rely mostly on multiple-choice questions -- the fastest and cheapest means of large-scale assessment. Such brief yet weighty exams limit the ways students can show their skills, and because it's impossible to test hundreds of state standards in a few hours, they leave teachers guessing on which to emphasize. Randy Bennett, who holds the title of distinguished scientist at ETS, writes that this rigid idea of assessment yields a "narrow view of proficiency" defined by "skills needed to succeed on relatively short, and quite artificial, items."

Even when states do pony up to use open-ended essay questions and pay human scorers, these questions can encourage formulaic answers. Last school year, I watched the principal of a (high-scoring) Boston high school interrupt a test-prep session to warn students not to stray from the essay-writing formula -- main idea, evidence, analysis, linking -- lest they lose points. "Don't be creative," she said fiercely. "You've heard me rail against standardized tests, and this is why. There's one way to do this, and it's the way the assessment coordinator told you."

Equally worrisome is that today's assessments emphasize narrow skill sets such as geometry and grammar, and omit huge chunks of what educators and business leaders say is essential for modern students to learn: creative thinking, problem solving, cooperative teamwork, technological literacy, and self-direction. Yet because NCLB has made accountability tests the tail that wags the dog of the whole education system -- threatening remediation and state takeover for schools that fall short -- what's not tested often isn't taught.

In short, the American accountability system is a bastion of the past that's stifling our ability to tackle the future.

High Stakes

The good news is there's work afoot to create better tests that will challenge students to demonstrate more creative, adaptable skills -- and, in turn, encourage teachers to teach them. Some model assessments already exist; for instance, many experts tout the Programme for International Student Assessment (PISA) exam for its challenging, open-ended questions on practical topics, such as climate change or the pros and cons of graffiti. Even more advanced models, some using computer simulations, will become available in a few years -- and none too soon.

Business leaders have issued dire warnings about how hard the U.S. economy will tank if our education system doesn't get itself out of the nineteenth century, and fast. They're clamoring for creative, productive, affable employees -- not just dutiful test takers -- and they point to assessment as a crucial tool for turning the tide. Microsoft founder Bill Gates, addressing state governors, CEOs, and educators at the National Education Summit on High Schools in 2005, said, "America's high schools are obsolete. Even when they're working exactly as designed, they cannot teach our kids what they need to know today. In the international competition to have the biggest and best supply of knowledge workers, America is falling behind."

The New Commission on the Skills of the American Workforce, convened by the nonprofit National Center on Education and the Economy, issued a stark report in December 2006 predicting that our standard of living "will steadily fall" compared to other nations unless we change course. The globalized economy has created, the commission wrote, "a world in which comfort with ideas and abstractions is the passport to a good job"; what's essential, it added, is "a deep vein of creativity that is constantly renewing itself." According to the report, whatever efforts we make to modernize education, without a complete overhaul of the testing system, "nothing else will matter."

Congressman George Miller, chairman of the House Education and Labor Committee and chief House wrangler of NCLB (and a member of The George Lucas Educational Foundation's Advisory Board), understands the problem. The original law left it up to states to choose their own tests, but now he believes most states picked tests more for cost and efficiency than for educational value. "They don't truly measure what a student knows or doesn't know," he says, "or whether students have a depth of understanding so that they can apply their knowledge."

Real Solutions to Real Problems

In the past, states haven't had much choice in the kinds of large-scale assessments available, nor have they asked for much. That's about to change.

Test makers in multiple corners are creating more complex assessments, ones that, if tied more closely to curriculum and instruction, could paint a clearer picture of student learning. They're building these assessments to measure the twenty-first-century skills we so urgently need, aiming to gauge a child's readiness for the real challenges that await. If tests like these succeed, they could not only provide better information about children's readiness for real life but also give educators incentive to do what they want to do anyway: teach kids in engaging ways to be well-rounded people and lifelong learners, not drill the life out of school with dry test preparation.

A number of researchers are building tests that could be models -- or at least one piece of a larger model. The University of Washington's John Bransford and Andreas Schleicher, head of the Indicators and Analysis Division at the Organisation for Economic Cooperation and Development (OECD), maker of the PISA exam, believe students need dynamic problems to solve, ones that require real-world research and allow them to learn on the spot, not just apply prior knowledge.

A static problem, for instance, would ask test takers to say from memory how to save a certain endangered bird species. A dynamic assessment (in a real example from Bransford's lab) asks students to use available resources to learn what it would take to prevent the white-eyed vireo from becoming endangered. This is a novel question that demands students independently dig for information and know enough to ask the right questions to reach a solution.

Bransford says he doesn't believe the old trope that students must master a battery of content-specific facts before they can have a prayer of learning higher-order skills. "Just the opposite," he says: Students need to understand big concepts in each discipline, such as the relationship between a species' life cycle and its risk of extinction, but from there it's the higher-order skills that lead them to the pertinent facts.

At ETS -- which writes the SAT and Advanced Placement exams, among others, and administers fifty million tests a year -- Randy Bennett is field-testing assessments that make use of about thirty years of psychology research on how children learn. It's research that he says has been largely left out of test design. The key strategies he has found include asking students to integrate multiple skills (such as reading and making comparisons) at once, presenting questions in meaningful contexts, and using a variety of information forms, such as text, diagrams, and symbols. Eva Baker, codirector of UCLA's National Center for Research on Evaluation, Standards, and Student Testing, proposes one more: Never have someone present a solution without explaining why he or she chose it.

It's not so different from the kind of assessment Jennifer Simone would like for her students. She'd like the exam to use more formats than just writing, including visual or spoken components. "You would have to take the time to have a student interview, allow students to have an oral response," she says. "That's how we teach them reading."

Technology is what will make this revolution possible. Already, computers have enabled Bransford, Baker, and others to create interactive questions, search environments where students can find new information, and simulations to make problems more engaging and real. These tools can record students' answers as well as their thought process: what kind of information they sought, how long they spent on each Web page, and where they might have gone off track.

The British government has created a computer-literacy test that challenges teens to solve realistic problems (how to control crowds at a soccer match, for instance) using online resources. The more sophisticated these tools become, and the more adeptly test makers use them, the better assessment will be.

So, progress is coming -- in some cases, has arrived -- but as the OECD's Andreas Schleicher says, "It's a long road, and we're at the beginning." The biggest hurdles are time and money (richer tests require more of both to design and administer), and that rarely tamable beast, politics. The next version of NCLB, due later this year, could pump federal money into pilot projects to help states create richer assessments, paired with richer curriculum -- but only if that clause survives the political battle to come.

Stephen Dunbar, the Iowa test author, has doubts that more complex tests can be done on a large scale. Though the effort is worthy, he says, the cost and time to create and score open-ended questions, and make them comparable from year to year, could make it too impractical. Scary as it might sound, artificial intelligence is likely to play a big role in the scoring of such exams. If the technology becomes sophisticated enough to handle answers to trickier problems, it could make better assessment more affordable.

The ETS's Randy Bennett, on the other hand, believes the prospects of building an assessment system to match the demands of the twenty-first century are "pretty good." The key is to convince states that it's practical, affordable, and clearly better than today's exams at providing meaningful information. At least one state, West Virginia, has begun asking the test makers it contracts to emphasize more modern problems and skills. Another hurdle will be for politicians to temper their devotion to multiple-choice questions and get comfortable with a little subjectivity. "For any assessment," Schleicher says, "you have to make a trade-off between objectivity and relevance."

Jennifer Simone, for one, is depending on forward-thinking test makers and policy makers to succeed -- for the sake of her students, most of all. "That we are held accountable is a good thing. That we are doing something to measure the progress of our students is a good thing," she says. "I just disagree with the way it's being done."

Hobbled by History

High Stakes

Real Solutions to Real Problems

Grace Rubenstein is a senior producer at Edutopia.