Sorely Tested

A combination of new federal accountability measures, states’ plans to comply with them, and new commercial testing products threatens students, teachers and schools with a new wave of inappropriate high-stakes testing.

So cautions Teachers College measurement-evaluation expert Madhabi Chatterji in a publication released this week by the National Education Policy Center, based at the University of Colorado, Boulder.

Chatterji’s warning – and a series of guidelines for preventing the scenarios she fears – comes as 44 states begin implementing their federally approved plans for meeting the testing and accountability requirements of the Every Student Succeeds Act (ESSA), enacted in 2015 under President Obama.

Many of these states are planning to use “statistically derived indices from test-based data to rank, rate or examine growth of schools or education systems to fulfill ESSA’s requirements,” writes Chatterji, Professor of Measurement, Evaluation & Education, in “A Consumer’s Guide to Testing under the Every Student Succeeds Act (ESSA): What Can the Common Core and Other ESSA Assessments Tell Us?” “However, measurement experts, researchers and professional associations (such as the American Educational Research Association and the American Statistical Association) have cautioned against several of these – particularly ‘student growth percentiles,’ ‘value-added’ growth models, and multi-indicator ‘composite’ scores.”

Many states are planning to use “statistically derived indices from test-based data to rank, rate or examine growth of schools or education systems to fulfill ESSA’s requirements,” writes Chatterj. Misuse of test information in this way is akin to “misreading a Fahrenheit thermometer in degrees Celsius.”

Misuse of test information in this way, Chatterji writes, is akin to “misreading a Fahrenheit thermometer in degrees Celsius.”

Chatterji, who is also founding director of TC’s Assessment and Evaluation Research Initiative, says her “Consumer’s Guide” is not a critique of particular standardized tests or testing programs, but instead a “‘tool kit’ for state, national, and district policymakers (and the assessment specialists/researchers who assist them) to help avert the most common pitfalls and adverse consequences of inappropriate test information use for students, families and concerned stakeholders.” A key message – which Chatterji has delivered in many past writings – is that “validity is not a fixed property” that can be built into tests. Rather, she writes “the extent to which tests yield meaningful or valid information on student learning, or the quality of schooling, depends on how appropriately test results are put to use in decision-making contexts.”

The nation’s recent track record in that regard has not been encouraging. For example the old SAT test was not designed to measure the schools’ effectiveness – but in 2012, under the No Child Left Behind Act (ESSA’s predecessor), many school districts used it as the basis for identifying exceptional schools and practices.

One of Chatterji's key messages is that “validity is not a fixed property” that can be built into tests. Rather, “the extent to which tests yield meaningful or valid information on student learning, or the quality of schooling, depends on how appropriately test results are put to use in decision-making contexts.”

Nor have test developers helped the situation. Rather than simply providing students “raw” scores on standardized tests (the total points a student earns for providing correct answers), makers of standardized tests typically provide “scaled scores” – scores that have been transformed to enable comparisons among students who took different levels or forms of a test. The statistical wizardry involved can be so complex that such “derived” scores become a “black box” to most test users, increasing the likelihood that they will be misused for policy purposes.

The new ESSA guidelines further increase the chances for testing misuse, Chatterji says, because they place heavy pressure and tight restrictions on states to meet self-set goals – including long-term “growth-related” targets.

On the broadest level, Chatterji recommends that all test users specify, up front, the kinds of inferences they intend to draw from test data; that they avoid “multi-purposing” tests in ways that go beyond either the test’s intended use or the reported evidence; that they justify their uses and inferences of test-based data by referring to specific appropriate criteria for validity, reliability and utility; and that they seek out expert technical review before using tests for accountability purposes.

Among her other, more specific recommendations, Chatterji calls for the use of “descriptive quality profiles” – reports on locally valued indicators of student and school success separately – instead of complex statistical indices.

The “high stakes” of states’ ESSA rollout plans go beyond the immediate impact on schools and students. Past assessments have prompted major backlashes, Chatterji notes – for example, the Opt Out movement (parents who refuse to let their kids to take standardized tests in public schools) and the concurrent decision by many states to opt out of the two national consortia that are implementing assessments geared to the Common Core State Standards. Such fragmentation can result in individual states adopting policies based on their own tests, with similar patterns of misuse of testing data.

And yet, Chatterji says, “there is a political demand for high-stakes uses of test data that is likely to continue.

“Regardless of the recent backlash, the public still seeks standardized test scores – not only for students, but also obtaining a better gauge of their local schools. Combined, these factors create conditions for some of the recurring testing issues that this guide identifies.”

Tags: Evaluation & Learning Analytics Assessment & Testing Research