Using Standardised Tests for Accountability?

Are standardised and large scale tests a credible means to measure progress toward ‘reforms?’ It sounds immediately attractive as one. Yet, raising the stakes for large scale standardised tests has not been without its cons. The field of psychometrics owes itself to the attempts to find a way out of the formidable problems presented by the need to standardise tests in ways that allow fair comparisons between individual children and between groups of children with different attributes. Yet, despite refinements in psychometry, the use of standardised tests for accountability of schools and teachers remains mired in controversy.

Before reading on, please be sure that I am not against testing per se but am only being a consequentialist–that is, focusing more on the consequences of X, instead of assigning an intrinsic value to X. My inspiration for consequentialism is Amartya Sen’s case for it in his book Development as Freedom. He worries that deprivation of many kinds such as lack of medical care, basic education, and so on, can coexist with all libertarian rights being fully satisfied, and that similar incomes do necessarily convert equally into well-being due to differences in human condition. In his view, if one focused more on human well-being [as the consequence of higher incomes or of more libertarian rights] the calculus would become different than it would have been if one accorded priority to libertarian rights. This is not to denigrate freedom and liberty, but to say that they do not, under all circumstances, necessarily lead to well-being. As he puts it: “To ignore consequences in general, including the freedoms that people get–or do not get–to exercise, can hardly be an adequate basis for an acceptable evaluative system.” [Amartya Sen in Development as Freedom, p.66].

My concerns about the use of standardised testing are also similarly consequentialist. Recasting Sen’s phrases in italics above, I say: The learning-deprivation can continue to coexist with the best of standardised testing regimens. To ignore the consequences in general, including how good teaching is converted into learning gains, can hardly be an adequate basis for an acceptable evaluative system.

The standardised testing as a strategy to tame the sheer complexity and uncertainty of the process of education, especially when it is extended to the entire population, has been in place at least since the times of Alfred Binet [also see this paper for a discussion on standardised tests and bell curve thinking]. Though old, the standardised testing became increasingly popular and became a central evaluative instrument to measure progress toward reforms in the United States, in the aftermath of the famous jeremiad, the so-called A Nation at Risk. With importance accorded to standardised tests and the inter-group comparisons that followed, the ‘achievement gaps,’ were constructed as an important policy issue. ‘Gap gazing’ was referred to by some scholars as the fetish of education research. The New Child Left Behind (NCLB) act passed by the Bush administration in 2001 took the stakes associated with the standardised tests to new heights.

Of course, like many other ‘travelling reform’ ideas, the discourse practice of standardised testing has crept surreptitiously into the global education discourses. But as it happens usually in the case of such travelling reform ideas, the ‘standardised tests’ were stripped of the debate/controversy surrounding them in the US [I mainly refer to the U.S. here for brevity, but the testing movement is not restricted to the US but has swept through almost all western countries]. The meaning of standardised tests also underwent some transformation. In some instances, I have also heard of the term used erroneously for any large-scale test irrespective of whether it has been developed through a ‘standard’ procedure or not. ‘Standardised’ is mostly used synonymously with any single test, which is given to a large number of students.

When the same test is given to a large number of students in the same grade, it is assumed that the test scores will allow for comparisons of various kinds. Does this assumption make sense? I think it does not stand some very preliminary scrutiny primarily due to the enormous, almost intractable, diversity in Pakistan. Pakistan is afflicted by the worst of inequities in the distribution of capabilities across its population. The contextual factors (both inside and outside of the school) that have an effect on learning vary significantly across different groups. So the same test being given to people located at different points on any interpersonal comparison of well-being can potentially produce very different results.

If different test takers taking the same test are on vastly different locations on measures of well-being, then it seems absurd to rely on the test scores to make judgements about accountability and performance of teachers. Imagine a teacher who is well-educated and motivated, and out of her motivation joins a schools where she’d have to work with children from high poverty groups. Compare her situation with another teacher who is, let us assume, half as good [on the metric of education, professional qualification, and motivation] but happens to find herself in a school of well-nourished and motivated kids in high-opulence districts of Lahore. How difficult would it be to come up with a metric that allows adjustments for teachers’ context-based advantages/disadvantages before they are declared effective [or in-effective].

Then, there is the other, much larger, problem–that of mortgaging your life to the test, i.e. doing everything that you must to get a good test score. Even in the rich countries, as I have noted earlier, the test-based accountability has had unintended effects. [The regular readers of this blog may have seen the posts on the explosion of reports of cheating in the schools in United States.] Diane Ravitch, the eminent historian of American schools, says:

Accountability pressures have also led to widespread gaming of the system. Every so often, a cheating scandal is uncovered, but such scandals are minor compared to the ways in which states have manipulated the scoring of tests to produce inflated results. New York state education officials, for instance, made it easier to rate students as “proficient” by lowering the number of points that a student needed to earn on the state tests. In 2006, a seventh grader needed to get 59.6 percent on the state math test to be rated proficient, but, by 2009, a student needed to earn only 44 percent. Although most people would consider this a failing grade, the lowering of the “cut point” produced the desired results: In 2006, 55.6 percent of seventh graders were rated proficient, but, by 2009, that proportion had soared to 87.3 percent.

I am not opposed to testing. Test scores should be used to diagnose problems or to provide information about student progress or a program’s effectiveness. They should be used to help students improve their learning and to help teachers become better at their jobs.

Test scores are misused, however, when they become blunt instruments to punish teachers or schools. States’ standardised tests are not the equivalent of yardsticks or barometers. They have margins of error. If Johnny takes a test on a Monday, he could take the same test a week later and get a higher or lower score depending on any number of things, including Johnny’s mood, his health, the weather, the testing conditions in the room, or just random variation. The tests also sometimes contain errors or ambiguities. These are weak reeds on which to hang the fate and future of students, teachers, and schools. [Quoted from Pass or Fail]

Lortie said that education reforms are ‘long on prescription and short on description.’ Though said in the context of the U.S., it is equally true for Pakistan as well. The suggestions to use the standardised tests for accountability need to be seen against the backdrop of this history of quick fixes and half-thought prescriptions, and less deliberation and analysis. I also find it interesting, that accountability talk focuses on teachers alone, but not on other cogs of the system. It would be a great idea to make the political leaders in a constituency also accountable for providing the kind of support to schools and teachers that is needed for good results. It would be equally heartening to also see the commissioners, or DCOs, or whatever they are called these days to be accountable for the performance of schools in their districts. It is easy both to pull the teachers out for election and other sorts of duties as well as to blame them for low performance of children. Let the standardised test scores, then, make it more difficult for every cog in the system, and not just for teachers. To be sure, I am not defending bad teachers, but just suggesting some critical issues for consideration by the readers of this blog re the use of standardised tests for accountability.

Let me end with an example from an extreme. In 1996, I attended a lecture from a Russian professor of Mathematics at Columbia University. He was talking about mathematics education in the erstwhile Soviet Union. The American audience were shocked when told that the parents of the children would get a letter of displeasure and a warning from the party offices if their kids did not do well in mathematics. Well, like it or not, this sort of targeting of parents for accountability also worked by presumably influencing the parental attitude toward learning of a specific school subject by their children. So, test scores have been used in many different ways. But it may be a good idea to spend some time thinking through the possible consequences of using them as part of an accountability regimen within particular contexts. Such thinking may be informed by evidence about the consequences of such usage elsewhere.


Update: Meanwhile, some Pakistani economists, media and civil society representatives and politicians met on the premises of Harvard University recently. They arrived at a consensus statement that says:

  • There must be standardized testing, measuring, and dissemination of learning achievements. This will also provide an objective outcome against which to measure reform progress.
  • There must be recognition that teachers are central to education in Pakistan. They need to be supported and held accountable.
  • There must be an acknowledgement that the private sector is an important ally in the quest to educate Pakistan. It needs to be facilitated, provided that it meets minimum education standards.

The full document can be read here


About Irfan

I am an independent researcher and blogger interested in everything under the sun, but more so in the philosophy and history of education and education reform generally, and specifically in the so-called post colonial contexts

10 Responses to “Using Standardised Tests for Accountability?”

  1. Irfan provocative as always but simply to ensure that we are talking about the same thing it may be useful to share the classification of assessments and see which ones do fit the consequentalism critique or perhaps all do. Prof. Dan Wagner, from UPenn and also once a visiting expert at IIEP/UNESCO has done some work on this

    Three main types/categories of assessment are
    1. LSEAs Large –Scale educational assessment
    2. HBES Household-based educational surveys
    3. SQC Smaller,quicker,cheaper / hybrid assessment

    • LSEA
    Large-scale educational assessments are increasingly used by national and international agencies. The 1990 Jomtien conference on “Education for All” demanded more accountability and systematic evaluation in less developed countries and LSEAs become increasingly a key tool for meeting this demand. Nonetheless, the increasing complexity of LSEAs has led some to question their necessity in LDCs. Several major LSEAs include PIRLS, PISA, SACMEQ, PASEC and LLECE.
    • HBES
    Household-based educational surveys employ sampling methods to gather specific types of information on target population at the household level, and stratified along certain demographic parameters.
    • Hybrid Assessments
    More recent hybrid assessments pay close attention to such factors as: Population diversity, linguistic and orthographic diversity, individual differences in learning, and timeliness of analysis. This hybrid approach is termed the “smaller, quicker, cheaper” (SQC) approach. The Early Grade Reading Assessment (EGRA), one recent hybrid assessment, has gained considerable attention in LDCs. The Annual Status of Education Report (India, Pakistan) and UWEZO in East Africa belong to the Hybrid category.

    The team that met at Harvard .. perhaps they did discuss this or did not. Perhaps they had an opportunity to be a bit brutal on what is the status of NEAS/PEAC (grades 4 and 8) its regularity and robustness as a benchmarking tool or for that matter the Punjab Examination Commission (PEC) latest nose dive from 33% passing mark to 25% to up the performance of Punjab public and private school (depending how we look at performance and proficiency like the NY schools!) or speak about ASER Pakistan/India in terms of its robustness and scale (Pakistan is the new kid on the block on ASER as is East Africa with India as the pioneer of the methodology of this hybrid). You see you can announce a Consensus Statement based on 30 participants from a country of 180 million (Balochistan/Gilgit Baltistan excluded) but what was the quality of the discussion /analysis we did not hear. Thus the consensus announcement sounds safe and tested and equally applicable anywhere .. in the world. What is not evident is critique of what exists and scalability of the statement and the agency/institutions for its promotion or operationalization.

    Your concerns are all very well placed but I think there must be clarity if say the hybrid or ASER type assessments which are citizen led, are people friendly or not .. entitlements focused or not.. are accessible by the common person or not, sensitive to diversity or not; promote social justice as the moral minimum for learning or not and if they do not, how can they be improved. If such hybrids are useful and in fact not only make citizens come closer to their demand to right to education (of a certain quality) and also right to information then there is some merit to such type of assessments that do set aside some concerns of Sen, Diane Ravitch,and Lortie .. They infact empower ordinary citizens through enhanced capabilities of self/community testing, diagnosis and action.. It would be great to have my colleagues Rukmini Banerji, Wilima Vadwa and Madhav Chavan respond to your blog from ASER & Pratham.

    Director Programs ITA/IPL, Coordinator South Asia Forum for Education Development (SAFED)

  2. Thanks for your such a detailed note. This really adds value to the conversation. I am honoured to have you as a reader of this blog.
    As I have noted in my post too, I am not against testing per se.
    I am only being consequentialist, and arguing that it may be a good idea to think thoroughly through the consequences of an activity, such as the testing, rather than focus exclusively on its intrinsic merits. When thinking about the consequences, the hindsight helps a lot more than the foresight, or so it seems. We already know about the consequences of testing and they are varied.

  3. This is just an update to my earlier response to Baela’s comment. Any of the tests you mention may be quite robust. If they aren’t, efforts ought to be made to make them robust enough.
    Let me also clarify that It is not the quality of the tests that I am concerned about but the ways in which they are put to use. The tests can have, as Diane Ravitch also points out, a diagnostic value. They can give us, if they are good enough, very useful information about particular school subjects or topics in which the students are facing difficulties. This information can be very helpful for the teachers, and the professional development programs. Conversations among the teachers can be arranged where they discuss the issues based on the evidence provided by large scale tests. Teachers from regions or classrooms, where students seem to have done well in particular areas, can share their experience with their peers. Evidence from the tests, like evidence from research in general, can become a basis for purposive dialogues. Such dialogues can involve teachers, but can also involve communities, politicians, the civil-service as well as society.
    Contrast this with using tests narrowly to evaluate, reward, and punish teachers and schools.

  4. a very timely post. standardised testing while attractive for getting alot of information very quickly has real implications for how teachers teach. to pick a line from a previous post – it holds policy makers in thrall but may not be the best way forward. the question that needs to be asked perhaps is why there should be standardised testing? is the objective to ascertain whether a minimum standard of teaching is happening? are the scores to be used to hold teachers accountable? the latter may be a bad idea because studies around the world have shown it promotes teaching to the test and has detrimental affects in the long run. if former – why can there not be non-standardised measures by district, by rural-urban or even by type of school that are developed with inputs by teachers themselves. arguably one of the biggest issues today is that absence of teachers from this discourse, which is largely about them. It leads to a sense of disconnect and demotivates them. Imposing accountability checks without ensuring support is not likely to fix the problem. It may worsen it further.

  5. Nadeem irshad kayani Reply 20/08/2011 at 10:21 pm

    I think that rather than taking action based on quantity/administrative aspects it is much more useful to assess a kid and indirectly teacher based on quality of teaching and delivery in the classroom.Other factors as pointed out by Irfan are important but they do not restrict our use of large scale testing presently being done by DSD in Punjab.I can also share that the results are planned to be shared with parents, training needs of teachers will be identified and education profile of a district for policy formulation could be produced.

    • Dear Nadeem, Thank you so much for visiting my blog and for leaving your comment.

      As you rightly point out I am not suggesting that children should not be tested.

      In fact my position on this issue is very well captured in a quote, which is in the post above, but which I will paste here again so the readers do not have to look for it in the text above:

      I am not opposed to testing. Test scores should be used to diagnose problems or to provide information about student progress or a program’s effectiveness. They should be used to help students improve their learning and to help teachers become better at their jobs.

      Test scores are misused, however, when they become blunt instruments to punish teachers or schools. States’ standardised tests are not the equivalent of yardsticks or barometers. They have margins of error. If Johnny takes a test on a Monday, he could take the same test a week later and get a higher or lower score depending on any number of things, including Johnny’s mood, his health, the weather, the testing conditions in the room, or just random variation. The tests also sometimes contain errors or ambiguities. These are weak reeds on which to hang the fate and future of students, teachers, and schools.

      So when the results of the standardised tests are used as part of feedback to improve the process of professional development, just as DSD is attempting to do, it would certainly go a long way in making a worthwhile contribution toward making the professional development most relevant and focused. The benefits for learning will also hopefully follow. This is a worthwhile undertaking. The only thing that DSD may want to consider is more coordination with PEC as it sets out to refine its own large scale testing process.

      When the tests are used in the ways suggested above, then margin of errors are not something to worry about. They will always be there, and there will always be other influences on student learning apart from the quality of teaching.

      However, when the results of the tests are used for rewards and punishments of schools and teachers, then the technical problems become more worrisome. A test should never be used for such purposes it has passed the crucial tests of reliability and validity. Doing that is a cumbersome technical process and needs investment and capacity. But even if they are assembled after following the most stringent test development procedures, there are other kinds of [documented] problems that may arise if tests alone are used for accountability. There are several posts on this blog where I have documented the unintended consequences associated with using testing for accountability (see , , and ).

      When such problems arise in countries where there is no shortage of technical and psychometric expertise, then the issue is perhaps not solely technical. When stakes associated with tests for individual teachers and schools are increased, unintended consequences are seen to follow. So, I suggest it would be useful for the policy makers to be cognisant of these issues when developing the policies regarding the use of standardized tests for accountability.

      But as far as using the results of the large-scale tests for improving the professional development programmes is concerned, I think it has been long due and much needed, and it is indeed heartening to know that DSD is taking steps to make that happen.

  6. Nadeem irshad kayani Reply 21/08/2011 at 1:11 pm

    Irfan,you are right that we should think in terms of use of standardised test for teacher accountability,but in country like pakistan their are fewer options,we can go for classroom observations but here also the element of bais for a colleague comes in .A better option which we are trying is accountability by parents,who are the actual beneficiary or otherwise of our educational reforms.Test results provided to the parents at hteir doorstep by DSD staff i.e DTEs might bring them in to play a decisive role in the process of accountability of teachers.


