现在,国家投入了比以往任何时候都要多的注意力、时间和金钱在公共部门的绩效衡量和评价上(经济合作与发展组织[OECD],1996;Pollitt & Bouckaert,2000;p.87;Power,1997)。基于结果的管理是各级公共部门一整天的话题,从地方、区域、国家,甚至前国家。学校和大学,地方政府,其他行政组织,发展援助机构(非政府组织和国际非政府组织),和组织,世界银行都参与绩效结果上的数据和信息制造,如果可能的话,也包括对绩效结果的影响。Power(1994,1997,2000)甚至提到“审计爆炸”或“审计的社会”。新公共管理领域的信徒将一个高度优先事项归于计量产出和成果。他们旨在根据这种理想信息基础上的新政策和管理活动,使得政策的执行更有效率和效力。但是,评价研究表明,很

多试图引进基于结果的管理方式最后仍然不成功(例如Leeuw & Van Gils, 1999, 荷兰研究述评)。不过,衡量产出、成果、和评价活动的需要在政治家和行政人员发表的改善政府工作表现的声明中仍然是一个重要的组成部分。



随着行政改革的崛起,公共部门中的绩效考核得到了愈来愈多的关注(cf.power,2000)。在上世纪80年代,大多数西方国家由于经济衰退和国际竞争加剧,逐渐引发了这种改革。而这场改革的口号便是“新公共管理”。它的目标是双重的:削减预算,并提高政府官僚机构的效力和效率。为了实现后一个目标,市场型的机制,如私有化,和竞争性的招标,都被引进了公共部门。另外,单位和部门被卖出进入准自治非政府组织。这种例子比比皆是(对10个OECD 国家的评论,Pollitt & Bouckaert, 2000)。


上述变化导致公共部门采用大量私人机构技术以衡量和改善绩效,如绩效指标。指标不仅使政治家来衡量和评价公共及私人政策执行组织的业绩,它们也增加了认定业绩的机会——行政改革的另一目标。显然,所有这些变化都基于一个强有力的信念,即公共部门的绩效是可衡量的。但是,正如我们下面要讨论的那样,信仰可能过于简单化(cf. Fountain, 2001)。


绩效悖论指的是绩效指标和绩效本身之间的薄弱联系(Meyer & Gupta, 1994; Meyer & O'Shaughnessy, 1993)。这种现象是被绩效指标减少一段运行时间的趋势所引起的。它们失去了作为绩效衡量的价值,并且不能够区分好和差的业绩。因此,实际和报告出来的业绩之间的关系便下滑了。

绩效指标的恶化是由四个过程引起的(Meyer & Gupta, 1994, pp. 330-342)。第一个过程被称为积极的学习,也就是说,业绩的改善时,指标失去了其检测不良业绩的灵敏度。事实上,当指标过时时,每个人都会做得很好。第二个进程是所谓的有害学习。当组织或个人了解到哪些方面的业绩被衡量(哪些不被衡量)的时候,他们可以利用这些信息来操纵他们的评估。例如,当他们把所有的努力都用到要衡量的方面时,绩效水平就上去了。但是,总体上可能没有实际的改进,甚至可能是另一方面绩效的恶化(cf. tunnel vision)(Smith, 1995)。第三个进程,选择,指的是以绩效高的更换绩效低下者,从而降低绩效差异。第四,当绩效中差异被忽略时,抑制就出现了。

重要的是要认识到,矛盾的是不是业绩本身,而是有关的业绩报告。与期望相反,指标并没有准确地报告业绩。这可能意味着绩效比所报告的要差,但也有认为比报告要好的情况。在后一种情况下,绩效悖论可能被认为没有坏处。然而,在绩效评估结果是用来评价组织或个人时,出现这些不公正的制裁的情况可能会上升。破案率下降,表明警方的绩效在不断恶化。但是,在研究期间内,之前相比,更多的肇事者已被逮捕,起诉和惩罚,这又表明,绩效水平改进了。Wiebrens 和Essers(1999)表明,在荷兰,犯罪模式走向了废止(国际公认的)指标。例如,犯罪已经越来越暴力,但是指标却没有区分像重罪和轻罪这类的差别。此外,更多的犯罪团伙由于一起犯罪被逮捕,比如破坏,这降低了刑事罪行的平均数目。Wiebrens和Essers得出结论,这不是警方的效果不好,而是指标不好。因此它应该被取代。

另一个绩效悖论的例子存在于关于超过代表性的事件中,它来自smith (1995)。在英国国民健康服务中心,与会者一致认为,患者为做手术而等待的时间不得超过2年。这些项措施似乎取得成功,因为轮候的平均时间减少了。然而,进一步检查发现,减少是因为等待时间只有在第一次医院协商之后才开始计算,而协商被推迟,以减少等待时间。事实上,等待时间并没有减少,只是转移了而已。该指标没有准确地反映业绩;它报告了一个本不存在的改善。


尽管大多数的读者都承认迄今为止我们已经给的例子,但是在绩效中跟踪绩效悖论仍然不容易。它不仅可以采取许多不同的形式,也可以无意间导致一些变数,如政府的要求,需要执行的任务的要求,含糊不清或相互矛盾性质的政策目标,以及执行机构的能力。此外,一个人往往不知道绩效悖论的存在,直到为时已晚。因为,只要一切顺利,或看似顺利,没有必要进行干预(cf. Leeuw, 1995)。一个爆竹厂爆炸的例子中发现,调查原因,这场灾难揭示了一系列“小”的问题,这些问题本身并不被视为是灾难性的。例如,这个事件中明显缺乏地方和中央政府,监察局,消防部门的监督。缺乏适当的监督,就会阻止对非法活动发生的发现。因此,在住宅区经营的授权被不公正地无条件延长了。当火灾在工厂的地面上发生时,社区被摧毁,人丧生。小问题的积累变成了大问题,但显然没有任何机制或制度,以检测和避免这样的小错误的积累。当然,当地政客被追究责任,但只能是事后。这留给我们一个问题,即我们怎样才能检测和预防公共部门中绩效悖论的发生。



公共部门绩效评估不得不将公共服务的性质用记事的方式来表现。专业服务的方式被生产和消费,并且公共服务的方式被具有影响绩效检测的社会所重视。在公共部门,消费者参与服务提供过程;影响产出和结果(cf. Fountain, 2001, p.58)。此外,大部分产品是无形的。因此,绩效指标应力求反映质量和可靠性,而不是“硬性”的产品属性。公共服务不仅有关效率和效益,更是关于正义,公平,平等和问责制。Fountain (2001)警告说,私人部门技术的应用,如绩效指



这个COAG绩效信息被政府机构用来评估业绩,并确定需求和资源。该框架的透明度,改善了业绩和问责制。然而,这透明度“增加,而不是解决了公共服务分配结果方面的政治冲突”(McGuire, 2001, p. 17)。因为政治冲突增加了问责制的机会,而这不应视为绩效评估的一种消极后果。



一些新的监督机制的增加,可能有助于打击意想不到的后果,如绩效悖论(cf. also Power, 1997)。例如,互联网使得公共部门绩效信息可为每个人所知,这增加了作弊的代理人被抓的风险。第二,公民宪章,开放的政府行为守则的实践和新的申诉程序,增加了不满意的客户投诉处理表现不好的组织的机会。这些新的,更加横向化的绩效评估将补充公共部门中的绩效评估体系。最后,学者应该制定和测试能够解释绩效悖论发生和其他不良后果的理论(Scott, 2001)。更多关于组织行为的理论、机构和公共部门使用绩效指标的影响等方面的知识,可以帮助政府真正实现绩效指标在公共领域的预期优点。




Nowadays, states spend more attention, time, and money on performance measurement and evaluation in the public sector than ever before (Organization for Economic Cooperation and Development [OECD], 1996; Pollitt & Bouckaert, 2000, p. 87; Power, 1997). Results-based management is the talk of the day at all levels of the public sector: local, regional, national, and even supra national. Schools and universities, local governments, other administrative agencies, developmental aid organizations (nongovernmental organizations and international nongovernmental organizations), and organizations such as the World Bank are all involved in producing data and information on performance results and, if possible, impact. Power (1994, 1997, and 2000) even refers to the "audit explosion" or the "audit society." Believers in New Public Management (NPM) attribute a high priority to measuring output and outcomes and aim to base their new policies and management activities on this type of information-ideally meant to make policy implementation more efficient and effective. However, evaluation studies show that many attempts to introduce results-based management are still unsuccessful (see, for example, Leeuw & Van Gils, 1999, for a review of Dutch studies). Nevertheless, the need for measuring output, outcomes, and evaluation activities remains an important element in statements by politicians and administrators focused on improving government's performance.

Below, we will argue that this increase of output measurement in the public sector can lead to several unintended consequences that may not only invalidate conclusions on public sector performance but can also negatively influence that performance .We will show that a number of characteristics of the public sector can be counterproductive to developing and using performance indicators, illustrated by different examples. Finally, we will conclude with some suggestions on how to deal with the problem of performance assessment in the public sector. We believe this question is important. Because, although with problems, performance measurement

indeed can be of value to the public sector, especially to the public expenditures.

Performance Assessment in the Public Sector The increased attention to performance assessment in the public sector coincides with the rise of administrative reform (cf. Power, 2000). In the 1980s, economic decline and increased international competition triggered such reform in most western states. New Public Management was the catchword (Hood, 1994). The objective was twofold: to cut budgets and to improve the efficiency and effectiveness of government bureaucracy. To achieve the latter objective, market-type mechanisms such as privatization, competitive tendering, and vouchers were introduced in the public sector, and departmental units were hived off into quasi-autonomous nongovernmental organizations (quangos). Examples can be found everywhere (for a review of 10 OECD countries, see Pollitt & Bouckaert, 2000).

The practitioner theory underlying these changes is that politicians should stick to their core business that is, developing new policies to realize (political) goals. Osborne and Gaebler's (1992) adage was "steering not rowing." According to these NPM gurus, policy implementation should be left to the market or, if that is not possible, to (semi)-autonomous organizations operating in a quasi-market environment (e.g., competition between schools or hospitals).This separation of policy and administration is facilitated through contracts being drawn up between the government and the organization that implements the policy. The contracts articulate which task has to be carried out and what the executive agent will receive as a "reward." The agent's performance is expressed in terms of performance indicators, such as the number of goods or services rendered. Input management is thus replaced by a results-based orientation. Similar changes took place within government bureaucracy as well, where self management and contract management were introduced to (partly) replace hierarchical steering.

The aforementioned changes in the public sector led to the adoption of a large number of private sector techniques to measure and improve performance, such as performance indicators. Not only do indicators enable politicians to measure and evaluate the performance of public and private policy-implementing organizations, they also increase the opportunities to account for performance—another important goal of administrative reform (Jenkins, Leeuw, & Van Thiel, in press). Obviously, all these changes were fed by a strong belief in the measurability of performance in the public sector. However, as we shall argue below, that belief may have been somewhat

simplistic (cf. Fountain, 2001).

The Performance Paradox

The performance paradox refers to a weak correlation between performance indicators and performance itself (Meyer & Gupta, 1994; Meyer & O'Shaughnessy, 1993). This phenomenon is caused by the tendency of performance indicators to run down over time. They lose their value as measurements of performance and can no longer discriminate between good and bad performers. As a result, the relationship between actual and reported performance declines.

Deterioration of performance indicators is caused by four processes (Meyer & Gupta, 1994, pp. 330-342). The first process is called positive learning; that is, as performance improves, indicators lose their sensitivity in detecting bad performance. In fact, everybody has become so good at what they do that the indicator becomes obsolete. The second process is called perverse learning. When organizations or individuals have learned which aspects of performance are measured (and which are not), they can use that information to manipulate their assessments. For example, by primarily putting all the efforts into what is measured, performance will go up. However, overall there may be no actual improvement or perhaps even a deterioration of (other aspects of) performance (cf. tunnel vision) (Smith, 1995). The third process, selection, refers to the replacement of poor performers with better performers, which reduces differences in performance. Only good performers remain, and the indicator loses its discriminating value-almost resembling a consequence of the survival of the fittest mechanism. And fourth, suppression occurs when differences in performance are ignored (see below for an example).

It is important to understand that the paradox is not about performance itself but about the reports on performance. Contrary to the expectation, indicators do not give an accurate report of performance. This could mean that performance is worse than reported (overrepresentation) but also that it is better than reported (under-representation). In the latter case, the performance paradox might be considered harm-less. However, when the results of performance assessment are used to evaluate organizations or persons, situations can arise where these are unjustly sanctioned.

The percentage of crimes solved is decreasing, indicating that the police's performance is deteriorating. However, during the time period studied, more perpetrators have been arrested, prosecuted, and penalized than before, which would

indicate an improvement of performance. Wiebrens and Essers (1999) show that crime patterns in the Netherlands have developed in a way that invalidates the (internationally well-established) indicator. For one, crime has become more violent, but the indicator does not differentiate between, for example, felonies and misdemeanors. Moreover, more groups of criminals have been arrested committing a crime together such as vandalism, which reduces the average number of crimes per criminal. Wiebrens and Essers conclude that it is not the police that are performing badly but the indicator and that it therefore should be replaced.

An example of a performance paradox in a case of over representation is taken from Smith (1995). In the British National Health Service, it was agreed that patients should be on a waiting list for an operation no longer than 2 years. This measure appeared successful, as the average waiting time decreased. However, on further inspection it was found that because the waiting time only began to be counted after the first hospital consultation, consultation was postponed to decrease the waiting time (perverse learning). In fact, the average waiting time did not decrease at all but was merely shifted in time. The indicator did not accurately reflect performance; it reported an improvement where there was none.

Detection and Prevention of a Performance Paradox

Although most readers will have recognized the examples we have given so far, it is not easy to trace a performance paradox in progress. Not only can it take on many different forms, it can also be the unintended result of a number of variables, such as government demands, the type of task to be carried out, the vagueness or contradictory nature of policy objectives, and the capabilities of the policy-implementing organization. Moreover, one is often not aware of the existence of a performance paradox until it is too late because as long as everything goes well—or appears to go well—there is no need to intervene (cf. Leeuw, 1995). A tragic example is found in the explosion of a firework factory, killing more than 20 people in the city of Enschede, the Netherlands, in May 2000. The investigation into the causes of this disaster revealed a range of "small" problems that by themselves were not considered to be catastrophic. For example, there was a clear lack of monitoring by the local and central governments, inspectorates, and the fire department. The absence of proper supervision prevented the discovery of illegal activities taking place. Hence, the license to operate in a residential area was, unjustly, renewed. When a fire occurred on the grounds of the factory, the neighborhood was destroyed and lives


