Effectiveness of Performance Appraisal: An Integrated Framework

Based on a robust analysis of the existing literature on performance appraisal (PA), this paper makes a case for an integrated framework of effectiveness of performance appraisal (EPA). To achieve this, it draws on the expanded view of measurement criteria of EPA, i.e. purposefulness, fairness and accuracy, and identifies their relationships with ratee reactions. The analysis reveals that the expanded view of purposefulness includes more theoretical anchors for the purposes of PA and relates to various aspects of human resource functions, e.g. feedback and goal orientation. The expansion in the PA fairness criterion suggests certain newly established nomological networks, which were ignored in the past, e.g. the relationship between distributive fairness and organization‐referenced outcomes. Further, refinements in PA accuracy reveal a more comprehensive categorization of rating biases. Coherence among measurement criteria has resulted in a ratee reactions‐based integrated framework, which should be useful for both researchers and practitioners.


Introduction
Effectiveness of performance appraisal (EPA) remains one of the most vital subjects in the theory and practice of performance appraisal (PA). In earlier times, it merely referred to how well the complex process of assessing employee work performance was operationalized (Keeping and Levy 2000;Lawler et al. 1984;Lee 1985). Now it has grown into a comprehensive evaluative approach to managing the PA system (Chiang and Birtch 2010). This approach uses certain 'measurement' and 'outcome' criteria and assesses the antecedent-outcome relationships that manifest EPA.
During the last three decades, PA literature has revealed a range of subordinate measurement and outcome criteria, albeit piecemeal. While developing the concept of EPA, Jacobs et al. (1980) proposed a system that established three categories of measurement criteria, i.e. utilization (refers to purposefulness), qualitative (fairness/justice) and quantitative criteria (PA accuracy). According to researchers (e.g. Chiang and Birtch 2010;Dewettinck and Dijk 2013;Hedge and Teachout 2000;Kudisch et al. 2006;Linna et al. 2012;Roch 2006;Wood and Marshall 2008), PA purposefulness addresses the question of why performance appraisals are conducted. Hence, it deals with the purposes and uses of PA, whereas PA fairness relates to a set of rules and practices that ensure justice in the PA system, and PA accuracy refers to elimination of rating errors.
In addition, researchers maintain that PA is considered effective when its key stakeholders (i.e. ratees) reckon it to be useful (Giles and Mossholder Effectiveness of Performance Appraisal 1990; Keeping and Levy 2000;Levy and Williams 2004;Roberson and Stewart 2006;Walsh and Fisher 2005), i.e. ratees' reactions. Pichler (2012, p. 710) defines them as 'individual-level attitudinal evaluations of and responses to the performance appraisal process'. In the light of this definition, this paper focuses on ratee reactions-based EPA outcomes, and thus uses Greenberg's (1990) taxonomy, which categorizes ratee reactions into two groups, i.e. personreferenced outcomes (ratee satisfaction with reward, the rater, rating system, ratings and feedback) and organization-referenced outcomes (organizational commitment, self-evaluation, feedback-seeking behaviour, role-clarity and perceived detriments to EPA).
Although organizations have instilled one set of measurement criteria or another, they seem to be discontented with their choices. Their complaint is that most PAs are ineffective, as they cause decreased employee performance (Latham et al. 2005) and increased employee dissatisfaction (Shrivastava and Purang 2011). This indicates that, by and large, PAs fail to contribute to human resource (HR) functions (Chiang and Birtch 2010) and organizational effectiveness (Taylor et al. 1995). Thus, responding to calls in the literature to propose a theoretically sound and coherent view of measurement criteria that may lead to desirable ratee reactions (e.g. Cardy and Dobbins 1994;Dipboye 1985;Fletcher 1995Fletcher , 2001Griffeth and Bedeian 1989;Haines and St-Onge 2012;Murphy and Cleveland 1995;Roch et al. 2007;Woehr and Huffcutt 1994), this paper aims to make two contributions to the field of PA. First, the paper identifies relationships between measurement criteria of EPA and ratee reactions. Ratee reactions are considered the most important PA outcome (Pichler 2012). Thus, this paper attempts to provide a ratee reactions-based view of EPA. Second, it proposes an integrated framework of EPA by two mechanisms: first, by suggesting integration between all the measurement criteria and ratee reactions, and, secondly, by discussing the integration among the measurement criteria.

Method
Given the dispersed nature of the EPA literature, we adopted a structured review (Tranfield et al. 2003), undertaking three decisive factors for search and selection of published literature, i.e. quality, rel-evance and recentness (see Figures 1, 2 and 3 for details). Unlike searching through databases (e.g. Claus and Briscoe 2009;de Menezes and Kelliher 2011), we targeted quality journals listed in the academic journal quality guide of Association of Business Schools (ABS) and Social Science Citation Index (SSCI). However, a few articles published in two-and one-grade journals were also included in the sample, and these articles were reviewed while carrying out the initial literature survey. On the homepage of each journal, the advanced search options were used to elicit relevant results. As a first step, main search terms of 'performance appraisal', 'performance rating' and 'performance evaluation' were applied. Afterwards, for searching within the results, major search terms were used. For example, for PA purposefulness, the search terms were 'purpose ', 'administrative', 'developmental', 'strategic' and 'role-definition'; for PA fairness, search terms of 'justice ', 'fair', 'distributive', 'procedural', 'interactional', 'interpersonal' and 'informational' were applied; and for PA accuracy, the search terms were 'accuracy', 'bias' and 'error.' Search terms for employee reactions were 'reward', 'organizational commitment', 'feedback', 'self-monitor', 'selfappraisal', 'self-evaluation' and 'satisfaction'. All search terms were applied to the full text using the truncation symbol (*).
The process produced 549 articles, which were skim read (rapid scanning of the entire article) to select the most relevant ones (Thomas 2004). Concentrating more on the concepts of PA relating to the theme of our study, i.e. EPA in general, purposefulness, fairness, accuracy and ratee reactions, we selected the most relevant articles (see Figure 2). A total of 127 articles, published in 37 journals falling under four subject categories, i.e. general management, HR management, psychology and organization studies, met the criteria per se. The selected journal articles include 104 empirical studies, 20 review papers, two triangulation studies and one conceptual paper. With regard to periodization, we focused more on studies published in the year 2000 and onwards. However, keeping in mind the inconsistent research attention being paid to each of the EPA criteria during this timeline, studies published before 2000 were also included. Thus, papers published in 2000 and onwards make up 59% of our selection. It is notable that eight articles published in 2012-14 were included during the revise and resubmit process (see in set of Figure 2 for details).

Integration among the EPA criteria
The proposed ratee reactions-based integrated framework of EPA is presented in Figure 4. In this section, the integrated framework is discussed in four parts. The first three parts discuss the relationships between the measurement criteria and ratee reactions. The fourth part discusses the correlates among the measurement criteria.
Building on research that highlights ratees' perceptions as the most important criteria for determining the effectiveness of PA systems (e.g. Keeping and Levy 2000;Levy and Williams 2004;Pichler 2012;Roberson and Stewart 2006;Roch et al. 2007), the centre of our analysis is ratee reactions. This review provides both a theoretical rationale and sufficient empirical evidence that measurement criteria lead to ratee reactions. It translates that purposeful and fair PA practices result in positive personand organization-referenced ratee reactions (e.g. ratee satisfaction and organizational commitment), whereas rating errors/biases cause negative outcomes (i.e. detriments to EPA), which manifest ratee dissatisfaction and low organizational commitment. As the focus of our proposed integrated framework is on ratee reactions, therefore, PA professionals will find it useful to use it as a (felt) needs assessment approach to PA, i.e. employees'/ratees' needs. Using the needs assessment approach to PA, further research can advance our integrated framework of EPA for establishing a ratee reactions-based view of PA theory.

PA purposefulness and ratee reactions
Researchers have provided certain theoretical reflections on the purposes of PA. These theories have helped to lay a pathway for the purposes of PA to be used as EPA criteria. However, attention paid to their empirical examination has been patchy. During the last three decades, most of the empirical research has been confined only to administrative and developmental purposes (e.g. Dorfman et al. 1986;Farh et al. 1991;Selvarajan and Cloninger 2011;Varma et al. 2008;Zimmerman et al. 2008). As a result, very little research has discussed the role-definition and strategic purposes of PA (e.g. Youngcourt et al. 2007 for the former; Noe et al. 2003 for the latter). Cleveland et al. (1989) inventoried and then categorized 20 purposes of PA into four a priori defined factors. All purposes in the first factor, i.e. 'between individuals', have been regarded as administrative purposes in the PA literature. These included: salary administration; promotion; retention or termination; recognition of individual performance; and layoffs and identification of poor performance. The second factor, i.e. 'within individuals', focuses on the developmental purposes (Tziner et al. 2000(Tziner et al. , 2001. These were: identification of individuals' training needs; performance feedback; determination of transfers and assignments; and identification of individuals' strengths and weaknesses. Some uses under the remaining factors (i.e. 'system maintenance' and 'documentation') relate to the strategic and roledefinition purposes. These include: 'evaluate goal achievement' and 'assist in goal identification' for the former; and 'reinforce authority structure' for the latter. Using a self-completion questionnaire survey in 74 Jordanian organizations (36 public and 38 private), Abu-Doleh and Weir (2007) partially replicated the study by Cleveland et al. (1989). Their sample of private organizations substantiated Cleveland et al.'s findings more than the sample of public organizations, i.e. PA systems in private organizations had a significantly greater impact than PA systems in public-sector organizations on promotion, retention/termination, lay-offs, identifying individual training needs, transfers and assignments.
Administrative purposes of PA. The relationship between administrative purposes of PA and ratee reactions has gained the support of expectancy and equity theories. Expectancy theory explains that, in order to raise the employees' interests in the organizational setting, they should be rewarded corresponding to their performance. This is because ratees expect that the higher the performance, the greater the reward (Harder 1992;Kudisch et al. 2006). Moreover, if the amount of reward  1970-1979 1980-1989 1990-1999 2000-2009

514
Effectiveness of Performance Appraisal corresponds to the level of ratee performance, they may perceive equity to be achieved (Chiang and Birtch 2010). If it is otherwise, the ratees perceive that they are under-rewarded; hence, they might decrease their performance to balance out the equity in their own way (Harder 1992). Supporting the above theoretical rationale, Chiang and Birtch (2010) argue that administrative purposes and financial needs of employees have always been current and short-term oriented. Hence, a strong link between performance results and reward may exist (Bititci et al. 2012). Chiang and Birtch's (2010) Hong Kong and Singapore samples empirically supported this. Similarly, another study with a cultural perspective (based on Latin America and Taiwan samples) has confirmed a similar relationship (see Milliman et al. 2002). Analysis of such studies confirms that, the more the rewards are tied to PA results, the more the EPA will be perceived (Lawler 2003).
Administrative purposes also relate to ratee satisfaction (with the rating system and the rater) and commitment. In their cross-sectional study (n = 599 employees), Youngcourt et al. (2007) report significant correlations between administrative PA and satisfaction with the rating system (r = 0.43, p < 0.01) and affective commitment (r = 0.36, p < 0.01). Using structural equation modelling, these researchers also found administrative purposes to have an effect on the ratee reactions (β = 0.53 and 0.03, p < 0.05, respectively). A longitudinal experimental study by  Boswell and Boudreau (2002) revealed similar findings. These scholars divided the sample (n = 116 employees) into the treatment group (rated for administrative purposes) and the control group (rated for both administrative and developmental purposes) and found significant correlations between PA ratings about ratees in both the treatment and the control groups, and their satisfaction with the rating system (r = 0.38 and 0.29, p < 0.01, respectively). Similarly, an earlier longitudinal study (n = 242 dyads) by Dorfman et al. (1986) found administrative purposes of PA to have a significant effect on ratee satisfaction with the rating system and the rater (β = 0.22, p < 0.05) as one factor. Thus, the PA used for administrative purposes may have a positive significant relationship on the ratees' satisfaction with reward, the rating system and the rater, and organizational commitment.
Developmental purposes of PA. Employee development is said to be among the primary purposes of PA (Cleveland et al. 1989;Nurse 2005). While identifying the desired emphasis on developmental purposes of PA, Milliman et al. (2002) found that a high priority was reported by samples in the American continent, Australia and Taiwan. However, the emphasis was moderate in some Asian countries. Chiang and Birtch (2010) carried out their study in seven countries (Canada, Hong Kong, Finland, Singapore, Sweden, the UK and the US) and found a strong consensus across the sample that PA was being used for employee development, albeit to varying degrees.
Social exchange theory explains that, when individuals feel that the organization is keen for their long-term development, they try to reciprocate (Chiang and Birtch 2010;Kuvaas 2006;Youngcourt et al. 2007). The most likely return on long-term development is employee organizational commitment (Tziner et al. 2001). As assumed by the social exchange theory, employees may feel motivated to maximize their outcomes (Roberson and Stewart 2006) and demonstrate positive attitudes (Kudisch et al. 2006). Substantiating this theory, a review by Beer (1981) and the following empirical studies suggest that developmental PA may lead to ratee commitment and satisfaction (with the rating system and the performance feedback).
Using a heterogeneous sample from three different countries (the US, Canada and Israel), Tziner et al. (2001) estimated inter-correlations among administrative and developmental purposes and affective commitment. They found developmental purposes to have a higher degree of corrected correlation (r = 0.38, p < 0.05) with affective commitment than administrative purposes did (r = 0.32, p < 0.05). Youngcourt et al. (2007) found developmental purposes to have significant correlations with satisfaction with the rating system (r = 0.43, p < 0.01) and affective commitment (r = 0.37, p < 0.01). They also found developmental purposes to have predicted affective commitment (β = 0.49, p < 0.05). In a longitudinal study by Tharenou (1995), 172 employees of the Australian Federal Agency (108 appraised and 64 non-appraised) were surveyed, both before and after the introduction of developmental PA. With respect to ratee satisfaction with the feedback, an increase in the post-test scores was found. This increase is accounted for by the developmental PA.
Some literature prefers administrative purposes to developmental purposes and vice versa. For example, a meta-analysis of 22 studies (Jawahar and Williams 1997) reveals that administrative purposes have been the focus of research more than developmental purposes have. In contrast, a survey of 276 students (Hong Kong, 141; UK, 135) by Snape et al. (1998) reveals that the Hong Kong sample appreciates administrative purposes more and developmental purposes less than the UK sample does. Drawing from these contrasting opinions, it is learnt that the relative importance of administrative and developmental purposes over each other may be assessed, particularly while predicting the common response variables, i.e. organizational commitment and satisfaction with the rating system.
Strategic purposes of PA. Goal-setting theory regards behaviours as goal directed. Using the goalsetting lens, van Dierendonck et al. (2007) maintain that ratees use performance ratings about themselves for self-monitoring. This is for assessing whether their performance is consistent with their goals or otherwise. However, before letting this desirable state occur, organizations solicit functional relationships between the organizational goals and the goals of their employees (Aguinis 2009). This is because organizations want ratees to self-monitor so that they pursue only those goals that are linked to organizational goals. This is why London et al. (2004) consider that 'setting goals' is better than 'assigning goals'.
Several researchers have suggested the relationship between PA ratings for strategic purposes and self-monitoring (see e.g. Jawahar 2001Jawahar , 2005Miller and Cardy 2000), and, thus, the latter is regarded as

Effectiveness of Performance Appraisal
an integral component of the PA system (Campbell and Lee 1988). In addition, Renn and Fedor (2001) identified that performance feedback-related research has focused largely on identifying antecedents of feedback-seeking behaviour and that goal orientation is one of them. Therefore, it is expected that the strategic PA may rouse ratees to self-monitor and seek performance feedback.
Role-definition purposes of PA. Role-definition purposes of PA remain the least explored ones. This paper found only one empirical study (Youngcourt et al. 2007) that even partially drew attention to this area. According to Duarte et al. (1994), roots of roledefinition purposes can be found in dyad formation. In fact, the role of an employee in the workplace changes over time; therefore, based on PA results, the supervisor defines and communicates roles to the subordinate. However, ideally, the process is completed only when the subordinate seeks feedback on their performance-position gaps, and this is the ratee reaction that organizations desire and into which researchers call for investigation (Levy and Williams 2004). Youngcourt et al. (2007) reported significant correlations between role-definition purposes and ratee satisfaction with the rating system (r = 0.49, p < 0.01) and affective commitment (r = 0.40, p < 0.01 and β = 0.03, p < 0.05). Although the existing literature provides little support for the above-mentioned relationships (see Dahling et al. 2012), it gives a lead to associating role-definition PA with feedback-seeking behaviour, organizational commitment and satisfaction with the rating system.
Ratee reactions are an outcome of PA purposes that is critical for the long-term EPA (Mount 1984). However, the literature highlights that PA researchers maintain two different opinions about relationships between PA purposes and ratee reactions. One suggests that each specific PA purpose may predict a unique outcome. The other suggests simultaneous effects of a combination of PA purposes on some outcomes. In support of the former theory, Beer (1981) suggested uncoupling administrative and developmental purposes in order to improve the PA system. Providing empirical support for this, two studies (Stephan and Dorfman 1989;Zimmerman et al. 2008) suggested administrative and developmental purposes to be unique predictors of 'task performance' and 'organizational goal performance', respectively. The former was an experimental study (n = 72 students), and the latter was a longitudinal study (n = 396 employees).
Substantiating the latter theory, three empirical studies (Harris et al. 1995;Tziner et al. 2001Tziner et al. , 2002 found significant correlations between administrative and developmental purposes (r = 0.58, 0.72, and 0.16, p < 0.05, respectively). Providing stronger evidence, Youngcourt et al. (2007) reported that correlations among administrative, developmental, and role-definition purposes were r ≥ 0.60, at p < 0.01. These results help infer that, if a category of PA purposes is not included in the research model of an empirical study undertaking PA purposefulness as a predictor, it may affect the framework as a nuisance variable.

PA fairness and ratee reactions
Performance appraisal fairness addresses the justice perceptions of ratees (Giles et al. 1997). Generally, fairness is derived from equity theory that refers to perceived outcome-related fairness (Dusterhoff et al. 2014;McDowall and Fletcher 2004). However, in this case it is based on organizational justice theory. Under the tenets of this theory, forms of justice are categorized as one-, two-, three-and four-factor models. In the one-factor model, major forms of justice, i.e. distributive and procedural, are measured through one scale, and are highly correlated with each other (Sweeney and McFarlin 1997;Welbourne et al. 1995). Greenberg's (1986) empirical investigation laid the foundation for the two-factor model. In his exploratory study (n = 217 employees), Greenberg showed that distributive justice and procedural justice were two distinct dimensions. Although the two-factor conceptualization incorporated distributive and procedural justice in one model, these were treated differently (Greenberg 1990).
The three-factor model was developed to address the inclusion of interactional fairness in the justice literature (e.g. Barling and Phillips 1993;Bies and Shapiro 1987;Martocchio and Judge 1995;Skarlicki and Folger 1997). In the early 2000s, the four-factor model was conceptualized, and it provided a clearer expression of all forms of justice by categorizing interactional justice into two factors, i.e. interpersonal and informational justice. While propounding the dimensionality of the four-factor model, Colquitt (2001) demonstrated its construct and predictive validities adequately. Since then and until now, this conceptualization has been used in most empirical research (e.g. Colquitt and Rodell 2011;Jawahar M.Z. Iqbal et al. 2007;Jepsen and Rodwell 2009;Kass 2008;McDowall and Fletcher 2004). However, without assessing the 'fair process effect' (Folger et al. 1979), i.e. the outcomes of fairness/justice, it cannot be said that justice is done. Thus, the positive relationship between the four-factor justice and personand organization-referenced ratee reactions indicates PA fairness.
Distributive fairness. Initially, distributive justice dealt with the fairness of decision outcomes (Colquitt 2001) and distribution of outcomes, e.g. reward (Jawahar 2007). Under the umbrella of the two-factor model (McFarlin and Sweeney 1992;Sweeney and McFarlin 1993), it was proposed to be related to only person-referenced outcomes, e.g. job satisfaction. However, recent research has included the evaluation of the outcomes-related fairness in its scope. This was done to embed norms of distribution, such as equity or equality (Colquitt 2001). This expanded view of distributive justice justified its measurement as a separate factor. The following empirical investigations support the relationships among distributive justice and person-referenced as well as organization-referenced ratee reactions.
Drawing on person-referenced outcomes, four empirical studies (Foley et al. 2005;Jepsen and Rodwell 2009;McFarlin and Sweeney 1992;Sweeney and McFarlin 1997) found distributive justice to have a positive effect on ratee job satisfaction, albeit to varying degrees, i.e. β = 0.11, 0.30, 0.18 and 0.23, p < 0.01, respectively. It is noteworthy that Jepsen and Rodwell (2009) reported the β coefficient only for their male sample (n = 265), as it was insignificant for their female sample (n = 113). Alongside the distal variable of job satisfaction, distributive justice relates to certain proximal variables as well, e.g. ratee satisfaction with ratings, rating system, the rater, the performance feedback and reward. Holbrook (1999) suggested a significant correlation between distributive justice and ratee satisfaction with ratings (r = 0.72, p < 0.01). Later, Colquitt (2001) and Jawahar (2007) examined this relationship in artificial and actual respondents, i.e. n = 301 students and n = 163 employees, respectively, and found distributive justice to have a significant effect on ratee satisfaction with ratings (β = 0.73 and 0.83, p < 0.05, respectively). Ratee satisfaction with the rating system is the second proximal variable that two empirical studies (Elicker et al. 2006;Korsgaard and Roberson 1995) reported to find an association with distributive justice (r = 0.75, p < 0.05 and r = 0.79, p < 0.001, respectively). Ratee satisfaction with the rater and the performance feedback have been found to have influenced by distributive justice, e.g. McFarlin and Sweeney (1992) and Sweeney and McFarlin (1997) (β = 0.15 and 0.37, p < 0.01, respectively) for the former, and Jawahar (2007) (β = 0.33, p < 0.05) for the latter. McFarlin and Sweeney (1992) and Colquitt (2001) also found that distributive justice explains variance in rewards (β = 0.52, p < 0.01 and β = 0.36, p < 0.05, respectively).
Drawing on the organization-referenced outcomes, out of six empirical studies supporting the relationship between distributive justice and organizational commitment, two have reported correlations between them, and four have suggested that the former may predict the latter. Conducting a scenario-based experiment on 240 students, Holbrook (1999) reported a positive correlation between the two constructs (r = 0.73, p < 0.01). Similarly, the correlation matrix generated from 92 matched manager-employee dyads in another study (Heslin and VandeWalle 2011) revealed a significant association between distributive justice and organizational commitment. However, while teasing apart the dimensions of organizational commitment, they reported the coefficients as r = 0.41, p < 0.01 for affective commitment, and r = 0.33, p < 0.01 for normative commitment.
With regard to predictive relationship, in a survey of 877 Protestant clergies in Hong Kong, Foley et al. (2005) reported a positive effect of distributive justice on organizational commitment (β = 0.19, p < 0.01). McFarlin and Sweeney also supported this relationship, but in one study they reported greater effect (β = 0.52, p < 0.01) and, in the other, smaller, yet more significant effect (β = 0.14, p < 0.001) (see McFarlin and Sweeney (1992) and Sweeney and McFarlin (1997), respectively). Such a variation could be accounted for by change of environment and the sample size. The former analysis was carried out with a sample of bank employees (n = 675), while the latter was undertaken with a survey of civilian employees of the US federal government (n = 12,670). In another survey of 378 employees (265 male and 113 female), Jepsen and Rodwell (2009) found organizational commitment of male employees to be influenced by their perception of distributive justice (β = 0.27, p < 0.01). Their results for females were insignificant. It is notable that the female sample was comparatively small. Moreover, their sample was composed, overall, of Effectiveness of Performance Appraisal occupationally diverse employees, which could have made it even more vulnerable to weak statistical power.
Procedural fairness. The construct of procedural justice has been developed through various stages. Initially, it highlighted the significance of procedures, facilitating decision-making on outcomes and distribution of resources to perceived fairness. Later, structural aspects of procedures were also included in its perimeter, e.g. giving weight-age to stakeholders' voices and letting them contribute to decisionmaking, demonstrating accuracy and practising ethics (Greenberg 1986;Holbrook 1999;Leventhal 1980;Leventhal et al. 1980). In the early 1990s, procedural justice was proposed to be used as a separate factor. Therefore, it was constructed and measured differently from distributive justice (McFarlin and Sweeney 1992;Sweeney and McFarlin 1993). These scholars maintained that it was related to evaluation of organization-referenced outcomes, e.g. organizational commitment. However, the present review has come across an interesting expansion in the literature that reveals procedural justice to have association with person-referenced ratee reactions as well (e.g. job satisfaction).
Being a distal variable, job satisfaction has been reported to be influenced by procedural justice (see Cropanzano et al. 2002;Foley et al. 2005;McFarlin and Sweeney 1992;Sweeney and McFarlin 1997). The PA literature also suggests a positive association between procedural fairness and certain proximal variables of ratee satisfaction, i.e. satisfaction with ratings, the rating system, the rater and performance feedback. For example, a field experiment (n = 111 dyads) by Taylor et al. (1995) suggested procedural fairness to have significant correlation with ratee satisfaction with ratings and the rating system (r = 0.66 and 0.52, p < 0.01, respectively). Elicker et al. 's (2006) study revealed a greater correlation coefficient for the latter (r = 0.78, p < 0.001). In addition, a recent survey of 203 full-time Mexican employees (Selvarajan and Cloninger 2011) suggested that procedural justice led to satisfaction with the rating system (β = 0.27, p < 0.01); however, the effect size was smaller than that reported by Jawahar (2007), i.e. β = 0.65, p < 0.05.
Procedural justice has been considered more as organization-referenced, thus its relationship with organizational commitment has been suggested in both non-contrived and contrived environments (e.g. Brockner et al. (2003) for the former; Holbrook (1999) for the latter). The correlation coefficients reported in these studies are r = 0.74, p < 0.001 and r = 0.62, p < 0.01, respectively. Heslin and VandeWalle (2011) substantiated these results; however, they teased apart organizational commitment into affective commitment and normative commitment (r = 0.43 and 0.39, p < 0.01, respectively). The literature also suggests that organizational commitment regresses procedural justice (e.g. Colquitt 2001;Foley et al. 2005;McFarlin and Sweeney 1992;Sweeney and McFarlin 1997).
Interactional fairness (interpersonal and informational). Initially, interpersonal treatment came under the heading of procedural justice. However, later, it was constructed as a separate dimension (Kass 2008). As a result, by the addition of this newly dubbed form of justice, i.e. interactional justice, the three-factor model came into existence. In this regard, Kass (2008) sounded a strong contention that it was merely a facet of procedural justice. At that stage, an interesting debate began, and the literature agreed on the distinction between the two models (procedural and interactional). That distinction was based on 'target', where the target of procedural justice was considered to be the 'system', whereas that of interactional justice was believed to be the 'agent' (Cropanzano et al. 2002). Thereafter, the four-factor model was conceptualized, which maintained that interactional justice should not be deemed to be merely distinct from procedural justice, but it should also be teased apart into two components, i.e. interpersonal and informational.
Interpersonal justice refers to interpersonal treatment by the person with the authority to enact the procedures. Treating employees politely and with dignity and respect are exemplified as do's, whereas, passing improper remarks and comments is regarded as don'ts. The interpersonal treatment was further represented by the agent-system model (Bies and Moag 1986). Informational justice is considered to be done when the person with authority to enact the procedures communicates willingly, readily and candidly with the employees. Moreover, he or she makes sure that the practicability of the procedures is thoroughly explained in a timely manner (Colquitt 2001). Informational justice also facilitates the evaluation of structural aspects of the process (Jawahar 2007), which further helps ratees to maintain perceptions of fairness with regard to the agent (rater/supervisor). Drawing from the literature, interactional fairness can be mirrored to interpersonal and informational fairness, for suggesting their associations with ratee reactions.
According to the agent-system model, interpersonal treatment of the agent (the rater/supervisor) may lead to person-referenced (ratee satisfactions with the rater, the performance feedback and the rating system) and organization-referenced outcomes (organizational commitment). For example, Colquitt (2001) and Jawahar (2007) suggested that interpersonal and informational fairness may relate to satisfaction with the rater (β = 0.23 and 0.50, p < 0.05, respectively). Jawahar (2007) also suggested that informational justice has an effect on satisfaction with the performance feedback (β = 0.61, p < 0.05). Moreover, results of three surveys suggest that interactional fairness may relate to ratee satisfaction with the rating system. For example, Elicker et al. (2006) reported a significant correlation between these two constructs (r = 0.63, p < 0.001), while Selvarajan and Cloninger (2011) and Cropanzano et al. (2002) reported that interactional justice predicted ratee satisfaction with the rating system (β = 0.22 and 0.77, p < 0.01, respectively). In addition, Jepsen and Rodwell (2009) suggested that informational justice may lead to job satisfaction (males: β = 0.32, p < 0.01 and females: β = 0.43, p < 0.01), whereas interpersonal justice may predict organizational commitment (females: β = 0.32, p < 0.01). The latter was also supported by Barling and Phillips (1993).

PA accuracy and ratee reactions
Performance appraisal accuracy refers to accurate and reliable performance ratings; hence, it aims to alleviate rating errors/biases (Jacobs et al. 1980). Being on the frontier of a PA system, raters are usually held responsible for rating errors, but in fact there are certain other factors that may cause biases. The argument presented by Curtis et al. (2005) seems logical: that there are some errors that a rater commits with a political agenda, but there are many for which ratees, the PA system and social factors (relations) should be held responsible. Thus, this review inventories and classifies the threats to accuracy into four groups, i.e. rater-centric, ratee-centric, relation-centric and system-centric rating errors, to understand their sources and effects.
Rater-centric rating errors. The major influence a rater takes on is of demographic aspects. Age bias occurs when raters are influenced by an elder ratee or become sympathetic with a younger one. They do this to safeguard the interests of such ratees. Supporting this, a study on 464 supervisor-subordinate dyads (Griffeth and Bedeian 1989) suggested that younger raters give significantly lower ratings than older raters. However, another study with similar design, i.e. supervisor-subordinate dyads (Shore and Bleicken 1991), shows that the age bias might not relate solely to older workers, but also to certain aspects of employee performance.
Gender bias takes place when raters distort true ratings to benefit the similar gender or victimize the opposite gender. Either of them may dissatisfy the affected ratees (Arvey and Murphy 1998;Cook 1995;Reichel and Mehrez 1994). In their study with 60 supervisors generating performance ratings of 220 supervisees, Varma and Stroh (2001) found that, after controlling for performance, both male and female supervisors had inflated ratings about ratees of the same gender. However, two scenario-based studies revealed diverse findings. Using a sample of 292 students, Hall and Hall (1976) found no significant effect of gender on ratings. Conversely, Lee et al. (2009), with a male sample (n = 92), found a significant impact of gender on ratings. Artificial phenomenon can be the major contributor to this contradiction. It is notable that, in another study, gender was found to have an interaction effect with age (Griffeth and Bedeian 1989).
Leniency (or strictness) is considered to be the backbone of most rating biases. Mainly, owing to Effectiveness of Performance Appraisal raters' own mindset, they set a tendency of leniency/ strictness bias. This tendency compels them to use those categories on the rating scale that represent a lenient/strict rating (Bernardin et al. 2009;Murphy and Cleveland 1995;Noe et al. 2003). The tendency of being lenient or strict can be based on many other biases. For example, ratings can be based on the previous performance of the ratee. Hence, the past performance error makes a rater lenient or strict while rating the current performance of the ratee (London et al. 2004). Practitioners pronounce it critical incident error. It occurs when raters rely only on some incidents during the appraisal period and disregard the rest. Similarly, raters' selectiveness about observations is found in the recency effect. This occurs when raters' ratings are based on the recent good or poor performance of ratees (London et al. 2004).
Raters may escalate their performance ratings while being influenced by the ratees' physical attractiveness (attractiveness effect) (Reichel and Mehrez 1994) or future potential (high-potential error). Usually, this happens when raters prefer subjective rating (trait-based) to objective rating (task-based) (Murray 1981). Similarly, raters' personal (dis)likes may lead to interpersonal affect, which brings out inaccurate ratings (Cook 1995). It occurs when the raters rate the liked ratees by recalling their positive work behaviours and vice versa (Arvey and Murphy 1998;Cardy and Dobbins 1994;Lefkowitz 2000;Sutton et al. 2013;Varma et al. 2005;Wayne and Liden 1995). Empirical studies with varying designs have suggested the effect of the interpersonal affect on ratings. In an experimental study (n = 66 students), Cardy and Dobbins (1986) investigated the effect of interpersonal affect. They found that raters' ratings were less accurate when scores on their liking had variations than when liking was constant. Confirming this for multisource feedback, a survey elicited 163 downward, 103 upward and 1027 peer ratings from 433 employees of an insurance company (Antonioni and Park 2001). These results reveal an influence of interpersonal affect in all three sources of the feedback (i.e. downward, upward, and peer).
With regard to culture, Asian raters are considered more prone to interpersonal affect than Western ones. Varma et al. (2005) carried out a cross-cultural study with two samples (the US, n = 190; and India, n = 113) and reported that interpersonal affect had a significant effect on performance ratings in India, as raters inflated the ratings of low performers. In contrast, the US raters could separate their liking for a ratee from actual performance, revealing no interpersonal affect. The results of the US sample are somewhat astonishing, as another study (Varma and Stroh 2001) in the US context reported a high correlation between interpersonal affect and performance ratings (r = 0.78, p < 0.01). However, results based on the Indian sample are substantiated by another field study in Asia (n = 172 military officers in Singapore), i.e. raters' interpersonal affect predicts leniency (β = 0.40, p < 0.01) (Ng et al. 2011). Emotional rating error is another threat to accuracy that resides beside the interpersonal affect. This occurs when raters, being emotionally attached (or detached) to ratees, use a positive (or negative) lens to see everything about them (London et al. 2004). Sometimes, these feelings of affection/hatred can be of a personal nature. Recently, in an empirical investigation, Bento et al. (2012) identified an interesting finding about stigma bias. In their study, they investigated raters' perceptions about ratees' obesity and suggested that such perceptions may influence ratings.
For some social reasons, raters may demonstrate avoidance of negative feedback (Hogan 1987). Using ratings from 667 bank staff by their 101 supervisors, Wilson (2010) reported raters' tendency to make positive comments and reluctance to give negative feedback. Social desirability pressures on supervisors and/or fears of retaliation from subordinates were reported as possible reasons. Furthermore, raters may mislay motivation to rate judiciously when they realize that ratings will affect ratees' promotion, salary or any other benefit; their low motivation towards judicious rating comes into play (London et al. 2004). Further, low motivation toward ratings may result in an escalation bias (inflated ratings) (Slaughter and Greguras 2008). The extant literature (e.g. Saffie-Robertson and Brutus 2014; Tziner et al. 2008) suggests that raters' discomfort with the rating system could be another reason behind inflated ratings. Similarity error or 'similar to me' effect is another behaviour-based threat to accuracy. This error is committed when raters perceive ratees to be similar to them, and thus give favourable ratings (London et al. 2004). This may happen the other way round when raters perceive ratees to be dissimilar.
The PA literature suggests two levels of (dis)similarity effect, i.e. deep level (behaviour-based) and surface (demographics-based) (Varma and Stroh 2001). This review includes two longitudinal studies with dyadic samples. The first (Tepper et al. 2011) investigated the deep-level (dis)similarity and suggested that rater perception of relationship conflict and ratee performance mediated the relationship between perceived deep-level dissimilarity and abusive supervision. The second (Wayne and Liden 1995) examined the surface similarity and suggested correlation between demographic similarity and supervisor's liking of the subordinate (r = 0.31, p < 0.01); the latter further related to supervisor's ratings of the subordinate's performance (r = 0.36, p < 0.001).
Like demographic variables (age, gender and education level), psychological variables (selfconfidence, self-efficacy, cognitive abilities and anxiety) also cause variations in ratings about ratees (Landy and Farr 1980;Wood and Marshall 2008). Psychological variables have been noticed to set raters' expectations about ratees or the position they hold. There are certain instances wherein raters compare ratees' actual performance with prior expectations and, when they find a disconfirmation of expectations, they deflate ratings. Endorsing this, in a field study of 49 supervisor-subordinate dyads, Hogan (1987) reported that prior expectations of raters about the ratee interact with actual performance to affect ratings (β = 0.32, p < 0.05). The results of this study also revealed that relationships between prior expectations and performance ratings were more strongly correlated (r = 0.28, p < 0.05) than actual performance and performance ratings (r = 0.16, not significant).
Until recently, there were five personality traits (i.e. extroversion, agreeableness, conscientiousness, neuroticism and openness) that were deemed vital to variations in ratings. For example, in their empirical investigations, Tziner et al. (2002), with a heterogeneous sample of 253 managers in Israel, and Randall and Sharples (2012), in an experiment with 230 government employees, found conscientiousness and agreeableness, respectively, causing variations in ratings. In two more empirical studies using students as participants, Bernardin and colleagues investigated the effects of these two personality traits on ratings about ratees. In their experimental study (n = 111), Bernardin et al. (2000) found that agreeableness and conscientiousness scores were correlated with rating levels, though in different directions (r = 0.33 and −0.37, p < 0.05, respectively). These relationships were also confirmed by a further longitudinal laboratory study by Bernardin et al. (2009). This study (n = 126) reported that raters with high agreeableness and low conscientiousness made the most lenient and least accurate ratings. The extant literature has made an addition to personality traits and their effects on ratings. Using an online survey of direct support professionals (n = 269) and the actual ratings by their supervisors (n = 250), Johnson et al. (2011) explored and found honestyhumility as a sixth personality type that uniquely affected the actual ratings (β = 0.25, p < 0.05).
Raters' inability to rate may lead to logical error and proximity error. The former is the tendency of giving similar ratings for performance areas that seem logically related. The latter is the tendency to rate similarly those performance areas that are adjacent on the evaluation form (Jacobs et al. 1980). Therefore, cognitive psychologists have drawn more attention towards information processing and retrieval aspects. They maintain that raters' memory affects ratings (Woehr 1992). In an experiment with 70 students, Robbins and DeNisi (1993) found correlation between direct recall and ratings (r = 0.24, p < 0.05). Moreover, another experimental study in a laboratory setting (n = 456 professionals in government agency) showed that participants' cognitive ability, practical intelligence, and job knowledge influence ratings about ratees (Pulakos et al. 1996). Wong and Kwong (2007) argue that raters' goals influence their ratings about ratees. They studied harmony, fairness and motivating goals. Their research was extended by Wang et al. (2010), who carried out two studies to analyse the effects of raters' goals on rating scores about low, medium and high performer ratees. The results of their study 1 (n = 103 students) revealed that raters were found to be inflating their peer ratings, in pursuance of harmony, fairness and motivation goals. As regards non-peer ratings, their study 2 (n = 120 students) revealed that, on the one hand, raters deflated ratings about high performers to demonstrate fairness, while, on the other hand, they inflated ratings about the low performer ratees to motivate them.
Ratee-centric rating errors. Raters cannot be held responsible on every occasion for errors; ratees also attempt to change raters' view. Ratees may use a family of three behaviours, i.e. impression management, ingratiation and undeserved reputation for the purpose. Wayne and Liden (1995) suggested that ratees' impression management behaviour may indirectly affect the performance ratings, i.e. through self-presentation and other-enhancement. Selfpresentation becomes a bias when ratees present themselves by out of proportionally magnifying Effectiveness of Performance Appraisal positives or airbrushing negatives to earn inflated ratings. Other-enhancement is considered a bias when ratees 'butter up' raters to earn favourable ratings.
Ingratiation occurs when a ratee successfully manages to get undue favours from the rater. Ingratiation can be job-focused, supervisor-focused and self-focused. Job-focused ingratiation refers to administering the credit for job-related achievements, regardless of whether the ratee has or even has not contributed to such an achievement. And sometimes ratees attempt to signify their role in the team's accomplishments. Supervisor-focused ingratiation refers to seeking to obtain raters' gratification by extending them favours in personal as well as professional life. The self-focused category of ingratiation reveals ratees' efforts to present themselves before raters as friendly, polite and sincere. Ratees do this in order to create a soft corner in raters' heart (Cook 1995). Undeserved reputation bias appears when ratees manage to establish an undeserved reputation. This is done by developing networks within the organization, public relations, covering their back by not taking part in controversial issues, stealing credit for successes, high turnover to avoid facing appraisal at every organization, continuously expanding unit or department, reorganization, and getting the benefit of their absence in critical times (Cook 1995).
Relation-centric rating errors. The PA literature also reveals relation-centric threats to accuracy, which are committed by both raters and ratees. Ethnicity bias intensifies the circle of relationships. This refers to intervention of racial discrimination instead of actual performance of ratees (Cook 1995;Hall and Hall 1976). Past literature has established that racial differences in PA have been found persistently (Arvey and Murphy 1998;Dewberry 2001). Using actual ratings of bank employees, Wilson (2010) found raters to be giving systematically lower ratings to black staff relative to white staff. The results of this study revealed many differences in the specific factors mentioned across ethnic groups. Similarly, in a longitudinal study (n = 3027 trainee lawyers in the UK), Dewberry (2001) reported evidence of racial discrimination by the assessors. He suggested that future research on ethnicity should focus on differences in the individual's life experiences since his or her childhood. Expanding the circle of influence further, raters may also commit cross-cultural biases, which occur because of the difference between cultural influences on raters and ratees (Bogardus 2004).
When it comes to dyadic quality and duration, empirical studies emphasizing leader-member exchange provide evidence of relation-centric biases. Duarte et al. (1994) used data from 261 dyads and six-month records of their telephone company to analyse the effect of dyadic quality on ratings. They found that, in both the short and the long run, in high-quality leader-member exchange relationships, employee performance was rated high. This was apart from objective ratings about them. The ratings of employees in low-quality leader-member exchange relationships in the short run were consistent with the objective ratings about them. However, these were high in the long run, apart from their objective ratings. They also found that correlations among leader-member exchange relationship quality, and task and relationship performance ratings were positively significant (r = 0.26 and 0.30, p < 0.001, respectively). Tepper et al. (2006) carried out two studies (n = 347) in which managers gave more favourable ratings about ratees with high leader-member exchange even for resistant ratees. However, ratings were higher for those ratees who resisted by negotiating than those who resisted by refusing. In another empirical study, Varma and Stroh (2001) found a positive correlation between dyadic relationship and ratings (r = 0.77, p < 0.01). Sometimes, the dyadic relationships are established for political motives (Dhiman and Maheshwari 2013). Therefore, a political culture in which the appraisal process operates may also aggravate in-group and out-group situations, resulting in favourable and unfavourable ratings, respectively (Wood and Marshall 2008). Usually, it happens when team performance is replaced with a political agenda. The political considerations start capitalizing on the PA system, and the rater becomes over-lenient or over-strict, to extend benefits or to victimize the ratee (Cook 1995).
Relatedness within and between ratees may also affect ratings, e.g. halo and horn effects and stereotyping. The halo error occurs when raters find a positive aspect of performance and then continue rating positively the remaining aspects of ratees' performance. Conversely, horn error leads to keep on rating negatively if one aspect is found to be so (Arvey and Murphy 1998;Bogardus 2004;Murphy and Cleveland 1995;Noe et al. 2003). In their experimental study (n = 170 students), Becker and Cardy (1986) found the halo effect on accuracy and even statistical control of its influence could not improve the rating validity. Jackson (1996) carried out two studies, one using 100 students, and in the other 323 trained interviewers rated eight video-taped interviewees in a laboratory setting. Both studies revealed that the maximum accuracy within a task was not necessarily at 'zero invalid halo'. Stereotyping is a tendency to generalize across groups and ignore individual differences (Bogardus 2004). It is more likely to happen when team performance is appraised.
System-centric rating errors. Findley et al. (2000) grouped certain PA aspects such as appraisal policies, procedures and support provided by the organization, and pronounced them appraisal system facets. Their survey (n = 199 school teachers) revealed that appraisal system facets explained significant incremental variance in perceived rating accuracy. This was more than that explained by the appraisal process facets (refer to observation, feedback/voice and planning) (ΔR 2 = 0.04). This shows a significant impact of PA policies and procedures on rating errors. Substantiating this, Jawahar (2005) investigated the impact of system factors (also known as situational influences) on rating accuracy. His experimental study 1 (n = 186) and study 2 (n = 108 HR managers) revealed that some system factors (e.g. quality of equipment, availability of resources, difficulty of sales territory) are beyond the control of individual employees. Therefore, sometimes the PA system compels raters to be lenient in order to offset the anticipated effect of system factors on ratee performance. The results of these two studies indicated that both junior and senior raters altered ratings, depending on the situational conditions under which ratees worked. For example, some PA systems exempt certain employees from being evaluated. Using a large German sample (n = 7598), Grund and Sliwka (2009) found that the performance of older employees, women, and employees with very high or very low responsibilities was often assessed less.
Based on ratings generated by students from videotapes, two laboratory studies have suggested that rating format may cause system-centric errors. One of these was a cross-sectional study (n = 180), which revealed that behavioural anchors caused biased ratings, as raters focused only on those aspects of performance that were anchored in the scale, regardless of their representativeness of ratees' actual performance (Murphy and Constans 1987). The other was a longitudinal study (n = 57), which revealed that consistently average ratings were less accurate than descending and ascending ratings. It was also found that the overall ratings by the subjects were more accurate than an average of ratings made on each concluding exercise (Karl and Wexley 1989).
Available tools for descriptive analysis of PA results may also reveal errors such as central tendency and range restriction, and negative and positive skew. The former is a tendency to use rating scales representing average rating (Bogardus 2004;Grote 2002;Murphy and Cleveland 1995;Noe et al. 2003). The latter occurs when raters stick to extreme ratings on either side of the rating scale (Grote 2002). Apart from analysis, the system in which raters perform sometimes compels them to commit a contrast error. This is normally caused by holding a comparison between ratees instead of comparing their performance with the objective standards (Bogardus 2004;Latham et al. 2008;Noe et al. 2003). If such comparison is held within-individual, opportunities to come across the inappropriate substitutes for performance become evident. This error takes place when the organization sets an inadequate criterion to determine performance (De Cenzo and Robbins 1996) and, ultimately, raters rate hypothetically (global observations).
The existing literature presents a caution that political considerations sometimes seem to intermingle with inflationary pressures. It also coerces raters to think that mere high ratings are not sufficient for certain ratees' promotion -only the highest ratings are (De Cenzo and Robbins 1996). Therefore, purposes and uses of PA compel raters to give the desirable PA results, leading to biased ratings (Farh et al. 1991;Tziner et al. 2002). Organizations can avoid biases by holding raters accountable to the PA system, as accountability relates to rating accuracy (r = 0.34, p < 0.01) (Wood and Marshall 2008). This was confirmed by a scenario-based study (Curtis et al. 2005) in which 123 students rated ratees more leniently when they were accountable to the ratee rather than to the experimenter. However, participants rated ratees less leniently when they were accountable to both (the ratee and the experimenter) rather than to ratees only (downwardly accountable). In contrast, participants rated ratees more leniently when they were accountable to both (the ratee and the experimenter) rather than to the experimenter only (upwardly accountable).
In another experimental study (n = 197 students), Mero et al. (2007) found that participants rated more accurately when they knew that they were Effectiveness of Performance Appraisal accountable to 'high-ups' than when they were either accountable to ratees or had no one to account to. This might be because participants pre-empted the self-criticism and relied on more complex judgment strategies when they were answerable to high-ups. Thus, their pre-emption-based complex information processing led them to more defensible ratings, which turned out to be more accurate.
Having discussed categories of rating errors in detail, we have brought this section to a stage where, according to literature (e.g. Keeping and Levy 2000;Levy and Williams 2004;Roberson and Stewart 2006), it is suggested that rating errors limit EPA. Thus, the rater-centric, ratee-centric, relation-centric and system-centric threat to accuracy may lead to perceived detriments to EPA. However, the relative importance of each is likely to vary.

Relationships among measurement criteria
Merely accomplishing some PA purposes, demonstrating fairness with regard to selected aspects of justice, or neutralizing effects of certain rating biases are not sufficient to demonstrate EPA, unless these measurement criteria are integrated in order to strengthen the PA system. Therefore, this section aims to identify links between the measurement criteria of EPA.
PA purposefulness and PA fairness. The chances of unfairness are more likely to occur when PA is used for administrative purposes. This is because of its vital role in organizational decision-making, especially when the ultimate beneficiaries of these decisions are employees. Organizations consider results of administrative PA helpful in pursuing personal agendas and/or satisfying political motives: e.g. victimizing certain employees, or casting certain employees into the limelight to pave the way for their promotion. Since such decisions directly affect the outcomes (pay, promotion), the literature suggests that administrative PA is perceived to be more prone to unfairness (distributive) than PA used for other purposes. Developmental PA is considered to have at least a neutral effect, because it is likely to have a mild effect on outcome-related organizational decisions (Selvarajan and Cloninger 2011). Selvarajan and Cloninger (2011) further argue that employees' perceptions of distributive unfairness may prompt their perceptions about procedural unfairness, maintaining that procedures that reveal unfair outcomes must themselves be unfair. Once again, developmental PA may interact differently with procedural fairness (Jawahar 2007). Overall, this argument is in line with empirical findings. For example, an experiment (n = 195) by Bettenhausen and Fedor (1997) revealed that developmental PA resulted in more positive outcomes than administrative PA did. They also found that administrative PA resulted in more negative outcomes than developmental PA did. Thus, developmental PA may have a more positive relationship with perceived distributive and procedural fairness than administrative PA has.
PA purposefulness and PA accuracy. Empirical literature suggests that administrative and developmental PA may relate to rating accuracy (e.g. Tsai and Wang 2013). For example, a simulation-based laboratory study (n = 130) by Zedeck and Cascio (1982) revealed that administrative and developmental PA explained more variation in rating accuracy than other variables, e.g. rater training. In addition, some empirical studies lay the foundation for establishing relationships between administrative and developmental PA, and system-and rater-centric rating errors.
Based on an analysis of two data sets, one for developmental purposes (ratings of 193 raters) and the other for administrative purposes (ratings of 223 ratees), Harris et al. (1995) found that ratings for administrative purposes were more biased (lenient) than those for developmental purposes. Moreover, their results revealed administrative purposes to have a significant relationship with ratee seniority (r = 0.18, p < 0.05), but developmental ratings did not have a significant relationship (r = 0.00). This is supported by the results of a quasi-experiment (n = 65 students) by Farh et al. (1991), which revealed a propensity to contain greater halo and leniency when ratings were conducted for administrative purposes than when they were conducted for developmental purposes. Curtis et al. (2005) found that, in the administrative purpose condition, raters rated most leniently when they were only accountable to the ratees. Conversely, in the developmental purpose condition, raters rated least leniently when they were accountable to the experimenter. Most of the empirical investigations have revealed that administrative PA leans more towards rating errors than developmental PA does. Therefore, to neutralize this effect, Selvarajan and Cloninger (2011) concluded that both administrative and developmental PAs are perceived to be more accurate than administrative PA alone. Thus, when used simultaneously, administrative and developmental PAs may explain a positive variation in system-centric rating errors. However, on teasing apart PA purposes, administrative PA would be more likely to explain variation in system-centric rating errors than developmental PA would.
The PA literature maintains that certain PA purposes may cause rater-centric rating errors, e.g. Tziner et al. (2002) and Tziner et al. (2008) suggest that developmental PA may relate positively to rater's confidence in PA (r = 0.59 and 0.39, p < 0.05, respectively). However, Tziner et al. (2008) also suggested that administrative PA may relate inversely to raters' confidence in PA (r = −0.28, p < 0.05). These results indicate that administrative PA is more prone to rater-centric errors than developmental PA is. However, there is a need for caution. Based on only one aspect (i.e. rater's confidence), the possibility of ratercentric rating errors triggered by developmental PA cannot be eliminated. Therefore, it can be expected that both administrative and developmental PAs may explain variations in rater-centric rating errors. However, on teasing apart PA purposes, administrative PA may explain more variations in rater-centric rating errors than developmental PA can.
PA accuracy and PA fairness. Empirical literature suggests that ratees' perceived fairness may lead to perceived rating accuracy (e.g. Tsai and Wang 2013). Taylor et al. (1995) found ratees' perceived procedural fairness to be correlated with rating accuracy (r = 0.73, p < 0.01). Adding to this, a survey by Elicker et al. (2006) reported that distributive, procedural and interactional justice are positively correlated with perceived accuracy (r = 0.81, 0.80, and 0.65, p < 0.001, respectively). Skarlicki and Folger (1997) further confirmed this using different criteria. They found distributive, procedural and interactional justice to have a significant negative effect on ratees' organizational retaliation behaviour (β = −3.73, p < 0.001, β = −2.38, p < 0.01, β = −5.23, p < 0.001 respectively). These results indicate that, if ratees perceive unfairness, they may try to establish equity in their own way (e.g. showing retaliation, being counterproductive or manipulating ratings). Thus, the higher the perceived fairness is, the lower the rateecentric biases will be, and vice versa.

Conclusions
This paper offers significant conclusions. These are discussed in two parts: 'general' (which deals with the research trends in the sub-field of EPA); and 'specific' (focusing on the ratee reactions-based integrated framework of EPA). We have monitored four aspects of research trends in EPA literature that can be helpful for upcoming empirical research in this body of knowledge.
First, empirical studies on EPA have used a variety of research designs such as cross-sectional and longitudinal, surveys and experiments or quasiexperiments. With regard to study setting, of the 104 empirical studies, 64% were carried out with real actors (e.g. employees) and 27% were in artificial settings (e.g. with students). Among the latter, most were scenario-based experimental studies with effective research designs. The remaining 9% of the studies used combinations of the above two (i.e. contrived and non-contrived).
Second, EPA literature lacks a holistic view. Therefore, a segment of literature considers PA a mere activity, instead of a system. Also, the effectiveness of this system is not discussed as such. This has resulted in patchy attention being paid to the EPA criteria. In the 1980s, PA accuracy outweighed other criteria. However, from the early 1990s, PA fairness started to attract the attention of the EPA researchers, and now its coverage in the literature is almost equal to that of PA accuracy. Thus far, PA purposefulness has managed less than a moderate appearance in the EPA research, especially during the last three decades.
Third, where attention being paid to the measurement criteria has been uneven, within each measurement criteria certain subordinate factors have also been ignored. For example, regarding PA purposefulness, the major focus has been on administrative purposes followed by developmental ones. A scarcity is also found with regard to strategic purposes and role-definition purposes. Similarly, with regard to the PA accuracy, the emphasis has been on rater-centric errors, followed by system-centric ones, whereas ratee-centric errors have been discussed rarely. Moreover, this analysis has discussed over 40 factors as direct or indirect determinants of rating bias. Many of them so far have not been part of robust empirical investigations.
Lastly, there is a limitation of the PA literature in that it largely represents the US-oriented models, approaches and theories. Since performance management is a social phenomenon, Bititci et al.

Effectiveness of Performance Appraisal
(2012) raise a valid question, i.e. 'do these theoretical rationales fit globally?' On the one hand, this question challenges the external validation of the existing evidence for diverse countries and cultures. On the other hand, this draws attention towards the fact that the PA body of knowledge has been deprived of indigenous wisdom from the perspective of geographical considerations. To the best of our understanding, cross-cultural studies can offset the deficiency in geographical representation, but only to a small extent. The PA literature needs to represent those countries and cultures that constitute more than two-thirds of the world's population, and also the emerging markets, because of their growing economic importance and the increasing interest of foreign investors in these markets. To start with, at least the Eastern researchers may be encouraged to replicate the models and theories propounded in the West and, where possible, develop their own contextspecific approaches to PA. This would serve a twofold purpose. First, it would help manage representation of the developing part of the world. Second, it would help demonstrate the external validation of research models geographically and would also develop context-relevant models. We believe that the contradictory results would refine the existing theories or give birth to new ones.
In addition to the above-mentioned general conclusions, this paper also offers some specific conclusions. It has highlighted notable refinements and expansions about purposefulness, fairness and accuracy of PA systems.
• Performance appraisal purposefulness: the longstanding view of PA that has focused more on administrative and little on developmental purposes had restricted this practice to personnel, evaluation, accountability, judgement and development functions. The addition of strategic and role-definition purposes has added more theoretical anchors and widened the scope of EPA towards more HR functions, e.g. feedback and goal orientation. On the face of the current PA practice and research, the latter are rapidly gaining prominence, whereas the former are becoming secondary, with the exception of development function. • Performance appraisal fairness: empirical literature has refined certain relationships by broadening the scope, e.g. under the two-factor model, distributive justice was thought to have affected only person-referenced outcomes. However, under the three-and four-factor models, organization-referenced outcomes were added as a criterion (e.g. Foley et al. 2005;Heslin and VandeWalle 2011;Jepsen and Rodwell 2009). • Performance appraisal accuracy: traditionally, raters were held responsible for rating errors. However, this paper has mounted sufficient evidence to justify the categorization of 40 factors (errors/biases) into four groups, i.e. rater-centric, ratee-centric, relation-centric and system-centric errors. It is expected that this categorization may lead PA researchers and practitioners to put directed efforts into minimizing bias and increasing accuracy.
The first objective of this paper was to identify relationships between measurement criteria and their respective outcomes. This paper has provided empirical confirmations based on a priori theory or models that have suggested nomological networks for the above-mentioned relationships, which are all set for empirical testing.
The final objective was to seek an integrated framework of EPA. Although the PA literature contains sufficient support for developing a ratee reactions-based integrated framework of EPA, some cautions must be borne in mind before putting this into practice. First, an uneven use of PA purposes may lead to injustice, e.g. administrative PA is more prone to distributive and procedural injustice than developmental PA is. Second, an uneven use of PA purposes may also lead to rating errors, e.g. administrative PA may lead to system-centric and ratercentric rating errors more than developmental PA. Finally, any slackness in PA fairness can dismantle the PA accuracy, as justice dimensions of the fourfactor model are inversely related to ratee-centric errors. Thus, integration among measurement criteria of EPA is simple yet complex.
Building on the analysis about relationships among measurement criteria, this paper helps us to reach an interesting conclusion for both researchers and practitioners. That is, PA systems often tend to pursue competing goals. Therefore, in addition to the theoretical perspective that considers the simultaneous application of measurement criteria useful for ratee reactions-based PA effectiveness, PA practitioners may also reckon that the trade-off among these criteria is valuable. However, it would mainly depend on the organizational culture, i.e. which criterion is considered more valuable than others. This necessitates further analysis on EPA from the perspective of organizational culture. We believe that the outcome of this analysis would provide a valuable venture to researchers, fuelling more relevant and focused research on PA systems.
In addition, using the needs assessment approach to PA, further research can advance our integrated framework of EPA for establishing a ratee reactionsbased view of PA theory. Performance appraisal systems have one agenda in common: that they aim to improve employee performance, among other ways, by assessing ratee needs. Fulfilment of these needs manifests EPA in the form of satisfied and committed employees. In line with this, our paper has presented a systematic review of the relevant literature on the growing concept and the simultaneous use of measurement criteria of EPA. However, bearing in mind the apparent caution that exists regarding simultaneous use of measurement criteria of EPA, empirical evidence on 'what is' and 'what should be' in different sets of organizations is yet to be provided, as this paper just triggers the thinking process to be deployed for integrating the measurement criteria of EPA, but future research needs to provide empirical evidence substantiating the integrated framework of EPA.
For example, we suggest that, on completion of a PA exercise, organizations may collect soft data (i.e. ratee perceptions about purposefulness, fairness, accuracy and their reactions), and analyse it using our proposed integrated framework. This will help them to identify the felt needs (of their employees, e.g. negative ratee reactions such as a low level of satisfaction and commitment), indicate high felt needs, and vice versa. Once employees' felt needs are identified, organizations can plan to manage and meet them, because meeting such needs will help the employees to know more about things such as: their organization's view of their performance, with regard to how well they perform; the ways in which they can improve their performance; their strengths and weaknesses; their future role; and how to devise a skill supply strategy for their future role. These would prepare them for pursuing their own and the organization's goals.
Thus, it is expected that future empirical research on EPA would fill the research gaps highlighted in this review, such as undertaking the expanded view of PA purposefulness, classification of PA accuracy, and their relationship with Greenberg's taxonomy of PA outcomes and, more importantly, from the perspective of competing values. Also, by filling the highlighted gaps in the existing literature, future empirical evidence on the EPA framework would inform professionals about the required focal point in their endeavours, i.e. ratee reactions-based view, for designing an effective PA system.