Creating Valid Skills/Performance Evaluations

In our previous two posts we reviewed the problems that occur when using rating scales for evaluations. First, we discussed (The Problem with Rating Scales) the difficulties that arise when rating data, which is ordinal, is treated as if it was ratio data prior to its use in mathematical operations, such as averaging. Then, we discussed (Reporting Level 1 Evaluations) how to solve this problem when reporting on Level I Evaluations (smile sheets). In this post, we will cover how best to use rating scales for Skills Evaluations.

Note: We don’t propose a single way to use rating scales for skills evaluations as we did for Level 1 evaluations. Instead, we review several approaches and introduce one new method that we consider more reliable and valid.

You are likely familiar with the classic skills/proficiency evaluation and its rating scales. Here is an example of a five-point proficiency scale:

Skills evaluations have some of the same problems as Level I evaluations, but are by their nature, more reliable and valid. Let’s look at the similarities and differences.


Both types of evaluations suffer from some level of subjectivity, but less so for skills evaluations. Here’s why:

  1. Skills evaluators are generally trained to distinguish among the ratings (e.g., a “four” performance from a “five” performance). Learners who fill out smile sheets receive no such training.
  2. Well-designed skills rubrics include “behavioral anchors.” Behavioral anchors inform the rater what behaviors need to be demonstrated to achieve each rating. When written properly, behavioral anchors remove a lot of the subjectivity from the ratings. For example, when evaluating someone on his/her ability to set realistic schedules, the behavioral anchors might look like:

Nevertheless, some organizations feel raters have a hard time distinguishing a four from a five or a three from a four, so they reduce the number of rating categories to three (while inserting proper behavioral anchors):

Some organizations prefer an even number of ratings – with good reason. It forces the rater to choose either “acceptable” or “not acceptable” and removes the “neutral” category. So, a four-point rating scale might look like this (again, with proper behavioral anchors):

Treating Data as Ratio

As we pointed out in the first post in this series, treating ordinal data as ratio data is considered a mathematical felony. In the case of Level I evaluations, this sort of “crime” is built into how the raw data is processed. Group averages are important but individual responses are not. So, the fact that one learner, Sally, gives the instructor’s slides a rating of three is unimportant. The fact that the average rating for the instructor’s slides from all the respondents is three is much more important.

For skills evaluations, group averages are important, but so are individual results. So, when Sally scores a two on “handling objections” this is important.

But scores are also averaged for individuals. So, if Sally scores a 1, 1, 5, 5 on the four skills on which she is being rated, she will have averaged a 3 (passing on most evaluations), when she clearly failed two of the skills.

This brings us to a final suggestion – Mastery Rating. In this type of rating, a skill is not evaluated on a numeric scale; it is either “mastered” or “not mastered.” Here’s an example for presentations skills, with behavioral anchors:

Now the person being evaluated must master all criteria. Failure to master even one results in a failure on the evaluation. This ensures that all skills are mastered, not merely the average of all skills.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s