# The Problem with Rating Scales

What’s wrong with rating scales? A lot. They are ubiquitous in learning and assessment, appearing primarily in two types of evaluations:

• Level One “smile sheets”
• Skills evaluations

The same criticisms apply to both usages, though the proposed solutions are different.

In Level One evaluations rating scales are primarily used as responses to statements about a learning experience. For example:

In Skills Evaluations they are used to rate a participant on a number of behavioral criteria. For example:

You’ve probably seen these types of forms so often that you just accept them as is and never question them. So, what are the problems?

Ratings Involve a lot of Subjectivity

Especially for Level One evaluations, there is a huge amount of subjectivity across the categories. One person’s “Agree” is another person’s “Strongly Agree.” Well-designed Skills Evaluations use behavioral anchors (what behaviors to look for in each rating) but even then, there is inevitably a fair amount of subjectivity (known formally as a lack of inter-rater reliability). So, in measurement parlance, the results are neither valid nor reliable.

Assigning Numeric Values to the Categories

But the most serious problems emerge from how we treat the data.  If we just treated the response categories as categories, we would be OK (subjectivity aside). But we don’t. We assign numbers to each of the rating categories and then perform mathematical operations (averaging) on those numbers. So Strongly Disagree is not just Strongly Disagree, it is “1.” And Disagree is not just Disagree, it is “2”, etc. To understand why this is a problem we need to review something you probably learned somewhere in your mathematics education. There is a hierarchy to data, and it looks like this:

Looking at the chart, we can see that the categorical data we have is Ordinal. It’s ordered (Agree is better than Disagree) but that’s all we can say. Is the “distance” between Disagree and Strongly Disagree equal to 1 and is that the same as the “distance” between Neither Agree nor Disagree and Disagree? Who knows? But that’s how we treat it. Is a rating of Disagree twice “as good as” a rating of Strongly Disagree (2 vs. 1), but a rating of Strongly Agree (5) is only 25% better than a rating of Agree (4)? In fact, what we are doing when we assign numbers to the categories and perform mathematical operations on them is to treat the Ordinal data as Ratio data.

Ratio data can be mathematically manipulated. A weight that weighs two pounds is indeed twice as heavy as a weight that weighs one. And the “distance” between three pounds and two pounds is exactly the same as the “distance” between two pounds and one pound. But this is not true of Ordinal data.

And then we compound the error by computing averages and presenting these averages as valid scores. We have committed a mathematical felony.

And it gets worse. Let’s say we have four learners and they are rating a course on three criteria:

As you can see all three criteria have an average rating of “3”, which is generally treated as acceptable. But the ratings for each of the criteria are wildly different. Criterion two is consistent across all four learners, criterion three is all over the place and criterion one is extreme on both ends. And yet, in our reporting, we treat all three criteria as having the same results. Of the three criteria, the first has the biggest problem. When we merely report its average we lose the details: two learners gave the criterion their lowest rating and two gave it their highest. This is a little like putting your feet in a bucket of boiling water and sticking your head in a freezer of dry ice and saying, “on average I feel fine.”

So, what can we do? There are solutions to these problems, which we will discuss in our next blog post.