# What's the Problem?

Suppose we want to assign scores to a large collection of objects – too large for a single assessor to deal with. We are pretty much forced into having a panel of assessors, each of whom assesses some small collection of the objects. Then the question is, how do we combine their scores to obtain final scores?

There are various possible approaches, depending on whether our final objective is an overall ranking of the objects, or an actual score for each object.

If a ranking is all that is required, then one interesting approach is to have each assessor rank the objects they assess, and then combine the rankings. Of course, the follow-up question that of the best way to combine the rankings. Maybe the most obvious answer is to choose the overal ranking that reverses as few as possible of the assessor rankings. This falls foul of the problem of computational expense: but there is an alternative approach, called Hodge ranking, which makes use of a discrete equivalent to some vector calculus. A nice introduction can be found here.

On the other hand, we may actually want a score. The most obvious approach to this is to take the average (by which I mean the arithmetic mean) of the scores awarded by the various assessors. At first blush, this may look like the best one can do, but it does have a significant problem caused by the fact that not all assessors will score each item.

The problem is that some assessors are generally more generous than others, i.e. may have an intrinsic bias towards scoring objects too high or too low. And since not all assessors provide a score for each object, some objects may get (un)lucky by having only (un)generous markers. It would be good to try to spot and compensate for this bias.

In the rest of this I'm going to describe an approach developed by Robert MacKay of the University of Warwick, and called Calibration with Confidence (I'll explain the name below).

# A Statistical Model

We want to get the best information we can out of the scores provided by a panel of assessors.

Of course, the word 'best' is loaded. Best as measured how? To answer that we need a decent mathematical (or statistical, if you will) model of the situation. So here is the model that this approach assumes.

- each object \(o\) to be assessed has an actual objective numerical value \(v_o\). (I know, I know. It's an imperfect universe.)
- each assessor has a fixed amount by which they score high or low, their bias \(b_a\).
- each assessor has a degree of consistency, the extent to which they award same mark to the same object on repeated assessments.

So when an assessor awards a score to an object, they give a score of \[ s_{ao} = v_o + b_a + \sigma_{ao} \varepsilon \] where \(v_o\) is the true value of the object, \(b_a\) is the bias of the assessor, \(\sigma_{ao}\) measures how consistently this assessor would score this object, and \(\varepsilon\) is a normal random variable of mean \(0\) and standard deviation \(1\).

As is the case with all statistical models, this is wrong; however, the assumptions are plausible for it to be useful.

We know what \(s_{ao}\); we want to find \(v_o\) and \(b_a\). But what about \(\sigma_{ao}\)?

This is something we also need to know in order to make good estimates for \(v_o\) and \(b_a\). But how can we know it?

One approach is simply to give up, and assume that they are all the same. In fact, if we do this, the method I'm going to describe here reduces to Fisher's method of incomplete block analysis.

Or we can try to give a plausible estimate of it, which will improve the values we find for \(v_o\) and \(b_a\).

There are a few possibilities even so.

- Prior information: do we already have reason to believe (from previous experience) that an assessor is likely to be very consistent or very inconsistent?
- Assessor experience: we assign smaller values of \(\sigma_{ao}\) in cases where we know that assessor \(a\) has lots of experience with the topic of object \(o\).
- We can ask the assessor to give a range of values rather than a specific score, and use that range to approximate \(\sigma_{ao}\).

However we arrive at this, we define \[ c_{ao} = \frac{1}{\sigma_{ao}^2} \] to be the confidence in assessor \(a\)'s score form object \(o\).

Then our task is to find a good approximation to each \(v_o\) and \(b_a\) (so we are calibrating the assessors) given the values of \(s_{ao}\) and \(\sigma_{ao}\) (which measures the confidence we have in the assessor's consistency, or reliability): this is the explanation I promised about for the name, calibration with confidence.

# The Algorithm

So, given the data \(s_{ao}\) and \(\sigma_{ao}\), we want to get the best possible estimate of each \(v_o\) and \(b_a\). One definition of what is best, which has the great advantage of being mathematically tractable, is the values of \(v_o\) and \(b_a\) which miminize \[ \sum c_{ao} (s_{ao} - v_o - v_b)^2. \] This gives higher weight to those contributions for which \(\sigma_{ao}\) is smaller.

A standard bit of calculus tells us that this is minimised if and only if for each object \(o\), \[ \sum_{a:(a,o) \in E} c_{ao}(s_{ao}-v_o-b_a) = 0 \] and for each assessor \(a\), \[ \sum_{o:(a,o) \in E} c_{ao}(s_{ao}-v_o-b_a) = 0 \] where, \(E\) is the set of assesor-object pairs of all the assessments that have taken place so this notation just means that the sums are over all those assessments which have taken place.

Let's just pause for a minute to think about these equations.

We know the values of \(s_{ao}\) and \(c_{ao}\) for each assessment (remembering that not all objects are assessed by all assessors); and we are trying to find the values of \(v_o\) and \(b_a\).

If we set \(c_{ao}=0\) in all the cases where no assessment took place, we can write the above equations slightly more evocatively as \[ \begin{split} \left( \sum_a c_{ao} \right) v_o + \sum_a b_a c_{ao} &= \sum_a c_{ao}s_{ao} = V_o\\ \sum_o c_{ao} v_o + \left(\sum_o c_{ao} \right) b_a &= \sum_o c_{ao}s_{ao} = B_a \end{split} \] where \(V_o\) and \(B_a\) are new names for \(\sum_a c_{ao}s_{ao}\) and \(\sum_o c_{ao}s_{a0}\) respectively.

We can make this look a little neater by expressing the equations in matrix form: \[ L \left[ \begin{array}{c} v\\b \end{array} \right] = \left[ \begin{array} {c} V \\B \end{array} \right] \] where \(L\) is built out of the various \(c_{ao}\) and \(s_{ao}\), and \(v\), \(V\), \(b\) and \(B\) are the vectors whose components are \(v_o\), \(V_o\), \(b_a\) and \(B_a\) respectively.

But now we have to ask the obvious question: does this system of equations have a solution, and if it does, is the solution unique?

The bad news is that the solution to these equations cannot be unique; if we have any solution, we can construct another by adding some number to all scores and subtracting the same number from all biases. But we can deal with this by placing a constraint on the biases, for example (and most simply) by requiring that they add up to \(0\), i.e. by appending the equation \(\sum b_a = 0\) to the above equations.

But does this fix the problem? It removes one source of degeneracy in the equations, but

- are there other degeneracies?
- are the equations consistent?

Well, there may be other degeneracies. However, we can remove them by ensuring that the network of assessments satisfies a very plausible condition.

To state this condition, first consider the assessments in the following way. We make a graph by using all objects and all assessors as nodes, and connect any pair by an edge if that assessor scores that object. (This is an example of a bipartite graph.) Then the system of equations will have a unique solution if and only if this graph is connected.

This is something we might have hoped to be true: it is saying that every assessor can be related to every other by some comparison of scores, possibly and indirect comparison via some other intermediate assessors. The miracle is that this is exactly what we need.

# Does It work?

The proof of any pudding is in the eating. So, does this work in practice?

This is clearly difficult to answer, because we don't have some kind of magical access to true values and assessor biases to check against.

However, we can at least check that it works against simulated data which satisfy the statistical assumptions of the model. And testing shows that it gives better estimates of true scores than simple averaging, and better estimates of true scores and biases that the incomplete block analysis approach which does not incorporate confidence values.

We can also compare the results with those obtained in real assessment exercises, and it does give different results. Given that we know that some assessors are more generous than others, and that some are more consistent than others, we can be fairly confident that the results are an improvement on simple averaging.

# What is it Good For?

There are two situations that spring immediately to my mind in which this approach can be useful:

- when there are many objects to be assessed, and it is not possible for assessors to assess every object (which is, of course, where we came in).
- for training assessors, so that a reasonable estimate of assessor bias can be found and this used as useful feedback in the training process.

## Reference

All the details which have been glossed over above (and many more) can be found in

RS MacKay, R Kenna, RJ Low and S Parker (2017) Royal Society Open Science Calibration with confidence: a principled method for panel assessment.

And finally, if you'd like to play with or explore the algorithm, you can get download it as an Excel spreadsheet or a Python (2.7) package from here.