Subject level TEF: are the metrics statistically adequate?

Opinion piece by D. V. M. Bishop

The Teaching Excellence and Outcomes Framework (TEF) was introduced at an institutional level in 2016. It uses metrics from the National Student Survey, together with information on drop-out rates, and on student outcomes. These are benchmarked against demographic information and combined to give an ‘initial hypothesis’ about the institution’s classification as Bronze, Silver or Gold. This is then supplemented by additional evidence provided by institutions; these are considered by a panel who can modify the initial hypothesis, to give a final ranking.

Potential students are probably more interested in the characteristics of specific courses than global rankings of institutions, and, on advice from the Competition and Markets Authority, it was decided that there should be a subject level TEF. This was piloted in autumn 2017, with another pilot to take place this summer, after the technical consultation that was announced this month. In future, institutions will have both a global ranking and subject level rankings. 

What is measured?

In general, the plans are for subject-level TEF to use the same metrics as for the institutional TEF, although the weight given to NSS scores will be halved. The metrics are as follows:

National Student Survey (NSS)

The NSS runs in the spring of each academic year and is targeted at all final year undergraduates, who rate a series of statements on a 5-point scale ranging from Strongly Agree to Strongly Disagree. The items used for TEF are shown in Table 1.

Course completion

This is perhaps the most straightforward and uncontentious metric – at least for full-time students –  and is assessed as the proportion of entrants who complete their studies, either at their original institution, or elsewhere.

Student outcomes

TEF will use data from the Destination of Leavers from Higher Education Survey, supplemented by the Longitudinal Education Outcomes (LEO) survey, (explained in this beginners’ guide). Data from LEO will include the proportion of students who have been in sustained employment or further study 3 years after leaving, and, most controversially, the proportion who fall above a median earnings threshold.

Teaching intensity?

It has been proposed that a measure of teaching intensity might be added, but this will be influenced by the consultation.

Statistical critique of TEF methods

Criticisms of TEF have been growing in frequency and intensity. Most of the commentary has focused on the questionable validity of the metrics used: quite simply the data entered into TEF do not measure teaching excellence. I agree with these concerns, but here I want to focus on a further problem: the statistical inadequacies of the approach. Quite simply, even if we had metrics that accurately reflected teaching excellence, it is unclear that we could get meaningful rankings from them, because they are just not reliable or sensitive enough. I made this point previously, noting that institutional-level TEF ratings are statistically problematic: I’m not aware that those issues have ever been adequately addressed, and now the problems are just made worse with the smaller samples available when assessments are made at subject level.

Scientists often want to measure things that are not directly observable. All such measurements will contain error, and the key issue is whether the error swamps the true effect of interest. If so, our measure will be useless. In the context of TEF, there is an implicit assumption that there is a genuine difference between institutions in terms of some underlying dimension of teaching excellence, which we may refer to as the ‘true’ score. The aim is to index this indirectly by proxy indicators: we take a weighted average of these indicators from all the students on a course as our measure of true score. There are two kinds of error we need to try to minimise: random error and systematic error.

Random error

Factors that are unrelated to teaching excellence can affect the proxy indicators: for instance, a student’s state of mind on the day of completing the NSS may affect whether they select category 3 or 4 on a scale; a student may have to drop out because of factors unrelated to teaching, such as poor health; subsequent earnings may fall because an employer goes out of business. Even more elementary is simple human error: the wrong number recorded in a spreadsheet. Measurement error is inevitable, and it does not necessarily invalidate a measure: the key question is whether it is so big as to make a measure unreliable.

There is a straightforward relationship between sample size and measurement error. The larger the sample size, the smaller the impact of an individual score on the average.  To take an extreme example, suppose that all students on a course intend to give a NSS rating of 4, but one student inadvertently hits the wrong key and gives a rating of 1. If there are 100 students, this will have little impact (average score is 3.97), but if there are 10 students, there is a larger effect (average score is 3.7).

Systematic error and benchmarking

In the context of TEF, systematic error refers to factors that differ between institutions, that are not related to ‘true’ teaching excellence, but which will bias the proxy indicators. This is where the notion of benchmarking comes in. The idea is that institutions vary in their intake of students, and it would not be reasonable to expect them all to have equivalent outcomes. An institution that is highly selective and only takes students who obtain three As at A-level is likely to have better outcome data than one that is less selective and takes a high proportion of students who have achieved less well at school. So the idea behind benchmarking is that one takes measures of relevant demographics related to outcomes, such as proportions of students from lower income backgrounds, with disabilities or from ethnic minorities, to see how these relate to proxy indicators. A ‘benchmark’ is computed for each institution and subject, which is the score they are expected to get based solely on their demographics. The benchmark is then subtracted from the obtained score to give a measure of whether they are performing above or below expectation. In effect, benchmarking is supposed to correct for these initial inequalities; in TEF the plan is that it will be done subject by subject, to ensure that the final rating ‘will not judge the relative worth of different subjects.

Small sample sizes are particularly problematic with benchmarked variables. Benchmarking involves complex statistical models that will only be valid if there is sufficient data on which to base predictions. It is possible to incorporate statistical adjustments to take into account variable sample sizes between institutions, but these typically create new problems, as the same absolute difference in obtained score vs benchmark will be interpreted differently depending on the sample size. It is already noted in a Lessons Learned document that benchmarking can create problems with extreme scores, and so further tweaks will be needed to avoid this. But this ignores the broader problem that the reliability of assessments will vary continuously with sample size: this is not something that suddenly kicks in with very small or large samples – it just becomes blindingly obvious in such cases.


Although it is stated that TEF exists to provide students with information, a main goal is to provide rankings, so that different institutions can be compared and put into bandings. For this to be meaningful, one needs measures that give a reasonable spread of scores.  This is something I have commented on before, noting that ratings from the National Student Survey tend to be closely bunched together towards the top of the scale, with insufficient variation to make meaningful distinctions between institutions. A similar point was made by the Office of National Statistics. The tight distribution of scores coupled with the variable, and often small, sample sizes for specific courses is seriously problematic for any attempt at rank ordering, because a great deal of the variation in scores will just be due to random error.

It would be possible to model the sample size at which a rating becomes meaningless, but it is not clear from the documents that I have read that anyone has done this. All I could find was a comment that scores would not be used in cases where there were fewer than 10 students (TEF consultation document, p. 24), in which case a more qualitative judgement would be used. We really need a practical test, with students from each institution and course subdivided into two groups at random, and TEF metrics computed separately for each subgroup. This would allow us to establish how large a sample size is needed to get replicable outcomes.

Extreme data reduction: Gold, Silver and Bronze

Some of the information gathered for TEF could be potentially interesting for students, provided it was presented with explanatory context about error of measurement. In practice, a pretty complex set of data for each institution and course, based on numerous data-points integrated with the subjective evaluations of qualitative evidence submitted with the metrics, will be reduced to a three-point scale – with the institution categorised as Bronze, Silver or Gold. This is perhaps the most pernicious aspect of the TEF: it is justified by the argument that prospective students need to have information about their courses, yet they are not given that information. Instead, a rich, multivariate dataset is adjusted and benchmarked in numerous ways that make it hard for a typical person to understand, and then is boiled down to a single scale that cannot begin to capture the diversity of our Higher Education system.

Is the distinction between Gold, Silver and Bronze robust?

As noted above, the metrics from proxy indicators are not blindly applied: they are integrated by panel members with additional evidence provided by the institution, to form a global judgement. This, however, provides ample opportunity for bias. A ‘Lessons learned’ evaluation of TEF concluded: ‘Overwhelmingly, assessors and panellists thought that the process of assessment was robust and effective. They were confident that they were able to make clear, defensible judgements in line with the guidance based on the evidence provided to them.’ Now, to a psychologist, I’m afraid this is laughable.  The one thing we do know is that people are a mass of biases. You get people to make a load of ratings and then ask them whether their ratings were ‘robust’. Of course they are going to say they were.

To demonstrate genuinely robust ratings, one would want to have two independent panels working in parallel from the same data, to demonstrate that there was close agreement between them in the final ratings. Ideally, the ratings should also be done without knowledge of which institution was being rated, to avoid halo effects. Given how high the stakes are in terms of the rewards and penalties of being rated ‘Bronze’, ‘Silver’ or ‘Gold’, universities should insist that OfS conducts studies to demonstrate that TEF meets basic tests of measurement adequacy before agreeing to take part.