Professor DV Bishop outlines the multiple flaws in the TEF methodology


In a previous post I questioned the rationale and validity of the Teaching Excellence and Student Outcomes Framework (TEF). Here I document the technical and statistical problems with TEF.


How are statistics used in TEF?

Two types of data are combined in TEF: a set of ‘contextual’ variables, including student backgrounds, subject of study, level of disadvantage, etc., and a set of ‘quality indicators’ as follows:

  • Student satisfaction as measured by responses to a subset of items from the National Student Survey (NSS)
  • Continuation – the proportion of students who continue their studies from year to year, as measured by data collected by the Higher Education Statistics Agency (HESA)
  • Employment outcomes- what students do after they graduate, as measured by responses to the Destination of Leavers from Higher Education survey (DLHE)

As detailed further below, data on the institution’s quality indicators is compared with the ‘expected value’ that is computed based on the contextual data of the institution. Discrepancies between obtained and expected values, either positive or negative, are flagged and used, together with a written narrative from the institution, to rate each institution as Gold, Silver or Bronze. This beginner’s guide provides more information.


Problem 1: Lack of transparency and reproducibility

When you visit the DfE’s website, the first impression is that it is a model of transparency. On this site, you can download tables of data and even consult interactive workbooks that allow you to see the relevant statistics for a given provider. Track through the maze of links and you can also find an 87-page technical document of astounding complexity that specifies the algorithms used to derive the indicators from the underlying student data, DLHE survey and NSS data.

The problem, however, is that nowhere can you find a script that documents the process of deriving the final set of indicators from the raw data: if you try to work this out from first principles by following the HESA guidance on benchmarking, you run into the sand, because the institutional data is not provided in the right format.  When I asked the TEF metrics team about this, I was told: “The full process from the raw data in HESA/ILR returns, NSS etc. cannot be made fully open due to data protection issues, as there is sensitive student information involved in the process.” But this seems disingenuous. I can see that student data files are confidential, but once this information has been extracted and aggregated at institutional level, it should be possible to share it. If that isn’t feasible, then the metrics team should be able to at least generate some dummy data sets, with scripts that would do the computations that convert the raw metrics into the flags that are used in TEF rankings.

As someone interested in reproducibility in science, I’m all too well aware of the problems that can ensue if the pipeline from raw data to results is not clearly documented – this short piece by Florian Markowetz makes the case nicely.  In science and beyond, there are some classic scare stories of what can happen when the analysis relies on spreadsheets: there’s even a European Spreadsheet Risks Interest Group. There will always be errors in data – and sometimes also in the analysis scripts: the best way to find and eradicate them is to make everything open.


Problem 2: The logic of benchmarking

The idea of benchmarking is to avoid penalising institutions that take on students from disadvantaged backgrounds:

Through benchmarking, the TEF metrics take into account the entry qualifications and characteristics of students, and the subjects studied, at each university or college. These can be very different and TEF assessment is based on what each college or university achieves for its particular students within this context. The metrics are also considered alongside further contextual data, about student characteristics at the provider as well as the provider’s location and provision.”

One danger of benchmarking is that it risks entrenching disadvantage. Suppose we have institutions X and Y, which are polar opposites in terms of how well they treat students. X is only interested in getting student fees, does not teach properly, and does not care about drop-outs – we hope such cases are rare, but, as this Panorama exposé showed, they do exist, and we’d hope that TEF would expose them. Y, by contrast, fosters its students and does everything possible to ensure they complete their course.  Let us further suppose that X offers a limited range of vocational courses, whereas Y offers a wider range of academic subjects, and that X has a higher proportion of disadvantaged students. Benchmarking ensures that X will be evaluated relative to other institutions offering similar courses to a similar population. This can lead to a situation where, because poor outcomes at X are correlated with its subject and student profile, expectations are low, and poor scores for student satisfaction and completion rates are not penalised.

Benchmarking is well-intentioned – its aim is to give institutions a chance to shine even if they are working with students who may struggle to learn. However, it runs the risk of making low expectations acceptable. It could be argued that, while there are characteristics of students and courses that affect student outcomes, in general, higher education institutions should not be offering courses where there is a high probability of student drop-out. And students would find it more helpful to see raw data on drop-out rates and student satisfaction, than to merely be told that an institution is Bronze, Silver or Gold – a rating that can only be understood in relative terms.


Problem 3: The statistics of benchmarking

The method used to do benchmarking comes from Draper and Gittoes (2005), and is explained here. A more comprehensive statistical treatment and critique can be found here.  Essentially, you identify background variables that predict outcomes, assess typical outcomes associated with each combination of these in the whole population under consideration, and then calculate an ‘expected’ score, as a mean of these combinations, weighted by the frequency of each combination at the institution.

The obtained score may be higher or lower than the ‘expected’ value. The question is how you interpret such differences, bearing in mind that some variation is expected just due to random fluctuations. The precision of the estimate of both observed and expected values will increase as the sample size increases: you can compute a standard error around the difference score, and then use statistical criteria to identify cases with difference scores that are likely to be meaningful and not just down to random noise. However, where there is a small number of students, it is hard to distinguish a genuine effect from noise, but where there is a very large number, even tiny differences will be significant. The process used in benchmarking uses statistical criteria to assign ‘flags’ to indicate scores that are extremely good (++), or good (+), or extremely bad (–) or bad (-) in relation to expectation. To ameliorate the problem of tiny effects being flagged in large samples, departures from expectation are flagged only if they exceed a specific number of percentage points.

This is illustrated for the case of one of the NSS measurements in Figure 1, which shows that the problem of sample size has not been solved: a large institution is far more likely to get a flagged score (either positive or negative) than a small one. Indeed, a small institution is a pretty safe bet for a silver award.

Figure 1. The Indicator (x-axis) is the percentage of students with positive NSS ratings, and the z-score (y-axis) shows how far this value is from expectation based on benchmarks. The plot illustrates several things: (a) the range of indicators becomes narrower as sample size increases; (b) most scores are bunched around 85%; (c) for large institutions, even small changes in indicators can make a big difference to flags, whereas for small institutions, most are unflagged, regardless of the level of indicator; (d) the number of extreme flags (filled circles or asterisks) is far greater for large than small institutions.


Problem 4: Benchmarking won’t work at subject level

From a student perspective, it is crucial to have information about specific courses; institution-wide evaluation is not much use to anyone other than vice-chancellors who wish to brag about their rating. However, the problems I have outlined with small samples are amplified if we move to subject-level evaluation.  I raised this issue with the TEF metrics team, and was told:

The issue of smaller student numbers ‘defaulting’ to silver is something we are aware of. Paragraph 94 on page 29 of the report on findings from the first subject pilot mentions some OfS analysis on this. The Government consultation response also has a section on this. On page 40, the government response to question 10 refers to assessability, and potential methods that could be used to deal with this in future runs of the TEF.

So the OfS knows they have a problem, but seems determined to press on, rather than rethinking the exercise.


Problem 5: You’ll never be good enough

The benchmarks used in TEF are based on identifying statistical outliers. Forget for a moment the sample size issue, and suppose we have a set of institutions with broadly the same large number of students, and a spread of scores on a metric, such that the mean percentage meeting criterion is 80%, with a standard deviation of 2% (see Figure 2). We flag the bottom 10% (those with scores below 77.5%) as problematic. In the next iteration of the exercise, those with low scores have either gone out of business, improved their performance, or learned how to game the metric, and so we no longer have anyone scoring below 77.5%. The mean score thus increases and the standard error decreases. So now, on statistical grounds, a score below 78.1% gets flagged as problematic. In short, with a statistical criterion for poor performance, even if everyone improves dramatically, or poor-performers drop out, there will still be those at the bottom of the distribution – unless we get to a point where there is no meaningful variation in scores.


Figure 2: Simulated data showing how improvements in scores can lead to increasing cutoff in the next round if statistical criterion is adopted.


The bottom line

TEF may be summarised thus:

  • Take a heterogenous mix of variables, all of them proxy indicators for ‘teaching excellence’, which vary hugely in their reliability, sensitivity and availability
  • Transform them into difference scores by comparing them with ‘expected’ scores derived from a questionable benchmarking process
  • Convert difference scores to ‘flags’, whose reliability varies with the size of the institution
  • Interpret these in the light of qualitative information provided by institutions

All to end up with a three-point ordinal scale, which does not provide students with the information that they need to select a course.

Time, maybe, to ditch the TEF and encourage students to consult the raw data instead to find out about courses?