From the course: GPT-4: The New GPT Release and What You Need to Know

HELM

- [Instructor] When organizations like OpenAI and Google train and make large language models available, they often spend millions of dollars doing so. Now, when these models are used in products like Google Search, this can impact billions of users. One of the things we don't have is a standard way to compare these models. Although we're interested in how good they are at a task, we don't know if the same model generates false information. Instead of looking at just one metric, Stanford University researchers proposed HELM, or the Holistic Evaluation of Language Models in their paper. There's a lot covered in this paper around different scenarios and metrics and benchmarks. We'll just focus on comparing the large language models for now. At the time of this recording, GPT-4 does not appear in this research, but it's important to understand how to evaluate large language models in general. Now, I know it isn't very easy to see the row and column labels, so let me talk you through it. Each row corresponds to different data sets, which are benchmarks to measure how good a model is at a certain task and going from the first true, such as natural questions, bull cue, and natural question and answering and so on. Each column is a large language model from a large language model company. So going from the first column, some examples are J01 Jumbo, J01 Grand, and so on. You can see that previously based on the ticks in the boxes, many of the large language models were only tested on certain data sets, and for many, there are no public results. With HELM, the research team worked together with the main large language model providers and they were able to benchmark the models across a variety of data sets and get a more holistic view of model performance. The HELM benchmark is a living benchmark and should change as new models are released. I'll just cover the first couple of benchmarks and you can explore the rest further if you're interested. So here, each rule corresponds to a different language model that have been tested and each of the columns correspond to different benchmarks for different tasks. You can hover over each of the benchmarks and see what they are for. So for example, the MMLU is massive multi-task language understanding for question answering. And the CNN and Daily Mail benchmark is for tech summarization. Now, if you look at accuracy, Text Da Vinci Two and Three, which are GPT-3.5 models, have some of the best overall results across all of the data sets. Now both GPT-3.5 Turbo and GPT-4 were released in March of 2023 and so they don't feature in the HELM dataset yet. Now, if you scroll down a few rows and look right down at the bottommost row, you can see Da Vinci. Now this was OpenAI's GPT-3 model, which had 175 billion parameters. And in terms of accuracy, you have all of these models above it that are more performant than the original GPT-3 model. The robustness of the model here is the worst case performance of the model. So how does it perform when you send a text with typos or misspellings? There is some correlation here. Typically models that have a high accuracy will be more robust and again, we can see that the GPT-3.5 models have high scores for this. It's important to note that HELM does have some limitations. For example, users don't know if they can easily fine tune a model for their use case. Other important considerations like the price of models and the latency when using these models is not measured. And also the availability of the model to users, meaning is the platform available most of the time, is also not measured. Alright, so we've seen that Text Da Vinci Two and Three have some of the best overall results for accuracy and robustness across all metrics compared to the other large language models. The results for GPT-4 aren't on HELM yet. Head over to the HELM website and dig into some of the other benchmarks like fairness and efficiency, bias and so on. Because this benchmark is updated periodically, there might be changes to what I've shown you here.

Contents