Leaderboards

Expert-Driven Private Evaluations

Private Datasets

Scale’s proprietary, private evaluation datasets can’t be gamed, ensuring unbiased and uncontaminated results.

Evolving Competition

We periodically update leaderboards with new datasets and models, fostering a dynamic, contest-like environment.

Expert Evaluations

Our evaluations are performed by thoroughly vetted experts using domain specific methodologies, ensuring the highest quality and credibility.

Private Datasets

Scale’s proprietary, private evaluation datasets can’t be gamed, ensuring unbiased and uncontaminated results.

Evolving Competition

We periodically update leaderboards with new datasets and models, fostering a dynamic, contest-like environment.

Expert Evaluations

Our evaluations are performed by thoroughly vetted experts using domain specific methodologies, ensuring the highest quality and credibility.

Learn more about our evaluation methodology here →

Coding→

Learn More

Model	Score	95% Confidence
1st Claude 3.5 Sonnet*	1176	+34/-36
2nd GPT-4 Turbo Preview	1146	+25/-27
3rd GPT-4o	1138	+31/-29
4 Gemini 1.5 Pro (May 2024)	1095	+27/-30
5 Claude 3 Opus	1060	+26/-26
6 Gemini 1.5 Flash Preview	1026	+28/-26
7 Gemini 1.5 Pro (April 2024)	996	+26/-27
8 Llama 3 70B Instruct	981	+27/-27
9 Claude 3 Sonnet	980	+24/-26
10 Mistral Large	915	+26/-25
11 Gemini 1.0 Pro	790	+30/-31
12 CodeLlama 34B Instruct	699	+36/-38

*Potential contamination warning: Claude Sonnet 3.5 was evaluated six weeks after Claude 3, possibly allowing Anthropic to access the prompt set from API logs. However, Anthropic's policy states they don't train on these data.

Instruction Following→

Learn More

Model	Score	95% Confidence
1st Claude 3.5 Sonnet*	90.80	+1.50/-1.60
2nd GPT-4o	88.60	+1.50/-1.40
3rd GPT-4 Turbo Preview	87.60	+1.40/-1.30
4 Llama 3 70B Instruct	85.50	+1.80/-1.70
5 Mistral Large	85.40	+1.50/-1.60
6 Gemini 1.5 Pro (May 2024)	85.20	+1.70/-1.80
7 Claude 3 Opus	84.70	+1.60/-1.60
8 Claude 3 Sonnet	83.30	+1.70/-1.80
9 Gemini 1.5 Pro (April 2024)	82.90	+2.00/-2.00
10 Gemini 1.5 Flash Preview	82.60	+1.90/-1.90
11 Gemini 1.0 Pro	73.60	+2.40/-2.30
12 CodeLlama 34B Instruct	66.80	+2.10/-2.20

Math→

Learn More

Model	Score	95% Confidence
1st Claude 3 Opus	95.19	+1.21/-1.21
2nd GPT-4 Turbo Preview	95.10	+1.22/-1.21
3rd GPT-4o	94.85	+1.25/-1.24
4 Claude 3 Sonnet	93.28	+1.41/-1.42
5 Gemini 1.5 Pro (May 2024)	92.28	+1.51/-1.50
6 Gemini 1.5 Pro (April 2024)	90.54	+1.65/-1.65
7 Llama 3 70B Instruct	90.12	+1.69/-1.68
7 Gemini 1.5 Flash Preview	90.12	+1.69/-1.68
9 Mistral Large	87.47	+1.87/-1.87
10 Gemini 1.0 Pro	79.83	+2.27/-2.26
11 CodeLlama 34B Instruct	37.51	+2.73/-2.73

Spanish→

Learn More

Model	Score	95% Confidence
1st GPT-4o	1139	+36/-28
2nd Gemini 1.5 Pro (May 2024)	1129	+25/-25
3rd GPT-4 Turbo Preview	1088	+28/-32
4 Gemini 1.5 Pro (April 2024)	1054	+25/-25
5 Gemini 1.5 Flash Preview	1023	+32/-23
6 Claude 3 Opus	941	+26/-26
7 Llama 3 70B Instruct	934	+25/-25
8 Claude 3 Sonnet	896	+19/-33
8 Gemini 1.0 Pro	896	+25/-24
10 Mistral Large	895	+25/-23

If you’d like to add your model to this leaderboard or a future version, please contact seal@scale.com. To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts.