Leaderboards

Private Datasets

Scale’s proprietary, private evaluation datasets can’t be gamed, ensuring unbiased and uncontaminated results.

Evolving Competition

We periodically update leaderboards with new datasets and models, fostering a dynamic, contest-like environment.

Expert Evaluations

Our evaluations are performed by thoroughly vetted experts using domain specific methodologies, ensuring the highest quality and credibility.

Learn more about our evaluation methodology here →

Coding

Learn More

Model
Score95% Confidence
1176
+34/-36
1146
+25/-27
1138
+31/-29
1095
+27/-30
1060
+26/-26
1026
+28/-26
996
+26/-27
981
+27/-27
980
+24/-26
915
+26/-25
790
+30/-31
699
+36/-38
*Potential contamination warning: Claude Sonnet 3.5 was evaluated six weeks after Claude 3, possibly allowing Anthropic to access the prompt set from API logs. However, Anthropic's policy states they don't train on these data.

Instruction Following

Learn More

Model
Score95% Confidence
90.80
+1.50/-1.60
88.60
+1.50/-1.40
87.60
+1.40/-1.30
85.50
+1.80/-1.70
85.40
+1.50/-1.60
85.20
+1.70/-1.80
84.70
+1.60/-1.60
83.30
+1.70/-1.80
82.90
+2.00/-2.00
82.60
+1.90/-1.90
73.60
+2.40/-2.30
66.80
+2.10/-2.20
*Potential contamination warning: Claude Sonnet 3.5 was evaluated six weeks after Claude 3, possibly allowing Anthropic to access the prompt set from API logs. However, Anthropic's policy states they don't train on these data.

Math

Learn More

Model
Score95% Confidence
95.19
+1.21/-1.21
95.10
+1.22/-1.21
94.85
+1.25/-1.24
93.28
+1.41/-1.42
92.28
+1.51/-1.50
90.54
+1.65/-1.65
90.12
+1.69/-1.68
90.12
+1.69/-1.68
87.47
+1.87/-1.87
79.83
+2.27/-2.26
37.51
+2.73/-2.73

Spanish

Learn More

Model
Score95% Confidence
1139
+36/-28
1129
+25/-25
1088
+28/-32
1054
+25/-25
1023
+32/-23
941
+26/-26
934
+25/-25
896
+19/-33
896
+25/-24
895
+25/-23

If you’d like to add your model to this leaderboard or a future version, please contact seal@scale.com. To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts.