![](https://cdn.statically.io/img/model-eval-leaderboard-git-main-scaleai.vercel.app/_next/image?url=%2Fassets%2Fseal-logo-gradient.png&w=256&q=75)
Leaderboards
Expert-Driven Private Evaluations
![](https://cdn.statically.io/img/model-eval-leaderboard-git-main-scaleai.vercel.app/_next/image?url=%2Fassets%2Fdatasets.png&w=256&q=75)
Private Datasets
Scale’s proprietary, private evaluation datasets can’t be gamed, ensuring unbiased and uncontaminated results.
![](https://cdn.statically.io/img/model-eval-leaderboard-git-main-scaleai.vercel.app/_next/image?url=%2Fassets%2Fcompetition.png&w=256&q=75)
Evolving Competition
We periodically update leaderboards with new datasets and models, fostering a dynamic, contest-like environment.
![](https://cdn.statically.io/img/model-eval-leaderboard-git-main-scaleai.vercel.app/_next/image?url=%2Fassets%2Fevaluations.png&w=256&q=75)
Expert Evaluations
Our evaluations are performed by thoroughly vetted experts using domain specific methodologies, ensuring the highest quality and credibility.
Learn more about our evaluation methodology here →
Coding→
Learn More
Model | Score | 95% Confidence |
---|---|---|
1176 | +34/-36 | |
1146 | +25/-27 | |
3rd | 1138 | +31/-29 |
1095 | +27/-30 | |
1060 | +26/-26 | |
1026 | +28/-26 | |
996 | +26/-27 | |
981 | +27/-27 | |
980 | +24/-26 | |
915 | +26/-25 | |
790 | +30/-31 | |
699 | +36/-38 |
Instruction Following→
Learn More
Model | Score | 95% Confidence |
---|---|---|
90.80 | +1.50/-1.60 | |
2nd | 88.60 | +1.50/-1.40 |
87.60 | +1.40/-1.30 | |
85.50 | +1.80/-1.70 | |
85.40 | +1.50/-1.60 | |
85.20 | +1.70/-1.80 | |
84.70 | +1.60/-1.60 | |
83.30 | +1.70/-1.80 | |
82.90 | +2.00/-2.00 | |
82.60 | +1.90/-1.90 | |
73.60 | +2.40/-2.30 | |
66.80 | +2.10/-2.20 |
Math→
Learn More
If you’d like to add your model to this leaderboard or a future version, please contact seal@scale.com. To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts.