Data Analysis

LLM Benchmark Analysis 2024

231 automated tests reveal clear performance tiers, surprising failures, and critical insights for production use.

Executive Summary

Overall Performance Ranking

1. Claude Opus 4.5 (TEMP 0.7)
10/11 Tests Passed
2. Gemini 3 Pro (TEMP 0.35)
10/11 Tests Passed
3. Claude Sonnet 4.5 (TEMP 0.7)
9/11 Tests Passed
4. GPT-5.1 Codex 9/11 Tests Passed
5. DeepSeek V3.2 8/11 Tests Passed

Critical Failure Analysis

Scrypt Hash Test: 4 models failed due to incorrect library imports or parameter handling.

The CDN Import Challenge

The scrypt test proved particularly challenging, with only 4 of 8 models passing. The failures reveal a critical gap in LLM knowledge: correct library import paths for browser environments.

Library-Specific Knowledge

Models that used cdn.skypack.dev or incorrect version paths consistently failed.

Performance Insights

Claude Opus Gemini 3 Pro Claude Sonnet GPT-5.1 Codex

Key Findings

Temperature matters: Gemini at 0.35 outperformed default settings.
Claude Opus demonstrated superior library knowledge and implementation accuracy.
Grok-4 and Minimax M2 showed significant weaknesses in complex implementations.

For production-grade code generation: Claude Opus 4.5 at TEMP 0.7 remains the most reliable choice across diverse coding challenges.