07.AI
LLM - 성능 - 벤치마크 - Extended NYT Connections
Mr. Slumber
2025. 12. 12. 13:19
728x90
반응형
(개념) 일반적인 NYT Connections 퍼즐을 인공지능(AI) 성능 측정용으로 확장·개량한 고난도 평가 지표
Extended NYT Connections 벤치마크에서 GPT‑5.2의 고추론 버전이 69.9→77.9로 향상됨

RankModelScore %#Puzzles
| 1 | Gemini 3 Pro Preview | 96.8 | 759 |
| 2 | Grok 4.1 Fast Reasoning | 93.5 | 759 |
| 3 | Sherlock Think Alpha | 92.4 | 759 |
| 4 | Grok 4 Fast Reasoning | 92.1 | 759 |
| 5 | Grok 4 | 91.7 | 759 |
| 6 | Sonoma Sky Alpha | 90.7 | 759 |
| 7 | o3-pro (medium reasoning) | 87.3 | 759 |
| 8 | GPT-5 Pro | 83.9 | 759 |
| 9 | o1-pro (medium reasoning) | 82.5 | 651 |
| 10 | o3 (high reasoning) | 78.6 | 759 |
| 11 | GPT-5.2 (high reasoning) | 77.9 | 759 |
| 12 | GPT-5 (high reasoning) | 77.0 | 759 |
Correlation of puzzle-level results: heatmap

https://github.com/lechmazur/nyt-connections/
728x90