LLM - 성능 - 모델 최적화 - '테스트 타임 스케일링(test-time scaling)' 기법의 역설
모델이 추론 길이(reasoning steps)를 늘릴수록 성능이 떨어지는 현상을 연구한 결과,
Claude 모델은 추론이 길어질수록 산만해지고,
OpenAI o-시리즈 모델은 과제 프레이밍에 과도하게 오버핏(overfit)하는 등,
모델별로 다양한 실패 모드를 확인하였습니다
https://arxiv.org/abs/2507.14417
Inverse Scaling in Test-Time Compute
We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: simp
arxiv.org
https://github.com/safety-research/inverse-scaling-ttc
GitHub - safety-research/inverse-scaling-ttc: Inverse Scaling in Test-Time Compute
Inverse Scaling in Test-Time Compute. Contribute to safety-research/inverse-scaling-ttc development by creating an account on GitHub.
github.com