LLM - 성능 - 모델 최적화 - '테스트 타임 스케일링(test-time scaling)' 기법의 역설

07.AI

LLM - 성능 - 모델 최적화 - '테스트 타임 스케일링(test-time scaling)' 기법의 역설

Mr. Slumber 2025. 7. 30. 16:54

728x90

모델이 추론 길이(reasoning steps)를 늘릴수록 성능이 떨어지는 현상을 연구한 결과,

Claude 모델은 추론이 길어질수록 산만해지고,

OpenAI o-시리즈 모델은 과제 프레이밍에 과도하게 오버핏(overfit)하는 등,

모델별로 다양한 실패 모드를 확인하였습니다

https://arxiv.org/abs/2507.14417

Inverse Scaling in Test-Time Compute

We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: simp

arxiv.org

https://github.com/safety-research/inverse-scaling-ttc

GitHub - safety-research/inverse-scaling-ttc: Inverse Scaling in Test-Time Compute

Inverse Scaling in Test-Time Compute. Contribute to safety-research/inverse-scaling-ttc development by creating an account on GitHub.

github.com

728x90