모델 성능 평가 - 데이터 분류 - 세이커 블루(sacreBLEU)

07.AI

Mr. Slumber 2024. 5. 14. 00:53

728x90

(개념) 원래 언어 번역을 테스트하는 데 사용되던 방법으로, 현재 TER, ChrF, BERTScore 등의 다른 방법과 함께 LLM 응답의 정량적 평가에 사용된다.

GitHub - mjpost/sacrebleu: Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitat

Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons - mjpost/sacrebleu

github.com

728x90