[D-36] Fine-tuning (Line Completion)

Posted Jan 20, 2026

By Yoojin Kim

2 min read

📋 작업 요약 (2026-01-20) 🎯 목표 BiGS와 Mamba 모델을 CodeXGLUE line completion 태스크로 fine-tuning 🔴 발견된 문제들

Dataset 문제 (Root Cause)

code_x_glue_cc_code_completion_line 데이터셋의 gt 필드가 완전히 비어있음 결과: loss=0.0, eval_loss=NaN

Context Truncation 문제

긴 코드가 max_length=512로 잘리면서 target(gt)이 사라짐 39%의 샘플에서 valid labels가 0개

API 호환성 문제

HuggingFace Trainer의 log() 메서드 시그니처 변경 start_time 파라미터 필수화

✅ 해결 방법 데이터셋 전환:

code_x_glue_cc_code_completion_line (문제 있음) → code_x_glue_cc_code_completion_token (정상) 100K train, 50K test samples

토큰으로 라인 구분 전처리 로직 개선: python# ★ 핵심: target 길이를 먼저 계산 후 context 제한 target_length = len(tokenizer(target_text)['input_ids']) max_context_length = max(50, max_length - target_length - 10) context_encoded = tokenizer(context_text, max_length=max_context_length) # ★ Validation: valid labels 없으면 skip if sum(1 for l in labels if l != -100) == 0: return {'input_ids': [], 'labels': []} 📁 생성된 파일들 BiGS: /mnt/user-data/outputs/finetune_line_bigs_fixed.py ✅ /mnt/user-data/outputs/bigs-line-fixed-sbatch.sh ✅ 출력 경로: /storage/athene/work/kim/causal/output_line_completion_fixed 로그: line_bigs_fixed_output.txt, line_bigs_fixed_error.txt Mamba: /mnt/user-data/outputs/finetune_line_mamba_fixed.py ✅ /mnt/user-data/outputs/mamba-line-fixed-sbatch.sh ✅ 출력 경로: /storage/athene/work/kim/mamba/output_line_completion_fixed 로그: line_mamba_fixed_output.txt, line_mamba_fixed_error.txt 🔧 최종 수정사항 (오늘 완료) Triton Cache & WandB 설정 추가 (스크립트 맨 위) python triton_cache = os.path.join(os.getcwd(), "triton_cache") os.environ["XDG_CACHE_HOME"] = triton_cache os.environ["TRITON_CACHE_DIR"] = triton_cache log() 메서드 시그니처 수정 python def log(self, logs, start_time=None, **kwargs): super().log(logs, start_time=start_time, **kwargs) 🚀 내일 할 작업 파일 복사 및 실행 bash # BiGS cd /storage/athene/work/kim/causal cp /mnt/user-data/outputs/finetune_line_bigs_fixed.py ./ sbatch bigs-line-fixed-sbatch.sh # Mamba cd /storage/athene/work/kim/mamba cp /mnt/user-data/outputs/finetune_line_mamba_fixed.py ./ sbatch mamba-line-fixed-sbatch.sh 학습 모니터링 로그 확인: tail -f line_*_fixed_error.txt 확인사항: ✅ Loss가 실제 값으로 나오는지 ✅ Eval loss가 NaN이 아닌지 ✅ Perplexity 계산되는지 ✅ Accuracy 메트릭 나오는지 학습 완료 후 모델 저장 확인: output_line_completion_fixed/ 디렉토리 최종 성능 비교: BiGS vs Mamba 📌 참고사항 Pretrained 모델 경로: BiGS: /storage/athene/work/kim/causal/output4/fim_code Mamba: /storage/athene/work/kim/mamba/output_mamba/fim_code 데이터셋 크기: 9000 train, 1000 eval 하이퍼파라미터: BiGS LR: 5e-5 Mamba LR: 1e-4 Batch size: 8 (gradient accumulation: 4) Epochs: 3 Max length: 512 이전 실패한 실행들: output_line_completion (not _fixed) 디렉토리는 무시하셔도 됩니다.

Thesis

log thesis

This post is licensed under CC BY 4.0 by the author.

Trending Tags