[D-30] Fine-tuning (Token Completion)
[D-30] Fine-tuning (Token Completion)
#Token-level Fine-tuning <==> Raw results;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
<BiGS>
(_) [kim@slurm-login causal]$ cat token_completion_py150/test_results.json
{
"eval_loss": 1.1523972749710083,
"eval_accuracy": 0.727497134796791,
"eval_runtime": 7.1555,
"eval_samples_per_second": 13.975,
"eval_steps_per_second": 1.817,
"epoch": 5.0
}(_) [kim@slurm-login causal]$ cat token_completion_java/test_results.json
{
"eval_loss": 1.3984194993972778,
"eval_accuracy": 0.6893705287220082,
"eval_runtime": 6.9239,
"eval_samples_per_second": 14.443,
"eval_steps_per_second": 1.878,
"epoch": 5.0
}(_) [kim@slurm-login causal]$
<MAMBA>
(_) [kim@slurm-login mamba]$ cat token_completion_py150/test_results.json
{
"eval_loss": 0.9129804968833923,
"eval_accuracy": 0.7614387728114256,
"eval_runtime": 6.0657,
"eval_samples_per_second": 16.486,
"eval_steps_per_second": 2.143,
"epoch": 4.998736842105263
}(_) [kim@slurm-login mamba]$ cat token_completion_java/test_results.json
{
"eval_loss": 1.0860016345977783,
"eval_accuracy": 0.7338610676036067,
"eval_runtime": 5.6439,
"eval_samples_per_second": 17.718,
"eval_steps_per_second": 2.303,
"epoch": 5.0
}
🔍 주요 발견사항
- Mamba가 BiGS를 일관되게 능가
Python: 3.39% 포인트 차이 Java: 4.45% 포인트 차이 두 언어 모두에서 Mamba가 명확한 우위
- Loss 값 분석
Mamba의 loss가 BiGS보다 현저히 낮음 (Python: 0.913 vs 1.152) 더 나은 수렴(convergence)을 보여줌
- 언어별 성능
두 모델 모두 Python에서 더 좋은 성능 Java가 더 어려운 task임을 시사
- 추론 효율성
Mamba가 더 빠름:
Python: 16.49 samples/sec (Mamba) vs 13.98 (BiGS) Java: 17.72 samples/sec (Mamba) vs 14.44 (BiGS)
약 18-23% 더 빠른 추론 속도
Token-Level Code Completion
We evaluate both architectures on token-level code completion tasks using the Py150 (Python) and Java benchmarks from CodeXGLUE.
Results (Table X):
- Mamba achieves 76.14% accuracy on Python, outperforming BiGS (72.75%) by 3.39 percentage points
- On Java, Mamba reaches 73.39% accuracy compared to BiGS’s 68.94%, a 4.45 point improvement
- Mamba demonstrates 18-23% faster inference speed while maintaining superior accuracy
Key Observations:
- Both models perform better on Python than Java, consistent with prior work showing Python’s more regular syntax
- Mamba’s lower loss values (0.913 vs 1.152 on Python) indicate better convergence during training
- The performance gap widens on Java, suggesting Mamba’s selective state space mechanism better handles more complex syntax patterns ```
🎯 다음 단계 제안
- Line Completion 실험 진행
- 이미 준비된 스크립트로 실행
- Python과 Java 모두 평가
- 결과 테이블 작성 ``` | Task | Language | Mamba | BiGS | CodeGPT* | Δ | |——|———-|——-|——|———-|—| | Token | Python | 76.14 | 72.75 | 76.60 | -0.46 | | Token | Java | 73.39 | 68.94 | - | - | | Line | Python | ? | ? | - | - | | Line | Java | ? | ? | - | - |
This post is licensed under CC BY 4.0 by the author.