Post

[D-30] Fine-tuning (Token Completion)

[D-30] Fine-tuning (Token Completion)

#Token-level Fine-tuning <==> Raw results;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
<BiGS>
(_) [kim@slurm-login causal]$ cat token_completion_py150/test_results.json
{
  "eval_loss": 1.1523972749710083,
  "eval_accuracy": 0.727497134796791,
  "eval_runtime": 7.1555,
  "eval_samples_per_second": 13.975,
  "eval_steps_per_second": 1.817,
  "epoch": 5.0
}(_) [kim@slurm-login causal]$ cat token_completion_java/test_results.json
{
  "eval_loss": 1.3984194993972778,
  "eval_accuracy": 0.6893705287220082,
  "eval_runtime": 6.9239,
  "eval_samples_per_second": 14.443,
  "eval_steps_per_second": 1.878,
  "epoch": 5.0
}(_) [kim@slurm-login causal]$

<MAMBA>
(_) [kim@slurm-login mamba]$ cat token_completion_py150/test_results.json
{
  "eval_loss": 0.9129804968833923,
  "eval_accuracy": 0.7614387728114256,
  "eval_runtime": 6.0657,
  "eval_samples_per_second": 16.486,
  "eval_steps_per_second": 2.143,
  "epoch": 4.998736842105263
}(_) [kim@slurm-login mamba]$ cat token_completion_java/test_results.json
{
  "eval_loss": 1.0860016345977783,
  "eval_accuracy": 0.7338610676036067,
  "eval_runtime": 5.6439,
  "eval_samples_per_second": 17.718,
  "eval_steps_per_second": 2.303,
  "epoch": 5.0
}

🔍 주요 발견사항

  1. Mamba가 BiGS를 일관되게 능가

Python: 3.39% 포인트 차이 Java: 4.45% 포인트 차이 두 언어 모두에서 Mamba가 명확한 우위

  1. Loss 값 분석

Mamba의 loss가 BiGS보다 현저히 낮음 (Python: 0.913 vs 1.152) 더 나은 수렴(convergence)을 보여줌

  1. 언어별 성능

두 모델 모두 Python에서 더 좋은 성능 Java가 더 어려운 task임을 시사

  1. 추론 효율성

Mamba가 더 빠름:

Python: 16.49 samples/sec (Mamba) vs 13.98 (BiGS) Java: 17.72 samples/sec (Mamba) vs 14.44 (BiGS)

약 18-23% 더 빠른 추론 속도

Token-Level Code Completion

We evaluate both architectures on token-level code completion tasks using the Py150 (Python) and Java benchmarks from CodeXGLUE.

Results (Table X):

  • Mamba achieves 76.14% accuracy on Python, outperforming BiGS (72.75%) by 3.39 percentage points
  • On Java, Mamba reaches 73.39% accuracy compared to BiGS’s 68.94%, a 4.45 point improvement
  • Mamba demonstrates 18-23% faster inference speed while maintaining superior accuracy

Key Observations:

  1. Both models perform better on Python than Java, consistent with prior work showing Python’s more regular syntax
  2. Mamba’s lower loss values (0.913 vs 1.152 on Python) indicate better convergence during training
  3. The performance gap widens on Java, suggesting Mamba’s selective state space mechanism better handles more complex syntax patterns ```

🎯 다음 단계 제안

  1. Line Completion 실험 진행
    • 이미 준비된 스크립트로 실행
    • Python과 Java 모두 평가
  2. 결과 테이블 작성 ``` | Task | Language | Mamba | BiGS | CodeGPT* | Δ | |——|———-|——-|——|———-|—| | Token | Python | 76.14 | 72.75 | 76.60 | -0.46 | | Token | Java | 73.39 | 68.94 | - | - | | Line | Python | ? | ? | - | - | | Line | Java | ? | ? | - | - |
This post is licensed under CC BY 4.0 by the author.