Post

[D-71] train_fim 1

[D-71] train_fim 1

set up

  1. /kim/causal/modeling_bigs_gen.py
  2. /kim/causal/configuration_bigs.py
  3. /kim/causal/train_fim.py
  4. /shared/STG/zhang_dataset/issues_eval.jsonl issues_train.jsonl code_train.jsonl
1
2
3
4
5
6
7
8
9
/mnt/gpfs/work/kim/causal/train_fim.py:590: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MyTrainer.__init__`. Use `processing_class` instead.
  trainer = MyTrainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 2, 'bos_token_id': 1, 'pad_token_id': 2}.
  0%|          | 0/37116 [00:00<?, ?it/s]You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [653,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

[...]

I am getting this error because the vocab_size differs from model and tokenizer.(I guess)

But just before I added static / dynamic / fixed attributes in configuration file.

Getting tones of:

../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [554,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize failed.

We enabled synchronous error reporting (export CUDA_LAUNCH_BLOCKING=1 in sbatch script).

Debug train_fim / class MyTrainer / compute_loss:

Open to see code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# In train_fim.py, inside MyTrainer.compute_loss:

# ... (lines for max_len, input_ids, labels)

# --- START DEBUGGING BLOCK ---
max_vocab_size = model.config.vocab_size

# 1. Find the maximum token ID in the batch
max_input_id = input_ids.max().item()

# 2. Check if any ID exceeds the model's expected vocabulary size
if max_input_id >= max_vocab_size:
    print(f"\n[!!! CRITICAL DEBUGGING ERROR !!!]")
    print(f"Max token ID in batch: {max_input_id}")
    print(f"Model vocab size (num_embeddings): {max_vocab_size}")
    print(f"The maximum token ID is OUTSIDE the embedding matrix bounds.")
    
    # Check the embedding matrix size for confirmation
    try:
        current_embedding_size = model.bigs.embeddings.word_embeddings.weight.size(0)
        print(f"Actual embedding matrix size: {current_embedding_size}")
        if max_input_id >= current_embedding_size:
            print(f"ERROR: Max ID ({max_input_id}) >= Matrix Size ({current_embedding_size}).")
            
            # Print the input_ids tensor to find the tokens
            print("First 10 rows of input_ids:")
            print(input_ids[:10])
            
            # RAISE ERROR to stop execution cleanly
            raise ValueError("Token ID out of bounds. Check tokenizer and model resizing.")
            
    except AttributeError:
        # Fallback in case attribute access fails
        print("Could not access model.bigs.embeddings.word_embeddings.weight.size(0)")

# --- END DEBUGGING BLOCK ---

outputs = model(input_ids=input_ids)
logits  = outputs.logits
# ...

Embedding Size Update:

Before :: def build_fresh_model_and_load():
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# BiGS-specific model loading
def build_fresh_model_and_load(config_name_or_path, tokenizer, orig_weight_path=None):
    
    # 1. Load BiGS Configuration from the imported class
    config = BiGSConfig()
    config.vocab_size = tokenizer.vocab_size
    
    # 2. Instantiate BiGSForCausalLM
    model = BiGSForCausalLM(config)
    
    # 3. Load initial weights if provided (e.g., for pre-trained checkpoints)
    if orig_weight_path and os.path.isfile(orig_weight_path):
        print(f"Loading initial weights from {orig_weight_path}")
        ckpt = torch.load(orig_weight_path, map_location="cpu")
        
        # We assume the checkpoint is a model state dict
        missing, unexpected = model.load_state_dict(ckpt, strict=False)
        print(f"  [Load State] Missing keys: {missing}")
        print(f"  [Load State] Unexpected keys: {unexpected}")
    
    # 4. Resize token embeddings (for sentinel and EOM tokens)
    model.resize_token_embeddings(tokenizer.vocab_size)  # correctly put? <<<
    
    return model

After :: def build_fresh_model_and_load():
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# ... (logic to create config and model) ...

    # 4. [CRITICAL FIX] Ensure the tokenizer's new vocab size is reflected.
    new_vocab_size = tokenizer.vocab_size

    # Update model config (should be done, but ensure)
    model.config.vocab_size = new_vocab_size
    
    # 5. Explicitly resize the model embeddings to the correct size.
    # This is the line that must successfully update the embedding weight matrix.
    model.resize_token_embeddings(new_vocab_size)

    # 6. [NEW ADDITION] Double-check the core embedding layer size manually.
    # This is the safety check that overrides potential internal bugs.
    if model.bigs.embeddings.word_embeddings.num_embeddings != new_vocab_size:
        # If the resizing was incomplete, manually force the correct num_embeddings.
        # This is a highly aggressive fix for this specific bug.
        model.bigs.embeddings.word_embeddings.num_embeddings = new_vocab_size
        print(f"[BiGS FIX] Manually set num_embeddings to {new_vocab_size}")

    return model
and After :: def build_fresh_model_and_load():
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# 4. [CRITICAL FIX] Ensure the tokenizer's new vocab size is reflected.
    new_vocab_size = tokenizer.vocab_size

    # Update model config (should be done, but ensure)
    model.config.vocab_size = new_vocab_size
    
    # 5. Explicitly resize the model embeddings to the correct size.
    # We found the required size is tokenizer.vocab_size (32101) + 1 to hold the max index (32101).
    # Since tokenizer.vocab_size is already 32101, we resize to this value.
    # The error suggests the *actual* necessary dimension is 32102.
    
    # We will resize to the new vocab size (32101) and then check.
    model.resize_token_embeddings(new_vocab_size)
    
    # --- Check for the 1-unit mismatch (the most common HF bug) ---
    actual_size = model.bigs.embeddings.word_embeddings.weight.size(0)
    if actual_size <= new_vocab_size:
        # If the actual size is less than or equal to the max ID + 1, we force it higher.
        # Max ID (32101) requires a size of 32102.
        required_size = new_vocab_size + 1 
        
        print(f"[BiGS FIX] Forcing final embedding resize to required size: {required_size}")
        model.resize_token_embeddings(required_size)
        model.config.vocab_size = required_size # Update config for consistency
        
        # Manually verify/force num_embeddings (as per previous debugging step)
        model.bigs.embeddings.word_embeddings.num_embeddings = required_size
        
        # Re-verify the actual weight size
        print(f"Final actual embedding matrix size after fix: {model.bigs.embeddings.word_embeddings.weight.size(0)}")


# --- DELETE the existing debugging block near the END of compute_loss ---
# ... (The block that raised the ValueError) ...

The error and output remain the same

error.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
$ cat error.txt 
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
/mnt/gpfs/work/kim/causal/train_fim.py:654: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MyTrainer.__init__`. Use `processing_class` instead.
  trainer = MyTrainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 2, 'bos_token_id': 1, 'pad_token_id': 2}.
  0%|          | 0/37116 [00:00<?, ?it/s]You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Traceback (most recent call last):
  File "/mnt/gpfs/work/kim/causal/train_fim.py", line 705, in <module>
    main(stage2_only)
  File "/mnt/gpfs/work/kim/causal/train_fim.py", line 677, in main
    train_on_jsonl(
  File "/mnt/gpfs/work/kim/causal/train_fim.py", line 665, in train_on_jsonl
    trainer.train()
  File "/storage/athene/work/kim/miniconda3/envs/sssm/lib/python3.12/site-packages/transformers/trainer.py", line 2325, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/storage/athene/work/kim/miniconda3/envs/sssm/lib/python3.12/site-packages/transformers/trainer.py", line 2674, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/storage/athene/work/kim/miniconda3/envs/sssm/lib/python3.12/site-packages/transformers/trainer.py", line 4020, in training_step
    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/gpfs/work/kim/causal/train_fim.py", line 459, in compute_loss
    raise ValueError("Token ID out of bounds. Check tokenizer and model resizing.")
ValueError: Token ID out of bounds. Check tokenizer and model resizing.
  0%|          | 0/37116 [00:00<?, ?it/s]

output.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ cat output.txt 
[init] redirecting XDG_CACHE_HOME & TRITON_CACHE_DIR → /mnt/gpfs/work/kim/causal/triton_cache
Switch to torch vandermonde kernel.
[init] redirecting XDG_CACHE_HOME & TRITON_CACHE_DIR → /mnt/gpfs/work/kim/causal/triton_cache
CUDA available: True
GPU: NVIDIA A100-SXM4-80GB
Stage‐1: FIM Pre‐training on Issues…
[BiGS FIX] Forcing final embedding resize to required size: 32101
Final actual embedding matrix size after fix: 32101

[!!! CRITICAL DEBUGGING ERROR !!!]
Max token ID in batch: 32101
Model vocab size (num_embeddings): 32101
The maximum token ID is OUTSIDE the embedding matrix bounds.
Actual embedding matrix size: 32101
ERROR: Max ID (32101) >= Matrix Size (32101).
First 10 rows of input_ids:
tensor([[ 1846,  5055,   716,  ...,     2,     2,     2],
        [   17,  2725,   848,  ...,     2,     2,     2],
        [10891, 16048,   364,  ...,     2,     2,     2],
        ...,
        [   18,    22, 12648,  ...,     2,     2,     2],
        [12303,   471, 21127,  ...,     2,     2,     2],
        [ 1686,   628,  9039,  ...,     2,     2,     2]], device='cuda:0')

using wandb - calculating loss(already calculated) and accuracy see perplexity better to plot all these three

This post is licensed under CC BY 4.0 by the author.