[D-71] train_fim 1

Posted Dec 16, 2025

By Yoojin Kim

7 min read

[D-71] train_fim 1

set up

/kim/causal/modeling_bigs_gen.py
/kim/causal/configuration_bigs.py
/kim/causal/train_fim.py
/shared/STG/zhang_dataset/issues_eval.jsonl issues_train.jsonl code_train.jsonl

  
/mnt/gpfs/work/kim/causal/train_fim.py:590: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MyTrainer.__init__`. Use `processing_class` instead.
  trainer = MyTrainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 2, 'bos_token_id': 1, 'pad_token_id': 2}.
  0%|          | 0/37116 [00:00<?, ?it/s]You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [653,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

[...]

I am getting this error because the vocab_size differs from model and tokenizer.(I guess)

But just before I added static / dynamic / fixed attributes in configuration file.

Getting tones of:

../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [554,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize failed.

We enabled synchronous error reporting (export CUDA_LAUNCH_BLOCKING=1 in sbatch script).

Debug train_fim / class MyTrainer / compute_loss:

Open to see code:

  
# In train_fim.py, inside MyTrainer.compute_loss:

# ... (lines for max_len, input_ids, labels)

# --- START DEBUGGING BLOCK ---
max_vocab_size = model.config.vocab_size

# 1. Find the maximum token ID in the batch
max_input_id = input_ids.max().item()

# 2. Check if any ID exceeds the model's expected vocabulary size
if max_input_id >= max_vocab_size:
    print(f"\n[!!! CRITICAL DEBUGGING ERROR !!!]")
    print(f"Max token ID in batch: {max_input_id}")
    print(f"Model vocab size (num_embeddings): {max_vocab_size}")
    print(f"The maximum token ID is OUTSIDE the embedding matrix bounds.")
    
    # Check the embedding matrix size for confirmation
    try:
        current_embedding_size = model.bigs.embeddings.word_embeddings.weight.size(0)
        print(f"Actual embedding matrix size: {current_embedding_size}")
        if max_input_id >= current_embedding_size:
            print(f"ERROR: Max ID ({max_input_id}) >= Matrix Size ({current_embedding_size}).")
            
            # Print the input_ids tensor to find the tokens
            print("First 10 rows of input_ids:")
            print(input_ids[:10])
            
            # RAISE ERROR to stop execution cleanly
            raise ValueError("Token ID out of bounds. Check tokenizer and model resizing.")
            
    except AttributeError:
        # Fallback in case attribute access fails
        print("Could not access model.bigs.embeddings.word_embeddings.weight.size(0)")

# --- END DEBUGGING BLOCK ---

outputs = model(input_ids=input_ids)
logits  = outputs.logits
# ...

Embedding Size Update:

Before :: def build_fresh_model_and_load():

  
# BiGS-specific model loading
def build_fresh_model_and_load(config_name_or_path, tokenizer, orig_weight_path=None):
    
    # 1. Load BiGS Configuration from the imported class
    config = BiGSConfig()
    config.vocab_size = tokenizer.vocab_size
    
    # 2. Instantiate BiGSForCausalLM
    model = BiGSForCausalLM(config)
    
    # 3. Load initial weights if provided (e.g., for pre-trained checkpoints)
    if orig_weight_path and os.path.isfile(orig_weight_path):
        print(f"Loading initial weights from {orig_weight_path}")
        ckpt = torch.load(orig_weight_path, map_location="cpu")
        
        # We assume the checkpoint is a model state dict
        missing, unexpected = model.load_state_dict(ckpt, strict=False)
        print(f"  [Load State] Missing keys: {missing}")
        print(f"  [Load State] Unexpected keys: {unexpected}")
    
    # 4. Resize token embeddings (for sentinel and EOM tokens)
    model.resize_token_embeddings(tokenizer.vocab_size)  # correctly put? <<<
    
    return model

After :: def build_fresh_model_and_load():

  
# ... (logic to create config and model) ...

    # 4. [CRITICAL FIX] Ensure the tokenizer's new vocab size is reflected.
    new_vocab_size = tokenizer.vocab_size

    # Update model config (should be done, but ensure)
    model.config.vocab_size = new_vocab_size
    
    # 5. Explicitly resize the model embeddings to the correct size.
    # This is the line that must successfully update the embedding weight matrix.
    model.resize_token_embeddings(new_vocab_size)

    # 6. [NEW ADDITION] Double-check the core embedding layer size manually.
    # This is the safety check that overrides potential internal bugs.
    if model.bigs.embeddings.word_embeddings.num_embeddings != new_vocab_size:
        # If the resizing was incomplete, manually force the correct num_embeddings.
        # This is a highly aggressive fix for this specific bug.
        model.bigs.embeddings.word_embeddings.num_embeddings = new_vocab_size
        print(f"[BiGS FIX] Manually set num_embeddings to {new_vocab_size}")

    return model

and After :: def build_fresh_model_and_load():

  
# 4. [CRITICAL FIX] Ensure the tokenizer's new vocab size is reflected.
    new_vocab_size = tokenizer.vocab_size

    # Update model config (should be done, but ensure)
    model.config.vocab_size = new_vocab_size
    
    # 5. Explicitly resize the model embeddings to the correct size.
    # We found the required size is tokenizer.vocab_size (32101) + 1 to hold the max index (32101).
    # Since tokenizer.vocab_size is already 32101, we resize to this value.
    # The error suggests the *actual* necessary dimension is 32102.
    
    # We will resize to the new vocab size (32101) and then check.
    model.resize_token_embeddings(new_vocab_size)
    
    # --- Check for the 1-unit mismatch (the most common HF bug) ---
    actual_size = model.bigs.embeddings.word_embeddings.weight.size(0)
    if actual_size <= new_vocab_size:
        # If the actual size is less than or equal to the max ID + 1, we force it higher.
        # Max ID (32101) requires a size of 32102.
        required_size = new_vocab_size + 1 
        
        print(f"[BiGS FIX] Forcing final embedding resize to required size: {required_size}")
        model.resize_token_embeddings(required_size)
        model.config.vocab_size = required_size # Update config for consistency
        
        # Manually verify/force num_embeddings (as per previous debugging step)
        model.bigs.embeddings.word_embeddings.num_embeddings = required_size
        
        # Re-verify the actual weight size
        print(f"Final actual embedding matrix size after fix: {model.bigs.embeddings.word_embeddings.weight.size(0)}")


# --- DELETE the existing debugging block near the END of compute_loss ---
# ... (The block that raised the ValueError) ...

The error and output remain the same

error.txt

  
$ cat error.txt 
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
/mnt/gpfs/work/kim/causal/train_fim.py:654: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `MyTrainer.__init__`. Use `processing_class` instead.
  trainer = MyTrainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 2, 'bos_token_id': 1, 'pad_token_id': 2}.
  0%|          | 0/37116 [00:00<?, ?it/s]You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Traceback (most recent call last):
  File "/mnt/gpfs/work/kim/causal/train_fim.py", line 705, in <module>
    main(stage2_only)
  File "/mnt/gpfs/work/kim/causal/train_fim.py", line 677, in main
    train_on_jsonl(
  File "/mnt/gpfs/work/kim/causal/train_fim.py", line 665, in train_on_jsonl
    trainer.train()
  File "/storage/athene/work/kim/miniconda3/envs/sssm/lib/python3.12/site-packages/transformers/trainer.py", line 2325, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/storage/athene/work/kim/miniconda3/envs/sssm/lib/python3.12/site-packages/transformers/trainer.py", line 2674, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/storage/athene/work/kim/miniconda3/envs/sssm/lib/python3.12/site-packages/transformers/trainer.py", line 4020, in training_step
    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/gpfs/work/kim/causal/train_fim.py", line 459, in compute_loss
    raise ValueError("Token ID out of bounds. Check tokenizer and model resizing.")
ValueError: Token ID out of bounds. Check tokenizer and model resizing.
  0%|          | 0/37116 [00:00<?, ?it/s]

output.txt

  
$ cat output.txt 
[init] redirecting XDG_CACHE_HOME & TRITON_CACHE_DIR → /mnt/gpfs/work/kim/causal/triton_cache
Switch to torch vandermonde kernel.
[init] redirecting XDG_CACHE_HOME & TRITON_CACHE_DIR → /mnt/gpfs/work/kim/causal/triton_cache
CUDA available: True
GPU: NVIDIA A100-SXM4-80GB
Stage‐1: FIM Pre‐training on Issues…
[BiGS FIX] Forcing final embedding resize to required size: 32101
Final actual embedding matrix size after fix: 32101

[!!! CRITICAL DEBUGGING ERROR !!!]
Max token ID in batch: 32101
Model vocab size (num_embeddings): 32101
The maximum token ID is OUTSIDE the embedding matrix bounds.
Actual embedding matrix size: 32101
ERROR: Max ID (32101) >= Matrix Size (32101).
First 10 rows of input_ids:
tensor([[ 1846,  5055,   716,  ...,     2,     2,     2],
        [   17,  2725,   848,  ...,     2,     2,     2],
        [10891, 16048,   364,  ...,     2,     2,     2],
        ...,
        [   18,    22, 12648,  ...,     2,     2,     2],
        [12303,   471, 21127,  ...,     2,     2,     2],
        [ 1686,   628,  9039,  ...,     2,     2,     2]], device='cuda:0')

using wandb - calculating loss(already calculated) and accuracy see perplexity better to plot all these three

Thesis

log thesis

This post is licensed under CC BY 4.0 by the author.

set up

Getting tones of:

Debug train_fim / class MyTrainer / compute_loss:

Embedding Size Update:

Trending Tags