Model Loading Best Practices

Intermediate

Enforce safe and efficient model loading patterns — device mapping, memory management, quantization configuration, and error handling for HuggingFace Transformers models.

File Patterns

**/*.py

This rule applies to files matching the patterns above.

Rule Content

rule-content.md

# Model Loading Best Practices

## Rule
All HuggingFace model loading MUST use AutoModel classes, explicit device mapping, and proper memory management. Never load models without specifying resource constraints.

## Required Patterns

### Good — Production Model Loading
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Always use Auto classes for portability
model_id = "Qwen/Qwen2.5-Coder-7B-Instruct"

# Explicit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

# Load with explicit device mapping
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=False,  # Explicit security setting
)

# Always load tokenizer from same source
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Set to evaluation mode for inference
model.eval()
```

### Bad — Unsafe Loading
```python
# Missing device_map, no quantization, no eval()
model = AutoModelForCausalLM.from_pretrained("some-model")
# trust_remote_code=True without review (security risk)
model = AutoModelForCausalLM.from_pretrained("unknown/model", trust_remote_code=True)
```

## Rules
1. **Always use Auto classes** — AutoModelForCausalLM, AutoTokenizer, etc.
2. **Always set device_map** — "auto" for inference, explicit for training
3. **Always call model.eval()** — disables dropout for consistent inference
4. **Never trust_remote_code=True** without reviewing the model's code
5. **Always specify torch_dtype** — float16 or bfloat16, never default float32
6. **Use quantization** for models > 3B parameters on consumer GPUs

## Memory Management
```python
# Free GPU memory after use
import gc
del model
gc.collect()
torch.cuda.empty_cache()
```

## Anti-Patterns
- Loading in float32 (double the memory of float16, negligible quality gain)
- No device_map (model goes to CPU, inference is 100x slower)
- trust_remote_code=True on untrusted models (arbitrary code execution risk)
- Not calling model.eval() (dropout active during inference)
- Loading model and tokenizer from different sources (tokenizer mismatch)

FAQ

Discussion

Loading comments...

# Model Loading Best Practices ## Rule All HuggingFace model loading MUST use AutoModel classes, explicit device mapping, and proper memory management. Never load models without specifying resource constraints. ## Required Patterns ### Good — Production Model Loading ```python from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch # Always use Auto classes for portability model_id = "Qwen/Qwen2.5-Coder-7B-Instruct" # Explicit quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, ) # Load with explicit device mapping model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto", torch_dtype=torch.float16, trust_remote_code=False, # Explicit security setting ) # Always load tokenizer from same source tokenizer = AutoTokenizer.from_pretrained(model_id) # Set to evaluation mode for inference model.eval() ``` ### Bad — Unsafe Loading ```python # Missing device_map, no quantization, no eval() model = AutoModelForCausalLM.from_pretrained("some-model") # trust_remote_code=True without review (security risk) model = AutoModelForCausalLM.from_pretrained("unknown/model", trust_remote_code=True) ``` ## Rules 1. **Always use Auto classes** — AutoModelForCausalLM, AutoTokenizer, etc. 2. **Always set device_map** — "auto" for inference, explicit for training 3. **Always call model.eval()** — disables dropout for consistent inference 4. **Never trust_remote_code=True** without reviewing the model's code 5. **Always specify torch_dtype** — float16 or bfloat16, never default float32 6. **Use quantization** for models > 3B parameters on consumer GPUs ## Memory Management ```python # Free GPU memory after use import gc del model gc.collect() torch.cuda.empty_cache() ``` ## Anti-Patterns - Loading in float32 (double the memory of float16, negligible quality gain) - No device_map (model goes to CPU, inference is 100x slower) - trust_remote_code=True on untrusted models (arbitrary code execution risk) - Not calling model.eval() (dropout active during inference) - Loading model and tokenizer from different sources (tokenizer mismatch)