Resource Management Rules

Advanced

Enforce GPU memory, disk storage, and concurrent request limits for Ollama deployments — preventing out-of-memory crashes, storage exhaustion, and performance degradation.

File Patterns

**/ollama***/.env*

This rule applies to files matching the patterns above.

Rule Content

rule-content.md

# Resource Management Rules

## Rule
All Ollama deployments MUST configure memory limits, storage monitoring, and concurrent request limits based on available hardware resources.

## Memory Rules

### GPU Memory Budget
```bash
# Rule: Never allocate more than 90% of VRAM to Ollama
# Leave 10% for OS and other GPU tasks

# Check available VRAM
nvidia-smi --query-gpu=memory.total,memory.used,memory.free --format=csv

# Set GPU overhead reservation
export OLLAMA_GPU_OVERHEAD=512m  # Reserve 512MB for system
```

### Context Window Limits
```
# Maximum num_ctx by VRAM (for 7B Q4 model)
4 GB VRAM  → num_ctx max 4096
8 GB VRAM  → num_ctx max 16384
16 GB VRAM → num_ctx max 32768
24 GB VRAM → num_ctx max 65536
```

## Storage Rules
```bash
# Monitor model storage
ollama list  # Check sizes

# Rule: Set up alerts when model storage exceeds threshold
# Models location: ~/.ollama/models
du -sh ~/.ollama/models/

# Cleanup rule: Remove models unused for 30+ days
# Document which models are in active use
```

## Concurrency Rules
```bash
# Rule: Concurrent slots = floor(Available VRAM / Model VRAM requirement)
# Example: 16GB VRAM, 7B Q4 model (5GB) → max 3 parallel slots

export OLLAMA_NUM_PARALLEL=2      # Conservative default
export OLLAMA_MAX_LOADED_MODELS=2 # Don't load more models than VRAM allows
```

## Environment Configuration Template
```bash
# /etc/environment.d/ollama.conf (Linux systemd)
# Or set in shell profile

# GPU Configuration
OLLAMA_FLASH_ATTENTION=1       # Always enable
OLLAMA_GPU_OVERHEAD=512m       # Reserve for system

# Concurrency (adjust per hardware)
OLLAMA_NUM_PARALLEL=2          # Concurrent requests
OLLAMA_MAX_LOADED_MODELS=2     # Models in memory

# Network (if serving to network)
OLLAMA_HOST=0.0.0.0:11434     # Listen on all interfaces
OLLAMA_ORIGINS=*               # CORS (restrict in production)
```

## Good Configuration (16GB VRAM)
```bash
OLLAMA_FLASH_ATTENTION=1
OLLAMA_NUM_PARALLEL=2
OLLAMA_MAX_LOADED_MODELS=2
# Running: qwen2.5-coder:14b-q4_K_M with num_ctx 8192
```

## Bad Configuration (8GB VRAM)
```bash
OLLAMA_NUM_PARALLEL=4          # Too many — will OOM
OLLAMA_MAX_LOADED_MODELS=3     # Can't fit 3 models in 8GB
# Running: llama3.1:70b with num_ctx 32768  # Way too large for 8GB
```

## Anti-Patterns
- No memory limits set (OOM crashes under load)
- PARALLEL set higher than hardware supports
- Loading multiple large models simultaneously
- Not monitoring disk space for model storage
- Running GPU models without checking CUDA/Metal availability

FAQ

Discussion

Loading comments...

# Resource Management Rules ## Rule All Ollama deployments MUST configure memory limits, storage monitoring, and concurrent request limits based on available hardware resources. ## Memory Rules ### GPU Memory Budget ```bash # Rule: Never allocate more than 90% of VRAM to Ollama # Leave 10% for OS and other GPU tasks # Check available VRAM nvidia-smi --query-gpu=memory.total,memory.used,memory.free --format=csv # Set GPU overhead reservation export OLLAMA_GPU_OVERHEAD=512m # Reserve 512MB for system ``` ### Context Window Limits ``` # Maximum num_ctx by VRAM (for 7B Q4 model) 4 GB VRAM → num_ctx max 4096 8 GB VRAM → num_ctx max 16384 16 GB VRAM → num_ctx max 32768 24 GB VRAM → num_ctx max 65536 ``` ## Storage Rules ```bash # Monitor model storage ollama list # Check sizes # Rule: Set up alerts when model storage exceeds threshold # Models location: ~/.ollama/models du -sh ~/.ollama/models/ # Cleanup rule: Remove models unused for 30+ days # Document which models are in active use ``` ## Concurrency Rules ```bash # Rule: Concurrent slots = floor(Available VRAM / Model VRAM requirement) # Example: 16GB VRAM, 7B Q4 model (5GB) → max 3 parallel slots export OLLAMA_NUM_PARALLEL=2 # Conservative default export OLLAMA_MAX_LOADED_MODELS=2 # Don't load more models than VRAM allows ``` ## Environment Configuration Template ```bash # /etc/environment.d/ollama.conf (Linux systemd) # Or set in shell profile # GPU Configuration OLLAMA_FLASH_ATTENTION=1 # Always enable OLLAMA_GPU_OVERHEAD=512m # Reserve for system # Concurrency (adjust per hardware) OLLAMA_NUM_PARALLEL=2 # Concurrent requests OLLAMA_MAX_LOADED_MODELS=2 # Models in memory # Network (if serving to network) OLLAMA_HOST=0.0.0.0:11434 # Listen on all interfaces OLLAMA_ORIGINS=* # CORS (restrict in production) ``` ## Good Configuration (16GB VRAM) ```bash OLLAMA_FLASH_ATTENTION=1 OLLAMA_NUM_PARALLEL=2 OLLAMA_MAX_LOADED_MODELS=2 # Running: qwen2.5-coder:14b-q4_K_M with num_ctx 8192 ``` ## Bad Configuration (8GB VRAM) ```bash OLLAMA_NUM_PARALLEL=4 # Too many — will OOM OLLAMA_MAX_LOADED_MODELS=3 # Can't fit 3 models in 8GB # Running: llama3.1:70b with num_ctx 32768 # Way too large for 8GB ``` ## Anti-Patterns - No memory limits set (OOM crashes under load) - PARALLEL set higher than hardware supports - Loading multiple large models simultaneously - Not monitoring disk space for model storage - Running GPU models without checking CUDA/Metal availability