GPU & Performance Optimization
Intermediatev1.0.0
Optimize Ollama inference performance — GPU layer allocation, batch processing, context window tuning, concurrent requests, and hardware-specific configuration for fast local AI.
Content
Overview
Ollama performance depends heavily on hardware configuration. Learn to allocate GPU layers, tune context windows, configure concurrent requests, and optimize memory usage for the fastest possible local inference.
Why This Matters
- -Speed — proper GPU config can 10x inference speed vs CPU-only
- -Responsiveness — optimized settings enable real-time code assistance
- -Stability — prevent OOM crashes from misconfigured memory allocation
- -Efficiency — run larger models on the same hardware
How It Works
Step 1: Check Hardware
Step 2: Configure GPU Layers
Step 3: Optimize Context Window
Step 4: Environment Variables
Step 5: Benchmark Your Setup
Performance Tuning Table
| Setting | Low Memory (8GB) | Medium (16GB) | High (24GB+) |
|---|---|---|---|
| Model size | 7B Q4 | 13B Q4 or 7B Q5 | 34B Q4 or 13B Q8 |
| num_ctx | 4096 | 8192 | 16384-32768 |
| num_gpu | All (-1) | All (-1) | All (-1) |
| PARALLEL | 1 | 2 | 4 |
| MAX_LOADED | 1 | 2 | 3 |
Best Practices
- -Always use Flash Attention when available (OLLAMA_FLASH_ATTENTION=1)
- -Start with all layers on GPU, reduce only if OOM
- -Match context window to actual need — don't set 32k for simple completions
- -Keep one concurrent slot per expected user
- -Monitor GPU memory with nvidia-smi during operation
Common Mistakes
- -Setting num_ctx to 32768 on 8GB VRAM (will OOM)
- -Running multiple large models simultaneously (memory thrashing)
- -Not enabling Flash Attention (20-30% speed improvement)
- -Using CPU-only mode when GPU is available
- -Setting PARALLEL too high (each slot reserves memory)
FAQ
Discussion
Loading comments...