Loading blog posts...
Loading blog posts...
Loading...

You've stared at "out of memory" errors more times than you can count. Your 8GB GPU chokes on models that Reddit claims run "just fine." Let’s make this predictable by focusing on what actually matters for local LLM inference.
Forget CUDA cores. Forget clock speeds. VRAM decides which models you can run locally.
During inference, the model’s weights need to sit in GPU memory and get read constantly while tokens are generated. Once weights spill into system RAM, performance usually falls off a cliff: think 50+ tokens/sec down to single digits.
The sizing rule is pretty simple: model size + KV cache + 10-25% headroom = minimum VRAM. A 7B model in FP16 needs roughly 14 GB. Quantize it to 4-bit GGUF and you’re down to about 4-5 GB. A 70B model at Q4 still wants around 38-40 GB, which is more than even an RTX 4090’s 24 GB.
| GPU | VRAM | Realistic Model Size | ~7B Q4 Speed |
|---|---|---|---|
| RTX 4060 | 8 GB | 3B-8B (limited context) | 28-35 tok/s |
| RTX 4070 | 12 GB | 7B-8B (useful context) | 48-58 tok/s |
| RTX 4080 | 16 GB | 7B-14B | 70-85 tok/s |
| RTX 4090 | 24 GB | 13B-34B | 90-110 tok/s |
| RTX 5090 | 32 GB | 27B-70B (quantized) | ~140 tok/s |
Memory bandwidth is a big part of why speeds vary. Inference is basically nonstop weight streaming, so the RTX 4090’s 1008 GB/s vs the RTX 5090’s 1792 GB/s tends to show up directly in generation performance.

Note
[!NOTE] AMD users can follow along using ROCm on Linux or Vulkan on Windows. Apple Silicon users benefit from unified memory, which makes larger models more accessible on M-series chips.
bashcurl -fsSL https://ollama.com/install.sh | sh
On Linux, that one command covers the install. Windows users should grab the installer from ollama.com. The installer/script checks your GPU and sets up the right backend automatically.
After installation, confirm the basics:
bashollama --version nvidia-smi
nvidia-smi shows current GPU memory use and your driver version. If GPU acceleration doesn’t show up later, this is the fastest way to tell whether it’s a driver issue or an Ollama setup issue.
bashollama pull llama3.1:8b-instruct-q4_K_M
The tag details matter. 8b is the 8 billion parameter variant of Llama 3.1. instruct means it’s tuned to follow instructions (not just autocomplete). q4_K_M is 4-bit quantization using the K-quant method at medium quality, which typically lands in the best speed vs memory vs quality tradeoff for local runs.
If you want options by VRAM tier:
bash# 8 GB VRAM - smaller models ollama pull gemma3:4b-it-q4_K_M # 12-16 GB VRAM - sweet spot ollama pull llama3.1:8b-instruct-q4_K_M ollama pull qwen3:14b-q4_K_M # 24 GB VRAM - larger models ollama pull llama3.1:70b-instruct-q4_K_M ollama pull gemma3:27b-it-q4_K_M
Gemma 3 from Google is a strong pick for multimodal work, and Qwen3 tends to do especially well on reasoning-heavy tasks. The right choice depends on what you’re doing day to day, but any of these are solid for general coding and writing help.
bashollama run llama3.1:8b-instruct-q4_K_M
While it’s running, open another terminal:
bashollama ps
textNAME ID SIZE PROCESSOR UNTIL llama3.1:8b-instruct-q4_K_M a]2c6b7d8e9f 5.4 GB 100% GPU 4 minutes from now
The PROCESSOR column tells you where inference is happening. Anything less than 100% GPU usually means weights are spilling into system RAM. If you see something like 50% GPU / 50% CPU, the model doesn’t fit cleanly in VRAM, and performance will drop.
Double-check with nvidia-smi:
bashwatch -n 1 nvidia-smi
GPU memory usage should jump when you send prompts. If it stays flat while Ollama says the model is loaded, GPU detection probably isn’t working the way it should.
bashollama run llama3.1:8b-instruct-q4_K_M --num-ctx 4096
Context length hits VRAM through the KV cache. In most cases, doubling context roughly doubles KV cache memory. The default 2048 tokens is fine for typical chats, but RAG setups or long-document work often needs 8192 or more.
Warning
[!WARNING] If you push context too high, you can trigger out-of-memory errors mid-conversation. Start at 4096 and only move up if you actually need it. A 32K context on a 12 GB GPU will fail even with small models.
To make the settings stick, use a Modelfile:
textFROM llama3.1:8b-instruct-q4_K_M PARAMETER num_ctx 8192 PARAMETER temperature 0.7
bashollama create my-custom-llama -f Modelfile ollama run my-custom-llama
This approach keeps context length, temperature, and system prompts in one reusable config, so you’re not retyping flags and wondering which run used which settings.
Ollama covers most workflows, but llama.cpp gives you tighter control over quantization, batching, and multi-GPU setups.
bashgit clone https://github.com/ggml-org/llama.cpp cd llama.cpp make GGML_CUDA=1
GGML_CUDA=1 is what turns on NVIDIA GPU support. Without it, you’ll get CPU inference even if your GPU is sitting there idle.
Download a GGUF model from Hugging Face:
bash./llama-cli -m models/llama-3.1-8b-instruct-q4_K_M.gguf \ -p "Explain quicksort in Python" \ -n 512 \ --n-gpu-layers 99 \ --ctx-size 4096
--n-gpu-layers 99 tells llama.cpp to push as many layers as it can onto the GPU. If VRAM runs out, it’ll drop remaining layers to CPU automatically. If you want a deliberate split (for example, leaving VRAM for another app), set a specific number like 32.
| Format | Bits | Quality | Speed | Memory | Best For |
|---|---|---|---|---|---|
| Q8_0 | 8-bit | Excellent | Slower | Higher | Quality-critical tasks |
| Q6_K | 6-bit | Very Good | Moderate | Moderate | Balance seekers |
| Q4_K_M | 4-bit | Good | Fast | Low | Most users |
| Q4_K_S | 4-bit | Acceptable | Fastest | Lowest | VRAM-constrained |
| Q2_K | 2-bit | Degraded | Very Fast | Minimal | Experimentation only |
The bitsandbytes documentation goes deep on quantization details. In practice, Q4_K_M is the default choice for consumer GPUs because quality usually stays close to FP16 while memory use drops a lot. If Q4_K_M still doesn’t fit, try Q4_K_S. Q2_K is best kept for quick experiments because quality loss becomes obvious.
Tip
[!TIP] If you’re deciding between a larger model at lower quantization vs a smaller model at higher quantization, the larger model usually wins. A 14B Q4 model often beats a 7B Q8 model even if the memory footprint is similar.

bashollama ps
Look at PROCESSOR. If CPU shows up, the model doesn’t fit in VRAM. Your best fixes are:
bashnvidia-smi
Check VRAM use before loading the model. Browsers with hardware acceleration, Discord, and video players can quietly eat a chunk of VRAM. Close them, or turn off GPU acceleration in their settings.
bash# Check CUDA installation nvcc --version # Verify driver compatibility nvidia-smi
Ollama needs CUDA 11.8+ and drivers that support compute capability 5.0+. If nvcc isn’t found, install the CUDA toolkit. If nvidia-smi shows an older driver, update via NVIDIA or your package manager.
That’s expected. The first prompt has to load weights into VRAM, which can take a few seconds depending on model size and disk speed. After that, prompts reuse cached weights. SSDs make a noticeable difference here compared to HDDs.
To get a baseline:
bashollama run llama3.1:8b-instruct-q4_K_M --verbose
Send a prompt and watch the stats:
text>>> Write a Python function to calculate fibonacci numbers eval count: 256 tokens eval duration: 4.2s eval rate: 60.95 tokens/s
eval rate is your generation speed. Compare it to the table earlier. If you’re well under what your GPU should hit, go back and check GPU detection and whether you’re actually fitting in VRAM.
If you want a heavier test:
bash## Generate 500 tokens to stress test ./llama-cli -m model.gguf -p "Write a detailed essay about renewable energy" -n 500 --n-gpu-layers 99
And keep an eye on temps during longer runs:
bashwatch -n 1 "nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,memory.used --format=csv"
If you’re sitting above ~83°C for long stretches, you’re likely running into cooling limits. That can throttle performance and isn’t great for long-term hardware health.
Coding assistance: Qwen3 14B or Llama 3.1 8B. Both are strong for code generation, debugging, and explanations. Qwen3 usually pulls ahead on harder reasoning.
Writing and editing: Gemma 3 12B is a good fit for creative writing and tone-sensitive edits.
RAG and document Q&A: Llama 3.1 8B with extended context (8192+). The base model supports up to 128K context, so quantized builds can still handle substantial documents if VRAM allows.
Multimodal tasks: Gemma 3 27B includes vision support for image understanding alongside text.
If you want more context on why teams are moving local, the piece on the offline-first AI movement adds useful background on privacy and cost tradeoffs.

Start here (your first step)
Install Ollama and run ollama pull llama3.1:8b-instruct-q4_K_M to get a working local LLM in under 5 minutes.
Quick wins (immediate impact)
ollama ps while chatting to confirm 100% GPU usage--num-ctx 4096 to improve conversation memory without blowing up VRAMDeep dive (for those who want more)
--n-gpu-layers to find the best GPU/CPU split for your machineRunning local LLMs on consumer GPUs mostly comes down to three things: VRAM, quantization, and context length. An 8 GB GPU can run 7B models comfortably. A 24 GB card can handle 34B-class models with breathing room. And the tooling is mature enough that you can go from zero to a working setup in minutes.
The bigger win isn’t only saving money vs API calls. Local inference keeps your data private, avoids round trips to external servers, works offline, and gives your team full control over how the model runs. For prototyping, sensitive data work, or learning how these systems behave, local setups remove a lot of dependencies and guesswork.