Loading blog posts...

Also in

Run Local LLMs on Consumer GPUs: VRAM Guide & Performance Tips

Stop fighting out-of-memory errors. Learn exactly which models fit your GPU's VRAM, from RTX 4060 to 5090, with real performance benchmarks and optimization tips.

4 Jul 20268 min readJoulyan IT

Run Local LLMs on Consumer GPUs: VRAM Guide & Performance Tips - ai illustration

You've stared at "out of memory" errors more times than you can count. Your 8GB GPU chokes on models that Reddit claims run "just fine." Let’s make this predictable by focusing on what actually matters for local LLM inference.

Why VRAM Beats Everything Else

Forget CUDA cores. Forget clock speeds. VRAM decides which models you can run locally.

During inference, the model’s weights need to sit in GPU memory and get read constantly while tokens are generated. Once weights spill into system RAM, performance usually falls off a cliff: think 50+ tokens/sec down to single digits.

The sizing rule is pretty simple: model size + KV cache + 10-25% headroom = minimum VRAM. A 7B model in FP16 needs roughly 14 GB. Quantize it to 4-bit GGUF and you’re down to about 4-5 GB. A 70B model at Q4 still wants around 38-40 GB, which is more than even an RTX 4090’s 24 GB.

GPU	VRAM	Realistic Model Size	~7B Q4 Speed
RTX 4060	8 GB	3B-8B (limited context)	28-35 tok/s
RTX 4070	12 GB	7B-8B (useful context)	48-58 tok/s
RTX 4080	16 GB	7B-14B	70-85 tok/s
RTX 4090	24 GB	13B-34B	90-110 tok/s
RTX 5090	32 GB	27B-70B (quantized)	~140 tok/s

Memory bandwidth is a big part of why speeds vary. Inference is basically nonstop weight streaming, so the RTX 4090’s 1008 GB/s vs the RTX 5090’s 1792 GB/s tends to show up directly in generation performance.

GPU VRAM capacity diagram showing model size limits and performance cliff when memory exceeded

Prerequisites

NVIDIA GPU with compute capability 5.0+ (GTX 900 series or newer)
Latest NVIDIA drivers installed
16 GB system RAM minimum (32 GB recommended for larger models)
50+ GB free storage for models
Windows 10/11 or Linux with CUDA support

Note

AMD users can follow along using ROCm on Linux or Vulkan on Windows. Apple Silicon users benefit from unified memory, which makes larger models more accessible on M-series chips.

Step 1: Install Ollama

bash
curl -fsSL https://ollama.com/install.sh | sh

On Linux, that one command covers the install. Windows users should grab the installer from ollama.com. The installer/script checks your GPU and sets up the right backend automatically.

After installation, confirm the basics:

bash
ollama --version
nvidia-smi

nvidia-smi shows current GPU memory use and your driver version. If GPU acceleration doesn’t show up later, this is the fastest way to tell whether it’s a driver issue or an Ollama setup issue.

Step 2: Pull Your First Model

bash
ollama pull llama3.1:8b-instruct-q4_K_M

The tag details matter. 8b is the 8 billion parameter variant of Llama 3.1. instruct means it’s tuned to follow instructions (not just autocomplete). q4_K_M is 4-bit quantization using the K-quant method at medium quality, which typically lands in the best speed vs memory vs quality tradeoff for local runs.

If you want options by VRAM tier:

bash
# 8 GB VRAM - smaller models
ollama pull gemma3:4b-it-q4_K_M

# 12-16 GB VRAM - sweet spot
ollama pull llama3.1:8b-instruct-q4_K_M
ollama pull qwen3:14b-q4_K_M

# 24 GB VRAM - larger models
ollama pull llama3.1:70b-instruct-q4_K_M
ollama pull gemma3:27b-it-q4_K_M

Gemma 3 from Google is a strong pick for multimodal work, and Qwen3 tends to do especially well on reasoning-heavy tasks. The right choice depends on what you’re doing day to day, but any of these are solid for general coding and writing help.

Step 3: Verify GPU Acceleration

bash
ollama run llama3.1:8b-instruct-q4_K_M

While it’s running, open another terminal:

bash
ollama ps

text
NAME ID SIZE PROCESSOR	UNTIL
llama3.1:8b-instruct-q4_K_M a]2c6b7d8e9f	5.4 GB	100% GPU 4 minutes from now

The PROCESSOR column tells you where inference is happening. Anything less than 100% GPU usually means weights are spilling into system RAM. If you see something like 50% GPU / 50% CPU, the model doesn’t fit cleanly in VRAM, and performance will drop.

Double-check with nvidia-smi:

bash
watch -n 1 nvidia-smi

GPU memory usage should jump when you send prompts. If it stays flat while Ollama says the model is loaded, GPU detection probably isn’t working the way it should.

Step 4: Configure Context Length

bash
ollama run llama3.1:8b-instruct-q4_K_M --num-ctx 4096

Context length hits VRAM through the KV cache. In most cases, doubling context roughly doubles KV cache memory. The default 2048 tokens is fine for typical chats, but RAG setups or long-document work often needs 8192 or more.

Warning

If you push context too high, you can trigger out-of-memory errors mid-conversation. Start at 4096 and only move up if you actually need it. A 32K context on a 12 GB GPU will fail even with small models.

To make the settings stick, use a Modelfile:

text
FROM llama3.1:8b-instruct-q4_K_M
PARAMETER num_ctx 8192
PARAMETER temperature 0.7

bash
ollama create my-custom-llama -f Modelfile
ollama run my-custom-llama

This approach keeps context length, temperature, and system prompts in one reusable config, so you’re not retyping flags and wondering which run used which settings.

Step 5: Set Up llama.cpp for Advanced Control

Ollama covers most workflows, but llama.cpp gives you tighter control over quantization, batching, and multi-GPU setups.

bash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make GGML_CUDA=1

GGML_CUDA=1 is what turns on NVIDIA GPU support. Without it, you’ll get CPU inference even if your GPU is sitting there idle.

Download a GGUF model from Hugging Face:

bash
./llama-cli -m models/llama-3.1-8b-instruct-q4_K_M.gguf \
  -p "Explain quicksort in Python" \
  -n 512 \
  --n-gpu-layers 99 \
  --ctx-size 4096

--n-gpu-layers 99 tells llama.cpp to push as many layers as it can onto the GPU. If VRAM runs out, it’ll drop remaining layers to CPU automatically. If you want a deliberate split (for example, leaving VRAM for another app), set a specific number like 32.

Quantization Formats Explained

Format	Bits	Quality	Speed	Memory	Best For
Q8_0	8-bit	Excellent	Slower	Higher	Quality-critical tasks
Q6_K	6-bit	Very Good	Moderate	Moderate	Balance seekers
Q4_K_M	4-bit	Good	Fast	Low	Most users
Q4_K_S	4-bit	Acceptable	Fastest	Lowest	VRAM-constrained
Q2_K	2-bit	Degraded	Very Fast	Minimal	Experimentation only

The bitsandbytes documentation goes deep on quantization details. In practice, Q4_K_M is the default choice for consumer GPUs because quality usually stays close to FP16 while memory use drops a lot. If Q4_K_M still doesn’t fit, try Q4_K_S. Q2_K is best kept for quick experiments because quality loss becomes obvious.

Tip

If you’re deciding between a larger model at lower quantization vs a smaller model at higher quantization, the larger model usually wins. A 14B Q4 model often beats a 7B Q8 model even if the memory footprint is similar.

Quantization format comparison showing quality, speed, and memory tradeoffs across five levels

Troubleshooting Common Issues

Model Loads But Runs Slowly

bash
ollama ps

Look at PROCESSOR. If CPU shows up, the model doesn’t fit in VRAM. Your best fixes are:

Try a smaller quantization (Q4_K_S instead of Q4_K_M)
Lower context length
Switch to a smaller model
Close other GPU-heavy apps

CUDA Out of Memory Errors

bash
nvidia-smi

Check VRAM use before loading the model. Browsers with hardware acceleration, Discord, and video players can quietly eat a chunk of VRAM. Close them, or turn off GPU acceleration in their settings.

Ollama Doesn't Detect GPU

bash
# Check CUDA installation
nvcc --version

# Verify driver compatibility
nvidia-smi

Ollama needs CUDA 11.8+ and drivers that support compute capability 5.0+. If nvcc isn’t found, install the CUDA toolkit. If nvidia-smi shows an older driver, update via NVIDIA or your package manager.

Slow First Response, Fast Subsequent Ones

That’s expected. The first prompt has to load weights into VRAM, which can take a few seconds depending on model size and disk speed. After that, prompts reuse cached weights. SSDs make a noticeable difference here compared to HDDs.

Testing Your Implementation

To get a baseline:

bash
ollama run llama3.1:8b-instruct-q4_K_M --verbose

Send a prompt and watch the stats:

text
>>> Write a Python function to calculate fibonacci numbers
eval count: 256 tokens
eval duration: 4.2s
eval rate: 60.95 tokens/s

eval rate is your generation speed. Compare it to the table earlier. If you’re well under what your GPU should hit, go back and check GPU detection and whether you’re actually fitting in VRAM.

If you want a heavier test:

bash
## Generate 500 tokens to stress test
./llama-cli -m model.gguf -p "Write a detailed essay about renewable energy" -n 500 --n-gpu-layers 99

And keep an eye on temps during longer runs:

bash
watch -n 1 "nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,memory.used --format=csv"

If you’re sitting above ~83°C for long stretches, you’re likely running into cooling limits. That can throttle performance and isn’t great for long-term hardware health.

Model Recommendations by Use Case

Coding assistance: Qwen3 14B or Llama 3.1 8B. Both are strong for code generation, debugging, and explanations. Qwen3 usually pulls ahead on harder reasoning.

Writing and editing: Gemma 3 12B is a good fit for creative writing and tone-sensitive edits.

RAG and document Q&A: Llama 3.1 8B with extended context (8192+). The base model supports up to 128K context, so quantized builds can still handle substantial documents if VRAM allows.

Multimodal tasks: Gemma 3 27B includes vision support for image understanding alongside text.

If you want more context on why teams are moving local, the piece on the offline-first AI movement adds useful background on privacy and cost tradeoffs.

Four-quadrant workspace showing coding, writing, RAG, and multimodal AI use cases

Start Here

Start here (your first step)
Install Ollama and run ollama pull llama3.1:8b-instruct-q4_K_M to get a working local LLM in under 5 minutes.

Quick wins (immediate impact)

Run ollama ps while chatting to confirm 100% GPU usage
Set --num-ctx 4096 to improve conversation memory without blowing up VRAM

Deep dive (for those who want more)

Install llama.cpp and tune --n-gpu-layers to find the best GPU/CPU split for your machine
Compare Q4_K_M vs Q6_K on your real tasks to find the quality/speed balance that fits your workflow

Useful Resources

Ollama GPU Hardware Support - Official documentation for GPU compatibility and configuration
llama.cpp Repository - Source code, build instructions, and advanced usage examples
Hugging Face Quantization Guide - Technical explanation of quantization methods and memory savings
Meta Llama 3.1 Announcement - Model capabilities, benchmarks, and download links
Google Gemma 3 Overview - Multimodal features and local deployment options

Wrapping Up

Running local LLMs on consumer GPUs mostly comes down to three things: VRAM, quantization, and context length. An 8 GB GPU can run 7B models comfortably. A 24 GB card can handle 34B-class models with breathing room. And the tooling is mature enough that you can go from zero to a working setup in minutes.

The bigger win isn’t only saving money vs API calls. Local inference keeps your data private, avoids round trips to external servers, works offline, and gives your team full control over how the model runs. For prototyping, sensitive data work, or learning how these systems behave, local setups remove a lot of dependencies and guesswork.

Topics

Local LLMsGPU VRAMQuantizationllama.cppAI Hardware

Share this article

Local LLMs: The Real AI Revolution? Inside Offline-First AI

Discover how local LLMs like Llama 3.2 are driving an offline-first AI revolution—low latency, no API keys, and real business impact.

7/4/2026

4 min read

ChatGPT Sites in Codex: Create, Deploy & Manage Web Apps

Learn how to create and manage ChatGPT Sites in Codex—from deployment workflows to access controls and secrets. Master this lightweight release pipeline for web apps.

7/21/2026

12 min read

ChatGPT Sites Tutorial: Use Cases, Backend & Prompts

Build and host real web apps inside ChatGPT: what to build, how the D1 backend works, submission forms, dashboards, and reusable prompts.

7/21/2026

6 min read

Back to Blog

Also in

Run Local LLMs on Consumer GPUs: VRAM Guide & Performance Tips

Stop fighting out-of-memory errors. Learn exactly which models fit your GPU's VRAM, from RTX 4060 to 5090, with real performance benchmarks and optimization tips.

4 Jul 20268 min readJoulyan IT

Why VRAM Beats Everything Else

Forget CUDA cores. Forget clock speeds. VRAM decides which models you can run locally.

GPU	VRAM	Realistic Model Size	~7B Q4 Speed
RTX 4060	8 GB	3B-8B (limited context)	28-35 tok/s
RTX 4070	12 GB	7B-8B (useful context)	48-58 tok/s
RTX 4080	16 GB	7B-14B	70-85 tok/s
RTX 4090	24 GB	13B-34B	90-110 tok/s
RTX 5090	32 GB	27B-70B (quantized)	~140 tok/s

GPU VRAM capacity diagram showing model size limits and performance cliff when memory exceeded

Prerequisites

NVIDIA GPU with compute capability 5.0+ (GTX 900 series or newer)
Latest NVIDIA drivers installed
16 GB system RAM minimum (32 GB recommended for larger models)
50+ GB free storage for models
Windows 10/11 or Linux with CUDA support

Note

AMD users can follow along using ROCm on Linux or Vulkan on Windows. Apple Silicon users benefit from unified memory, which makes larger models more accessible on M-series chips.

Step 1: Install Ollama

bash
curl -fsSL https://ollama.com/install.sh | sh

On Linux, that one command covers the install. Windows users should grab the installer from ollama.com. The installer/script checks your GPU and sets up the right backend automatically.

After installation, confirm the basics:

bash
ollama --version
nvidia-smi

nvidia-smi shows current GPU memory use and your driver version. If GPU acceleration doesn’t show up later, this is the fastest way to tell whether it’s a driver issue or an Ollama setup issue.

Step 2: Pull Your First Model

bash
ollama pull llama3.1:8b-instruct-q4_K_M

If you want options by VRAM tier:

bash
# 8 GB VRAM - smaller models
ollama pull gemma3:4b-it-q4_K_M

# 12-16 GB VRAM - sweet spot
ollama pull llama3.1:8b-instruct-q4_K_M
ollama pull qwen3:14b-q4_K_M

# 24 GB VRAM - larger models
ollama pull llama3.1:70b-instruct-q4_K_M
ollama pull gemma3:27b-it-q4_K_M

Step 3: Verify GPU Acceleration

bash
ollama run llama3.1:8b-instruct-q4_K_M

While it’s running, open another terminal:

bash
ollama ps

text
NAME ID SIZE PROCESSOR	UNTIL
llama3.1:8b-instruct-q4_K_M a]2c6b7d8e9f	5.4 GB	100% GPU 4 minutes from now

Double-check with nvidia-smi:

bash
watch -n 1 nvidia-smi

GPU memory usage should jump when you send prompts. If it stays flat while Ollama says the model is loaded, GPU detection probably isn’t working the way it should.

Step 4: Configure Context Length

bash
ollama run llama3.1:8b-instruct-q4_K_M --num-ctx 4096

Warning

To make the settings stick, use a Modelfile:

text
FROM llama3.1:8b-instruct-q4_K_M
PARAMETER num_ctx 8192
PARAMETER temperature 0.7

bash
ollama create my-custom-llama -f Modelfile
ollama run my-custom-llama

This approach keeps context length, temperature, and system prompts in one reusable config, so you’re not retyping flags and wondering which run used which settings.

Step 5: Set Up llama.cpp for Advanced Control

Ollama covers most workflows, but llama.cpp gives you tighter control over quantization, batching, and multi-GPU setups.

bash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make GGML_CUDA=1

GGML_CUDA=1 is what turns on NVIDIA GPU support. Without it, you’ll get CPU inference even if your GPU is sitting there idle.

Download a GGUF model from Hugging Face:

bash
./llama-cli -m models/llama-3.1-8b-instruct-q4_K_M.gguf \
  -p "Explain quicksort in Python" \
  -n 512 \
  --n-gpu-layers 99 \
  --ctx-size 4096

Quantization Formats Explained

Format	Bits	Quality	Speed	Memory	Best For
Q8_0	8-bit	Excellent	Slower	Higher	Quality-critical tasks
Q6_K	6-bit	Very Good	Moderate	Moderate	Balance seekers
Q4_K_M	4-bit	Good	Fast	Low	Most users
Q4_K_S	4-bit	Acceptable	Fastest	Lowest	VRAM-constrained
Q2_K	2-bit	Degraded	Very Fast	Minimal	Experimentation only

Tip

Quantization format comparison showing quality, speed, and memory tradeoffs across five levels

Troubleshooting Common Issues

Model Loads But Runs Slowly

bash
ollama ps

Look at PROCESSOR. If CPU shows up, the model doesn’t fit in VRAM. Your best fixes are:

Try a smaller quantization (Q4_K_S instead of Q4_K_M)
Lower context length
Switch to a smaller model
Close other GPU-heavy apps

CUDA Out of Memory Errors

bash
nvidia-smi

Check VRAM use before loading the model. Browsers with hardware acceleration, Discord, and video players can quietly eat a chunk of VRAM. Close them, or turn off GPU acceleration in their settings.

Ollama Doesn't Detect GPU

bash
# Check CUDA installation
nvcc --version

# Verify driver compatibility
nvidia-smi

Slow First Response, Fast Subsequent Ones

Testing Your Implementation

To get a baseline:

bash
ollama run llama3.1:8b-instruct-q4_K_M --verbose

Send a prompt and watch the stats:

text
>>> Write a Python function to calculate fibonacci numbers
eval count: 256 tokens
eval duration: 4.2s
eval rate: 60.95 tokens/s

eval rate is your generation speed. Compare it to the table earlier. If you’re well under what your GPU should hit, go back and check GPU detection and whether you’re actually fitting in VRAM.

If you want a heavier test:

bash
## Generate 500 tokens to stress test
./llama-cli -m model.gguf -p "Write a detailed essay about renewable energy" -n 500 --n-gpu-layers 99

And keep an eye on temps during longer runs:

bash
watch -n 1 "nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,memory.used --format=csv"

If you’re sitting above ~83°C for long stretches, you’re likely running into cooling limits. That can throttle performance and isn’t great for long-term hardware health.

Model Recommendations by Use Case

Coding assistance: Qwen3 14B or Llama 3.1 8B. Both are strong for code generation, debugging, and explanations. Qwen3 usually pulls ahead on harder reasoning.

Writing and editing: Gemma 3 12B is a good fit for creative writing and tone-sensitive edits.

RAG and document Q&A: Llama 3.1 8B with extended context (8192+). The base model supports up to 128K context, so quantized builds can still handle substantial documents if VRAM allows.

Multimodal tasks: Gemma 3 27B includes vision support for image understanding alongside text.

If you want more context on why teams are moving local, the piece on the offline-first AI movement adds useful background on privacy and cost tradeoffs.

Four-quadrant workspace showing coding, writing, RAG, and multimodal AI use cases

Start Here

Start here (your first step)
Install Ollama and run ollama pull llama3.1:8b-instruct-q4_K_M to get a working local LLM in under 5 minutes.

Quick wins (immediate impact)

Run ollama ps while chatting to confirm 100% GPU usage
Set --num-ctx 4096 to improve conversation memory without blowing up VRAM

Deep dive (for those who want more)

Install llama.cpp and tune --n-gpu-layers to find the best GPU/CPU split for your machine
Compare Q4_K_M vs Q6_K on your real tasks to find the quality/speed balance that fits your workflow

Useful Resources

Ollama GPU Hardware Support - Official documentation for GPU compatibility and configuration
llama.cpp Repository - Source code, build instructions, and advanced usage examples
Hugging Face Quantization Guide - Technical explanation of quantization methods and memory savings
Meta Llama 3.1 Announcement - Model capabilities, benchmarks, and download links
Google Gemma 3 Overview - Multimodal features and local deployment options

Wrapping Up

Topics

Local LLMsGPU VRAMQuantizationllama.cppAI Hardware

Share this article

Local LLMs: The Real AI Revolution? Inside Offline-First AI

Discover how local LLMs like Llama 3.2 are driving an offline-first AI revolution—low latency, no API keys, and real business impact.

7/4/2026

4 min read

ChatGPT Sites in Codex: Create, Deploy & Manage Web Apps

Learn how to create and manage ChatGPT Sites in Codex—from deployment workflows to access controls and secrets. Master this lightweight release pipeline for web apps.

7/21/2026

12 min read

ChatGPT Sites Tutorial: Use Cases, Backend & Prompts

Build and host real web apps inside ChatGPT: what to build, how the D1 backend works, submission forms, dashboards, and reusable prompts.

7/21/2026

6 min read

Run Local LLMs on Consumer GPUs: VRAM Guide & Performance Tips | Joulyan IT Blog

Run Local LLMs on Consumer GPUs: VRAM Guide & Performance Tips

Why VRAM Beats Everything Else

Prerequisites

Step 1: Install Ollama

Step 2: Pull Your First Model

Step 3: Verify GPU Acceleration

Step 4: Configure Context Length

Step 5: Set Up llama.cpp for Advanced Control

Quantization Formats Explained

Troubleshooting Common Issues

Model Loads But Runs Slowly

CUDA Out of Memory Errors

Ollama Doesn't Detect GPU

Slow First Response, Fast Subsequent Ones

Testing Your Implementation

Model Recommendations by Use Case

Start Here

Useful Resources

Wrapping Up

Topics

Share this article

Related Articles

Local LLMs: The Real AI Revolution? Inside Offline-First AI

ChatGPT Sites in Codex: Create, Deploy & Manage Web Apps

ChatGPT Sites Tutorial: Use Cases, Backend & Prompts

Run Local LLMs on Consumer GPUs: VRAM Guide & Performance Tips

Why VRAM Beats Everything Else

Prerequisites

Step 1: Install Ollama

Step 2: Pull Your First Model

Step 3: Verify GPU Acceleration

Step 4: Configure Context Length

Step 5: Set Up llama.cpp for Advanced Control

Quantization Formats Explained

Troubleshooting Common Issues

Model Loads But Runs Slowly

CUDA Out of Memory Errors

Ollama Doesn't Detect GPU

Slow First Response, Fast Subsequent Ones

Testing Your Implementation

Model Recommendations by Use Case

Start Here

Useful Resources

Wrapping Up

Topics

Share this article

Related Articles

Local LLMs: The Real AI Revolution? Inside Offline-First AI

ChatGPT Sites in Codex: Create, Deploy & Manage Web Apps

ChatGPT Sites Tutorial: Use Cases, Backend & Prompts