The most popular discrete GPUs on the market right now ship with 8GB of VRAM — and that was fine when local AI meant tinkering with a 7B model for fun. It is not fine anymore.
Table of Contents
- What Actually Determines VRAM Usage?
- The VRAM Estimator: Calculate Your Own Requirements
- 70B Models in Practice: What Each GPU Tier Actually Gets You
- Quantization Deep Dive: How Low Can You Go Without Destroying Quality?
- CPU Offloading and Hybrid Inference: Making 16GB Work
- 2025-2026 GPU Buying Guide for Local AI
- What If You Only Run 7B to 13B Models?
- Plan for 16GB Minimum, Target 24GB
The most popular discrete GPUs on the market right now ship with 8GB of VRAM. The Steam Hardware Survey confirms it: 8GB cards dominate the installed base, with the majority of discrete GPUs sitting at 8GB or below. That was fine when local AI meant tinkering with a 7B model for fun.
It is not fine anymore.
70B-parameter models like Llama 3.1 70B, Qwen 2.5 72B, and Mixtral 8x22B now represent the sweet spot where local output quality approaches frontier API results. Your 8GB card cannot get you there. Not even close. And if you are about to drop several hundred dollars on a GPU upgrade, you need to understand exactly where the floor sits before you buy the wrong card.
A note on model sizes: DeepSeek-V2-Lite is a 16B-parameter Mixture-of-Experts model, not a 70B-class model. A better 70B-class example alongside Llama 3.1 70B and Qwen 2.5 72B would be DeepSeek-V2 (the full 236B MoE model with ~21B active parameters) or Mixtral 8x22B.
What Actually Determines VRAM Usage?
VRAM consumption during inference is not a single number. It is the sum of three distinct components, each with its own scaling behavior. Miss any one of them and your estimate will be wrong by gigabytes.
Model Parameters × Precision = Base Memory
The largest chunk of VRAM goes to storing model weights. The formula is straightforward:
VRAM_weights ≈ (num_params × bits_per_weight) / 8
That gives you bytes. But real-world quantized formats like GGUF include block metadata, scale factors, and zero-point values, so the actual size runs slightly larger than the raw bit calculation implies. When community references cite "Q4_K_M is approximately 4.8 effective bits per weight," that figure accounts for this overhead.
Here is what that looks like across model sizes and quantization levels (approximate weight memory only, in GB):
| Quant | Eff. Bits | 7B | 13B | 34B | 70B |
|---|---|---|---|---|---|
| FP16 | 16.0 | 14.0 | 26.0 | 68.0 | 140.0 |
| Q8_0 | 8.5 | 7.4 | 13.8 | 36.1 | 74.4 |
| Q5_K_M | 5.7 | 5.0 | 9.3 | 24.2 | 49.9 |
| Q4_K_M | ~4.8 | 4.2 | 7.8 | 20.4 | 42.0 |
| Q3_K_M | ~3.9 | 3.4 | 6.3 | 16.6 | 34.1 |
| Q2_K | ~3.0 | 2.6 | 4.9 | 12.8 | 26.3 |
Even at aggressive Q4_K_M quantization, 70B still demands roughly 42GB just for the weights on disk, and a bit more once loaded into memory with runtime structures.
The KV Cache Tax
This is the hidden VRAM consumer that catches people off guard. During autoregressive generation, the model stores key and value tensors for every layer at every token position in the context window. The formula:
KV_cache ≈ 2 × num_layers × num_kv_heads × head_dim × context_length × bytes_per_element
For architectures using Grouped Query Attention (GQA), like Llama 3.1 70B, the number of KV heads is much smaller than the number of query heads, which cuts this cost significantly. With 80 layers, 8 KV heads, a head dimension of 128, and FP16 storage, an 8K context window costs roughly 2.5GB.
But context scales linearly. Push to 32K and that becomes around 10GB. At 128K, you are looking at approximately 40GB for the KV cache alone. Some backends support quantized KV caches (FP8 or lower), which can cut this, but FP16 remains the common default.
This is the hidden VRAM consumer that catches people off guard. During autoregressive generation, the model stores key and value tensors for every layer at every token position in the context window.
Overhead You Forget: CUDA Context, Activations, and OS
Beyond weights and KV cache, you need to budget for runtime overhead. The CUDA context and driver allocations typically eat several hundred megabytes just by initializing a GPU compute session. llama.cpp allocates compute buffers and scratch space that scale with batch size. Your OS and display compositor also reserve VRAM, especially on Windows.
Budget 500MB to 1GB of overhead on top of your calculated totals. It sounds small. It adds up.
Here is a Python function that puts all three components together:
def estimate_vram_gb(
num_params_billion: float,
bits_per_weight: float,
num_layers: int,
num_kv_heads: int,
head_dim: int,
context_length: int,
kv_bytes_per_element: int = 2, # 2 for FP16, 1 for FP8
overhead_gb: float = 0.8,
) -> dict:
"""Estimate total VRAM for inference given model architecture and quantization.
Returns a breakdown dict with weight, KV cache, overhead, and total in GB.
"""
weight_gb = (num_params_billion * 1e9 * bits_per_weight / 8) / (1024**3)
kv_cache_bytes = (
2 * num_layers * num_kv_heads * head_dim * context_length * kv_bytes_per_element
)
kv_cache_gb = kv_cache_bytes / (1024**3)
total_gb = weight_gb + kv_cache_gb + overhead_gb
return {
"weights_gb": round(weight_gb, 2),
"kv_cache_gb": round(kv_cache_gb, 2),
"overhead_gb": round(overhead_gb, 2),
"total_gb": round(total_gb, 2),
}
# Example: Llama 3.1 70B at Q4_K_M (~4.8 bits), 8K context
# 80 layers, 8 KV heads (GQA), head_dim=128
result = estimate_vram_gb(
num_params_billion=70,
bits_per_weight=4.8,
num_layers=80,
num_kv_heads=8,
head_dim=128,
context_length=8192,
)
print(result)
# {'weights_gb': 39.12, 'kv_cache_gb': 2.5, 'overhead_gb': 0.8, 'total_gb': 42.42}
# weight_gb = (70e9 * 4.8 / 8) / 1024^3 ≈ 39.12 GB
# kv_cache_gb = (2 * 80 * 8 * 128 * 8192 * 2) / 1024^3 ≈ 2.50 GB
That ~42.4GB total for a 70B Q4_K_M model at just 8K context tells you immediately: no single consumer GPU under 48GB can hold this entirely in VRAM.
The VRAM Estimator: Calculate Your Own Requirements
The Python function above captures the core math, but a browser-based tool makes it easier when you are planning a GPU purchase. The estimator below takes your model parameters, quantization level, and context length, then shows a stacked breakdown of where your VRAM goes. Preset buttons for common GPU tiers draw a budget line so you can see whether your configuration fits at a glance.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>VRAM Estimator for LLM Inference</title>
<style>
body {
font-family: system-ui, sans-serif;
max-width: 680px;
margin: 2rem auto;
padding: 0 1rem;
}
label {
display: block;
margin-top: 0.7rem;
font-weight: 600;
}
select, input {
padding: 0.4rem;
width: 100%;
box-sizing: border-box;
margin-top: 0.2rem;
}
.bar-container {
background: #eee;
border-radius: 6px;
overflow: hidden;
height: 36px;
margin-top: 1rem;
position: relative;
}
.bar-segment {
height: 100%;
float: left;
text-align: center;
color: #fff;
font-size: 0.75rem;
line-height: 36px;
overflow: hidden;
}
.budget-line {
position: absolute;
top: 0;
height: 100%;
border-right: 3px solid red;
z-index: 2;
}
.budget-label {
position: absolute;
top: -1.3rem;
color: red;
font-size: 0.75rem;
font-weight: 700;
white-space: nowrap;
}
#result {
margin-top: 1rem;
font-size: 1.05rem;
}
.gpu-btns {
margin-top: 0.8rem;
}
.gpu-btns button {
margin: 0.2rem;
padding: 0.4rem 0.8rem;
cursor: pointer;
}
</style>
</head>
<body>
<h2>VRAM Estimator for LLM Inference</h2>
<label>
Model preset:
<select id="preset" onchange="applyPreset()">
<option value="">-- Manual --</option>
<option value="70,4.8,80,8,128">Llama 3.1 70B Q4_K_M</option>
<option value="70,3.9,80,8,128">Llama 3.1 70B Q3_K_M</option>
<option value="8,4.8,32,8,128">Llama 3.1 8B Q4_K_M</option>
<option value="72,4.8,80,8,128">Qwen 2.5 72B Q4_K_M</option>
<option value="34,4.8,60,8,128">CodeLlama 34B Q4_K_M</option>
</select>
</label>
<label>Parameters (B): <input type="number" id="params" value="70" step="0.1"></label>
<label>Bits per weight: <input type="number" id="bits" value="4.8" step="0.1"></label>
<label>Layers: <input type="number" id="layers" value="80"></label>
<label>KV heads: <input type="number" id="kvheads" value="8"></label>
<label>Head dim: <input type="number" id="headdim" value="128"></label>
<label>Context length: <input type="number" id="ctx" value="8192" step="1024"></label>
<label>Overhead (GB): <input type="number" id="overhead" value="0.8" step="0.1"></label>
<button
onclick="calc()"
style="margin-top:1rem;padding:0.5rem 1.5rem;font-size:1rem;"
>Estimate</button>
<div class="gpu-btns">
GPU budget line:
<button onclick="setBudget(8)">8 GB</button>
<button onclick="setBudget(16)">16 GB</button>
<button onclick="setBudget(24)">24 GB</button>
<button onclick="setBudget(48)">48 GB</button>
<button onclick="setBudget(64)">M4 Max 64 GB</button>
<button onclick="setBudget(128)">M4 Max 128 GB</button>
</div>
<div class="bar-container" id="bar"></div>
<div id="result"></div>
<script>
let budgetGB = 24;
function setBudget(gb) {
budgetGB = gb;
calc();
}
function applyPreset() {
const v = document.getElementById("preset").value;
if (!v) return;
const p = v.split(",");
document.getElementById("params").value = p[0];
document.getElementById("bits").value = p[1];
document.getElementById("layers").value = p[2];
document.getElementById("kvheads").value = p[3];
document.getElementById("headdim").value = p[4];
calc();
}
function calc() {
const P = parseFloat(document.getElementById("params").value);
const B = parseFloat(document.getElementById("bits").value);
const L = parseInt(document.getElementById("layers").value, 10);
const K = parseInt(document.getElementById("kvheads").value, 10);
const D = parseInt(document.getElementById("headdim").value, 10);
const C = parseInt(document.getElementById("ctx").value, 10);
const O = parseFloat(document.getElementById("overhead").value);
const GiB = Math.pow(1024, 3);
const wGB = (P * 1e9 * B / 8) / GiB;
const kvGB = (2 * L * K * D * C * 2) / GiB;
const total = wGB + kvGB + O;
const max = Math.max(total, budgetGB) * 1.15;
const pW = (wGB / max * 100).toFixed(1);
const pK = (kvGB / max * 100).toFixed(1);
const pO = (O / max * 100).toFixed(1);
const bL = (budgetGB / max * 100).toFixed(1);
document.getElementById("bar").innerHTML =
'<div class="bar-segment" style="width:' + pW + '%;background:#3b82f6;" title="Weights">' +
wGB.toFixed(1) + "G wt</div>" +
'<div class="bar-segment" style="width:' + pK + '%;background:#f59e0b;" title="KV Cache">' +
kvGB.toFixed(1) + "G kv</div>" +
'<div class="bar-segment" style="width:' + pO + '%;background:#6b7280;" title="Overhead">' +
O.toFixed(1) + "G oh</div>" +
'<div class="budget-line" style="left:' + bL + '%;">' +
'<span class="budget-label">' + budgetGB + "GB GPU</span>" +
"</div>";
const fits = total <= budgetGB ? "✅ Fits" : "❌ Does NOT fit";
document.getElementById("result").innerHTML =
"<strong>Total: " + total.toFixed(2) + " GB</strong> " +
"(Weights " + wGB.toFixed(2) + " + KV " + kvGB.toFixed(2) + " + Overhead " + O.toFixed(1) + ") — " +
fits + " in " + budgetGB + " GB";
}
calc();
</script>
</body>
</html>
Quick Reference Table
Pre-calculated estimates for the most common configurations (8K context, FP16 KV cache, ~0.8GB overhead):
| Model | Quant | Context | Est. VRAM | 8GB? | 16GB? | 24GB? |
|---|---|---|---|---|---|---|
| Llama 3.1 8B | Q4_K_M | 8K | ~5.8 GB | ✅ | ✅ | ✅ |
| Llama 3.1 8B | Q4_K_M | 32K | ~8.5 GB | ❌ | ✅ | ✅ |
| Llama 2 13B | Q4_K_M | 8K | ~9.5 GB | ❌ | ✅ | ✅ |
| CodeLlama 34B | Q4_K_M | 8K | ~22.5 GB | ❌ | ❌ | ✅ |
| Llama 3.1 70B | Q4_K_M | 8K | ~42.4 GB | ❌ | ❌ | ❌ |
| Llama 3.1 70B | Q3_K_M | 8K | ~36.1 GB | ❌ | ❌ | ❌ |
| Llama 3.1 70B | Q4_K_M | 32K | ~52.1 GB | ❌ | ❌ | ❌ |
A note on Llama 3.1 model sizes: The Llama 3.1 family shipped in 8B, 70B, and 405B sizes. There is no official Llama 3.1 13B. The 13B row above uses Llama 2 13B architecture parameters (layer count, KV heads). Adjust the preset values to match your actual target model.
The 70B row tells the whole story. Even at the most common quantization, you need 40+ GB. No single consumer GPU holds that without offloading.
70B Models in Practice: What Each GPU Tier Actually Gets You
8GB (RTX 4060, RX 7600, RTX 3070): The Poverty Line
An 8GB card handles 7B Q4_K_M comfortably at up to about 8K context with room to spare, and can stretch to 32K if you keep overhead tight. A 13B Q4 model at 8K context sits just above the 8GB line, making it risky.
Anything 30B or above? You are offloading nearly every layer to the CPU, pulling 1 to 3 tokens per second on typical hardware. That is not a usable interactive experience.
Verdict: adequate for small model experimentation, completely inadequate for 70B work.
16GB (RTX 4060 Ti 16GB, RX 7800 XT)
This is the minimum viable tier for touching 70B models. You will not fit the full model, but you can offload 20 to 30 of the 80 transformer layers onto the GPU while the CPU handles the rest. Each layer running on the GPU contributes proportionally to total throughput, so partial offload can lift you from 1-3 tok/s (pure CPU) into a range you can actually work with.
A 13B model at 32K context fits entirely. A 34B Q4 model at 8K context is feasible with full GPU offload. Apple's M4 Pro at 24GB unified memory sits in a similar ballpark, with the advantage that unified memory eliminates the PCIe transfer bottleneck.
24GB (RTX 4090, RTX 3090)
With 24GB, you can offload roughly 40 to 45 layers of a 70B Q4_K_M model, which improves generation speed substantially over the 16GB tier. A 34B Q4 model at 32K context fits comfortably. Two 24GB cards can hold a 70B Q4 model entirely in VRAM.
This is the comfortable tier for 70B work, with caveats around context length.
48GB+ (RTX A6000, L40S, Dual 24GB, Apple Silicon with High Unified Memory)
A 70B Q4_K_M model at 8K context (~42.7GB) fits in a single 48GB card with headroom for longer context. Apple Silicon configurations with large unified memory (M2 Ultra with up to 192GB, M4 Max with up to 128GB, M3/M4 Ultra) can hold 70B models entirely, with high memory bandwidth and no PCIe bottleneck.
No compromises. No offloading dance.
Here are the llama.cpp commands for each scenario:
# 16GB GPU — partial offload, 25 of 80 layers on GPU
./llama-cli -m llama-3.1-70b-q4_k_m.gguf \
-ngl 25 -c 8192 --mlock \
-p "Explain the theory of relativity in simple terms."
# Expected: ~15-18GB VRAM used (25 layers of weights + partial KV), rest in system RAM
# Requires 32GB+ system RAM for remaining layers
# 48GB GPU — full offload, all 80 layers on GPU
# Use -ngl 81 (layers + 1) to also offload the output layer
./llama-cli -m llama-3.1-70b-q4_k_m.gguf \
-ngl 81 -c 8192 \
-p "Explain the theory of relativity in simple terms."
# Expected: ~43GB VRAM used, fastest inference
# 16GB GPU, conservative context with memory pinning
./llama-cli -m llama-3.1-70b-q4_k_m.gguf \
-ngl 20 -c 4096 --mlock \
-p "Summarize the following document:"
# --mlock pins model weight pages in system RAM, preventing swap thrashing
Quantization Deep Dive: How Low Can You Go Without Destroying Quality?
GGUF Quant Tiers
GGUF quantization in llama.cpp ranges from Q2_K (roughly 3 effective bits with overhead) through Q8_0 (roughly 8.5 bits). Community perplexity benchmarks consistently show a gradual quality degradation curve as you move down the quant ladder.
The drop from Q8_0 to Q5_K_M is nearly imperceptible for most tasks. From Q5_K_M to Q4_K_M, there is a small but measurable perplexity increase that rarely affects practical output quality. Q4_K_M has become the community consensus sweet spot for 70B models: the best tradeoff between size, speed, and coherence.
GPTQ, AWQ, EXL2, and GGUF
These are different quantization ecosystems targeting different inference backends.
GGUF is the native format for llama.cpp and works across CUDA, Metal, ROCm, Vulkan, and other backends. GPTQ and AWQ are primarily used with GPU-only inference stacks like vLLM, AutoGPTQ, and transformers, where all weights must reside in VRAM. EXL2 (used by ExLlamaV2) offers variable bit-rate quantization, letting important layers retain higher precision while compressing less critical ones. It often achieves better quality at the same average bit-rate.
For the partial-offload workflow that makes 16GB GPUs viable, GGUF with llama.cpp is the practical choice. It is the only major format with mature CPU/GPU hybrid inference support.
The Cliff Below Q3
Below Q3_K_M, quality falls off a cliff. Q2_K cuts the model size to roughly 26GB for 70B parameters, but perplexity increases substantially and coherence on complex reasoning tasks suffers in ways you will notice immediately.
At that point, you are often better served by a 13B model at Q8_0 (~13.8GB), which fits comfortably in 16GB VRAM and frequently produces more coherent output than a 70B at Q2_K.
The lesson: quantization has a floor below which you are destroying the capability you wanted the larger model for in the first place.
CPU Offloading and Hybrid Inference: Making 16GB Work
How Layer Offloading Works in llama.cpp
When you set -ngl 25 for an 80-layer model, llama.cpp places 25 transformer layers (their weights and corresponding KV cache slices) in GPU VRAM. The remaining 55 layers stay in system RAM and run on the CPU. During each forward pass, execution bounces between GPU and CPU. The GPU-resident layers run at full accelerated speed. CPU layers run at whatever throughput your system RAM bandwidth allows.
Performance scales roughly in proportion to the fraction of layers on GPU, though PCIe transfer latency adds overhead at each boundary crossing.
System RAM: The Other Bottleneck
The weight memory for layers not on the GPU must live in system RAM, along with any KV cache spillover. For a 70B Q4_K_M model with 55 layers on CPU, that is roughly 29GB of weight data plus KV cache for those layers. Adding OS overhead, 32GB of system RAM is the bare minimum and will leave you tight.
64GB is what I would actually recommend for comfortable 70B hybrid inference. Use the --mlock flag to prevent the OS from paging model data to disk. If your model weights hit swap, performance goes from slow to unusable.
Real-World Token Generation Speeds
Exact throughput varies with CPU model, RAM speed, PCIe generation, and GPU. But approximate ranges based on community benchmarks give useful guidance:
| Configuration | Approx. tok/s |
|---|---|
| 70B Q4_K_M, full CPU (no GPU) | 1 to 3 |
| 70B Q4_K_M, 16GB GPU, ~25 layers GPU | 4 to 8 |
| 70B Q4_K_M, 24GB GPU, ~45 layers GPU | 8 to 15 |
| 70B Q4_K_M, 48GB GPU, full offload | 15 to 25 |
These depend heavily on your specific hardware. The key takeaway: partial offload on 16GB is not fast, but it is functional for interactive use. Pure CPU inference is painful.
2025-2026 GPU Buying Guide for Local AI
The 16GB Floor Is Real
If you are buying a GPU for local AI, 16GB of VRAM is the floor worth considering. Below that line, you are locked out of 30B+ models and constrained to short context windows on 13B models.
For budget buyers, the RTX 4060 Ti 16GB remains the most accessible 16GB option on the NVIDIA side. AMD's RX 7800 XT with 16GB competes well on price. Both support the inference backends that matter.
If you can swing mid-range, a used RTX 3090 with 24GB has become genuinely attractive. The RTX 4090 opens up substantially more headroom for partial 70B offload and comfortable 34B operation.
For high-end builds, professional cards with 48GB (RTX A6000, L40S) or dual-GPU setups hold 70B models entirely in VRAM with no offloading overhead.
On the Apple side, the M4 Pro with 24GB unified memory is the minimum for serious work. The M4 Max at 64GB or 128GB holds 70B models entirely, with high memory bandwidth and zero PCIe bottleneck.
Upcoming GPU generations will shift these recommendations. Treat any specifications for unannounced cards as rumor until the manufacturer confirms them.
Memory Bandwidth Matters Too
VRAM capacity gets you through the door. Memory bandwidth determines how fast you walk through it.
Token generation speed during inference is fundamentally bounded by how quickly you can read model weights from memory. HBM2e on professional cards delivers over 1.5 TB/s (HBM3/HBM3e on newer data center GPUs pushes even higher). GDDR6X on consumer cards like the RTX 4090 provides around 1 TB/s. Apple Silicon unified memory on the M4 Max delivers around 550 GB/s but avoids the PCIe copy overhead entirely.
An 8GB card with high bandwidth will still be faster at small models than a 16GB card with slow memory. Consider both dimensions. And verify that your chosen GPU has mature backend support: NVIDIA's CUDA ecosystem is the most mature, AMD's ROCm has improved substantially, and Apple's Metal backend in llama.cpp is well-optimized for M-series chips.
What If You Only Run 7B to 13B Models?
If your workload genuinely tops out at 13B parameters, an 8GB card is serviceable for 7B Q4 at moderate context lengths (8K to 16K). Push to 32K context and the KV cache alone can eat several gigabytes, which may exceed 8GB when combined with weights. A 16GB card gives you comfortable headroom for 13B models at long context, which is the other major sweet spot in local inference.
The quality gap between 13B and 70B models remains significant for complex reasoning, code generation, and nuanced instruction following. Staying at 13B-class models to save money on a GPU is a reasonable short-term call. Just know what you are giving up.
Plan for 16GB Minimum, Target 24GB
The math does not lie. A 70B model at Q4_K_M quantization requires approximately 42GB just to hold weights, KV cache, and overhead at 8K context. No 8GB or 16GB card will ever hold that in full. But 16GB of VRAM gives you partial offload capability that transforms a 70B model from unusable to functional, and it lets you run 13B and 34B model classes comfortably. That makes 16GB the VRAM poverty line for serious local AI work in 2026.
If you can stretch your budget, 24GB is the comfort zone: enough to offload a meaningful majority of 70B layers and run 34B models at generous context lengths without compromise. And if you are building a workstation specifically for AI, 48GB or more eliminates the offloading question entirely.
Before your next GPU purchase, run the estimator above with your target model, your preferred quantization level, and the context length your workload demands. The three numbers you need: weight memory, KV cache, and overhead. Add them up, compare to your GPU's VRAM, and you will know exactly where you stand.
Eight gigabytes is for gaming. Sixteen is for working. Twenty-four is for working comfortably. Buy accordingly.

