This metrics tool terrifies bad developers

Start free trial
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

Self-hosted LLM costs have shifted substantially since 2024. Next-generation GPUs became available, secondary hardware markets drove prices down, and electricity rates updated across major hosting regions. For engineers and technical decision-makers evaluating whether to run open-weight models on infrastructure they control, the calculus depends on a set of measurable variables that this guide breaks down.

Table of Contents

What Does It Actually Cost to Self-Host an LLM in 2026?

Three cost tiers define the decision. Personal or hobbyist hardware suits experimentation and low-throughput workloads. Dedicated servers and colocation arrangements serve production inference at predictable cost. Cloud GPU instances offer flexibility but carry premium hourly rates. Each tier has a distinct total cost of ownership (TCO) profile, and the right choice depends on token volume, latency requirements, data residency constraints, and available engineering capacity.

The goal here is straightforward: provide real dollar figures for each tier, a framework for calculating TCO, and a clear break-even analysis against API providers. The TCO formula and usage tables below allow teams to plug in their own variables and model outcomes before committing budget.

Hardware Costs: Cloud GPU vs. Dedicated Server vs. Personal Rig

Cloud GPU Instances (On-Demand and Reserved)

The 2026 cloud GPU market reflects NVIDIA shipping Blackwell-generation hardware alongside still-available Hopper-class instances. On-demand pricing for NVIDIA H200 instances (141 GB HBM3e) runs approximately $4.50 to $6.00 per hour on AWS, GCP, and Azure, with Lambda Labs and CoreWeave offering rates between $3.50 and $4.50 per hour for equivalent configurations. NVIDIA B200 instances, now available in limited regions, command $6.00 to $8.50 per hour on-demand from major hyperscalers, while GB200 NVL2 configurations (2-GPU module; confirm 384 GB combined HBM3e per NVIDIA product page before procurement) price between $10.00 and $14.00 per hour on-demand.

Note: All cloud GPU prices cited here reflect estimates as of mid-2026. Rates change frequently—consult AWS, GCP, Azure, CoreWeave, and Lambda Labs pricing pages directly before making procurement decisions. On-demand H200 and B200 instance availability is capacity-constrained; production workloads should use reserved instances or maintain fallback capacity.

Reserved pricing changes the picture. One-year commitments typically reduce rates by 30% to 40%, and three-year reservations by 50% to 60%. For running a 70B-parameter model at moderate throughput (~30 to 50 tokens per second for concurrent users), an H200 instance suffices. At a reserved one-year rate of around $2.80 per hour on CoreWeave, monthly cost lands at $2,016 for continuous operation.

Dedicated Server and Colocation

Bare-metal GPU server rentals from providers like Hetzner, OVH, and specialized GPU hosts offer a middle path. Monthly pricing for a server with a single NVIDIA A6000 Ada (48 GB VRAM) runs $400 to $700, while configurations with dual or quad H100 SXM GPUs range from $4,000 to $10,000 per month depending on provider and contract length.

Colocation introduces different math. Rack space in Tier 3 data centers costs $200 to $500 per month per 1U to 4U of space, plus metered power and network fees. You purchase GPUs outright on the capital expenditure path: NVIDIA RTX 5090 cards sit around $2,000 to $2,500 retail, A6000 Ada cards around $4,500 to $5,500, and used H100 SXM GPUs on the secondary market between $15,000 and $20,000 (down significantly from $25,000+ in early 2025). Secondary-market GPUs carry no manufacturer warranty; budget for burn-in testing and potential early failure replacement. Amortized over three years, a dual-H100 setup's hardware cost alone breaks down to ~$900 to $1,200 per month before power and colocation fees.

Personal and On-Premises Hardware

Consumer-tier self-hosting now handles models that were impractical on local hardware in 2024. The NVIDIA RTX 5090 (32 GB GDDR7X) runs quantized models up to approximately 34B parameters at 4-bit quantization, producing around 30 to 50 tokens per second at that model size. Dual-GPU setups using PCIe (NVLink is unavailable on consumer RTX GPUs) push that to around 70B with 4-bit quantization, using software-level model sharding (e.g., llama.cpp layer splitting across GPUs). The Apple Mac Studio with M4 Ultra and 192 GB unified memory can run quantized 70B models and even 120B+ parameter models at 4-bit quantization (~65 GB), leaving headroom; expect 2 to 5 tokens per second throughput at this size. Discrete dual-GPU setups outperform the Mac Studio on raw throughput at 70B by roughly 5 to 10x, but the Mac Studio handles larger models that simply don't fit in 64 GB of GDDR7X.

Upfront costs range from ~$2,000 for a single RTX 5090 build, $5,000 to $8,000 for a dual-GPU workstation, and $8,000 to $15,000+ for a maxed-out Mac Studio configuration.

Self-hosting has advantages you won't see in per-token cost alone: full data residency control, eliminated wide-area network latency for on-premises deployments, and the ability to fine-tune or run custom model variants without per-query markup.

TierGPU/HardwareVRAM/MemoryUpfront CostMonthly CostMax Model Size
Cloud GPU (Reserved 1yr)NVIDIA H200141 GB HBM3e$0~$2,01670B (FP16)
Cloud GPU (On-Demand)NVIDIA B200192 GB HBM3e$0~$5,000–$6,120120B+ (FP16)
Dedicated ServerDual H100 SXM160 GB HBM3$0~$6,000–$10,00070B (FP16)
Colocation (Owned)Dual H100 SXM160 GB HBM3~$35,000–$45,000~$600–$1,000 (power/space, excl. network egress)70B (FP16)
Personal RigRTX 5090 x264 GB GDDR7X~$6,000–$8,000~$30–$80 (electricity)70B (Q4)
Personal RigMac Studio M4 Ultra192 GB unified~$8,000–$15,000~$15–$40 (electricity)120B+ (Q4)

The Hidden Costs: Electricity, Cooling, and Maintenance

Electricity and Cooling

Power consumption under inference load varies significantly by GPU. An H100 SXM has a TDP of 700 W at full load; actual draw ranges from around 400 to 500 W at partial utilization up to 700 W under sustained maximum throughput. For worst-case TCO planning, use 700 W per GPU. A B200 ranges from 600 to 1,000 watts depending on workload density. Consumer RTX 5090 cards draw ~300 to 450 watts under load.

At the U.S. average commercial electricity rate of approximately $0.13 per kWh (and $0.25 to $0.35 per kWh in Western Europe), running two H100 GPUs at an average 500 watts each yields 720 kWh per month for the pair (360 kWh per GPU), costing about $47 per GPU or $93.60 per month for both at U.S. rates. Applying a PUE of 1.4 increases that to around $131 per month. Home cooling overhead effectively increases electricity cost by 50 to 80% during warm months, analogous to a PUE of 1.5 to 1.8.

Electricity math check: 2 GPUs x 500 W average x 720 hours/month (24 hr x 30 days) = 720,000 Wh = 720 kWh total for both GPUs. At $0.13/kWh, that is $93.60 for both, or $46.80 per GPU. With a PUE of 1.4: 720 kWh x 1.4 = 1,008 kWh x $0.13 = $131.04. When modeling your own TCO, calculate from your measured GPU wattage and local electricity rate rather than using these approximate figures directly.

Maintenance, Staffing, and Software

The inference software stack itself is free and open source. vLLM, Hugging Face Text Generation Inference (TGI), and llama.cpp carry no licensing costs but demand engineering expertise to deploy, optimize, and maintain. Observability tooling (Prometheus, Grafana, or hosted alternatives) adds $0 to $200 per month depending on scale and chosen tools.

The real hidden cost is human time. Deploying, monitoring, patching, updating models, and responding to incidents require DevOps or MLOps attention. For a production system, allocating 20% to 30% of a senior engineer's time translates to roughly $3,000 to $6,000 per month in staffing cost. The low end reflects Eastern Europe or APAC compensation levels; the high end reflects U.S. tier-1 metro fully loaded rates. Teams that underestimate this line item consistently blow their TCO projections.

Self-Hosted LLM Costs vs. API Pricing in 2026

Current API Pricing Snapshot

API pricing as of mid-2026 spans a wide range. OpenAI GPT-4.1 charges approximately $2.00 per 1M input tokens and $8.00 per 1M output tokens. Anthropic Claude 4 Sonnet sits at ~$3.00/$15.00 (input/output per 1M tokens). Google Gemini 2.5 Pro prices at around $1.25/$10.00. Open-model API providers undercut these substantially: Together AI and Fireworks serve Llama 4 70B at ~$0.20 to $0.60 per 1M tokens (blended), and Groq delivers inference on optimized hardware at similar rates with notably lower latency.

Note: All API prices cited here are point-in-time estimates. Verify at each provider's pricing page before making decisions: platform.openai.com/docs/pricing, anthropic.com/pricing, ai.google.dev/pricing, together.ai/pricing.

Translating these into monthly budgets at defined usage levels:

Usage LevelTokens/Day*OpenAI GPT-4.1Claude 4 SonnetTogether AI (Llama 4 70B)Self-Hosted H200 (Reserved)Self-Hosted Colo (Owned H100x2)
Low1M~$150–$300/mo~$270–$540/mo~$6–$18/mo~$2,016/mo~$1,500–$1,800/mo
Medium10M~$1,500–$3,000/mo~$2,700–$5,400/mo~$60–$180/mo~$2,016/mo~$1,500–$1,800/mo
High100M+~$15,000–$30,000/mo~$27,000–$54,000/mo~$600–$1,800/mo~$2,016/mo~$1,500–$1,800/mo

*Assumes approximately 70% input / 30% output token ratio. Adjust API cost estimates based on your actual token mix, as most providers charge different rates for input and output tokens.

Break-Even Analysis

The crossover point depends heavily on which API serves as the comparison. Against frontier closed-source models (GPT-4.1, Claude 4), self-hosting on a reserved cloud GPU breaks even at roughly 2M to 5M tokens per day compared to GPT-4.1 tier pricing. That range widens or narrows based on your input/output token ratio and whether you use reserved or on-demand instances; verify against your actual mix using the TCO formula below. Against open-model API providers like Together AI, the break-even shifts dramatically higher, often to 50M+ tokens per day, because those providers already operate optimized infrastructure at scale with thin margins.

Self-hosting has advantages you won't see in per-token cost alone: full data residency control, eliminated wide-area network latency for on-premises deployments, and the ability to fine-tune or run custom model variants without per-query markup. These factors can justify the investment below the pure cost break-even point. Conversely, for bursty or unpredictable workloads, rapidly evolving model requirements, or teams without spare MLOps capacity, APIs remain more cost-effective and operationally simpler.

ROI Framework: How to Calculate Self-Hosting Costs

Total Cost of Ownership Formula

TCO per month equals the sum of hardware amortization (purchase price divided by lifespan in months), electricity, cooling overhead, networking fees, fractional staffing allocation, and software or tooling costs. Expressed as a formula:

def tco_monthly(
    P_hardware: float,          # USD — total hardware purchase price
    amortization_months: int,   # months — must be > 0 (e.g., 36)
    GPU_watts: float,           # Watts — per-GPU average draw under load
    num_gpus: int,              # count of GPUs in deployment
    PUE: float,                 # dimensionless — facility overhead multiplier (e.g., 1.4)
    electricity_rate: float,    # USD/kWh — local commercial rate (e.g., 0.13)
    P_colocation: float,        # USD/month — rack space + metered power fees
    P_network: float,           # USD/month — egress/ingress fees (set to 0 if not applicable)
    FTE_fraction: float,        # fraction of one FTE (0 < x <= 1.0)
    annual_salary: float,       # USD — fully-loaded annual cost of one FTE
    P_software: float,          # USD/month — tooling, observability, licensing
) -> float:
    """
    Returns estimated monthly TCO in USD.

    Reference check (worked example):
      P_hardware=36000, amortization_months=36, GPU_watts=500, num_gpus=2,
      PUE=1.4, electricity_rate=0.13,
      P_colocation=700, P_network=0, FTE_fraction=0.25,
      annual_salary=192000, P_software=100
      => Expected output: ~$5,931/month
    """
    if amortization_months <= 0:
        raise ValueError("amortization_months must be > 0")
    if not (0 < FTE_fraction <= 1.0):
        raise ValueError("FTE_fraction must be between 0 (exclusive) and 1.0 (inclusive)")

    HOURS_PER_MONTH = 720  # 30 days × 24 hours

    electricity_kwh = (GPU_watts * num_gpus * HOURS_PER_MONTH / 1000) * PUE
    electricity_cost = electricity_kwh * electricity_rate

    return (
        (P_hardware / amortization_months)
        + electricity_cost
        + P_colocation
        + P_network
        + (FTE_fraction * annual_salary / 12)
        + P_software
    )


def cost_per_million_tokens(tco: float, daily_tokens_millions: float) -> float:
    """USD per 1M tokens given monthly TCO and daily token volume."""
    if daily_tokens_millions <= 0:
        raise ValueError("daily_tokens_millions must be > 0")
    monthly_tokens_millions = daily_tokens_millions * 30
    return tco / monthly_tokens_millions


def break_even_month(
    P_hardware: float,
    monthly_opex: float,
    monthly_api_cost: float,
) -> float:
    """
    Returns the month at which cumulative self-hosting cost equals
    cumulative API spend. Returns float('inf') if self-hosting never
    breaks even (opex >= API cost).
    """
    monthly_savings = monthly_api_cost - monthly_opex
    if monthly_savings <= 0:
        return float('inf')  # self-hosting never cheaper at this volume
    return P_hardware / monthly_savings

Consider a concrete scenario: a mid-size SaaS company running Llama 4 70B for automated customer support, processing around 5M tokens per day. Using a colocation setup with two owned H100 SXM GPUs purchased at $18,000 each on the secondary market (a midpoint estimate; actual secondary-market prices range $15,000 to $20,000 per card, yielding hardware amortization of $833 to $1,111/month), amortized over 36 months, the hardware component is $1,000 per month. Electricity for both GPUs at 500 W average, 720 hours/month, PUE of 1.4, and $0.13/kWh comes to about $131 per month. Colocation and power fees add ~$700 per month. Staffing at 25% of a senior engineer ($192,000/year fully loaded) comes to $4,000 per month. Software and observability tooling adds $100. Monthly TCO: ~$5,931, yielding a cost of about $0.40 per 1M tokens.

Against GPT-4.1 at that volume, monthly API cost would exceed $7,500 to $15,000, yielding a break-even timeline of roughly 6 to 7 months for the $36,000 hardware investment. Against Together AI serving the same model, the API cost at $0.40 per 1M blended tokens would be ~$600 per month, making self-hosting considerably more expensive at this volume. In that case, self-hosting never breaks even on cost alone.

Using the TCO Formula

The TCO formula above accepts inputs for GPU choice, local electricity rate, expected utilization percentage, staffing cost allocation, and projected daily token volume. Teams should model at least three scenarios (optimistic utilization, realistic utilization, and low utilization) to stress-test assumptions before procurement decisions.

To calculate cost per 1M tokens: divide monthly TCO by total monthly tokens (in millions). For example: cost_per_million_tokens(tco=5931, daily_tokens_millions=5.0) yields ~$0.40. To find the break-even month versus a specified API alternative: break_even_month(P_hardware=36000, monthly_opex=4931, monthly_api_cost=11250) yields approximately 5.7 months against GPT-4.1 midpoint pricing.

Key Takeaways and Recommendations

When Self-Hosting Makes Sense

Self-hosting delivers clear economic advantage at sustained volumes above 10M tokens per day when compared to frontier closed-source APIs. It becomes the obvious choice when strict data residency or privacy regulations apply. Fine-tuned or custom models that must run without per-query markup push the decision further toward owned infrastructure, as do latency requirements that demand on-premises or single-tenant deployment.

When to Stick with APIs

Variable or low usage (under 2M to 5M tokens per day against frontier APIs) rarely justifies the fixed costs of self-hosting. Teams that need access to the latest frontier closed-source models have no self-hosting path for those weights, and limited DevOps or MLOps capacity turns the operational burden into a genuine risk to uptime and security posture.

Teams that underestimate this line item consistently blow their TCO projections.

Self-hosting economics improved between 2024 and 2026: GPU prices fell, open-weight model quality closed the gap for many production use cases (particularly text-based reasoning and instruction following), and inference tooling matured. But the economics only work above a clear volume threshold, and that threshold shifts depending on whether the comparison is frontier APIs or open-model API providers already running the same weights. The TCO formula above exists so you can find your own crossover point before committing capital.

SitePoint TeamSitePoint Team

Sharing our passion for building incredible internet things.

© 2000 – 2026 SitePoint Pty. Ltd.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.