From plain English to the math. Starts simple, gets deep. Designed so anyone — a finance lead, an infrastructure VP, or an engineer — can follow the story and verify the numbers.
Want the math? Jump to §6. Want to validate against your own workload? §7. Quick walkthrough of the demo? §4.
Imagine a very expensive chef. The chef costs $1,000 an hour. The chef can cook fast, but only if the ingredients show up on time. If the delivery truck is late, the chef just stands there with a knife. You still pay the chef.
That's what's happening to GPUs every day. An H100 GPU rents for about $3 to $4 per hour. A cluster of 256 H100s costs more than $1,000 per hour to run. AI companies pay that whether the GPUs are training a model or sitting there waiting for the next batch of data.
Two words that sound the same but mean different things:
A truck with a 200 mph top speed and zero packages delivered has high throughput and zero goodput. A truck driving 60 mph and delivering every package on schedule has lower throughput and perfect goodput. Goodput is what you actually get paid for.
For an AI training cluster, goodput is the percentage of GPU time spent actually computing, not waiting on storage. A "95% goodput" cluster means 5% of the time the GPUs are idle — waiting for the next shuffle of training data, or waiting for a checkpoint to finish flushing to disk.
That's the whole idea in one paragraph: storage that keeps GPUs busy converts idle GPU dollars into billable training dollars. The faster and steadier the storage, the more of the GPU bill is doing useful work.
Two conditions have to be true at the same time. If either one isn't, the goodput math reduces to zero (and Standard B2 is the right answer).
| Threshold | Where it tips |
|---|---|
| Cluster size large enough that 1% idle = real money | ~32+ GPUs (~$100/hr cluster cost). Below this, even significant goodput improvements are too small to justify a tier upgrade. |
| Working set large enough that storage actually stalls the cluster | ~3+ PB hot working set, OR bursty checkpoints ≥ 100 GB on a 64+ GPU cluster. Below this, Standard B2's 50 Gbps already keeps the GPUs fed. |
Below both thresholds — Profile 1 territory in the demo — Standard B2 is the right answer and the goodput tiles correctly read $0. The tool will recommend Standard when you click "Auto-select optimal tier" in this regime.
Above both thresholds — Profile 2 and Profile 3 territory — the goodput story dominates. The Net ROI math turns positive for an Overdrive tier sized to the cluster's demand.
| Segment | What they sell | Their pain | How Overdrive helps |
|---|---|---|---|
| GPU cloud providers "Neoclouds" renting GPU capacity by the hour |
GPU capacity by the hour to AI builders | Margin is GPU $/hr minus storage and ops. Every idle GPU minute is dead margin. Flash is expensive — paid for whether customers re-read it or not. | Higher goodput → more billable GPU hours from the same hardware. Flash displacement → smaller flash footprint to amortize. |
| GPU orchestration platforms Schedulers that pack training jobs across clusters |
Software that schedules and packs training jobs across clusters | Their value prop IS goodput — they promise "we make your GPUs busier." If the underlying storage stalls, the scheduler can't fix it. | Overdrive is the storage layer that makes their orchestration claims defensible end-to-end. |
| AI infrastructure providers Bare-metal training, fine-tune-as-a-service, inference platforms |
Compute platforms measured in tokens/sec or training $/epoch | SLA pressure from customers who measure everything. A 2-minute checkpoint stall × 4 jobs × 24 hours is hours of lost training per day. | Checkpoint stall elimination — directly recovers customer-visible throughput. |
| Data providers for AI training Web-scale crawlers, media aggregators, dataset publishers |
Crawl, curate, and package petabytes of training data for downstream AI builders | Petabytes of cold data re-read on every training cycle. Flash for this is economically impossible; slow object storage stalls every downstream training run. | Overdrive serves PB-scale catalogs at training-cluster line rate, without the flash bill. |
The page has four big numbers across the top — the "hero tiles." Each is one piece of the goodput story.
| Tile | What it answers | Who cares most |
|---|---|---|
| Goodput Recovered / mo | "How much of my GPU bill stops being wasted on I/O wait?" | CRO, CFO — direct revenue impact |
| Checkpoint Stall Eliminated | "How much GPU time stops being burned during checkpoint flushes?" | Platform engineer, AI customers |
| Flash Footprint Displaced | "How much of my expensive flash can I retire because B2 can serve the hot working set?" | Infrastructure VP, finance |
| Revenue Unlocked / mo | "What can I charge customers for the flash I just freed up?" | CRO — the resale story |
Storage is a small fraction of any AI training bill. Even cutting it 80% doesn't move the conversation. GPU spend is 10–100× the storage spend, which means a 1% goodput improvement is worth more than a 50% storage discount. This tool surfaces that math directly instead of burying it inside a per-TB rate sheet.
When the page loads on Standard B2 (50 Gbps), every goodput tile shows $0 with an explainer ("Standard tier — no goodput delta to claim"). That's intentional, not a bug. Standard B2 IS the baseline — there's no delta to compare against itself. Switch to an Overdrive tier (100, 200, 300, 400, or 800 Gbps) to see the math kick in and the dollar values change.
The fastest way to learn what the tool does is to click through it. The walkthrough below takes about five minutes.
Open the Configure panel to adjust every input directly — cluster size, GPU $/hr, checkpoint cadence, hot working set %, flash $/TB-mo, and I/O wait %. The hero tiles and Tier Comparison panel update live as you change any field. The Tier Comparison panel breaks every tier down (goodput / stall / flash / premium / net ROI) so you can compare them side-by-side.
The Run button executes all 8 pipeline stages end-to-end in sequence: Data Lake → Training Prep → Checkpoints → Model Registry → RAG Knowledge → KV Cache → Inference → Exhaust. The first 5 stages carry the goodput / stall / flash-displacement headline economics; the last 3 (RAG, KV Cache, Exhaust) demonstrate the full set of B2 patterns in a production AI pipeline. Individual stage cards also have their own Run / Re-run buttons if you want to demo a single pattern in isolation.
| Field | Default | Why it's here |
|---|---|---|
| Profile 1 — inference / fine-tune | 1–3 PB, 16 GPUs | Credibility profile. Math correctly recommends Standard B2 here. |
| Profile 2 — frontier pretraining | 25 PB, 256 H100s (default) | The scenario where Overdrive economics typically carry. Frontier LLM pretraining is the canonical fit. |
| Profile 3 — video / multimodal at scale | 100 PB, 512 GPUs | The flash-displacement story for the largest accounts. |
| Field | Default | Why it's here |
|---|---|---|
| Drive price ($/TB hardware) | $490 (QLC) / $570 (TLC) | Q1 2026 reference pricing for 30.72 TB enterprise drives at volume. |
| Amortization (months) | 36 | Standard 3-year refresh cycle most CFOs assume. |
| DC overhead (%) | 50 | Power + cooling + fabric + ops markup on raw drive cost. 30-60% is typical for high-density flash. |
| Sell price ($/TB-mo to customers) | $100 | What a neocloud charges its customers for flash. Used in Revenue Unlocked math. |
| Field | Default | Why it's here |
|---|---|---|
| Cluster size (GPUs) | 256 | Drives cluster $/hr in every goodput formula. |
| GPU $/hr | $4.00 | H100 on-demand rate. Reserved instances are lower; CFO will tune. |
| Model size | 70B | Auto-fills checkpoint size (~13 bytes/param for full training state per MLCommons MLPerf Storage benchmark). |
| Checkpoint size (GB) | 800 | Hourly checkpoint of a 70B model. MLPerf anchor buttons next to the input auto-fill canonical sizes: 8B → 105 GB, 70B → 912 GB, 405B → 5.29 TB, 1T → 15 TB (MLCommons MLPerf Storage). |
| Checkpoint cadence | every 1 hr | Frontier pretraining typical. Faster cadence = more stall payoff for Overdrive. |
| Concurrent jobs | 4 | Multi-tenant clusters checkpoint each job independently. Multiplies the stall payoff linearly. |
| Hot working set (% of dataset) | 32% | Fraction of the dataset that has to stay on flash for active shuffled reads. |
| Flash $/TB-mo (displacement) | $100 | The flash cost basis Overdrive is replacing. Mirrors the sell-price field above. |
| Std / Overdrive I/O wait % | 4.0% / 0.8% | Working defaults — should be validated against your cluster's telemetry before treating as authoritative. |
| Overdrive flash-replace % | 80% | Fraction of hot working set Overdrive can serve at line rate (so flash isn't needed for it). |
| What we removed | Why |
|---|---|
| "Training runs / month" | An earlier version multiplied per-run load-time savings by runs/month and produced implausible $30M+ Net ROI numbers. Real training I/O isn't sequential per run — it's continuous shuffled reads plus checkpoint flushes. Removing this knob keeps the numbers honest. |
| "S3 egress fees" / "S3 storage rate" | The old pitch lived on the S3 comparison. The reframe puts the GPU dollar conversation first; comparing storage line items is a distractor at this stage of the pitch. |
| "Cold tier vs warm tier vs hot tier" sliders | Too much detail for a 5-minute opener. Rolled into a single "Hot working set %" input. Engineers who want the breakdown can read the stage flash-freed fractions in the source. |
| "$/TB·mo for B2 Overdrive" as an input | Hardcoded per tier — these are published Backblaze rates, not tunable inputs. The Standard rate is $6.95/TB·mo; Overdrive tiers are listed in §6.5. |
| "Per-GPU bandwidth need" as a UI knob | Set as a constant (1.0 Gbps/GPU, the conservative lower bound from published H100 training measurements). Heavier or lighter workloads can override the constant in the source; the assumption is called out in §7 so it can be validated against real telemetry. |
| A separate "Calculator" panel | Used to have its own inputs duplicating Configure. Removed for clarity — one set of inputs drives one set of outputs. |
Every formula below is what the tool computes. Each is presented with a worked example using Profile 2 defaults so the chain of reasoning is reproducible by hand.
cluster_$_per_hr = cluster_size × gpu_$_per_hr io_wait_delta = max(0, (io_wait_std% − io_wait_od%) / 100) training_hrs_per_mo = 720 × cluster_utilization // default 80% → 576 hrs goodput_$_per_mo = cluster_$_per_hr × io_wait_delta × tier_capability × training_hrs_per_mo
The dollar value of GPU cluster time that's no longer spent in I/O wait. If Standard B2 leaves the cluster waiting on storage 4% of the time, and Overdrive brings that to 0.8%, then 3.2% of the cluster's hourly bill becomes productive work instead of idle GPUs — applied only during the 576 effective training hours each month (a production cluster spends ~20% of every month idle: maintenance, queue gaps, restart windows).
cluster_$/hr = 256 × $4 = $1,024
delta = (4.0 − 0.8) / 100 = 0.032
capability = min(1, 300 / (256 × 1)) = 1.0
training_hrs_per_mo = 720 × 0.80 = 576
goodput = 1,024 × 0.032 × 1.0 × 576
≈ $18,874 / mo
stall_std_min = (ckpt_GB × 8) / standard_gbps / 60 stall_od_min = (ckpt_GB × 8) / tier_gbps / 60 delta_min = max(0, stall_std_min − stall_od_min) ckpts_per_hr = 60 / cadence_min training_hrs_per_mo = 720 × cluster_utilization // default 80% → 576 hrs hours_recovered_per_mo = (delta_min / 60) × ckpts_per_hr × training_hrs_per_mo × concurrent_jobs stall_$_per_mo = hours_recovered_per_mo × cluster_$_per_hr
Every training run flushes checkpoints periodically. While flushing, the cluster stalls — every GPU waits on the checkpoint write to complete. Standard B2 at 50 Gbps takes about 2.1 minutes to flush an 800 GB checkpoint. Overdrive at 200 Gbps does it in 33 seconds. That 1.6 minute delta, repeated hourly across 4 concurrent training jobs over the 576 effective training hours, adds up.
stall_std = (800 × 8) / 50 / 60 = 2.13 min
stall_od = (800 × 8) / 300 / 60 = 0.36 min
delta = 1.78 min
ckpts/hr = 60 / 60 = 1
training_hrs_per_mo = 720 × 0.80 = 576
hours/mo = (1.78/60) × 1 × 576 × 4 jobs
= 68.4 hours
stall_$/mo = 68.4 × $1,024
≈ $69,972 / mo
concurrent_jobs. In a multi-tenant cluster each job checkpoints on its own cadence, so the payoff scales linearly with the number of concurrent flushers. Single-tenant clusters should set concurrent_jobs = 1 in the Configure panel for an unmultiplied result.
dataset_PB = pipeline_training_TB / 1,000 (decimal PB, matches spec §4) hot_PB = dataset_PB × hot_working_set% / 100 displaced_PB = hot_PB × overdrive_replace% / 100 × tier_capability displaced_TB = displaced_PB × 1,000 flash_$ = displaced_TB × flash_$_per_TB_per_mo b2_overdrive_$ = displaced_TB × tier_storage_rate net_savings = max(0, flash_$ − b2_overdrive_$)
The amount of flash storage a customer no longer needs to provision because Overdrive can serve the hot working set at line rate. For a 100 PB customer with a 30 PB hot tier, Overdrive eliminating 80% of that hot-tier flash provisioning is a 24 PB reduction. At $100/TB-mo flash and $19/TB-mo B2 Overdrive, that's ~$1.9M/mo of net storage savings.
dataset = 25,000 TB / 1,000 = 25 PB
hot = 25 × 0.32 = 8 PB
displaced = 8 × 0.80 × 1.0 = 6.4 PB (6,400 TB)
flash_$ = 6,400 × $100 = $640,000
b2_od_$ = 6,400 × $19 = $121,600
net = $640,000 − $121,600
≈ $518,400 / mo
freed_TB = Σ STAGE_FLASH_FREED_TB[stage] for stage in completed_stages margin_per_TB = flash_sell_price − b2_storage_rate revenue_$_per_mo = max(0, freed_TB × margin_per_TB)
The neocloud's resale opportunity. As each demo stage offloads its data tier to B2, that flash capacity is freed up. The neocloud can resell that flash to a new GPU customer at the going rate (~$100/TB-mo) while paying B2 only the storage rate (~$6.95–$19/TB-mo depending on tier). The margin is recurring revenue.
| Stage | Flash freed | What stays hot |
|---|---|---|
| Data Lake | 90% of training corpus | ~10% re-training cache |
| Checkpoints | 75% of checkpoint history | Active + last 2-3 checkpoints |
| Model Registry | 24% of model tier | Currently deployed model |
| RAG Knowledge | 7.5% of model tier | Vector index + embeddings |
| KV Cache | 8.25% of model tier | Hot KV blocks for inference |
| Exhaust | 70% of logs | 72-hour log window |
These per-stage fractions are educated estimates. Validate against actual customer deployment patterns before externalizing as authoritative numbers.
storage_TB = training_TB + checkpts_TB + models_TB + logs_TB
standard_$_per_mo = storage_TB × $6.95
tier_$_per_mo = max(tier_min_monthly_commit,
storage_TB × tier_storage_rate + tier_network_fee)
tier_premium = max(0, tier_$_per_mo − standard_$_per_mo)
net_ROI_per_mo = goodput_$ + stall_$ + net_flash_$ − tier_premium
This is what the "Auto-select optimal tier" button uses to pick the best tier. The winner is whichever Overdrive tier maximizes net ROI for the configured workload. If no Overdrive tier produces positive ROI (small workloads where the tier premium exceeds the gains), it recommends Standard B2.
cluster_demand_gbps = cluster_gpus × BANDWIDTH_PER_GPU_GBPS (default: 1.0 Gbps/GPU) tier_capability = min(1, tier_gbps / cluster_demand_gbps)
An Overdrive tier that can't deliver enough bandwidth for the cluster can't fully deliver its benefits. The capability cap prorates Goodput Recovered and Flash Displaced by the fraction of cluster demand the tier can actually serve.
Examples:
0.3 Gbps/GPU default. The two calculators answer different questions and therefore use different capability-cap derivations:
cluster_gpus × 1.0 Gbps/GPU as the bandwidth demand — a conservative worst-case capacity envelope appropriate for fleet-wide planning.working_set × epochs / run_duration, with the per-GPU constant defaulting to 0.3 — reflecting pre-staged workflows where the GPU reads from local NVMe most of the time and only occasionally pulls from B2.| Assumption | Default | Notes | How to validate |
|---|---|---|---|
| Cluster utilization % | 80% | Production-reservation typical | Fraction of the month the cluster is actively training. 720 hr/mo × 80% = 576 effective training hours. Production reservations often run higher (90%+) on reserved capacity; research clusters can run lower. The math multiplies goodput recovered + stall recovery by this — overstating utilization overstates monthly impact linearly. |
| Standard I/O wait % | 4.0% | Working default | Pull cluster utilization telemetry from a representative training run. GPU SM efficiency in nvidia-smi or DCGM gives the inverse — the fraction of time GPUs were doing useful compute. |
| Overdrive I/O wait % | 0.8% | Working default | Same approach in an environment with sustained 200+ Gbps storage throughput. If you don't yet have one, treat this as the modeled outcome and re-measure after a pilot. |
| Per-GPU bandwidth demand | 1.0 Gbps/GPU | Conservative lower bound | Measure actual training-read GB/s, divide by GPU count. Heavier workloads (small shards, low cache locality) can be 2–4 Gbps/GPU. |
| Flash $/TB-mo (sell price basis) | $100 | Mid-market reference | Use your own flash sell price or internal chargeback rate. Common range is $80–$150. |
| Overdrive flash-replace % | 80% | Modeled outcome | This is a workload claim, not a pricing input. Validate by piloting Overdrive against a representative hot working set. |
| Stage flash-freed fractions | 90 / 75 / 40 / 30 / 55 / 70% | Educated estimates from typical AI pipeline patterns | Compare against your retention policy. Numbers vary by training cadence and how much old data stays warm. |
| QLC flash hardware cost | $490/TB | Public reference (2026 Q1, 30.72 TB drives in volume) | Use your own quote. Sustained-volume agreements often come in lower. |
The 4.0% (Standard) and 0.8% (Overdrive) defaults sit in the middle of public ranges reported by storage and ML infrastructure vendors. They are not measurements of your specific workload — they describe what well-tuned vs un-tuned training-storage stacks typically look like in the industry.
| Source | What it reports |
|---|---|
| AWS S3 Mountpoint & S3 Express One Zone performance documentation | Cold-stream I/O wait on training workloads in the high-single-digit range without local NVMe staging. Throughput improves dramatically with prefetch + parallel range reads (the pattern this tool models). |
| MLPerf Storage benchmark (MLCommons) | Uses I/O wait as the proxy for "good vs bad storage." The 2–8% envelope separates well-tuned from un-tuned object-storage stacks across published submissions. |
| VAST Data & WekaIO public whitepapers + production telemetry | Cite measured I/O wait reductions on un-tuned object-storage backed training in the 3–10% band depending on shard pattern. |
| Hyperscaler + academic telemetry (AWS reports, Azure reports, Meta SPEC paper, Google Pathways) | I/O wait can swing 1–15% depending on dataset size, shard pattern, dataloader concurrency, and checkpoint cadence. Azure specifically reports checkpoint overhead averaging 12% of total training time, up to 43% in some large-model scenarios. |
They are working defaults pending validation against telemetry from a representative training run. The "Where the I/O wait baselines come from" subsection in §7 above cites the public source ranges. To produce numbers you can budget against, take a measurement of your own cluster (GPU SM efficiency from nvidia-smi or DCGM gives the inverse of I/O wait %) and update the Configure fields accordingly.
The optimizer computes Net ROI for every tier and picks the highest. For a 256-GPU cluster with 1 Gbps/GPU demand, 300 Gbps is the first tier where capability reaches 100% — bigger tiers add tier premium without proportional benefit. Adjust your cluster size or per-GPU bandwidth demand and the optimizer will choose differently.
The headline math here measures GPU-productivity impact, not storage line-item savings. A 5% improvement on a multi-hundred-thousand-dollar monthly GPU bill is worth more than even a large percentage discount on a storage line item. The tool is intentionally focused on the GPU-economic story.
The BANDWIDTH_PER_GPU_GBPS constant in the source is the conservative lower bound. For heavier workloads (small shards, low cache locality), tune it upward and the math will recompute. The Tier Comparison "Capability" column shows whether a given tier can keep up with the resulting demand.
The live stage runs demonstrate the patterns end-to-end — multipart uploads, parallel range reads, mid-run checkpointing, model registry, inference cold-start — using small data so the demo finishes in under a minute. The dollar projections run against the configured workload (Profile 1/2/3 in the workload preset). Each completed stage card explicitly says "↓ Projects to X PB freed" to flag the scale shift.
So you can see — and change — every input that drives the headline numbers. Nothing is hidden behind a magic button. If you want a cleaner view, click the Configure summary to collapse it; the hero tiles stay populated.
It's the fraction of the cluster's read demand the tier can actually serve. Yellow (under 100%) = the tier is undersized for this cluster; goodput and flash benefits are prorated. Green (100%) = the tier saturates the workload; full benefit. The optimizer prefers the smallest 100%-capable tier because anything larger adds tier premium without adding benefit.
It does — but it only changes when the bandwidth capability changes. On a 256-GPU cluster, every Overdrive tier at 256 Gbps and above delivers the same goodput recovery because they all saturate the workload at 100% capability. Smaller tiers (100 Gbps, 200 Gbps) show lower goodput because they're undersized. This is the model behaving honestly — once a tier is big enough to feed the cluster, more bandwidth doesn't recover more goodput.
No. For small workloads Standard is the right answer (Profile 1 is built around exactly that case). The note is just saying Standard is the baseline by definition, so there's no delta to recover when comparing it to itself. At Overdrive tiers the math measures the delta from the Standard baseline.
Cleanup deletes only objects under the configured DEMO_PREFIX (default goodput-demo/). The confirmation prompt names the exact bucket and prefix being wiped. Other prefixes in the same bucket are untouched. Even so — use a dedicated demo bucket if you're running this against your own credentials.
Live demo: https://genai.backblazedemos.xyz/goodput/
Author: Kevin Lott · klott@backblaze.com