How this demo works

From plain English to the math. Starts simple, gets deep. Designed so anyone — a finance lead, an infrastructure VP, or an engineer — can follow the story and verify the numbers.

Estimate only — not a quote. Figures throughout this document are projections derived from the Configure inputs and Backblaze list pricing on the date shown. Actual results depend on your specific workload, cluster utilization, contracted rates, and implementation. This is for discussion purposes and does not constitute a sales quote, pricing commitment, service-level agreement, or warranty. The I/O wait and flash-replacement defaults are working defaults pending pilot validation — instrument your own cluster and override the Configure fields before relying on these numbers for budget decisions. Final terms require a signed agreement.

The 30-second version

Goodput, not throughput, is what gets paid for. A 256-H100 cluster at $4/GPU·hr is $1,024/hr — every percentage point of I/O wait is real money. Storage that keeps GPUs busy converts idle GPU dollars into billable training dollars.
Four numbers move together — not one. Goodput recovered (less idle wait), checkpoint stall eliminated (faster flushes), flash footprint displaced (B2 serves the hot working set so your flash bill shrinks), and revenue unlocked (resell the freed flash). Each is computed independently and summed into Net ROI.
Net ROI = goodput + stall + flash displacement − tier premium. "Auto-select optimal tier" picks the tier that maximizes that sum for the configured workload. If no Overdrive tier produces positive ROI, the recommendation is Standard B2.
The capability cap keeps the math honest. A 100 Gbps tier on a 256-GPU cluster only serves 39% of demand, so it earns 39% of the goodput delta — not 100%. Undersized tiers don't get credit they didn't earn.
When does any of this start to matter? ~32+ GPUs (so 1% idle is real money) AND ~3+ PB hot working set (so storage actually stalls the cluster). Below both, Standard B2 is the right answer and the tool says so. Profile 1 is built to demonstrate exactly that case.
Monthly impact range: Profile 1 (inference / fine-tune, 1–3 PB, 16 GPUs) → typically Standard, no Overdrive premium · Profile 2 (25 PB, 256 H100s frontier pretraining) → ~$600K/mo total · Profile 3 (100 PB, 512 GPUs video / multimodal) → ~$1.9M/mo on flash displacement alone.

Want the math? Jump to §6. Want to validate against your own workload? §7. Quick walkthrough of the demo? §4.

What's in here

Start here — what's "goodput" and why should I care?
The opportunity — who has this problem
What this tool shows
Demo Walkthrough
Every field, explained — and what we left out
The math — formulas and derivation
Assumptions and what to validate before showing externally
FAQ — questions you'll get from engineers and CFOs

1. Start here — what's "goodput" and why should I care?

The kitchen story

Imagine a very expensive chef. The chef costs $1,000 an hour. The chef can cook fast, but only if the ingredients show up on time. If the delivery truck is late, the chef just stands there with a knife. You still pay the chef.

That's what's happening to GPUs every day. An H100 GPU rents for about $3 to $4 per hour. A cluster of 256 H100s costs more than $1,000 per hour to run. AI companies pay that whether the GPUs are training a model or sitting there waiting for the next batch of data.

Throughput vs goodput

Two words that sound the same but mean different things:

Throughput — how fast the truck CAN drive. The top speed.
Goodput — how many packages actually arrive on time at the right house. The useful work.

A truck with a 200 mph top speed and zero packages delivered has high throughput and zero goodput. A truck driving 60 mph and delivering every package on schedule has lower throughput and perfect goodput. Goodput is what you actually get paid for.

For an AI training cluster, goodput is the percentage of GPU time spent actually computing, not waiting on storage. A "95% goodput" cluster means 5% of the time the GPUs are idle — waiting for the next shuffle of training data, or waiting for a checkpoint to finish flushing to disk.

Why a few percentage points matter so much

A 256-GPU H100 cluster at $4/hr per GPU is $1,024/hr. Over a month (720 hours) that's $737,000.

If your goodput is 95% (5% idle), you waste ~$37,000/mo on idle GPU time.
If you can get to 99% goodput, you save ~$30,000/mo on the same hardware bill.
At 512 GPUs the same delta becomes ~$60,000/mo.

That's the whole idea in one paragraph: storage that keeps GPUs busy converts idle GPU dollars into billable training dollars. The faster and steadier the storage, the more of the GPU bill is doing useful work.

When does goodput actually start to matter?

Two conditions have to be true at the same time. If either one isn't, the goodput math reduces to zero (and Standard B2 is the right answer).

Threshold	Where it tips
Cluster size large enough that 1% idle = real money	~32+ GPUs (~$100/hr cluster cost). Below this, even significant goodput improvements are too small to justify a tier upgrade.
Working set large enough that storage actually stalls the cluster	~3+ PB hot working set, OR bursty checkpoints ≥ 100 GB on a 64+ GPU cluster. Below this, Standard B2's 50 Gbps already keeps the GPUs fed.

Below both thresholds — Profile 1 territory in the demo — Standard B2 is the right answer and the goodput tiles correctly read $0. The tool will recommend Standard when you click "Auto-select optimal tier" in this regime.

Above both thresholds — Profile 2 and Profile 3 territory — the goodput story dominates. The Net ROI math turns positive for an Overdrive tier sized to the cluster's demand.

2. The opportunity — who has this problem

The four customer segments this is built for

Segment	What they sell	Their pain	How Overdrive helps
GPU cloud providers "Neoclouds" renting GPU capacity by the hour	GPU capacity by the hour to AI builders	Margin is GPU $/hr minus storage and ops. Every idle GPU minute is dead margin. Flash is expensive — paid for whether customers re-read it or not.	Higher goodput → more billable GPU hours from the same hardware. Flash displacement → smaller flash footprint to amortize.
GPU orchestration platforms Schedulers that pack training jobs across clusters	Software that schedules and packs training jobs across clusters	Their value prop IS goodput — they promise "we make your GPUs busier." If the underlying storage stalls, the scheduler can't fix it.	Overdrive is the storage layer that makes their orchestration claims defensible end-to-end.
AI infrastructure providers Bare-metal training, fine-tune-as-a-service, inference platforms	Compute platforms measured in tokens/sec or training $/epoch	SLA pressure from customers who measure everything. A 2-minute checkpoint stall × 4 jobs × 24 hours is hours of lost training per day.	Checkpoint stall elimination — directly recovers customer-visible throughput.
Data providers for AI training Web-scale crawlers, media aggregators, dataset publishers	Crawl, curate, and package petabytes of training data for downstream AI builders	Petabytes of cold data re-read on every training cycle. Flash for this is economically impossible; slow object storage stalls every downstream training run.	Overdrive serves PB-scale catalogs at training-cluster line rate, without the flash bill.

What they all share

A flash bill they're trying to shrink (drive cost + power + cooling + ops)
A GPU bill they're trying to monetize harder (sell more billable hours from the same cluster)
A CFO asking "why are we paying for storage we're not actively reading?"
Customers asking "why does my training run stall for two minutes every hour at every checkpoint?"

The core idea: the conversation about AI infrastructure storage is moving away from storage line-item comparisons. The bigger question is whether storage can convert idle GPU dollars into billable GPU dollars and let you displace expensive flash. This demo models that math.

3. What this tool shows

The page has four big numbers across the top — the "hero tiles." Each is one piece of the goodput story.

Tile	What it answers	Who cares most
Goodput Recovered / mo	"How much of my GPU bill stops being wasted on I/O wait?"	CRO, CFO — direct revenue impact
Checkpoint Stall Eliminated	"How much GPU time stops being burned during checkpoint flushes?"	Platform engineer, AI customers
Flash Footprint Displaced	"How much of my expensive flash can I retire because B2 can serve the hot working set?"	Infrastructure VP, finance
Revenue Unlocked / mo	"What can I charge customers for the flash I just freed up?"	CRO — the resale story

Why these four, and not a simple storage cost comparison?

Storage is a small fraction of any AI training bill. Even cutting it 80% doesn't move the conversation. GPU spend is 10–100× the storage spend, which means a 1% goodput improvement is worth more than a 50% storage discount. This tool surfaces that math directly instead of burying it inside a per-TB rate sheet.

How to read the baseline

When the page loads on Standard B2 (50 Gbps), every goodput tile shows $0 with an explainer ("Standard tier — no goodput delta to claim"). That's intentional, not a bug. Standard B2 IS the baseline — there's no delta to compare against itself. Switch to an Overdrive tier (100, 200, 300, 400, or 800 Gbps) to see the math kick in and the dollar values change.

4. Demo Walkthrough

The fastest way to learn what the tool does is to click through it. The walkthrough below takes about five minutes.

Open the page. Profile 2 (frontier pretraining — 25 PB dataset, 256 H100s) is the default scenario. The hero tiles show roughly $600K/mo total monthly impact on 300 Gbps Overdrive — that's the headline number this profile produces.
Click the Standard B2 tier button. Every Overdrive-dependent tile drops to $0. This is the baseline — Standard B2 is the comparison point and has no "Overdrive delta" by definition.
Click "⚡ Auto-select optimal tier." The tool picks the Overdrive tier with the highest Net ROI for the configured workload and shows the reasoning in the live event stream below.
Switch to Profile 1 (inference / fine-tune — 1–3 PB workload). The optimizer will recommend Standard B2, not Overdrive. At this scale the math doesn't justify the Overdrive premium — and the tool says so honestly.
Switch to Profile 3 (video / multimodal at scale — 100 PB, 512 GPUs). Flash Footprint Displaced jumps to about 24 PB. This profile illustrates the flash-displacement story for very large data libraries.
Click "Run." The full 8-stage pipeline executes against the demo bucket. Hero tiles fill in progressively as each stage completes; the "Open in B2 console" link on each finished stage card surfaces the actual object in the bucket so you can verify nothing is simulated. Prefer to skip the wait? Click the small "skip" link next to the Run button — tiles populate from the math immediately.

Tuning the scenario to your workload

Open the Configure panel to adjust every input directly — cluster size, GPU $/hr, checkpoint cadence, hot working set %, flash $/TB-mo, and I/O wait %. The hero tiles and Tier Comparison panel update live as you change any field. The Tier Comparison panel breaks every tier down (goodput / stall / flash / premium / net ROI) so you can compare them side-by-side.

Switching to Full pipeline mode

The Run button executes all 8 pipeline stages end-to-end in sequence: Data Lake → Training Prep → Checkpoints → Model Registry → RAG Knowledge → KV Cache → Inference → Exhaust. The first 5 stages carry the goodput / stall / flash-displacement headline economics; the last 3 (RAG, KV Cache, Exhaust) demonstrate the full set of B2 patterns in a production AI pipeline. Individual stage cards also have their own Run / Re-run buttons if you want to demo a single pattern in isolation.

5. Every field, explained — and what we left out

Fields included

Workload profile

Field	Default	Why it's here
Profile 1 — inference / fine-tune	1–3 PB, 16 GPUs	Credibility profile. Math correctly recommends Standard B2 here.
Profile 2 — frontier pretraining	25 PB, 256 H100s (default)	The scenario where Overdrive economics typically carry. Frontier LLM pretraining is the canonical fit.
Profile 3 — video / multimodal at scale	100 PB, 512 GPUs	The flash-displacement story for the largest accounts.

Flash cost model

Field	Default	Why it's here
Drive price ($/TB hardware)	$490 (QLC) / $570 (TLC)	Q1 2026 reference pricing for 30.72 TB enterprise drives at volume.
Amortization (months)	36	Standard 3-year refresh cycle most CFOs assume.
DC overhead (%)	50	Power + cooling + fabric + ops markup on raw drive cost. 30-60% is typical for high-density flash.
Sell price ($/TB-mo to customers)	$100	What a neocloud charges its customers for flash. Used in Revenue Unlocked math.

Goodput scenario

Field	Default	Why it's here
Cluster size (GPUs)	256	Drives `cluster $/hr` in every goodput formula.
GPU $/hr	$4.00	H100 on-demand rate. Reserved instances are lower; CFO will tune.
Model size	70B	Auto-fills checkpoint size (~13 bytes/param for full training state per MLCommons MLPerf Storage benchmark).
Checkpoint size (GB)	800	Hourly checkpoint of a 70B model. MLPerf anchor buttons next to the input auto-fill canonical sizes: 8B → 105 GB, 70B → 912 GB, 405B → 5.29 TB, 1T → 15 TB (MLCommons MLPerf Storage).
Checkpoint cadence	every 1 hr	Frontier pretraining typical. Faster cadence = more stall payoff for Overdrive.
Concurrent jobs	4	Multi-tenant clusters checkpoint each job independently. Multiplies the stall payoff linearly.
Hot working set (% of dataset)	32%	Fraction of the dataset that has to stay on flash for active shuffled reads.
Flash $/TB-mo (displacement)	$100	The flash cost basis Overdrive is replacing. Mirrors the sell-price field above.
Std / Overdrive I/O wait %	4.0% / 0.8%	Working defaults — should be validated against your cluster's telemetry before treating as authoritative.
Overdrive flash-replace %	80%	Fraction of hot working set Overdrive can serve at line rate (so flash isn't needed for it).

Fields we deliberately left out

What we removed	Why
"Training runs / month"	An earlier version multiplied per-run load-time savings by runs/month and produced implausible $30M+ Net ROI numbers. Real training I/O isn't sequential per run — it's continuous shuffled reads plus checkpoint flushes. Removing this knob keeps the numbers honest.
"S3 egress fees" / "S3 storage rate"	The old pitch lived on the S3 comparison. The reframe puts the GPU dollar conversation first; comparing storage line items is a distractor at this stage of the pitch.
"Cold tier vs warm tier vs hot tier" sliders	Too much detail for a 5-minute opener. Rolled into a single "Hot working set %" input. Engineers who want the breakdown can read the stage flash-freed fractions in the source.
"$/TB·mo for B2 Overdrive" as an input	Hardcoded per tier — these are published Backblaze rates, not tunable inputs. The Standard rate is $6.95/TB·mo; Overdrive tiers are listed in §6.5.
"Per-GPU bandwidth need" as a UI knob	Set as a constant (`1.0 Gbps/GPU`, the conservative lower bound from published H100 training measurements). Heavier or lighter workloads can override the constant in the source; the assumption is called out in §7 so it can be validated against real telemetry.
A separate "Calculator" panel	Used to have its own inputs duplicating Configure. Removed for clarity — one set of inputs drives one set of outputs.

6. The math — formulas and derivation

Every formula below is what the tool computes. Each is presented with a worked example using Profile 2 defaults so the chain of reasoning is reproducible by hand.

The premise behind the math. Public sources establish that storage performance materially affects GPU economics:

AWS training-performance documentation: inefficient I/O on 4,000-accelerator clusters can waste thousands of GPU-hours daily.
Azure training reports: checkpoint overhead averages 12% of total training time, up to 43% in some large-model scenarios.
VAST Data + WekaIO production telemetry: I/O wait reductions in the 3–10% band depending on shard pattern + storage layer.
MLCommons MLPerf Storage benchmark: uses I/O wait as the proxy for "good vs bad storage"; the 2–8% envelope separates well-tuned from un-tuned object-storage stacks.
Backblaze production reference: customers have driven 100+ Gbps sustained to B2 buckets in production (operational pattern: packed large objects, parallel range reads to local NVMe). This grounds the Overdrive throughput tiers as deliverable, not aspirational.

The math below converts these established storage-side levers into per-tier dollar impact for the operator's economic model. Defaults are positioned mid-range conservative; the Configure panel lets you tune every input to your fleet's measured telemetry.

6.1 — Goodput Recovered / mo

cluster_$_per_hr      = cluster_size × gpu_$_per_hr
io_wait_delta         = max(0, (io_wait_std% − io_wait_od%) / 100)
training_hrs_per_mo   = 720 × cluster_utilization        // default 80% → 576 hrs
goodput_$_per_mo      = cluster_$_per_hr × io_wait_delta × tier_capability × training_hrs_per_mo

What it measures

The dollar value of GPU cluster time that's no longer spent in I/O wait. If Standard B2 leaves the cluster waiting on storage 4% of the time, and Overdrive brings that to 0.8%, then 3.2% of the cluster's hourly bill becomes productive work instead of idle GPUs — applied only during the 576 effective training hours each month (a production cluster spends ~20% of every month idle: maintenance, queue gaps, restart windows).

Worked example (Profile 2 on 300 Gbps Overdrive)

cluster_$/hr        = 256 × $4    = $1,024
delta               = (4.0 − 0.8) / 100  = 0.032
capability          = min(1, 300 / (256 × 1)) = 1.0
training_hrs_per_mo = 720 × 0.80 = 576
goodput             = 1,024 × 0.032 × 1.0 × 576
                    ≈ $18,874 / mo

Why the capability cap? A 100 Gbps tier can't fully feed a 256-GPU cluster (256 GPUs × 1 Gbps/GPU = 256 Gbps required). The smaller tier only serves 100/256 ≈ 39% of the demand, so it only delivers 39% of the I/O wait improvement. Without the cap, the model would credit any Overdrive tier with full goodput recovery regardless of size — an obvious overstatement once you do the bandwidth math.

6.2 — Checkpoint Stall Eliminated

Why checkpoint stall is a real economic lever, not a model artifact. Public sources confirm checkpoint overhead is one of the largest underclaimed sources of GPU idle time in production training:

Azure reports checkpoint overhead averaging 12% of total training time, with peaks to 43% in some large-model scenarios.
AWS reports that inefficient checkpointing on 4,000 accelerators can waste thousands of GPU-hours daily.
VAST Data production telemetry across 40+ large training runs shows median checkpoint overlap under 10% in most runs only when storage is appropriately provisioned.

The default checkpoint sizes used in the math are from the MLCommons MLPerf Storage benchmark: 8B → 105 GB, 70B → 912 GB, 405B → 5.29 TB, 1T → 15 TB. The Configure panel exposes these as model-size anchor buttons.

stall_std_min          = (ckpt_GB × 8) / standard_gbps / 60
stall_od_min           = (ckpt_GB × 8) / tier_gbps     / 60
delta_min              = max(0, stall_std_min − stall_od_min)
ckpts_per_hr           = 60 / cadence_min
training_hrs_per_mo    = 720 × cluster_utilization              // default 80% → 576 hrs
hours_recovered_per_mo = (delta_min / 60) × ckpts_per_hr × training_hrs_per_mo × concurrent_jobs
stall_$_per_mo         = hours_recovered_per_mo × cluster_$_per_hr

What it measures

Every training run flushes checkpoints periodically. While flushing, the cluster stalls — every GPU waits on the checkpoint write to complete. Standard B2 at 50 Gbps takes about 2.1 minutes to flush an 800 GB checkpoint. Overdrive at 200 Gbps does it in 33 seconds. That 1.6 minute delta, repeated hourly across 4 concurrent training jobs over the 576 effective training hours, adds up.

Worked example (Profile 2 on 300 Gbps)

stall_std           = (800 × 8) / 50  / 60 = 2.13 min
stall_od            = (800 × 8) / 300 / 60 = 0.36 min
delta               = 1.78 min
ckpts/hr            = 60 / 60 = 1
training_hrs_per_mo = 720 × 0.80 = 576
hours/mo            = (1.78/60) × 1 × 576 × 4 jobs
                    = 68.4 hours
stall_$/mo          = 68.4 × $1,024
                    ≈ $69,972 / mo

Multi-tenant cluster modeling: the tool multiplies stall recovery by concurrent_jobs. In a multi-tenant cluster each job checkpoints on its own cadence, so the payoff scales linearly with the number of concurrent flushers. Single-tenant clusters should set concurrent_jobs = 1 in the Configure panel for an unmultiplied result.

6.3 — Flash Footprint Displaced

Hot/cold split is the lever VAST and WekaIO have monetized for years. Their published whitepapers and customer references repeatedly cite hot working sets in the 20–40% range of total training corpus depending on epoch size, shuffle pattern, and dataset scale. Cold tail data (60–80%) is the displaceable portion the Overdrive flash-replace % math operates on. The default 32% hot working set is mid-range conservative against those public ranges. Should be validated against your fleet's actual hot/cold telemetry — instrument with object-access timestamps to derive your real number.

dataset_PB     = pipeline_training_TB / 1,000     (decimal PB, matches spec §4)
hot_PB         = dataset_PB × hot_working_set% / 100
displaced_PB   = hot_PB × overdrive_replace% / 100 × tier_capability
displaced_TB   = displaced_PB × 1,000
flash_$        = displaced_TB × flash_$_per_TB_per_mo
b2_overdrive_$ = displaced_TB × tier_storage_rate
net_savings    = max(0, flash_$ − b2_overdrive_$)

What it measures

The amount of flash storage a customer no longer needs to provision because Overdrive can serve the hot working set at line rate. For a 100 PB customer with a 30 PB hot tier, Overdrive eliminating 80% of that hot-tier flash provisioning is a 24 PB reduction. At $100/TB-mo flash and $19/TB-mo B2 Overdrive, that's ~$1.9M/mo of net storage savings.

Worked example (Profile 2 on 300 Gbps)

dataset       = 25,000 TB / 1,000 = 25 PB
hot           = 25 × 0.32        = 8 PB
displaced     = 8 × 0.80 × 1.0   = 6.4 PB (6,400 TB)
flash_$       = 6,400 × $100     = $640,000
b2_od_$       = 6,400 × $19      = $121,600
net           = $640,000 − $121,600
              ≈ $518,400 / mo

6.4 — Revenue Unlocked / mo

freed_TB        = Σ STAGE_FLASH_FREED_TB[stage] for stage in completed_stages
margin_per_TB   = flash_sell_price − b2_storage_rate
revenue_$_per_mo = max(0, freed_TB × margin_per_TB)

What it measures

The neocloud's resale opportunity. As each demo stage offloads its data tier to B2, that flash capacity is freed up. The neocloud can resell that flash to a new GPU customer at the going rate (~$100/TB-mo) while paying B2 only the storage rate (~$6.95–$19/TB-mo depending on tier). The margin is recurring revenue.

Per-stage flash-freed assumptions

Stage	Flash freed	What stays hot
Data Lake	90% of training corpus	~10% re-training cache
Checkpoints	75% of checkpoint history	Active + last 2-3 checkpoints
Model Registry	24% of model tier	Currently deployed model
RAG Knowledge	7.5% of model tier	Vector index + embeddings
KV Cache	8.25% of model tier	Hot KV blocks for inference
Exhaust	70% of logs	72-hour log window

These per-stage fractions are educated estimates. Validate against actual customer deployment patterns before externalizing as authoritative numbers.

6.5 — Tier Premium and Net ROI (Tier Comparison table)

storage_TB         = training_TB + checkpts_TB + models_TB + logs_TB
standard_$_per_mo  = storage_TB × $6.95
tier_$_per_mo      = max(tier_min_monthly_commit,
                         storage_TB × tier_storage_rate + tier_network_fee)
tier_premium       = max(0, tier_$_per_mo − standard_$_per_mo)

net_ROI_per_mo     = goodput_$ + stall_$ + net_flash_$ − tier_premium

This is what the "Auto-select optimal tier" button uses to pick the best tier. The winner is whichever Overdrive tier maximizes net ROI for the configured workload. If no Overdrive tier produces positive ROI (small workloads where the tier premium exceeds the gains), it recommends Standard B2.

6.6 — The bandwidth-capability cap

cluster_demand_gbps = cluster_gpus × BANDWIDTH_PER_GPU_GBPS  (default: 1.0 Gbps/GPU)
tier_capability     = min(1, tier_gbps / cluster_demand_gbps)

An Overdrive tier that can't deliver enough bandwidth for the cluster can't fully deliver its benefits. The capability cap prorates Goodput Recovered and Flash Displaced by the fraction of cluster demand the tier can actually serve.

Examples:

100 Gbps tier on a 256-GPU cluster: capability = 100/256 = 39%. The tier is undersized; goodput delta and flash displacement are scaled by 0.39.
300 Gbps tier on a 256-GPU cluster: capability = 100% (capped at 1.0). Tier saturates the workload; full benefit delivered.
800 Gbps tier on a 256-GPU cluster: capability = 100%. Over-provisioned — same benefit as 300 Gbps but higher tier premium, so net ROI drops.

Why 1.0 Gbps/GPU? Published H100 training measurements report roughly 2–4 GB/s aggregate read bandwidth for 8 H100s on shuffled training data — i.e., 2–4 Gbps per GPU at the upper bound. We picked 1.0 as a conservative lower bound because real training I/O overlaps with compute (the GPU isn't blocking on every read; the data loader prefetches). For more bandwidth-intensive workloads (e.g., very small shards, low-cache-locality reads) this can be tuned upward in the source.

Why this number differs from the GenAI tool's 0.3 Gbps/GPU default. The two calculators answer different questions and therefore use different capability-cap derivations:

This (Neocloud) tool models the operator's capacity-planning question: "can a tier feed my cluster at line rate?" It uses cluster_gpus × 1.0 Gbps/GPU as the bandwidth demand — a conservative worst-case capacity envelope appropriate for fleet-wide planning.
The GenAI tenant tool models a single training pipeline: "can a tier finish my training run in the time I budgeted?" It derives demand from working_set × epochs / run_duration, with the per-GPU constant defaulting to 0.3 — reflecting pre-staged workflows where the GPU reads from local NVMe most of the time and only occasionally pulls from B2.

Both arrive at a defensible cap; they start from different observable inputs because their audiences plan against different constraints. If you ever need apples-to-apples comparison, override the constant in either tool's source to match the other.

7. Assumptions to validate against your workload

The numbers in this tool are conservative defaults pending pilot validation. Before relying on any of these numbers for a real budget decision, validate the inputs below against your own telemetry. Every Configure field can be tuned to match your environment.

Assumption	Default	Notes	How to validate
Cluster utilization %	80%	Production-reservation typical	Fraction of the month the cluster is actively training. 720 hr/mo × 80% = 576 effective training hours. Production reservations often run higher (90%+) on reserved capacity; research clusters can run lower. The math multiplies goodput recovered + stall recovery by this — overstating utilization overstates monthly impact linearly.
Standard I/O wait %	4.0%	Working default	Pull cluster utilization telemetry from a representative training run. `GPU SM efficiency` in `nvidia-smi` or DCGM gives the inverse — the fraction of time GPUs were doing useful compute.
Overdrive I/O wait %	0.8%	Working default	Same approach in an environment with sustained 200+ Gbps storage throughput. If you don't yet have one, treat this as the modeled outcome and re-measure after a pilot.
Per-GPU bandwidth demand	1.0 Gbps/GPU	Conservative lower bound	Measure actual training-read GB/s, divide by GPU count. Heavier workloads (small shards, low cache locality) can be 2–4 Gbps/GPU.
Flash $/TB-mo (sell price basis)	$100	Mid-market reference	Use your own flash sell price or internal chargeback rate. Common range is $80–$150.
Overdrive flash-replace %	80%	Modeled outcome	This is a workload claim, not a pricing input. Validate by piloting Overdrive against a representative hot working set.
Stage flash-freed fractions	90 / 75 / 40 / 30 / 55 / 70%	Educated estimates from typical AI pipeline patterns	Compare against your retention policy. Numbers vary by training cadence and how much old data stays warm.
QLC flash hardware cost	$490/TB	Public reference (2026 Q1, 30.72 TB drives in volume)	Use your own quote. Sustained-volume agreements often come in lower.

Where the I/O wait baselines come from

The 4.0% (Standard) and 0.8% (Overdrive) defaults sit in the middle of public ranges reported by storage and ML infrastructure vendors. They are not measurements of your specific workload — they describe what well-tuned vs un-tuned training-storage stacks typically look like in the industry.

Source	What it reports
AWS S3 Mountpoint & S3 Express One Zone performance documentation	Cold-stream I/O wait on training workloads in the high-single-digit range without local NVMe staging. Throughput improves dramatically with prefetch + parallel range reads (the pattern this tool models).
MLPerf Storage benchmark (MLCommons)	Uses I/O wait as the proxy for "good vs bad storage." The 2–8% envelope separates well-tuned from un-tuned object-storage stacks across published submissions.
VAST Data & WekaIO public whitepapers + production telemetry	Cite measured I/O wait reductions on un-tuned object-storage backed training in the 3–10% band depending on shard pattern.
Hyperscaler + academic telemetry (AWS reports, Azure reports, Meta SPEC paper, Google Pathways)	I/O wait can swing 1–15% depending on dataset size, shard pattern, dataloader concurrency, and checkpoint cadence. Azure specifically reports checkpoint overhead averaging 12% of total training time, up to 43% in some large-model scenarios.

What we don't have: Backblaze does not yet publish a customer-pilot data point of measured baseline I/O wait. The defaults are positioned in the middle of the public envelope as conservative starting points, but the only way to validate the number for your specific workload is to instrument your training run. The Configure panel is where you override the defaults with measured data when available.

8. FAQ

Where do the 4% / 0.8% I/O wait numbers come from?

They are working defaults pending validation against telemetry from a representative training run. The "Where the I/O wait baselines come from" subsection in §7 above cites the public source ranges. To produce numbers you can budget against, take a measurement of your own cluster (GPU SM efficiency from nvidia-smi or DCGM gives the inverse of I/O wait %) and update the Configure fields accordingly.

Why does Profile 2 default to 300 Gbps, not 200 Gbps?

The optimizer computes Net ROI for every tier and picks the highest. For a 256-GPU cluster with 1 Gbps/GPU demand, 300 Gbps is the first tier where capability reaches 100% — bigger tiers add tier premium without proportional benefit. Adjust your cluster size or per-GPU bandwidth demand and the optimizer will choose differently.

Why isn't there an object-storage vs object-storage comparison?

The headline math here measures GPU-productivity impact, not storage line-item savings. A 5% improvement on a multi-hundred-thousand-dollar monthly GPU bill is worth more than even a large percentage discount on a storage line item. The tool is intentionally focused on the GPU-economic story.

What if my workload's training I/O is heavier than 1 Gbps/GPU?

The BANDWIDTH_PER_GPU_GBPS constant in the source is the conservative lower bound. For heavier workloads (small shards, low cache locality), tune it upward and the math will recompute. The Tier Comparison "Capability" column shows whether a given tier can keep up with the resulting demand.

The demo only uploads ~200 MB. Why are you projecting petabyte numbers?

The live stage runs demonstrate the patterns end-to-end — multipart uploads, parallel range reads, mid-run checkpointing, model registry, inference cold-start — using small data so the demo finishes in under a minute. The dollar projections run against the configured workload (Profile 1/2/3 in the workload preset). Each completed stage card explicitly says "↓ Projects to X PB freed" to flag the scale shift.

Why does the page open with Configure expanded?

So you can see — and change — every input that drives the headline numbers. Nothing is hidden behind a magic button. If you want a cleaner view, click the Configure summary to collapse it; the hero tiles stay populated.

Can you explain the Capability column in the Tier Comparison table?

It's the fraction of the cluster's read demand the tier can actually serve. Yellow (under 100%) = the tier is undersized for this cluster; goodput and flash benefits are prorated. Green (100%) = the tier saturates the workload; full benefit. The optimizer prefers the smallest 100%-capable tier because anything larger adds tier premium without adding benefit.

Why doesn't the Goodput tile change when I switch between Overdrive tiers?

It does — but it only changes when the bandwidth capability changes. On a 256-GPU cluster, every Overdrive tier at 256 Gbps and above delivers the same goodput recovery because they all saturate the workload at 100% capability. Smaller tiers (100 Gbps, 200 Gbps) show lower goodput because they're undersized. This is the model behaving honestly — once a tier is big enough to feed the cluster, more bandwidth doesn't recover more goodput.

What does "Standard rates — recommend Overdrive to recover goodput" mean? Is Standard bad?

No. For small workloads Standard is the right answer (Profile 1 is built around exactly that case). The note is just saying Standard is the baseline by definition, so there's no delta to recover when comparing it to itself. At Overdrive tiers the math measures the delta from the Standard baseline.

What does the Clean up button do? Is it safe?

Cleanup deletes only objects under the configured DEMO_PREFIX (default goodput-demo/). The confirmation prompt names the exact bucket and prefix being wiped. Other prefixes in the same bucket are untouched. Even so — use a dedicated demo bucket if you're running this against your own credentials.

Live demo: https://genai.backblazedemos.xyz/goodput/
Author: Kevin Lott · klott@backblaze.com