BLOG SERIES — POST 7 OF 7

A Different Architecture.

The Compute-Once Delivery Model, the 95% Cost Reduction Case, and Why the Organizations That Win Will Redesign Delivery — Not Just Governance

This is Post 7 of 7 — the final post in the series: The AI Inference Cost Crisis. Thank you for following the full series.

The six previous posts in this series documented a problem. The spending trajectory. The multiplier. The Jevons Paradox. The regulatory calendar. The margin compression math. This post is about the answer — and specifically about why the answer is architectural rather than purely managerial.

Organizations that respond to the AI inference cost problem by tightening procurement controls, adding FinOps dashboards, and renegotiating contracts will reduce their exposure at the margins. Organizations that redesign the delivery layer will eliminate entire categories of inference cost at scale. The difference is not incremental. It is structural.

Why Governance Alone Is Not Enough

The FinOps Foundation’s 2026 State of FinOps report¹ documents that 98 percent of practitioners now manage AI spend — up from 63 percent in 2025 and 31 percent in 2024. FinOps for AI is the top forward-looking priority across the practitioner community. The top skill gap is AI cost management. But tracking spend and governing it at the workload level are different capabilities. An organization that can report its total monthly AI expenditure without being able to attribute it to individual workflows or business outcomes is watching a number grow, not managing a cost.

The FinOps Foundation’s data also reveals a structural gap that matters for CFOs: 78 percent of FinOps practices report to the CTO or CIO, and only 8 percent report to the CFO. The governance function closest to AI cost is sitting inside the technology organization rather than inside finance. For AI cost governance to function as a genuine financial discipline — with workload-level attribution, ROI linkage, and board-level reporting — it needs to be owned where financial accountability lives.

MIT’s NANDA initiative² finds that the 5 percent of organizations that reach production with material AI value capture share a consistent characteristic: they build tools that integrate deeply into existing workflows rather than standing up separate AI platforms. They empower line managers rather than central AI labs. They purchase integrated vendor solutions where appropriate rather than building bespoke systems. And they focus on back-office automation — the use cases where inputs are defined and outputs are measurable — rather than on customer-facing chatbots where ROI is diffuse and hard to attribute.

The Three Modes of AI Cost

Understanding the architectural answer begins with understanding that enterprise AI inference workloads fall into three fundamentally different cost modes, each with a different optimal delivery architecture.

On-demand interactive inference is the right model for genuinely unpredictable, personalized queries where the response depends on the specific context of each request and cannot be pre-computed. A customer support agent handling an unusual claim. A financial analyst exploring an unexpected data pattern. A compliance officer reviewing an exception case. These workloads are interactive by nature. Their costs are legitimately variable. Governance focuses on attribution and rate controls.

Agentic inference is the right model for complex, multi-step tasks requiring autonomous decision-making across variable inputs. Gartner documents³ that agentic systems consume 5 to 30 times more tokens per task than standard interactions. These workloads carry the highest per-task cost and the highest governance complexity. The CFO questions for agentic workloads are: which of these truly require autonomous multi-step execution, and which are running as agentic loops by default rather than by design?

Pre-computed structured delivery is the right model for the class of enterprise outputs that follow repeatable structures — financial reports, operational statements, payroll summaries, regulatory filings, executive briefings, customer-facing analytics, and compliance evidence packs. These outputs are not generated unpredictably. They follow defined templates, draw on known data sources, and produce outputs that will be read by specific recipients at predictable intervals. For these workloads, the on-demand inference model — generating a fresh response for each recipient each time — is an architecture choice, not a technical requirement. And it is the most expensive architecture choice available.

The Economics of Pre-Computation

The cost arithmetic of pre-computed delivery is compelling, and it is being validated in production environments. The FinOps Foundation’s own working group⁴ documented a 99 percent token reduction in a production workload by applying hashing and caching to a document-processing pipeline — only running the model when the underlying content actually changed. The principle is simple: if the intelligence required to produce a structured output can be computed once per production cycle, it should be computed once.

All three major AI providers now offer discount structures that make this architecture significantly more advantageous. Anthropic, OpenAI, and Google each provide a 50 percent batch processing discount for asynchronous workloads returned within 24 hours. Layered on top of prompt caching — which cuts cached input costs by 90 percent — the combined discount on structured, repeatable intelligence delivery approaches 95 percent of standard on-demand rates.

99%

token reduction documented in production via compute-once caching

FinOps Foundation working group case

~95%

combined batch + caching discount available from all major providers

Anthropic, OpenAI, Google — April 2026

For enterprises distributing hundreds of thousands of structured outputs monthly — account statements, compliance summaries, operational reports — the cost difference between on-demand generation and pre-computed delivery at this discount level is not marginal. At 500,000 monthly outputs using Claude Sonnet 4.6, the difference between on-demand and pre-computed delivery exceeds $1 million per month. At 2 million outputs — a scale common in financial services statement distribution — it exceeds $4 million per month.

Governance Built Into Delivery

The pre-computed delivery architecture does more than reduce inference cost. It enables a governance posture that on-demand inference makes structurally difficult. When intelligence is computed once and packaged for distribution, the following governance properties become implementable by design rather than by policy:

Data minimization at the delivery layer. Each packaged output contains only the data authorized for the specific recipient. The EU AI Act’s data minimization requirements and GDPR’s purpose limitation principle are enforceable at the artifact level rather than at the model level.

Audit trail completeness. A single computation event produces a definitive record. There is no prompt-injection surface, no variable output for the same input across sessions, and no hallucination risk that varies with each inference call. The output that was distributed is the output that was reviewed.

No-training compliance by architecture. Pre-computed delivery artifacts contain no live API calls at distribution time. There are no prompts being sent to third-party models when recipients interact with the output. The no-training commitment at the computation stage — which can be contractually bound in a DPA — is not extended or re-opened at the delivery stage.

SR 11-7 model inventory clarity. For financial institutions governed by the OCC’s model risk management framework, pre-computed delivery creates a clean boundary for model validation. The computation stage uses a defined model with defined inputs and can be validated as a discrete model instance. The distribution stage does not invoke a model. The model inventory is bounded and auditable.

The Organizational Decision Underneath All the Decisions

The six previous posts in this series have each pointed toward a decision the CFO needs to make. This final post points toward the organizational decision underneath all of them: AI cost governance needs to be owned by finance, not by technology.

The FinOps Foundation’s finding that only 8 percent of FinOps practices report to the CFO is not an indictment of FinOps practitioners. It is a reflection of how AI and cloud cost management have been categorized. They have been treated as infrastructure concerns — the domain of the CTO or CIO — rather than as financial resource allocation questions. The $37 billion in enterprise AI spending documented by Menlo Ventures for 2025, growing at 7x over two years, is not an infrastructure budget. It is a financial commitment of the scale that belongs in the CFO’s remit, governed with the same discipline as any other material cost category.

The organizations that will contain AI inference cost growth are the ones that treat it as a financial governance question from the start — with workload-level attribution, ROI linkage built into the cost model, delivery architectures that eliminate unnecessary inference events, and contractual protections aligned to the regulatory calendar. The organizations that treat it as a technology management question will find that the bill arrives faster than the governance catches up.

The competitive advantage in enterprise AI will not belong to the organizations that adopted it earliest. It will belong to the organizations that built the financial architecture to govern it most effectively.

Six Decisions: A Summary

Decision	Why It Cannot Wait
Govern at the workload level	Aggregate AI spend tells you almost nothing useful. You need cost per workload, per product line, and per customer.
Audit your agentic exposure	Gartner finds agentic AI consumes 5–30x more tokens per task. Identify every agentic workflow and quantify the multiplier before the next contract renewal.
Renegotiate contracts now	The EU Data Act eliminated switching fees and will prohibit them entirely in January 2027. The negotiating leverage will not be higher than it is now.
Formalize no-training clauses as DPA commitments	Policy statements are not legally binding DPA commitments. In financial services, SR 11-7 model inventory obligations apply to LLM API dependencies. Contractual clarity is not optional.
Separate inference from delivery	Structured, repeatable outputs do not need to be generated on demand per recipient. Pre-computed delivery eliminates the token multiplier at the distribution layer, with up to 95% cost reduction on eligible workloads.
Build ROI linkage before the board demands it	MIT NANDA finds 95% of AI pilots produce no measurable P&L impact. Only half of organizations can confidently evaluate AI ROI. Build the attribution model now.

The AI inference cost crisis is not a technology problem that will be solved by better models or lower token prices. It is a financial governance problem. And financial governance problems are solved by the CFO.

Start the conversation.

GUUT helps enterprise organizations govern AI inference spend at the delivery layer — eliminating the token multiplier for structured, repeatable intelligence outputs.

Eric Ford | Chief Data and Analytics Officer | GUUT

eric.ford@guutit.com
guutit.com

Sources & Citations

FinOps Foundation, “State of FinOps 2026,” February 2026. 98% manage AI spend; only 8% report to CFO. FinOps for AI = #1 priority. https://data.finops.org/ ↩
MIT NANDA, “The GenAI Divide,” July 2025. 95% of corporate GenAI pilots deliver no measurable P&L impact. https://virtualizationreview.com/articles/2025/08/19/mit-report-finds-most-ai-business-investments-fail-reveals-genai-divide.aspx ↩
Gartner, “By 2030, LLM Inference Will Cost 90% Less,” March 25, 2026. Agentic models require 5–30x more tokens per task. https://www.gartner.com/en/newsroom/press-releases/2026-03-25-gartner-predicts-that-by-2030-performing-inference-on-an-llm-with-1-trillion-parameters-will-cost-genai-providers-over-90-percent-less-than-in-2025 ↩
FinOps Foundation working group case: 99% token reduction via compute-once caching in production workload. https://data.finops.org/ ↩

AI Inference Cost Crisis, Part 7: Compute Once, Deliver Everywhere

Previous PostThe Consumption Amplification Problem in Azure, Fabric, Power BI, and Databricks Architectures

Next PostAI Inference Cost Crisis, Part 6: The AI Compliance Clock Is Ticking

Leave a Reply Cancel Reply