Sitemap

Default to Local: Why AI Should Run on Edge by Design, Not as an Afterthought

8 min readOct 19, 2025

Hello Leader/Manager/Strategist!

If you’re shaping your company’s AI strategy, you need to understand how AI actually works — not at the level of math or architecture, but at the level of experience, economics, and infrastructure.

Large Language Models (LLMs) have changed what software can do, but they’ve also changed the cost structure behind every interaction. Unlike traditional software or small on-device models, cloud-hosted LLMs incur a cost, a delay, and an energy footprint every single time a user asks a question. When millions — or billions — of such interactions occur, those “invisible” or “negligible” costs scale into millions of dollars and thousands of megawatt-hours.

For you as a leader, this distinction is strategic. Choosing between a cloud LLM-based assistant as the default and a local, edge-optimized model that runs on user devices is a decision that shapes your user experience (latency, responsiveness, privacy), your operational flexibility (dependency on LLM vendors), and your financial and environmental footprint.

This article walks through that trade-off step by step. Using the automotive sector as an example, it quantifies how a single architectural choice — whether every interaction goes through a remote LLM or runs locally — can determine whether your AI strategy costs a few hundred thousand dollars a year or several billion.

In short: understanding where your “intelligence” runs is a board-level decision.

Thesis: For the majority of everyday LLM interactions (short prompts, quick answers, voice assistants, UI copilots), the default should be local/edge models. They deliver lower latency and stronger privacy — and, at scale, dramatically lower energy use and cost. Use cloud LLMs selectively for the minority of “heavy” tasks that truly need them.

Below I (1) state transparent assumptions, (2) walk through the math using in-car assistant interaction as an illustrative market, and (3) generalize the conclusion to LLM services overall. Every figure that isn’t a simple arithmetic product is referenced. I have tried to be as clear as possible when I have made an assumption or just an example metric.

Press enter or click to view image in full size
Photo by Nahrizul Kadri on Unsplash

1) Assumptions

This thought experiment uses the U.S. automotive market as a concrete example. In 2022, Americans made roughly 227 billion driving trips. Of course, today only a small fraction of those trips involve an AI assistant, let alone a fully conversational one powered by large language models. But that’s the direction the industry is heading. Automakers are rapidly integrating voice-driven copilots for navigation, information, and entertainment. There is, in my first-hand experience as a practicioner, a naive hope that all AI assistant interactions will be magically “solved” by the LLM paradigm. And similar assistants are being built into phones, operating systems, productivity suites, and customer service channels. As these conversational interfaces become ubiquitous, the difference between running them in the cloud versus locally will scale linearly with usage — turning what seems like a technical deployment choice into a massive economic and environmental variable.

The point isn’t that every U.S. trip today involves an LLM — it’s that if we reach that level of adoption, the cumulative cost and energy of “default-to-cloud” architectures become staggering. This example shows what happens when a small inefficiency per interaction is multiplied across hundreds of billions of events per year — and why it’s time to rethink where AI should live.

Another important caveat: the numbers below do not include training. If you want, you can include that — but here I am focusing in operational (inference) cost/energy. Retraining cadency would also affect these numbers, like, if OpenAI releases a new model every quarter, that re-training energy cost would have to be multiplied by 4, and added to your annual scorecard balance if you want to take responsibility.

Workload size (automotive example, U.S.)

  • U.S. drivers made ~227 billion driving trips in 2022 (AAA American Driving Survey).
  • Voice interactions per trip (assumption):
    Low = 2 (quick requests),
    Base = 5 (typical mix: nav + follow-ups + media),
    High = 10 (more chatty/assistive).
    Note: “high” is what the industry is aiming for, in terms of branding and UX.
  • I analyze 2 / 5 / 10 interactions per trip (low/base/high). This is a scenario knob, not a measured stat.

Per-interaction energy (cloud vs. edge)

  • Cloud LLM (modern, efficient): ~0.3 Wh per chat (typical GPT-4o-class query), based on an updated synthesis showing ~0.3 Wh rather than the older ~3 Wh estimate, even though some data indicate that GPT-5 is x8 more energy hungry than previous versions. We’ll stay with the conservative data point just to be super careful in not exaggerating claims and calculations here.

Sanity check on the “0.3 Wh/chat” cloud figure: Epoch’s 2025 synthesis converges around ~0.3 Wh per GPT-4o-class query10× below the older “3 Wh” rule. This is consistent with the direction of H100 perf/W gains and energy-per-token research. But it also says something about the difficulties in doing calculations. A 10x difference between estimates shows how hard this area is to grasp.

  • Add data-center overhead via Power Usage Effectiveness (PUE). Industry averages cluster ~1.5–1.6 (Uptime Institute), while Google averages ~1.09–1.12; I’ll use 1.4 as a reasonable blended assumption.
  • So cloud (modern) all-in ≈ 0.3 Wh × 1.4 = 0.42 Wh per interaction. (Remember, this is a lower and generous assumption. It’s reasonable to suspect that Wh is higher in practice.)
  • Cloud LLM (conservative/older): If we assume ~3 Wh per chat compute energy (older dense models, lower utilization), then all-in ≈ 4.2 Wh with PUE 1.4. (Again, this “~3 Wh” figure is the older rule-of-thumb now considered by some a 10× overestimate for modern stacks, but we include it to show a more complete view.)
  • Edge/local: On-device LLM inference on NPUs has been measured to yield ~30× energy savings vs. strong baselines and large speedups, confirming that per-turn energy can be in the single- to low-tens of joules for short interactions. I use 10 J (0.00278 Wh) per interaction as a conservative midpoint for a speech+small-LLM turn, based on peer-reviewed evidence of large energy reductions with NPUs and real on-device LLM measurements.

Carbon intensity

Carbon intensity refers to how much CO₂ each produced kWh generates. It is difficult to find accurate data, but it seems the US context is most easily accessible for comparisons. Use this as an indicative example. EU and Asian contexts, for example, will probably vary.

  • U.S. grid carbon intensity: in 2023, there were “about 0.81 pounds of CO₂ emissions per kWh”. 0.81 lb CO₂/kWh ≈ 368 g CO₂/kWh (latest EIA FAQ).

Cloud token prices (illustrative, 2024–25):

  • OpenAI (flagship tier costs): $3/M input and $12/M output tokens for GPT-4.1; smaller “mini”/“nano” tiers much cheaper. Again, for the most recent models, prices are even higher. GPT-5 Pro costs $15.00 / 1M input tokens, and $120.00 / 1M output tokens (x10 GPT-4.1).
  • There are many different price points. APIs for e.g. DeepSeek can go as low as $0.27/M in, $0.40/M out (DeepInfra).
  • Tokens per interaction: Estimating ~400 tokens (e.g., ~200 in + ~200 out) per interaction. For your particular domain and context, this needs to be tweaked this to match your UX. (Scaling is linear, as I understand it.)

2) Step-by-step arithmetic (automotive case study, U.S.)

Ok, so this example covers all trips in the US, as if they all had voice assistants. Clearly, this is not the case (yet), but seeing the development and targets for automotive brands, there is a push to position AI assistant voice interaction at the center of brand experience, especially as a differentiator when not only the driver, but all passengers are encouraged to interact with various AI assistants in the car. The development I have seen first-hand also points to a LLM-based solution space.

Utterances/year = trips × interactions/trip

  • Low: 227 billion (B) × 2 = 454 B utterances
  • Base: 227 B × 5 = 1.135 trillion (T) utterances
  • High: 227 B × 10 = 2.27 T utterances

Energy per interaction (all-in):

  • Cloud (modern): 0.42 Wh (0.3 Wh × PUE 1.4).
  • Cloud (old estimate): 4.20 Wh.
  • Edge: 0.00278 Wh.

Annual energy = utterances × Wh/interaction

  • Low (454 B): Cloud modern 190.7 GWh; Cloud cons. 1,906.8 GWh; Edge 1.26 GWh.
  • Base (1.135 T): Cloud modern 476.7 GWh; Cloud cons. 4,767 GWh; Edge 3.16 GWh.
  • High (2.27 T): Cloud modern 953.4 GWh; Cloud cons. 9,534 GWh; Edge 6.31 GWh.

Annual CO₂ = kWh × 0.368 kg/kWh (EIA)

  • Base: Cloud modern ~175 kt, Cloud conservative ~1,754 kt, Edge ~1.16 kt.

Annual service cost (cloud tokens)

Tokens/year = utterances × 400.

Base: 1.135 T × 400 = 454 T tokens.

  • At $0.27/M in + $0.40/M out (lowest rate I could find): ≈ $304 M/yr.
  • At $3/M in + $12/M out (flagship rate): ≈ $1.59 B/yr.

(Edge electricity is negligible by comparison: 3.16 GWh × typical $0.12/kWh ≈ $0.38 M/yr.)

3) What this implies beyond automotive

Most consumer and enterprise LLM usage resembles “short-turn” interactions: autocomplete, UI help, code hints, voice commands, and brief Q&A. For these, edge NPUs are 1–2 orders of magnitude more energy-efficient per turn than sending every request to a large cloud model — before counting network latency and recurring token fees.

Caveats and nuance:

  • Carbon depends on energy source. If the device is charged on a carbon-heavy grid and the cloud is run on a low-carbon grid with excellent PUE, edge’s energy advantage may not translate one-for-one to CO₂. (Location and PUE matter; carbon can vary ~5–10× by siting.) Still, the absolute kWh saved locally is typically so large that edge remains favored.
  • Not all tasks fit on-device. Long contexts, heavy reasoning (tool-use, multi-step planning), or high multimodal bandwidth can exceed local memory/compute. Those should go to the cloud.
  • Prices evolve fast. Provider pricing is changing quarter-to-quarter (see OpenAI 4o-mini, Azure/OpenAI pages, and third-party Llama hosts). Always recompute with your token mix. (Bear in mind though, financial cost is not correlated with environmental cost, of course.)
  • Ecological footprint is much more than CO₂ emissions. It should also cover water usage, energy entropy, land usage for data centers, rare mineral mining, and so on.

4) Policy and architecture recommendation

Default to local.

  • Ship on-device ASR + NLU + compact generative model for the common path. Most queries could arguably be solved with non-LLM functionality. Escalating to more resource-heavy computation should be carefully considered.
  • Add policy-based routing to the cloud only when prompts exceed local capacity (e.g., long context, specialized tools, or strict quality gates).

Conclusion and why this default is rational:

  • In our U.S. automotive example, moving the default path from “all-cloud” to “all-edge” would change the base-case footprint from ~477–4,767 GWh/yr down to ~3 GWh/yr, and from ~$1.6 B/yr in token fees to <$1 M/yr in electricity — while simultaneously eliminating the UX attributes of latency lag and connectivity dependence.

5) Conclusion

For LLM-based services in general, the math and the user experience both point the same way: run locally by default, then escalate to cloud only when necessary. That architecture minimizes latency, strengthens privacy, and — when multiplied over billions to trillions of short interactions — avoids hundreds to thousands of GWh and hundreds of millions to billions of dollars each year. The references above reflect the current (2024–2025) state of practice on query energy, PUE, grid carbon, and token pricing; if any of those change, the framework here makes it straightforward to recalc with your own workloads and markets.

Did you find any flaws in the calculations? Please don’t hesitate to contact me.

--

--

Pontus Wärnestål
Pontus Wärnestål

Written by Pontus Wärnestål

Designer at Ambition Group. Deputy Professor (PhD) at Halmstad University (Sweden). Author of "Designing AI-Powered Services". I ride my bike to work.

No responses yet