GPU Utilization Is Breaking: AI Infrastructure Needs a New Definition of Efficiency

From GPU utilization to productive GPU-hours.

Every GPU should not just be used. It should create value.

We have all been chasing GPU utilization

For the past few years, whether it is Kubernetes GPU scheduling, vGPU, MIG, or HAMi, everyone has really been doing the same thing: pushing one number up.

GPU Utilization

It makes sense. GPUs are expensive. An H100 can run anywhere from a few dollars to over ten dollars per GPU-hour, and nobody can afford to let a GPU sit idle. So the entire AI Infra community’s narrative for the past few years has boiled down to one line:

Maximize GPU utilization.

I have been involved in the HAMi community for a long time, and a few posts I have written, like Kubernetes as the GPU Control Plane for AI and From GPU to Token: An Eight-Layer Observability Stack for AI Infrastructure, are really about the same thing: how to slice GPUs finer, share them more thoroughly, and schedule them more sensibly.

But recently I read Arjun Kaarat’s piece in Towards Data Science, When GPU Utilization Lies: The Hidden Systems Problem Slowing Modern AI, and it made me rethink a question:

Is GPU utilization really the metric we should be optimizing for?

There is one line in the article that made me pause for a few seconds the first time I read it:

“GPUs can be busy without being productive.”

Arjun Kaarat

It punctures a layer of illusion. The number we have spent so much effort pushing up may have been answering the wrong question from the start.

GPU Busy is not GPU Productive

Kaarat tells a representative story in the article.

At 2 AM, an infrastructure team gets paged: inference latency just spiked 60%. They open the monitoring dashboard, and GPU utilization looks perfectly normal:

GPU: 79%
GPU: 82%
GPU: 84%

Looks healthy. So the usual playbook kicks in: trigger autoscaling, add nodes, add GPUs. The cloud bill climbs, but latency barely improves.

An hour later, they find the root cause: three nodes had quietly entered a RAID rebuild state, storage throughput was severely dragged down, and the inference tasks around them were starving. The scheduler kept treating these nodes as “still healthy enough” because the GPU and memory metrics looked fine, but the underlying disk performance had collapsed.

What strikes me most about this story is that it is not a rare edge case. It is a failure mode that is becoming common.

Many teams look at their monitoring and see:

GPU: 82%
GPU: 84%
GPU: 79%

and think “our cluster is busy and healthy.” But at the same time:

Latency ↑
Queue ↑
Throughput ↓
Cost ↑

The real problem is that the GPU is waiting.

Waiting for what?

  • The retrieval pipeline to feed embeddings over;
  • The SSD to read context out of storage;
  • The CPU to prepare the data pipeline;
  • The KV cache to have room for new requests;
  • Storage I/O to not be squeezed out by background tasks.

The GPU end is full, but the data path feeding it is empty, blocked, or collapsing. From the dashboard the GPU reads 84%, but the actual output may be less than half. Kaarat describes this state precisely in the original piece:

A GPU that appears active may still spend meaningful time waiting for the system around it.

That is the overlooked gap between “busy” and “productive”.

The “illusion” of GPU utilization

The RAID story above is still just a “point failure”. The more compelling part of Kaarat’s article describes a systemic phenomenon: Fragmentation.

Consider a cluster with three nodes, after running a mixed wave of GenAI workloads:

NodeGPU computeHBMStorage bandwidthI/O CPU
Aavailablenearly fullavailableavailable
Bavailableavailablesaturatedavailable
Climitedavailableavailablesaturated
Table 1: Residual resources on three nodes after a GenAI wave

Now a new inference job arrives, with an ordinary footprint: a little GPU, a little VRAM, decent storage bandwidth, decent I/O capacity.

In total, the cluster still has plenty of resources. A has GPU and bandwidth, B has VRAM, C has bandwidth and CPU. But no single node can take this job on its own.

That is fragmentation. I drew it out, roughly like this:

Figure 1: The cluster is not short on resources. It is short on resources of the right shape.
Figure 1: The cluster is not short on resources. It is short on resources of the right shape.

The cluster is not empty. It has just been carved into “leftovers” that can no longer be used productively. Kaarat sums up the phenomenon in one line, which I think is the most memorable sentence in the whole piece:

The cluster is not empty. It is fragmented into leftovers that are difficult to use productively.

Put another way:

The cluster does not lack resources. It lacks resources of the right shape.

This judgment matters a lot for a project like HAMi, which builds a GPU resource control plane. We used to think that “slicing the card finely and letting more people share it” was the answer to fragmentation. But Kaarat points to a deeper problem: fragmentation is not just a GPU-layer concern. It spans GPU, HBM, storage bandwidth, and I/O CPU. You can slice the GPU as finely as you like, but if the storage dimension is choked, that node is still unavailable for the next genuinely useful task.

What HAMi solves

Let me directly answer a question: what role does HAMi play in this chain?

HAMi solves a very specific, and very foundational, problem:

Can the GPU be used by more people.

What it does can be summarized in one line: reduce fragmentation at the GPU layer. The concrete forms include:

  • GPU Sharing: letting multiple Pods share one card instead of one card per Pod;
  • vGPU / HAMi-core: doing memory isolation and compute throttling in userspace, slicing one card into MB-level virtual devices;
  • MIG integration: managing NVIDIA MIG hardware partitions in software;
  • Heterogeneous GPU abstraction: abstracting more than a dozen device families, including NVIDIA, Ascend, Cambricon, Hygon, and Vastai, into semantics the scheduler can consume uniformly;
  • DRA compatibility: keeping pace with the evolution of the Kubernetes resource model.

This is the first layer of efficiency. It answers the question “can the GPU be put to use”.

At this layer, HAMi’s value is already clear: in real environments with mixed domestic and foreign GPUs, mixed training and inference, and multi-tenant sharing, HAMi lets one card serve more workloads, so the cluster is no longer wasted by a coarse-grained “one card per Pod” model.

But note: this is only the first chapter of the efficiency story.

What comes after HAMi

If I step back and look at it from a higher vantage point, “improving GPU utilization” is actually solved across three distinct layers.

Layer one: can the GPU be sliced, shared, and allocated?

This is what a project like HAMi solves. It corresponds to the GPU resource control plane.

Layer two: who gets to use the GPU? Who runs first, who queues, who has priority?

This is what schedulers like Volcano, Kueue, and KAI Scheduler solve. It corresponds to job queuing, fair share, priority, and gang scheduling.

Layer three: can the GPU actually run? Are the data path, storage I/O, and KV cache keeping up?

This is exactly where Kaarat’s article sounds the alarm. When a node’s RAID is rebuilding, its SSD queue is exploding, and its I/O CPU is eaten by background tasks, no matter how much GPU you allocate to it and no matter how elegantly the scheduler queues its tasks, it still cannot produce effective compute.

Draw these three layers together, and you get what next-generation AI infrastructure should actually look like:

Figure 2: From GPU utilization to productive GPU-hours
Figure 2: From GPU utilization to productive GPU-hours

I think this diagram is the single most important one for understanding the whole picture.

For the past few years, almost all of the industry’s attention has been on the bottom two layers: Kubernetes and HAMi. Those two layers have essentially solved “can the GPU be put to use”. Volcano, Kueue, and KAI are also mature at layer two, solving queuing and priority.

But layer three, storage / I/O-aware scheduling, is currently almost a blank. And that is precisely the layer Kaarat’s article keeps emphasizing, the one that is becoming increasingly valuable in modern GenAI systems. Because for workloads like RAG, long context, and multimodal, the bottleneck has long since shifted from “is the GPU enough” to “is the data path feeding the GPU clear”.

To put it bluntly: HAMi slices the GPU finely, and Volcano queues the jobs well, but if a node assigned a task cannot feed its GPU, then all that upstream effort is just pumping blood into an idle endpoint.

What should next-gen AI infrastructure optimize for

Based on the layering above, I want to make one clear point: we should upgrade the optimization target from “GPU utilization” to “Productive GPU-Hours”.

This is not wordplay. It is a shift that translates directly into money.

Kaarat does the math in the article. A 1000-H100 cluster, at a blended cost of about $3 per GPU-hour, runs around $26 million a year. If fragmentation and I/O stall quietly waste 10% of the effective GPU time, that is roughly $2.6 million a year of wasted spend. Not because the GPUs are missing, but because the system failed to use them efficiently.

That math can be translated into a simple contrast.

The past target:

Maximize GPU Utilization

Today’s target:

Maximize Productive GPU-Hours

The future target:

Maximize Productive Compute
Across Heterogeneous AI Clusters

This evolution maps exactly onto the three-layer structure in the diagram above. That final line, “Across Heterogeneous AI Clusters”, is the heterogeneous narrative HAMi has been pushing all along: in the future you will not optimize just one kind of GPU. You will maximize effective compute uniformly across completely different cards from NVIDIA, Ascend, Cambricon, and Hygon.

In other words, HAMi’s long-term value should not be boxed into the old “improve GPU utilization” narrative. Its real direction is: a resource control plane that lets Productive GPU-Hours be maximized across heterogeneous AI clusters.

From utilization to productive GPU-hours

If I had to summarize this whole line of thinking in one sentence, I would put it like this:

HAMi solves “how to let more GPUs be used”, and next-generation AI infrastructure has to solve “how to let every GPU actually create value”.

That is also the deepest line Kaarat’s article left me with:

The real question is no longer “Are the GPUs busy?”

It is: “Are they productively busy?”

It also makes me rethink what the HAMi community’s tagline for the next phase should be. We used to say “let GPUs be shared by more people”. Next, we should probably move toward:

Turn GPU utilization into productive GPU-hours.

Let every GPU not just be used, but genuinely create value.

GPU utilization was never the endpoint. It was only the first chapter of the efficiency story. Only when we start talking about Productive GPU-Hours, and start caring about storage, I/O, the data path, and the aggregate output of heterogeneous clusters, does AI infrastructure truly enter its second chapter.

Acknowledgments and references

Parts of this post were inspired by Arjun Kaarat’s piece in Towards Data Science, When GPU Utilization Lies: The Hidden Systems Problem Slowing Modern AI, and quote its published paper and article. My thanks to the author. Both diagrams in this post (resource fragmentation, the efficiency layering) were redrawn by the author based on the ideas in the original, not copied from it.

  • Arjun Kaarat. When GPU Utilization Lies: The Hidden Systems Problem Slowing Modern AI. Towards Data Science, 2026.
  • Kaarat, A., Batthula, V. J. R., & Segall, R. Fitting the Void: Residual-Aware Geometric Packing for GenAI Workloads. IEEE, 2025.

Further reading:

Jimmy Song

Jimmy Song

Focusing on research and open source practices in AI-Native Infrastructure and cloud native application architecture.

Post Navigation

Comments