GPU utilization is not the destination. Token cost is the true North Star metric for AI infrastructure.
For the past few years, one of the hottest topics in AI infrastructure has been GPU scheduling. Whether it’s Kubernetes, Volcano, Kueue, or HAMi, they all fundamentally solve the same problem — how to make expensive and scarce GPUs more efficiently utilized.
But as more enterprises begin running production-grade Large Language Model (LLM) services, a new phenomenon has emerged: GPU utilization is high, but users still complain about slow response times; GPU clusters are near full capacity, but business throughput hasn’t grown proportionally; VRAM and compute still have headroom, yet TTFT (Time To First Token) continues to degrade. This reveals a fundamental truth — GPU utilization alone is no longer sufficient to describe the real operational state of modern AI systems.
For traditional cloud-native applications, we focus on CPU, memory, network, and disk. For AI systems, however, we also need to care about: whether GPUs are truly performing useful computation, whether NCCL (NVIDIA Collective Communications Library) communication has become a bottleneck, whether Kubernetes has correctly allocated resources, whether KV Cache is exhausted, whether Token latency meets user experience requirements, and whether the cost per Token is reasonable. In other words, the observability target for AI infrastructure has expanded from the GPU to the entire inference chain.
Recently, while participating in the development of an industry standard — “Technical Capability Requirements for Computing Power Efficiency Enhancement: Heterogeneous Computing Services” organized by the China Academy of Information and Communications Technology (CAICT) — a core question came up repeatedly in discussions with experts: How do GPU and Token relate? How do we measure GPU output through Tokens? The standard introduces the concept of “Token as a Service,” shifting the unit of measurement for computing services from traditional GPU hours to Tokens, yet a mature practical framework for building a complete observability chain from hardware to Token was still missing.
Around the same time, I came across a technical article on GPU and LLM observability (GPU & LLM Inference Observability — Layer-by-Layer Coverage) that proposed a constructive approach: decomposing the AI system into eight observability layers from GPU hardware to business cost, which neatly fills the observability gap between GPU and Token. This article reorganizes and interprets the original content, removes product-specific implementation details and vendor promotion, and extends the model by incorporating Kubernetes, HAMi, and modern LLM inference architectures.
This article aims to answer one question:
After we’ve solved GPU scheduling, what should we observe next?
8-Layer Observability Architecture Overview
The diagram below shows the eight observability layers from GPU hardware to business cost, each corresponding to different observability targets and areas of responsibility:
Each layer collects data through the OTel Collector and routes it to metrics, tracing, and logging backends, forming a complete observability loop. Let’s examine each layer in detail.
L1 GPU Hardware Layer
This layer focuses on the GPU itself. The core question is: Is the GPU running healthy?
The table below lists the key observability metrics for the GPU hardware layer:
| Category | Key Metrics |
|---|---|
| Compute | GPU Utilization, SM Occupancy, Tensor Core Activity |
| Memory | VRAM Usage, HBM Bandwidth |
| Interconnect | NVLink Throughput, PCIe Throughput |
| Reliability | ECC Error, XID Error |
| Thermal | Temperature, Power Draw, Throttle Reason |
Many teams only track GPU Utilization, but there’s a critical distinction — GPU Utilization ≠ GPU Efficiency. Two GPUs might both show 90% Utilization: one executing Tensor Core operations, the other waiting for memory access. The actual performance can be vastly different. Therefore, SM (Streaming Multiprocessor) Occupancy and Tensor Core Activity are often more valuable than Utilization alone.
L2 CUDA Runtime and Communication Layer
For distributed training, GPU computation is usually not the bottleneck — communication is. The core question at this layer is: Is the GPU computing, or waiting for other GPUs?
The table below lists the key metrics for the CUDA runtime and communication layer:
| Category | Metrics |
|---|---|
| NCCL | AllReduce, AllGather, ReduceScatter |
| Communication | Duration, Bandwidth |
| Kernel | Execution Time, P99 Duration |
| Straggler | Rank Skew |
In large-scale training scenarios, a single slow node can drag down the entire training job. By monitoring NCCL communication Duration and Bandwidth, as well as Skew between Ranks, you can quickly pinpoint communication bottlenecks.
L3 Host / OS Layer
Many GPU problems are ultimately not GPU problems. When GPU utilization is abnormal, the root cause may lie in the host’s CPU, memory, disk, or network.
The table below lists the key metrics at the host level:
| Category | Metrics |
|---|---|
| CPU | Utilization, IO Wait |
| Memory | Usage, Swap |
| Disk | Throughput, Latency |
| Network | Retransmit, Bandwidth |
| Process | RSS, Thread Count |
A common misconception: when GPU utilization is low, the first reaction is that the GPU isn’t fast enough. In reality, the GPU is often waiting for data — a slow DataLoader, insufficient CPU, network congestion, or inadequate storage performance can all leave the GPU idle.
L4 Kubernetes and Scheduling Layer
This layer is the core of cloud-native AI infrastructure. The question to answer is: Who owns the GPU? You need to know which Pod is using the GPU, which Namespace consumes the most resources, which team has the highest cost, and which model occupies the most VRAM.
The table below lists the key observability dimensions at the scheduling layer:
| Category | Key Attributes |
|---|---|
| Ownership | Pod, Container |
| Workload | Deployment, Job |
| Organization | Namespace, Team |
| Scheduling | GPU Sharing, GPU Partition |
| Topology | Node, AZ, Region |
Projects like HAMi, Volcano, and Kueue primarily operate at this layer. HAMi solves core problems like GPU Sharing, GPU Partitioning, and heterogeneous GPU scheduling, while observability answers the question: Has scheduling actually improved resource utilization? The observability data at this layer is the foundation for resource auditing and cost allocation.
L5 Training Runtime Layer
The training phase requires monitoring the model’s runtime state. The table below lists the key metrics for the training runtime:
| Category | Metrics |
|---|---|
| Efficiency | MFU, TFLOPS, Step Time |
| Gradient | Norm, NaN, Inf |
| Loss | Training Loss |
| Data | DataLoader Wait |
| Checkpoint | Save Time, Restore Time |
The most important metric here is MFU (Model FLOPs Utilization). Many training jobs show high GPU utilization but only 30% MFU, meaning a significant amount of GPU time is not being converted into actual training progress. MFU is the golden metric for measuring training efficiency — it directly reflects the ratio of hardware compute converted into effective training computation.
L6 Inference Engine Layer
This is the most critical layer in production environments and the one seeing the fastest growth in industry attention. The table below lists the core metrics for inference engines:
| Metric | Description |
|---|---|
| TTFT | Time To First Token |
| ITL | Inter Token Latency |
| Queue Wait | Request queuing time |
| Throughput | Token throughput |
| Batch Size | Batch processing efficiency |
| KV Cache Usage | KV Cache utilization |
Many teams still focus solely on GPU Utilization, but the true capacity metric for inference systems is often KV Cache (Key-Value Cache) Utilization.
Why KV Cache Matters More Than GPU Utilization
In modern LLM inference systems, GPU compute usually has headroom, but KV Cache often runs out first. A typical scenario: GPU utilization is only 60%, KV Cache is at 95%, new requests start queuing, and TTFT spikes rapidly.
For inference systems, KV Cache is more akin to a database’s Buffer Pool — it often determines the capacity ceiling of the entire system. When KV Cache approaches saturation, the system is forced to perform Eviction, leading to context loss, request retries, and ultimately latency spikes and throughput drops.
L7 GenAI API Layer
With the development of OpenTelemetry GenAI Semantic Conventions, the industry is converging on unified AI observability standards. The table below lists the key metrics at the API layer:
| Metric | Description |
|---|---|
| Input Tokens | Input token count |
| Output Tokens | Output token count |
| Request Duration | Request latency |
| TTFT | First Token latency |
| Token Throughput | Token throughput |
This layer marks the shift from infrastructure-centric to application-centric observability. API-layer metrics directly face end users and business systems, serving as the core data source for SLA/SLO quality assessment.
L8 Business and Cost Layer
Ultimately, enterprises don’t care about GPU utilization — they care about: Is the GPU creating value? The table below lists the core metrics for the business and cost layer:
| Metric | Description |
|---|---|
| Cost per GPU Hour | GPU hourly cost |
| Cost per Token | Token cost |
| Cost per Request | Request cost |
| Idle GPU Cost | Idle resource cost |
| Tokens per Watt | Inference efficiency |
The core competitive metric for future AI infrastructure may not be GPU Utilization, but rather Cost per Useful Token — the comprehensive cost of producing one useful Token. This metric unifies hardware cost, energy consumption, inference efficiency, and business value into a single measurement framework.
Cross-Layer Troubleshooting
Real production issues often span multiple layers. The table below lists common cross-layer failure symptoms and their potential causes:
| Symptom | Possible Causes |
|---|---|
| TTFT Increase | CPU IO Wait, Queue Depth, KV Cache Pressure |
| Throughput Drop | NCCL, Batch Size, Network |
| Latency Spike | KV Cache Eviction |
| OOM | Insufficient VRAM, Oversized Batch |
| Training Stall | NCCL Straggler |
Modern AI operations are evolving from single-point monitoring to cross-layer observability. The root cause of a single-layer metric anomaly often lies in another layer. Only by building cross-layer correlation capabilities can teams achieve rapid root-cause identification and precise troubleshooting.
From GPU Control Plane to AI Observability Plane
Over the past few years, the industry has focused primarily on projects like Kubernetes Scheduler, Volcano, Kueue, and HAMi, solving the problem of how to allocate GPUs. In the coming years, the industry will start asking whether GPUs are truly generating value, leading to the formation of a new AI infrastructure technology stack:
GPU Hardware
↓
Kubernetes Control Plane
↓
Inference Runtime
↓
Observability Plane
↓
Optimization PlaneGPU scheduling determines how resources are allocated. Observability determines whether resources are being used correctly. Together, they form the next-generation AI Infrastructure Stack. The evolution from GPU Control Plane to AI Observability Plane marks a new era where AI infrastructure transitions from “resource management” to “value management.”
Summary
AI infrastructure observability is undergoing a fundamental paradigm shift. We used to look only at GPU utilization; now we need to build a complete observability system across eight layers — from GPU hardware, CUDA runtime, host OS, Kubernetes scheduling, training runtime, inference engine, GenAI API, to business cost.
Key takeaways:
- L1-L3 focus on hardware and systems: GPU health, CUDA communication efficiency, and host resource adequacy
- L4 focuses on resource allocation: Kubernetes scheduling and GPU sharing, with tools like HAMi solving allocation problems
- L5-L6 focus on model runtime: Training efficiency (MFU) and inference capacity (KV Cache) are the core metrics
- L7-L8 focus on business value: From Token throughput to cost per Token, ultimately measuring whether GPUs are creating value
GPU scheduling is the starting point for AI infrastructure, but not the end goal. The 8-layer observability stack from GPU to Token is the critical closed loop that ensures AI infrastructure truly delivers business value.
