From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

From GPU hardware, Kubernetes scheduling, inference engines to token cost — understanding the 8-layer observability architecture for modern AI infrastructure.

GPU utilization is not the destination. Token cost is the true North Star metric for AI infrastructure.

For the past few years, one of the hottest topics in AI infrastructure has been GPU scheduling. Whether it’s Kubernetes, Volcano, Kueue, or HAMi, they all fundamentally solve the same problem — how to make expensive and scarce GPUs more efficiently utilized.

But as more enterprises begin running production-grade Large Language Model (LLM) services, a new phenomenon has emerged: GPU utilization is high, but users still complain about slow response times; GPU clusters are near full capacity, but business throughput hasn’t grown proportionally; VRAM and compute still have headroom, yet TTFT (Time To First Token) continues to degrade. This reveals a fundamental truth — GPU utilization alone is no longer sufficient to describe the real operational state of modern AI systems.

For traditional cloud-native applications, we focus on CPU, memory, network, and disk. For AI systems, however, we also need to care about: whether GPUs are truly performing useful computation, whether NCCL (NVIDIA Collective Communications Library) communication has become a bottleneck, whether Kubernetes has correctly allocated resources, whether KV Cache is exhausted, whether Token latency meets user experience requirements, and whether the cost per Token is reasonable. In other words, the observability target for AI infrastructure has expanded from the GPU to the entire inference chain.

Recently, while participating in the development of an industry standard — “Technical Capability Requirements for Computing Power Efficiency Enhancement: Heterogeneous Computing Services” organized by the China Academy of Information and Communications Technology (CAICT) — a core question came up repeatedly in discussions with experts: How do GPU and Token relate? How do we measure GPU output through Tokens? The standard introduces the concept of “Token as a Service,” shifting the unit of measurement for computing services from traditional GPU hours to Tokens, yet a mature practical framework for building a complete observability chain from hardware to Token was still missing.

Around the same time, I came across a technical article on GPU and LLM observability (GPU & LLM Inference Observability — Layer-by-Layer Coverage) that proposed a constructive approach: decomposing the AI system into eight observability layers from GPU hardware to business cost, which neatly fills the observability gap between GPU and Token. This article reorganizes and interprets the original content, removes product-specific implementation details and vendor promotion, and extends the model by incorporating Kubernetes, HAMi, and modern LLM inference architectures.

This article aims to answer one question:

After we’ve solved GPU scheduling, what should we observe next?

8-Layer Observability Architecture Overview

The diagram below shows the eight observability layers from GPU hardware to business cost, each corresponding to different observability targets and areas of responsibility:

Figure 1: 8-Layer AI Infrastructure Observability Architecture
Figure 1: 8-Layer AI Infrastructure Observability Architecture

Original image source GitHub.

Each layer collects data through the OTel Collector and routes it to metrics, tracing, and logging backends, forming a complete observability loop. Let’s examine each layer in detail.

L1 GPU Hardware Layer

This layer focuses on the GPU itself. The core question is: Is the GPU running healthy?

The table below lists the key observability metrics for the GPU hardware layer:

CategoryKey Metrics
ComputeGPU Utilization, SM Occupancy, Tensor Core Activity
MemoryVRAM Usage, HBM Bandwidth
InterconnectNVLink Throughput, PCIe Throughput
ReliabilityECC Error, XID Error
ThermalTemperature, Power Draw, Throttle Reason
Table 1: L1 GPU Hardware Layer Key Metrics

Many teams only track GPU Utilization, but there’s a critical distinction — GPU Utilization ≠ GPU Efficiency. Two GPUs might both show 90% Utilization: one executing Tensor Core operations, the other waiting for memory access. The actual performance can be vastly different. Therefore, SM (Streaming Multiprocessor) Occupancy and Tensor Core Activity are often more valuable than Utilization alone.

L2 CUDA Runtime and Communication Layer

For distributed training, GPU computation is usually not the bottleneck — communication is. The core question at this layer is: Is the GPU computing, or waiting for other GPUs?

The table below lists the key metrics for the CUDA runtime and communication layer:

CategoryMetrics
NCCLAllReduce, AllGather, ReduceScatter
CommunicationDuration, Bandwidth
KernelExecution Time, P99 Duration
StragglerRank Skew
Table 2: L2 CUDA Runtime and Communication Layer Key Metrics

In large-scale training scenarios, a single slow node can drag down the entire training job. By monitoring NCCL communication Duration and Bandwidth, as well as Skew between Ranks, you can quickly pinpoint communication bottlenecks.

L3 Host / OS Layer

Many GPU problems are ultimately not GPU problems. When GPU utilization is abnormal, the root cause may lie in the host’s CPU, memory, disk, or network.

The table below lists the key metrics at the host level:

CategoryMetrics
CPUUtilization, IO Wait
MemoryUsage, Swap
DiskThroughput, Latency
NetworkRetransmit, Bandwidth
ProcessRSS, Thread Count
Table 3: L3 Host / OS Layer Key Metrics

A common misconception: when GPU utilization is low, the first reaction is that the GPU isn’t fast enough. In reality, the GPU is often waiting for data — a slow DataLoader, insufficient CPU, network congestion, or inadequate storage performance can all leave the GPU idle.

L4 Kubernetes and Scheduling Layer

This layer is the core of cloud-native AI infrastructure. The question to answer is: Who owns the GPU? You need to know which Pod is using the GPU, which Namespace consumes the most resources, which team has the highest cost, and which model occupies the most VRAM.

The table below lists the key observability dimensions at the scheduling layer:

CategoryKey Attributes
OwnershipPod, Container
WorkloadDeployment, Job
OrganizationNamespace, Team
SchedulingGPU Sharing, GPU Partition
TopologyNode, AZ, Region
Table 4: L4 Kubernetes and Scheduling Layer Key Dimensions

Projects like HAMi, Volcano, and Kueue primarily operate at this layer. HAMi solves core problems like GPU Sharing, GPU Partitioning, and heterogeneous GPU scheduling, while observability answers the question: Has scheduling actually improved resource utilization? The observability data at this layer is the foundation for resource auditing and cost allocation.

L5 Training Runtime Layer

The training phase requires monitoring the model’s runtime state. The table below lists the key metrics for the training runtime:

CategoryMetrics
EfficiencyMFU, TFLOPS, Step Time
GradientNorm, NaN, Inf
LossTraining Loss
DataDataLoader Wait
CheckpointSave Time, Restore Time
Table 5: L5 Training Runtime Key Metrics

The most important metric here is MFU (Model FLOPs Utilization). Many training jobs show high GPU utilization but only 30% MFU, meaning a significant amount of GPU time is not being converted into actual training progress. MFU is the golden metric for measuring training efficiency — it directly reflects the ratio of hardware compute converted into effective training computation.

L6 Inference Engine Layer

This is the most critical layer in production environments and the one seeing the fastest growth in industry attention. The table below lists the core metrics for inference engines:

MetricDescription
TTFTTime To First Token
ITLInter Token Latency
Queue WaitRequest queuing time
ThroughputToken throughput
Batch SizeBatch processing efficiency
KV Cache UsageKV Cache utilization
Table 6: L6 Inference Engine Key Metrics

Many teams still focus solely on GPU Utilization, but the true capacity metric for inference systems is often KV Cache (Key-Value Cache) Utilization.

Why KV Cache Matters More Than GPU Utilization

In modern LLM inference systems, GPU compute usually has headroom, but KV Cache often runs out first. A typical scenario: GPU utilization is only 60%, KV Cache is at 95%, new requests start queuing, and TTFT spikes rapidly.

For inference systems, KV Cache is more akin to a database’s Buffer Pool — it often determines the capacity ceiling of the entire system. When KV Cache approaches saturation, the system is forced to perform Eviction, leading to context loss, request retries, and ultimately latency spikes and throughput drops.

L7 GenAI API Layer

With the development of OpenTelemetry GenAI Semantic Conventions, the industry is converging on unified AI observability standards. The table below lists the key metrics at the API layer:

MetricDescription
Input TokensInput token count
Output TokensOutput token count
Request DurationRequest latency
TTFTFirst Token latency
Token ThroughputToken throughput
Table 7: L7 GenAI API Layer Key Metrics

This layer marks the shift from infrastructure-centric to application-centric observability. API-layer metrics directly face end users and business systems, serving as the core data source for SLA/SLO quality assessment.

L8 Business and Cost Layer

Ultimately, enterprises don’t care about GPU utilization — they care about: Is the GPU creating value? The table below lists the core metrics for the business and cost layer:

MetricDescription
Cost per GPU HourGPU hourly cost
Cost per TokenToken cost
Cost per RequestRequest cost
Idle GPU CostIdle resource cost
Tokens per WattInference efficiency
Table 8: L8 Business and Cost Layer Key Metrics

The core competitive metric for future AI infrastructure may not be GPU Utilization, but rather Cost per Useful Token — the comprehensive cost of producing one useful Token. This metric unifies hardware cost, energy consumption, inference efficiency, and business value into a single measurement framework.

Cross-Layer Troubleshooting

Real production issues often span multiple layers. The table below lists common cross-layer failure symptoms and their potential causes:

SymptomPossible Causes
TTFT IncreaseCPU IO Wait, Queue Depth, KV Cache Pressure
Throughput DropNCCL, Batch Size, Network
Latency SpikeKV Cache Eviction
OOMInsufficient VRAM, Oversized Batch
Training StallNCCL Straggler
Table 9: Cross-Layer Troubleshooting Reference

Modern AI operations are evolving from single-point monitoring to cross-layer observability. The root cause of a single-layer metric anomaly often lies in another layer. Only by building cross-layer correlation capabilities can teams achieve rapid root-cause identification and precise troubleshooting.

From GPU Control Plane to AI Observability Plane

Over the past few years, the industry has focused primarily on projects like Kubernetes Scheduler, Volcano, Kueue, and HAMi, solving the problem of how to allocate GPUs. In the coming years, the industry will start asking whether GPUs are truly generating value, leading to the formation of a new AI infrastructure technology stack:

GPU Hardware
Kubernetes Control Plane
Inference Runtime
Observability Plane
Optimization Plane

GPU scheduling determines how resources are allocated. Observability determines whether resources are being used correctly. Together, they form the next-generation AI Infrastructure Stack. The evolution from GPU Control Plane to AI Observability Plane marks a new era where AI infrastructure transitions from “resource management” to “value management.”

Summary

AI infrastructure observability is undergoing a fundamental paradigm shift. We used to look only at GPU utilization; now we need to build a complete observability system across eight layers — from GPU hardware, CUDA runtime, host OS, Kubernetes scheduling, training runtime, inference engine, GenAI API, to business cost.

Key takeaways:

  • L1-L3 focus on hardware and systems: GPU health, CUDA communication efficiency, and host resource adequacy
  • L4 focuses on resource allocation: Kubernetes scheduling and GPU sharing, with tools like HAMi solving allocation problems
  • L5-L6 focus on model runtime: Training efficiency (MFU) and inference capacity (KV Cache) are the core metrics
  • L7-L8 focus on business value: From Token throughput to cost per Token, ultimately measuring whether GPUs are creating value

GPU scheduling is the starting point for AI infrastructure, but not the end goal. The 8-layer observability stack from GPU to Token is the critical closed loop that ensures AI infrastructure truly delivers business value.

References

Jimmy Song

Jimmy Song

Focusing on research and open source practices in AI-Native Infrastructure and cloud native application architecture.

Post Navigation

Comments