Detailed Introduction
Awex is a high-performance weight synchronization framework for Reinforcement Learning (RL) training-to-inference workflows. It enables second-level parameter updates from training to inference, ensuring rollouts use the latest model weights. Awex scales to models from tens of billions to trillions of parameters and adapts to varied parallel strategies and deployment topologies to minimize update latency.
Main Features
- Extreme synchronization speed: exchanges 10B-scale models within seconds on thousand-GPU clusters.
- Unified weight adaptation layer: handles tensor layout and parallel strategy differences automatically.
- Zero-redundancy transfer & in-place updates: transfers only necessary shards and supports in-place GPU memory updates during inference.
- Multi-mode transport: supports NCCL, RDMA, and shared-memory transports to balance bandwidth and latency.
Use Cases
Suitable for scenarios that require rapid feedback from training to online inference, such as RL systems that frequently update policies for rollouts/evaluation, inference clusters needing low-latency parameter hot-updates, and heterogeneous deployments combining co-located and separated training/inference engines.
Technical Features
Awex uses a MetaServer for global metadata exchange and deterministic P2P transfer plans to build shard-level transfer execution. It supports NCCL and RDMA backends, tensor-level validation to ensure correctness, and a modular design to integrate new training or inference engines while delivering high throughput and low tail latency on production clusters.