Join the ArkSphere community to build the Agentic Runtime together.

GLM-TTS

A controllable, emotion-expressive zero-shot text-to-speech system using multi-reward reinforcement learning.

Detailed Introduction

GLM‑TTS is a text-to-speech (TTS) project released by Zai that focuses on controllable generation of emotion and speaking style. The project uses multi-reward reinforcement learning in a zero-shot setting to enhance the emotional expressiveness and naturalness of synthesized speech, enabling the model to produce speech with specified emotions or styles even for unseen examples.

Main Features

  • Zero-shot emotional expression: generate speech with target emotions without specialized training samples.
  • Strong controllability: multi-dimensional controls such as emotion intensity, speaking rate, and timbre.
  • Multi-reward training: multiple reward signals optimize quality and emotional consistency.
  • Open-source license: released under Apache-2.0 for community reuse and extension.

Use Cases

  • Voice assistants and dialogue systems: provide more natural and emotionally expressive responses.
  • Audiobooks and content dubbing: automatically adapt narration style to content emotion.
  • Rapid prototyping for new languages/styles: quick zero-shot experiments on novel styles or languages.
  • Creative tools: give creators fine-grained control over speech style synthesis.

Technical Features

  • Model architecture combines extensible TTS backbones with emotion-conditioning modules.
  • Training strategy employs multiple reward signals balancing perceptual quality, emotional alignment, and naturalness.
  • Supports the PyTorch ecosystem for local or cloud fine-tuning and extension.
  • The project is open-source on GitHub ; see the official site https://audio.z.ai for demos and documentation.
GLM-TTS
Resource Info
🗣️ Text to Speech 🔊 Audio 🌱 Open Source 🏋️ Training 🎯 RLHF