Flash Linear Attention (fla)

A Triton-based, PyTorch library of efficient linear-attention kernels and models for scalable sequence modeling.

Author: fla-org

Added Date: 2025-09-14

Open Source Since: 2023-12-20

Introduction

fla (Flash Linear Attention) is a Triton-based PyTorch library providing efficient implementations of state-of-the-art linear attention kernels, fused modules, and model components. It targets high-performance training and inference across hardware (NVIDIA/AMD/Intel).

Key Features

Wide collection of linear attention kernels and models (GLA, DeltaNet, Mamba, etc.).
Triton-optimized kernels and fused modules for memory and compute efficiency.
Integration-ready layers for Hugging Face transformers and benchmarking tools.

Use Cases

Replace standard attention with linear variants in large-model training for lower memory footprint.
Research and benchmarking of subquadratic attention mechanisms.
Production deployment of memory-efficient attention layers.

Technical Highlights

Triton kernels for fused operations and efficient cross-entropy implementations.
Support for hybrid models (mixing standard and linear attention layers).
Extensive examples, benchmarks, and evaluation harness compatible with HF-style models.

Flash Linear Attention (fla)

Introduction

Key Features

Use Cases

Technical Highlights

Resource Info

Related Resources

Glow

LangREPL

MONAI