A guide to building long-term compounding knowledge infrastructure. See details on GitHub .

Flash Linear Attention (fla)

A Triton-based, PyTorch library of efficient linear-attention kernels and models for scalable sequence modeling.

Introduction

fla (Flash Linear Attention) is a Triton-based PyTorch library providing efficient implementations of state-of-the-art linear attention kernels, fused modules, and model components. It targets high-performance training and inference across hardware (NVIDIA/AMD/Intel).

Key Features

  • Wide collection of linear attention kernels and models (GLA, DeltaNet, Mamba, etc.).
  • Triton-optimized kernels and fused modules for memory and compute efficiency.
  • Integration-ready layers for Hugging Face transformers and benchmarking tools.

Use Cases

  • Replace standard attention with linear variants in large-model training for lower memory footprint.
  • Research and benchmarking of subquadratic attention mechanisms.
  • Production deployment of memory-efficient attention layers.

Technical Highlights

  • Triton kernels for fused operations and efficient cross-entropy implementations.
  • Support for hybrid models (mixing standard and linear attention layers).
  • Extensive examples, benchmarks, and evaluation harness compatible with HF-style models.

Comments

Flash Linear Attention (fla)
Resource Info
Author fla-org
Added Date 2025-09-14
Tags
OSS Dev Tools Project