A guide to building long-term compounding knowledge infrastructure. See details on GitHub .

Machine Learning Engineering

An open Machine Learning Engineering open-book covering compute, storage, networking, training and inference best practices.

Overview

Machine Learning Engineering is an open book that compiles practical knowledge for ML engineers working on large-scale training and inference systems. It covers hardware selection (accelerators, storage), networking, distributed training strategies, inference optimizations, debugging and operational playbooks.

Key Features

  • Broad coverage: system-level guidance from hardware to distributed training and inference.
  • Practical scripts and tools: benchmarks, debugging scripts and example configurations to reproduce and diagnose issues.
  • Community maintained: long-running project with many contributors and rich discussions.

Use Cases

  • Engineering teams building or optimizing large-scale training and inference clusters.
  • Educational use as course material or reference for ML engineering best practices.
  • Decisions around cloud vs on-prem compute and architecture selection.

Technical Highlights

  • Markdown-organized handbook with comparison tables and practical troubleshooting steps.
  • Focus on distributed training (SLURM, network optimizations) and inference engineering.
  • PDF and online versions available for different consumption scenarios.

Comments

Machine Learning Engineering
Resource Info
🌱 Open Source 📖 Tutorial 🏋️ Training