Machine Learning Engineering

An open Machine Learning Engineering open-book covering compute, storage, networking, training and inference best practices.

Author: Stas Bekman

Since: 2020-09-02

Visit Website GitHub

Overview

Machine Learning Engineering is an open book that compiles practical knowledge for ML engineers working on large-scale training and inference systems. It covers hardware selection (accelerators, storage), networking, distributed training strategies, inference optimizations, debugging and operational playbooks.

Key Features

Broad coverage: system-level guidance from hardware to distributed training and inference.
Practical scripts and tools: benchmarks, debugging scripts and example configurations to reproduce and diagnose issues.
Community maintained: long-running project with many contributors and rich discussions.

Use Cases

Engineering teams building or optimizing large-scale training and inference clusters.
Educational use as course material or reference for ML engineering best practices.
Decisions around cloud vs on-prem compute and architecture selection.

Technical Highlights

Markdown-organized handbook with comparison tables and practical troubleshooting steps.
Focus on distributed training (SLURM, network optimizations) and inference engineering.
PDF and online versions available for different consumption scenarios.

Machine Learning Engineering

Overview

Key Features

Use Cases

Technical Highlights

Resource Info

Related Resources

Awex

AReaL

CellARC