Overview
Maestro streamlines fine-tuning for multimodal vision-language models. It provides reusable recipes, a consistent JSONL data format, and CLI/Python interfaces to reduce boilerplate and improve reproducibility.
Key features
- Ready-to-use training recipes for Florence-2, PaliGemma 2, and Qwen2.5-VL.
- Support for LoRA / QLoRA and graph-freezing optimizations to lower resource requirements.
- CLI and Python APIs with Colab cookbooks for quick experimentation.
Use cases
- Fine-tuning VLMs for detection, JSON extraction, and captioning tasks.
- Reproducible experimentation in research and teaching.
- Resource-efficient adaptation in constrained environments.
Technical notes
- Compatible with major VLM backbones and provides cookbooks/Colab notebooks to reproduce experiments quickly.