A guide to building long-term compounding knowledge infrastructure. See details on GitHub .

Dingo

A tool for automated data quality evaluation that combines rule-based and model-based assessments.

Introduction

Dingo is a comprehensive data quality evaluation tool that automatically detects issues in datasets and produces visual reports. It supports both rule-based checks and LLM-driven evaluation, suitable for pretraining, fine-tuning and evaluation datasets.

Key Features

  • Multi-source & multi-modal: supports text and image data from local files, Hugging Face and S3.
  • Rule & model hybrid evaluation: ships with 20+ built-in rules and supports LLM-based assessments for hallucination, completeness and relevance.
  • Visual reports: generates summaries and detailed reports, with local GUI and Gradio demos available.
  • Flexible integration: offers CLI and SDK interfaces and can run on local or Spark execution engines.

Use Cases

  • Pretraining data filtering: detect and remove low-quality samples before training.
  • Fine-tuning data auditing: check SFT datasets for consistency and harmful content.
  • Evaluation pipelines: integrate into CI to automate dataset and model quality checks.

Technical Highlights

  • Extensible rule system: register custom rules and prompts for domain-specific checks.
  • LLM-assisted evaluation: configure OpenAI or local models for semantic assessments.
  • Traceable outputs: produces score summaries and per-sample diagnostics for easy triage.

Comments

Dingo
Resource Info
Author MigoXLab / DataEval
Added Date 2025-10-02
Open Source Since 2024-12-24
Tags
Open Source Evaluation