A curated list of AI tools and resources for developers, see the AI Resources .

Rhesis

An open-source testing platform and SDK for LLM and agentic applications that generates test scenarios and evaluates model outputs.

Detailed Introduction

Rhesis is an open-source testing platform and SDK for large language model (LLM) and agentic applications. Teams describe what their app should and should not do in plain language; Rhesis then generates hundreds of single-turn and multi-turn test scenarios (including adversarial prompts), runs them against the target application, and highlights failures such as hallucinations, data leakage, or policy violations. The platform includes a review UI, SDK, and CI integrations to help cross-functional teams find and fix issues before production.

Main Features

  • AI-driven test generation: produce broad coverage of adversarial and edge-case inputs for single-turn and multi-turn flows.
  • LLM-based evaluation: automatically score outputs against requirements using LLM evaluators.
  • Collaboration workflow: comments, issues, and review tools so non-engineers can define requirements and review results.
  • Flexible deployment: hosted service or self-hosted Docker stack, with CI/CD friendly interfaces.

Use Cases

  • Pre-production testing for chatbots, RAG systems, and agentic applications to catch regressions and safety issues.
  • Integrating automated tests into CI pipelines to prevent unsafe model versions from reaching production.
  • Compliance and product teams validating model behavior against policy requirements at scale.

Technical Features

  • Supports single-turn and multi-turn (Penelope) testing to simulate realistic conversation chains.
  • Built-in metrics library (RAGAS, DeepEval, etc.) and visual reports for diagnostics.
  • SDK and API support for IDE-based workflows and scripted test automation.
  • Open-source with modular architecture and community contributions.
Rhesis
Resource Info
📝 Evaluation 📚 RAG 🛠️ Dev Tools 🌱 Open Source