GEPA: Optimize Prompts, Code, and More with AI-Powered Reflective Text Evolution

Link: https://github.com/gepa-ai/gepa

Docs: https://gepa-ai.github.io/gepa/

GEPA (Genetic-Pareto) is a framework for optimizing any text-based component of an AI system — prompts, code snippets, agent configurations, evaluation rubrics — without needing model weights or large training datasets. The key innovation is treating LLM execution traces as "actionable side information." Rather than collapsing a run into a single pass/fail score, GEPA reads the full trace, diagnoses specifically why candidates failed, and uses that diagnosis to drive targeted mutations. The loop is: Select a candidate from the Pareto frontier → Execute on test cases → Reflect (LLM analyzes what went wrong) → Mutate (generate improved variants) → Accept if better. This is textual gradient descent without the gradient.

The efficiency gap versus reinforcement learning approaches is dramatic: GEPA achieves results with 100–500 evaluations where RL methods require 5,000–25,000+. On AIME 2025 math benchmarks, it improved GPT-4 Mini from 46.6% to 56.6% accuracy. Starting from a basic ChainOfThought at 67% on the MATH benchmark, GEPA evolved a multi-step reasoning program reaching 93%. It requires as few as 3 training examples, works with API-only models (no weight access), and is framework-agnostic — the same optimizer produces consistent results across DSPy, OpenAI SDK, CrewAI, Google ADK, and others. Companies including Databricks, Shopify, and Pydantic have adopted it.

The framing that makes GEPA immediately applicable: if you have a pipeline where some step is text-parameterized and you have a way to evaluate output quality, GEPA can optimize that step without any manual prompt iteration. For anyone building multi-step AI pipelines — conversation intelligence, real-time cue systems, retrieval pipelines — this is a practical tool worth wiring in today. The Pareto-aware candidate selection is particularly useful when you have competing objectives (e.g., accuracy vs. latency), which is nearly every real production system.