Leva - Flexible Evaluation Framework for Language Models

Leva is a Ruby on Rails framework for evaluating Language Models (LLMs) using ActiveRecord datasets on production models. It provides a flexible structure for creating experiments, managing datasets, and implementing various evaluation logic on production data with security in mind.

✳ Mac Battery Drain- Warp@2x

Installation

Add this line to your application's Gemfile:

gem 'leva'

And then execute:

bundle install

Add the migrations to your database:

rails leva:install:migrations
rails db:migrate

Mount the Leva engine in your application's routes file:

# config/routes.rb
Rails.application.routes.draw do
  mount Leva::Engine => "/leva"
  # your other routes...
end

The Leva UI will then be available at /leva in your application.

Usage

1. Setting up Datasets

First, create a dataset and add any ActiveRecord records you want to evaluate against. To make your models compatible with Leva, include the Leva::Recordable concern in your model:

class TextContent < ApplicationRecord
  include Leva::Recordable

  # @return [String] The ground truth label for the record
  def ground_truth
    expected_label
  end

  # @return [Hash] A hash of attributes to be displayed in the dataset records index
  def index_attributes
    {
      text: text,
      expected_label: expected_label,
      created_at: created_at.strftime('%Y-%m-%d %H:%M:%S')
    }
  end

  # @return [Hash] A hash of attributes to be displayed in the dataset record show view
  def show_attributes
    {
      text: text,
      expected_label: expected_label,
      created_at: created_at.strftime('%Y-%m-%d %H:%M:%S')
    }
  end

  # @return [Hash] A hash of attributes to be displayed in the dataset record show view
  def to_llm_context
    {
      text: text,
      expected_label: expected_label,
      created_at: created_at.strftime('%Y-%m-%d %H:%M:%S')
    }
  end

  # Optional: Override for DSPy optimization (falls back to to_llm_context if not defined).
  # Use this to provide a simplified context with only the fields needed for optimization.
  # All values must be strings (nil values are automatically converted to empty strings).
  # @return [Hash<Symbol, String>] Context hash for DSPy optimization
  def to_dspy_context
    { text: text }
  end
end

dataset = Leva::Dataset.create(name: "Sentiment Analysis Dataset")
dataset.add_record TextContent.create(text: "I love this product!", expected_label: "Positive")
dataset.add_record TextContent.create(text: "Terrible experience", expected_label: "Negative")
dataset.add_record TextContent.create(text: "It's ok", expected_label: "Neutral")

2. Implementing Runs

Create a run class to handle the execution of your inference logic:

rails generate leva:runner sentiment

class SentimentRun < Leva::BaseRun
  def execute(record)
    # Your model execution logic here
    # This could involve calling an API, running a local model, etc.
    # Return the model's output
  end
end

3. Implementing Evals

Create one or more eval classes to evaluate the model's output:

rails generate leva:eval sentiment_accuracy

class SentimentAccuracyEval < Leva::BaseEval
  def evaluate(prediction, record)
    score = prediction == record.expected_label ? 1.0 : 0.0
    [score, record.expected_label]
  end
end

class SentimentF1Eval < Leva::BaseEval
  def evaluate(prediction, record)
    # Calculate F1 score
    # ...
    [f1_score, record.f1_score]
  end
end

4. Running Experiments

You can run experiments with different runs and evals:

experiment = Leva::Experiment.create!(name: "Sentiment Analysis", dataset: dataset)

run = SentimentRun.new
evals = [SentimentAccuracyEval.new, SentimentF1Eval.new]

Leva.run_evaluation(experiment: experiment, run: run, evals: evals)

5. Using Prompts

You can also use prompts with your runs:

prompt = Leva::Prompt.create!(
  name: "Sentiment Analysis",
  version: 1,
  system_prompt: "You are an expert at analyzing text and returning the sentiment.",
  user_prompt: "Please analyze the following text and return the sentiment as Positive, Negative, or Neutral.\n\n{{TEXT}}",
  metadata: { model: "gpt-4", temperature: 0.5 }
)

experiment = Leva::Experiment.create!(
  name: "Sentiment Analysis with LLM",
  dataset: dataset,
  prompt: prompt
)

run = SentimentRun.new
evals = [SentimentAccuracyEval.new, SentimentF1Eval.new]

Leva.run_evaluation(experiment: experiment, run: run, evals: evals)

6. Analyzing Results

After the experiments are complete, analyze the results:

experiment.evaluation_results.group_by(&:evaluator_class).each do |evaluator_class, results|
  average_score = results.average(&:score)
  puts "#{evaluator_class.capitalize} Average Score: #{average_score}"
end

Prompt Optimization (DSPy Integration)

Leva includes optional prompt optimization powered by DSPy.rb. This feature automatically finds optimal prompts and few-shot examples for your datasets.

Requirements:

Ruby 3.3.0 or higher
DSPy gem and optional optimizer gems

Installation

Add the DSPy gems to your Gemfile:

gem "dspy"           # Core DSPy functionality (required)
gem "dspy-ruby_llm"  # RubyLLM provider adapter (required)
gem "dspy-gepa"      # GEPA optimizer (optional, recommended)
gem "dspy-miprov2"   # MIPROv2 optimizer (optional)

You can use any DSPy provider adapter instead of dspy-ruby_llm, such as dspy-openai or dspy-anthropic.

Available Optimizers

Optimizer	Best For	Description
Bootstrap	Quick iteration, small datasets	Fast selection of few-shot examples. No gem required.
GEPA	Maximum quality	State-of-the-art reflective prompt evolution. 10-14% better than MIPROv2.
MIPROv2	Large datasets (200+)	Bayesian optimization for instructions and examples.

Usage

# Create an optimizer for your dataset
optimizer = Leva::PromptOptimizer.new(
  dataset: dataset,
  optimizer: :gepa,      # :bootstrap, :gepa, or :miprov2
  mode: :medium,         # :light, :medium, or :heavy
  model: "claude-opus-4-5"   # Any model supported by RubyLLM
)

# Run optimization
result = optimizer.optimize

# Result contains optimized prompts
result[:system_prompt]  # Optimized instruction
result[:user_prompt]    # Template with Liquid variables
result[:metadata]       # Score, examples, and optimization details

Optimization Modes

Mode	Duration	Use Case
`:light`	~5 min	Quick experiments
`:medium`	~15 min	Balanced quality/speed
`:heavy`	~30 min	Production prompts

Configuration

Ensure you set up any required API keys or other configurations in your Rails credentials or environment variables.

Leva's Components

Classes

Leva: Handles the process of running experiments.
Leva::BaseRun: Base class for run implementations.
Leva::BaseEval: Base class for eval implementations.

Models

Leva::Dataset: Represents a collection of data to be evaluated.
Leva::DatasetRecord: Represents individual records within a dataset.
Leva::Experiment: Represents a single run of an evaluation on a dataset.
Leva::RunnerResult: Stores the results of each run execution.
Leva::EvaluationResult: Stores the results of each evaluation.
Leva::Prompt: Represents a prompt for an LLM.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/kieranklaassen/leva.

License

The gem is available as open source under the terms of the MIT License.

Roadmap

Parallelize evaluation

leva

README

Leva - Flexible Evaluation Framework for Language Models

Installation

Usage

1. Setting up Datasets

2. Implementing Runs

3. Implementing Evals

4. Running Experiments

5. Using Prompts

6. Analyzing Results

Prompt Optimization (DSPy Integration)

Installation

Available Optimizers

Usage

Optimization Modes

Configuration

Leva's Components

Classes

Models

Contributing

License

Roadmap