leva

LLM Evaluation Framework for Rails apps to be used with production data.

llm · llm-evaluation · ruby-on-rails
入门
GitHub在线演示
Stars:114
License:MIT License
更新:2025/12/24

README

Leva - Flexible Evaluation Framework for Language Models

Gem Version CI

Leva is a Ruby on Rails framework for evaluating Language Models (LLMs) using ActiveRecord datasets on production models. It provides a flexible structure for creating experiments, managing datasets, and implementing various evaluation logic on production data with security in mind.

✳ Mac Battery Drain- Warp@2x

Installation

Add this line to your application's Gemfile:

gem 'leva'

And then execute:

bundle install

Add the migrations to your database:

rails leva:install:migrations
rails db:migrate

Mount the Leva engine in your application's routes file:

# config/routes.rb
Rails.application.routes.draw do
  mount Leva::Engine => "/leva"
  # your other routes...
end

The Leva UI will then be available at /leva in your application.

Usage

1. Setting up Datasets

First, create a dataset and add any ActiveRecord records you want to evaluate against. To make your models compatible with Leva, include the Leva::Recordable concern in your model:

class TextContent < ApplicationRecord
  include Leva::Recordable

  # @return [String] The ground truth label for the record
  def ground_truth
    expected_label
  end

  # @return [Hash] A hash of attributes to be displayed in the dataset records index
  def index_attributes
    {
      text: text,
      expected_label: expected_label,
      created_at: created_at.strftime('%Y-%m-%d %H:%M:%S')
    }
  end

  # @return [Hash] A hash of attributes to be displayed in the dataset record show view
  def show_attributes
    {
      text: text,
      expected_label: expected_label,
      created_at: created_at.strftime('%Y-%m-%d %H:%M:%S')
    }
  end

  # @return [Hash] A hash of attributes to be displayed in the dataset record show view
  def to_llm_context
    {
      text: text,
      expected_label: expected_label,
      created_at: created_at.strftime('%Y-%m-%d %H:%M:%S')
    }
  end

  # Optional: Override for DSPy optimization (falls back to to_llm_context if not defined).
  # Use this to provide a simplified context with only the fields needed for optimization.
  # All values must be strings (nil values are automatically converted to empty strings).
  # @return [Hash<Symbol, String>] Context hash for DSPy optimization
  def to_dspy_context
    { text: text }
  end
end

dataset = Leva::Dataset.create(name: "Sentiment Analysis Dataset")
dataset.add_record TextContent.create(text: "I love this product!", expected_label: "Positive")
dataset.add_record TextContent.create(text: "Terrible experience", expected_label: "Negative")
dataset.add_record TextContent.create(text: "It's ok", expected_label: "Neutral")

2. Implementing Runs

Create a run class to handle the execution of your inference logic:

rails generate leva:runner sentiment
class SentimentRun < Leva::BaseRun
  def execute(record)
    # Your model execution logic here
    # This could involve calling an API, running a local model, etc.
    # Return the model's output
  end
end

3. Implementing Evals

Create one or more eval classes to evaluate the model's output:

rails generate leva:eval sentiment_accuracy
class SentimentAccuracyEval < Leva::BaseEval
  def evaluate(prediction, record)
    score = prediction == record.expected_label ? 1.0 : 0.0
    [score, record.expected_label]
  end
end

class SentimentF1Eval < Leva::BaseEval
  def evaluate(prediction, record)
    # Calculate F1 score
    # ...
    [f1_score, record.f1_score]
  end
end

4. Running Experiments

You can run experiments with different runs and evals:

experiment = Leva::Experiment.create!(name: "Sentiment Analysis", dataset: dataset)

run = SentimentRun.new
evals = [SentimentAccuracyEval.new, SentimentF1Eval.new]

Leva.run_evaluation(experiment: experiment, run: run, evals: evals)

5. Using Prompts

You can also use prompts with your runs:

prompt = Leva::Prompt.create!(
  name: "Sentiment Analysis",
  version: 1,
  system_prompt: "You are an expert at analyzing text and returning the sentiment.",
  user_prompt: "Please analyze the following text and return the sentiment as Positive, Negative, or Neutral.\n\n{{TEXT}}",
  metadata: { model: "gpt-4", temperature: 0.5 }
)

experiment = Leva::Experiment.create!(
  name: "Sentiment Analysis with LLM",
  dataset: dataset,
  prompt: prompt
)

run = SentimentRun.new
evals = [SentimentAccuracyEval.new, SentimentF1Eval.new]

Leva.run_evaluation(experiment: experiment, run: run, evals: evals)

6. Analyzing Results

After the experiments are complete, analyze the results:

experiment.evaluation_results.group_by(&:evaluator_class).each do |evaluator_class, results|
  average_score = results.average(&:score)
  puts "#{evaluator_class.capitalize} Average Score: #{average_score}"
end

Prompt Optimization (DSPy Integration)

Leva includes optional prompt optimization powered by DSPy.rb. This feature automatically finds optimal prompts and few-shot examples for your datasets.

Requirements:

  • Ruby 3.3.0 or higher
  • DSPy gem and optional optimizer gems

Installation

Add the DSPy gems to your Gemfile:

gem "dspy"           # Core DSPy functionality (required)
gem "dspy-ruby_llm"  # RubyLLM provider adapter (required)
gem "dspy-gepa"      # GEPA optimizer (optional, recommended)
gem "dspy-miprov2"   # MIPROv2 optimizer (optional)

You can use any DSPy provider adapter instead of dspy-ruby_llm, such as dspy-openai or dspy-anthropic.

Available Optimizers

OptimizerBest ForDescription
BootstrapQuick iteration, small datasetsFast selection of few-shot examples. No gem required.
GEPAMaximum qualityState-of-the-art reflective prompt evolution. 10-14% better than MIPROv2.
MIPROv2Large datasets (200+)Bayesian optimization for instructions and examples.

Usage

# Create an optimizer for your dataset
optimizer = Leva::PromptOptimizer.new(
  dataset: dataset,
  optimizer: :gepa,      # :bootstrap, :gepa, or :miprov2
  mode: :medium,         # :light, :medium, or :heavy
  model: "claude-opus-4-5"   # Any model supported by RubyLLM
)

# Run optimization
result = optimizer.optimize

# Result contains optimized prompts
result[:system_prompt]  # Optimized instruction
result[:user_prompt]    # Template with Liquid variables
result[:metadata]       # Score, examples, and optimization details

Optimization Modes

ModeDurationUse Case
:light~5 minQuick experiments
:medium~15 minBalanced quality/speed
:heavy~30 minProduction prompts

Configuration

Ensure you set up any required API keys or other configurations in your Rails credentials or environment variables.

Leva's Components

Classes

  • Leva: Handles the process of running experiments.
  • Leva::BaseRun: Base class for run implementations.
  • Leva::BaseEval: Base class for eval implementations.

Models

  • Leva::Dataset: Represents a collection of data to be evaluated.
  • Leva::DatasetRecord: Represents individual records within a dataset.
  • Leva::Experiment: Represents a single run of an evaluation on a dataset.
  • Leva::RunnerResult: Stores the results of each run execution.
  • Leva::EvaluationResult: Stores the results of each evaluation.
  • Leva::Prompt: Represents a prompt for an LLM.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/kieranklaassen/leva.

License

The gem is available as open source under the terms of the MIT License.

Roadmap

  • Parallelize evaluation