Evaluations

Evaluations is in beta

Evaluations is currently in beta. We'd love to hear your feedback as we develop this feature.

Evaluations automatically assess the quality of your LLM generations and return a pass/fail result with reasoning. PostHog supports two types of evaluations:

LLM-as-a-judge – Uses an LLM to score each generation against a prompt you define. Great for nuanced, subjective checks like tone, helpfulness, or hallucination detection.
Code-based (Hog) – Runs deterministic code you write against each generation. Great for rule-based checks like format validation, keyword detection, or length limits. Free to run with no LLM cost.

Why use evaluations?

Monitor output quality at scale – Automatically check if generations are helpful, relevant, or safe without manual review.
Detect problematic content – Catch hallucinations, toxicity, or jailbreak attempts before they reach users.
Track quality trends – See pass rates across models, prompts, or user segments over time.
Debug with reasoning – Each evaluation provides an explanation for its decision, making it easy to understand failures.

Choosing an evaluation type

	LLM-as-a-judge	Code-based (Hog)
Best for	Subjective quality checks (tone, helpfulness, hallucination)	Deterministic rule-based checks (format, keywords, length)
Cost	LLM API call per evaluation	Free
Speed	Seconds	Milliseconds
Consistency	May vary between runs	Deterministic — same input always produces same result
Setup	Write a prompt	Write Hog code

LLM-as-a-judge evaluations

How they work

When a generation is captured, PostHog samples it based on your configured rate (0.1% to 100%). If sampled, the generation's input and output are sent to an LLM judge with your evaluation prompt. The judge returns a boolean pass/fail result plus reasoning, which is stored and linked to the original generation.

You can optionally filter which generations get evaluated using event properties or person properties. For example, only evaluate generations from production, from a specific model, or above a certain cost threshold. You can also use person properties to exclude internal users or target specific user segments.

Built-in templates

PostHog provides five pre-built evaluation templates to get you started:

Template	What it checks	Best for
Relevance	Whether the output addresses the user's input	Customer support bots, Q&A systems
Helpfulness	Whether the response is useful and actionable	Chat assistants, productivity tools
Jailbreak	Attempts to bypass safety guardrails	Security-sensitive applications
Hallucination	Made-up facts or unsupported claims	RAG systems, knowledge bases
Toxicity	Harmful, offensive, or inappropriate content	User-facing applications

Creating an LLM judge evaluation

Navigate to LLM analytics > Evaluations
Click New evaluation
Select LLM-as-a-judge as the evaluation type
Choose a template or start from scratch
Configure the evaluation:
- Name: A descriptive name for the evaluation
- Prompt: The instructions for the LLM judge (templates provide sensible defaults)
- Sampling rate: Percentage of generations to evaluate (0.1% – 100%)
- Property filters (optional): Narrow which generations to evaluate using event or person properties
Enable the evaluation and click Save

Writing custom prompts

When creating a custom evaluation, your prompt should instruct the LLM judge to return true (pass) or false (fail) along with reasoning. The judge receives the generation's input and output for context.

Tips for effective evaluation prompts:

Be specific about what constitutes a pass or fail
Include examples of edge cases when relevant
Keep the prompt concise but comprehensive

Example custom prompt:

text
You are evaluating whether an LLM response follows our brand voice guidelines.

Given the user input and assistant response, determine if the response:
- Uses a friendly, conversational tone
- Avoids corporate jargon
- Addresses the user by name when provided

Return true if the response follows these guidelines, false otherwise.
Explain your reasoning briefly.

Code-based evaluations (Hog)

Code-based evaluations run Hog code you write against each generation. They execute in milliseconds with zero LLM cost, making them ideal for high-volume, deterministic checks.

How they work

You write Hog code that inspects the generation's input and output.
On save, PostHog compiles your code to bytecode.
When a generation is sampled, the code runs against it in PostHog's HogVM.
Your code must return true (pass) or false (fail). Use print() to add reasoning.
If allows_na is enabled, returning null marks the result as N/A (not applicable).

Code-based evaluations share the same sampling rate and property filter options as LLM judge evaluations.

Available globals

Your Hog code has access to these variables:

Variable	Type	Description
`input`	string or object	The LLM input (prompt or messages array)
`output`	string or object	The LLM output (response or choices)
`properties`	object	All event properties from the generation
`event.uuid`	string	The event UUID
`event.event`	string	The event name
`event.distinct_id`	string	The distinct ID of the user

Creating a code-based evaluation

Navigate to LLM analytics > Evaluations
Click New evaluation
Select Code-based (Hog) as the evaluation type
Write your Hog code in the editor
Click Test on sample to run your code against recent generations and verify it works
Configure:
- Name: A descriptive name for the evaluation
- Sampling rate: Percentage of generations to evaluate (0.1% – 100%)
- Allows N/A: Whether your code can return null to skip inapplicable generations
- Property filters (optional): Narrow which generations to evaluate
Enable the evaluation and click Save

Writing Hog evaluation code

Your code must return a boolean: true for pass, false for fail. Use print() statements to provide reasoning — the output is captured and stored alongside the result.

Hog tip: Use single quotes for strings ('hello'), length() instead of len(), and wrap property access with ifNull() to avoid null comparison errors.

Check output length:

Hog
let maxLength := 2000
let outputLength := length(ifNull(output, ''))

if (outputLength > maxLength) {
    print('Output too long: ' + toString(outputLength) + ' characters (max ' + toString(maxLength) + ')')
    return false
}

print('Output length OK: ' + toString(outputLength) + ' characters')
return true

Check for required keywords:

Hog
let requiredKeywords := ['disclaimer', 'not financial advice']
let outputLower := lower(ifNull(output, ''))

for (let i := 0; i < length(requiredKeywords); i := i + 1) {
    if (outputLower !ilike '%' + requiredKeywords[i] + '%') {
        print('Missing required keyword: ' + requiredKeywords[i])
        return false
    }
}

print('All required keywords found')
return true

Check model and cost thresholds:

Hog
let model := ifNull(properties.$ai_model, 'unknown')
let cost := ifNull(properties.$ai_total_cost_usd, 0)
let maxCost := 0.05

if (cost > maxCost) {
    print('Cost too high for model ' + model + ': $' + toString(cost))
    return false
}

print('Cost within budget: $' + toString(cost))
return true

Return N/A for non-applicable generations (requires allows_na enabled):

Hog
let model := ifNull(properties.$ai_model, '')

// Only evaluate GPT-4 generations
if (model !ilike '%gpt-4%') {
    print('Skipping non-GPT-4 model: ' + model)
    return null
}

// Check that the output is non-empty
if (length(ifNull(output, '')) == 0) {
    print('Empty output from GPT-4')
    return false
}

print('GPT-4 output OK')
return true

Testing on sample data

Before saving, use the Test on sample button to run your code against recent generations. This shows:

The input and output from each sampled generation
Whether your code returned pass, fail, or N/A
Any print() output (reasoning)
Any errors in your code

Testing does not create evaluation events or affect your data — it runs entirely in preview mode.

Using AI to generate evaluations

Click Generate with AI in the code editor to open Max, PostHog's AI assistant, with your evaluation context pre-loaded. Max can help you:

Write Hog evaluation code from a description of what you want to check
Debug errors in your existing code
Iterate on the logic by testing and refining

Viewing results

The Evaluations page shows all your evaluations with their pass rates and recent activity. Click an evaluation to see its run history, including individual pass/fail results and the reasoning from the evaluation.

You can also filter generations by evaluation results or create insights based on evaluation data to build quality monitoring dashboards.

Pricing

Each evaluation run counts as one LLM analytics event toward your quota.

LLM judge evaluations use an LLM to score your generations. Your first 100 evaluation runs are on us so you can try the feature right away. After that, add your own API key from OpenAI, Google Gemini, Anthropic, OpenRouter, or Fireworks in Settings > LLM analytics to keep running evaluations.

If a provider API key becomes invalid or encounters an error, PostHog displays a warning banner on the evaluations page so you can take action quickly. Update or replace the key in Settings > LLM analytics.

Code-based evaluations have no LLM cost — they run your Hog code directly with no external API calls.

Use sampling rates strategically to balance coverage and cost – 5-10% sampling often provides sufficient signal for quality monitoring.

Evaluations

Contents

Why use evaluations?

Choosing an evaluation type

LLM-as-a-judge evaluations

How they work

Built-in templates

Creating an LLM judge evaluation

Writing custom prompts

Code-based evaluations (Hog)

How they work

Available globals

Creating a code-based evaluation

Writing Hog evaluation code

Testing on sample data

Using AI to generate evaluations

Viewing results

Pricing

Community questions

Was this page useful?