What are you trying to do?
Start with your role. Each guide covers specific jobs with step-by-step workflows, not feature lists.
AI Product & Automation
You ship model updates and have no reliable way to know if anything got better.
Compare releases on real outcomes
Upload conversations from each agent version as a separate import. Name them clearly: "Support Bot v2.1" and "Support Bot v2.2". After scanning both, compare Task Success rates and failure pattern counts across imports.
If v2.2 shows lower Task Success than v2.1 on the same interaction types, you have a regression. You know before users do.
Release comparison workflow
What changed between versions is not always what you think. Patterns surface the actual failure mode, not just the score delta.
Turn production failures into regression test coverage
When an interaction fails in production, it should become a test before the next launch. The loop: scan interactions, find failure pattern, create an agent test from the scenario, run before every release.
Production-to-test flywheel
This compounds. The longer you run Tovix, the more production-proven coverage your test suite has.
Show stakeholders the release improved
After scanning both imports, the comparison is in Tovix directly. No spreadsheet needed.
This is the before/after your stakeholders can read without needing to understand how LLM evaluation works.
Customer Success & Operations
Your containment rate looks fine. Your customers are still escalating.
Find the conversations that need attention
Not all failures are equal. Start with the ones that cost you customers.
Finding conversations that need attention
Abandoned interactions are often more valuable than failed ones. The user had intent but hit a wall. The wall is fixable.
Prioritize what to fix by business impact
Patterns do the prioritization for you. A pattern with 40 interactions at an average score of 38 is a known, repeating problem, not a one-off.
Prioritizing by business impact
Know what is driving escalations before CSAT drops
Escalation signals appear at the interaction level before they show up in your support ticket volume.
Tracing escalation root cause
Risk, Legal & AI Governance
Your agent said something you did not approve. You found out from a complaint.
Route high-risk conversations to human review automatically
Define what "high risk" means for your context, then let Tovix find it.
High-risk conversation routing
Low Safety score is a leading indicator. Act on it before it becomes a complaint or a regulatory finding.
Prove compliance fixes held after a deployment
A fix is only a fix if it holds across future releases. Use Agent Testing to verify continuously.
Compliance fix verification loop
The test run history is your audit trail: pass/fail per deployment, full transcript, and Tovix's scoring rationale. All exportable to CSV.
Create an auditable record of AI risk decisions
Every high-risk conversation flagged by Tovix is stored with its score, signals, and the evaluator's reasoning. The record exists whether or not anyone asks for it.
The audit trail is a byproduct of the review workflow, not a separate process.
Engineering & Platform
Your CI passes. Your agent still fails real users on Monday.
Wire evaluation into your CI pipeline
The Public API accepts conversations from any server-side code. Run it after your integration tests, before merge.
CI pipeline integration
POST /api/public/evals/submit with your x-api-key headerGET /api/public/evals/jobs/:jobId until status is completedSee the Public API reference section for full request format, polling strategy, and error codes.
Use async mode for batches. Always include an Idempotency-Key header so retries are safe without creating duplicate jobs.Connect your agent and run regression tests before launch
Agent regression testing workflow
Build a test suite from real production failures
Production failures are the best source of test cases because they represent real user intents, not hypothetical ones.
Start with your most common production failure. Everything else compounds from there.
Reference documentation
Complete technical reference for imports, evaluations, signals, APIs, and billing.
Getting Started
Sign up, import your first conversations, and get an evaluation in under 5 minutes.
1. Create your workspace
Go to tovix.ai, click Run a test, and sign in. A workspace is created for you automatically. Your data is fully isolated from other users.
2. Get data in
Two ways to import conversations:
POST /api/public/evals/submit. Use this from CI pipelines, backend services, or after every production run.Your JSON file should be an array of conversation objects:
[{"id": "conv-1", "messages": [{"role": "user", "content": "Where is my order?"}, {"role": "assistant", "content": "Let me look that up for you."}]}]
3. Run an evaluation
Select interactions and click Scan. Results appear in seconds.
Each evaluation costs 1 credit. New workspaces start with 50 free credits, no card required.
Once results are in, go to Failure Patterns to see which problems repeat across conversations. That is the fastest path to the highest-impact fix.
Imports and Interactions
An import is a named batch of conversations, typically from one agent version or time window. Grouping by version lets you compare scores across releases.
Creating an import
Go to Imports -> New Import. Give it a descriptive name like "Customer Support Bot v2.1" or "Q4 Refund Flows" and upload your JSON file. Processing happens in the background; you will see results as they complete.
JSON format
Each conversation needs an id and a messages array. Every message needs a role (user or assistant) and content.
Optional but useful: channel, started_at, ended_at, and any metadata fields you want preserved in exports.
The Interactions view
After importing, every conversation appears in Interactions. Filter by outcome, score range, label, date, or import. Use bulk actions to scan selected conversations, export to CSV, or delete.
Clicking any interaction shows the full evaluation: score breakdown, issue blocks with turn references, root cause narrative, recommended fixes, and the complete transcript with issue indicators overlaid.
Exporting
Select interactions and click Download as CSV. The export includes all scores, labels, custom tags, summaries, root cause narratives, and timestamps, ready for Excel, BI tools, or fine-tuning datasets.
Evaluations
Every conversation is scored across multiple quality dimensions. Here is exactly what each score means and how to act on it.
Core metrics (0-100, higher is better)
Overall Score: weighted average of all dimensions. Below 50 means the interaction failed in a meaningful way.
Task Success: did the user accomplish what they came for? This is the most important single metric. An agent can be polite and clear and still fail here.
Factuality: did the agent state things that are true? Penalizes invented policy details, wrong facts, or claims the agent cannot verify.
Safety: did the agent stay within acceptable guardrails? Covers harmful instructions, disallowed content, and regulatory compliance.
Interaction Quality: clarity, structure, and tone. A response can be accurate but still score low here if it is confusing or inappropriately toned.
Outcome thresholds
Issue severity
Issues within an evaluation are tagged High (score < 50), Medium (50-70), or Low (70-90). The evaluation includes the specific turn where each issue occurred.
Acting on results
Low Task Success usually means the agent answered the question asked, not the question meant. Check whether the agent is inferring intent correctly.
Low Factuality usually means the agent is interpolating from training data rather than grounding in your knowledge base or tools. Add retrieval or tighten system prompts.
Low Safety means something got through a guardrail. Route similar conversations to human review while you investigate.
Signals Reference
Signals are specific, named behaviors detected in a conversation. They explain what happened. The score summarizes how well it went. You can filter, export, and trend on signals.
Interaction quality
These signals describe how the agent communicated, independent of whether the answer was correct.
Readability: the response is easy to read and scan. Clear language, good structure, appropriate length. Hard-to-read answers cause drop-off even when factually correct.
Coherence: the agent stays logically consistent throughout the turn and across turns. Contradictions force users to repeat themselves.
Tone fit: the agent's tone matches the situation. A casual response to an upset user, or a stiff response to a casual question, damages trust even when the content is right.
Truth and trust
Faithfulness: the agent stays grounded in the conversation context and does not invent details. Hallucinated assumptions erode trust quickly.
Hallucinated fact: the agent states an incorrect or unsupported claim as true. Confident errors are more damaging than honest uncertainty. Example: citing a feature that does not exist.
Unsupported inference: the agent draws a conclusion not justified by the available evidence. The leap may sound reasonable but can still be wrong.
Confidence miscalibration: the agent expresses more certainty than the evidence allows. Overconfidence discourages users from verifying before acting.
Premature commitment: the agent locks onto one interpretation or solution before confirming assumptions. Example: diagnosing a root cause before asking a single clarifying question.
Safety and boundaries
Harmfulness: the agent output could cause real harm: unsafe instructions, disallowed guidance, regulatory violations.
Policy boundary hit: the agent correctly stopped because a request crossed a safety or legal boundary. This is a good signal when appropriate.
Refusal quality: when the agent refuses, it explains why and offers a compliant alternative. A refusal that leaves the user stranded is a UX failure even if the refusal itself was correct.
Over-refusal: the agent blocks a valid request that could have been answered safely. Frustrates users and reduces adoption.
Under-refusal: the agent complies with a request it should have gated or refused. A safety and trust risk.
Leading indicator: escalation and abandonment spikes often follow over-refusal or unclear alternatives. Check these signals when containment rates drop.
Escalation signals
User requested human: the user explicitly asked for a person, manager, or ticket. Indicates lost confidence in the AI flow.
AI offered handoff: the agent offered to transfer to support. Shows the agent could not resolve within the interaction.
AI instructed contact support: the agent told the user to contact support without transferring. Sometimes correct; often means the agent gave up.
Safety gate escalation: the agent stopped due to a policy constraint. Appropriate in regulated contexts; harmful when overused.
Abandonment signals
Explicit abandonment: the user gave up. Example: "Never mind, I'll figure it out."
No reply after AI request: the agent asked the user to do something and the user never responded. Usually means the next step was too hard or unclear.
Timeout: the session ended without resolution. Measures real drop-off when timing data is available.
End states
Each interaction is assigned one of five end states:
Completed: Success: the user confirmed resolution. The business outcome was achieved.
Completed: Failure: the interaction ended without achieving the user's goal.
Escalated: the interaction shifted to a human channel. Correlates with cost and churn risk.
Abandoned: the user disengaged before resolution. Strong signal of friction.
Unknown: not enough evidence to classify. Prevents guessing in your metrics.
Behavioral tags
Tags are applied when a specific failure pattern is detected. They are designed for consistent diagnosis across teams.
missing_step: an essential step was skipped. The most common cause of "it sounded helpful but did not work."
incomplete_response: partial guidance without enough detail to act on.
unclear_guidance: steps are vague or ambiguous. "Check your settings" (which settings?) is a classic example.
repetitive: the agent repeats the same response without making progress. Signals that it is stuck.
off_topic: the agent drifted from the actual request.
policy_violation: the agent violated a safety or compliance boundary.
hallucinated_fact: an incorrect or unsupported fact was stated as true.
sycophancy: the agent over-agreed with the user at the expense of accuracy or usefulness.
over_refusal: a valid request was blocked without justification.
Custom Evaluators
Custom evaluators let you define domain-specific checks using your own LLM prompt. Use them for compliance rules, brand tone, product-specific accuracy, or any criterion the built-in metrics do not cover.
Creating an evaluator
Go to Evaluators -> Create Custom Evaluator. You need:
{interaction} placeholder where the conversation transcript will be insertedWriting a good system prompt
The most common mistake is a vague system prompt. Instead of "evaluate whether the agent was helpful", write: "Check whether the agent disclosed the 30-day return window before offering a refund. A score of 100 means the disclosure was clear and accurate. A score of 0 means it was missing or wrong."
Concrete, measurable criteria produce consistent scores. Vague criteria produce noise.
Running a custom evaluator
From Interactions, select the conversations you want to check and click Run Custom Evaluator. Choose your evaluator from the dropdown. Results are stored alongside the built-in evaluation and appear in the interaction detail view.
Custom evaluators also return tags (e.g. missing-disclosure) that show up in filters, labels, and CSV exports, so you can trend on them just like built-in signals.
Failure Patterns
Patterns surface systemic issues: failure modes that repeat across many conversations, not just one-off problems.
How patterns are generated
After scanning a batch, Tovix clusters interactions with similar failure labels and scores. Each cluster becomes a pattern with a name, severity, interaction count, average score, and task success rate.
Reading a pattern
A pattern like "Refusal on policy questions, 23 interactions, avg score 41" tells you that a specific class of user request is being handled badly at scale. All 23 interactions are linked so you can open any of them to see the exact transcript.
The most actionable patterns are high-severity with high interaction counts. Fix those first.
Acting on a pattern
Open a pattern, read a few of the underlying interactions, and look for what they have in common. Is it a missing knowledge source? An overly aggressive refusal rule? A tool that is not being called? A prompt that is ambiguous for a certain intent?
Patterns are not automatically resolved. After you ship a fix, re-scan the next import and check whether the pattern's interaction count and average score improved.
Agent Testing
Agent Testing runs autonomous simulated-user conversations against a live agent endpoint and scores the results. It catches regressions before users do.
Connect an agent
Go to Agent Testing -> Agents -> Connect Agent and select your provider:
Click Test Connection. Tovix runs a live handshake and shows the result of each step so you can pinpoint exactly where a configuration problem is.
Create a test
Go to Agent Testing -> Tests -> New Test, or use the guided setup chat which walks you through each field conversationally.
happy_path, edge_case, adversarial, confusion_loop, or policy_probingRun a benchmark
Select tests and click Run. Choose which connected agents to benchmark against. You can run the same test against multiple agent versions at once to compare them.
Each test-agent pair is an independent job. 1 credit per evaluation.
Reading results
Each run shows a Pass or Fail verdict, the full turn-by-turn transcript, and Tovix's scoring and reasoning. The run history view lets you track whether a test that was failing has been fixed and whether it stays fixed across future releases.
Scheduling
Set a cron schedule on any test from the test detail page. Scheduled runs are useful for overnight regression checks after deployments.
Public API
The Tovix Public API lets you submit conversations for evaluation from any server-side code: backend services, CI pipelines, data pipelines, or AI coding agents.
Authentication
Every request requires your API key in the x-api-key header. Generate keys from Settings -> API Keys. Keep keys server-side only and never expose them in browser code or mobile apps.
Base URL: https://tovix.ai
Endpoints
POST /api/public/evals/submit submits one conversation or a batch.
GET /api/public/evals/jobs/:jobId polls for results.
Submit a single conversation
Required fields: name (label for this evaluation), input (the user's last message), actualOutput (the agent's response), expectedOutput (your rubric, what a good response should do).
Optional: mode ("async" for background job, "sync" for inline result), metadata (pass conversation_id, channel, timestamps, or the full message array for context).
Always include an Idempotency-Key header (UUID v4). This lets you safely retry without creating duplicate jobs.
Submit a batch
Use the conversations array to send multiple conversations in one request. Set a top-level expectedOutput to apply the same rubric to all conversations, or override it per item.
Polling
Async jobs return a job.id. Poll GET /api/public/evals/jobs/:jobId until status is completed or failed.
Recommended poll delays: 1 s, then 2 s, then 3 s, then repeat 3 s. Stop after 60 s. The job will still complete; poll again later if needed.
Error codes
invalid_api_key: key is missing or wronginsufficient_credits: balance is zero; top up in Settings -> Billingrate_limited: too many requests; back off and retryinvalid_payload: request body is malformed or missing required fieldsjob_not_found: job ID does not exist in your workspaceinternal_error: server error; retry with the same Idempotency-KeyIdempotency
Same key plus same payload returns the existing job. Same key plus different payload is treated as a new submission. Use the same key when retrying failures.
Credits and Billing
Credits are the unit of account for Tovix usage. One credit equals one evaluation scored by Tovix.
What costs a credit
Free starter credits
Every new workspace gets 50 free credits on signup. No credit card required. Enough to evaluate 50 conversations or run several agent tests.
Buying credits
Go to Settings -> Billing and purchase a credit pack. Credits are added to your balance instantly and expire after 90 days of inactivity. $29 per 1,000 credits.
Auto-recharge
Enable auto-recharge in Settings -> Billing to avoid interruptions during batch runs. When your balance drops below the threshold, Tovix automatically purchases a pack and charges your saved payment method.
Both values are configurable. You can disable auto-recharge at any time.
When credits run out
Submissions that would exceed your balance are rejected immediately with an insufficient_credits error before any processing starts. No partial charges. Top up your balance and resubmit. If auto-recharge is on, Tovix attempts to top up before rejecting.
Viewing usage
Settings -> Billing shows your current balance and full credit ledger: every evaluation charged, every pack purchased, and your remaining balance.
Ready to test your agent?
50 free credits. No credit card required.
Something missing or unclear? Email support@tovix.ai
Back to Tovix