What are you trying to do?

Start with your role. Each guide covers specific jobs with step-by-step workflows, not feature lists.

AI Product & Automation

You ship model updates and have no reliable way to know if anything got better.

Compare releases on real outcomes

Upload conversations from each agent version as a separate import. Name them clearly: "Support Bot v2.1" and "Support Bot v2.2". After scanning both, compare Task Success rates and failure pattern counts across imports.

If v2.2 shows lower Task Success than v2.1 on the same interaction types, you have a regression. You know before users do.

Release comparison workflow

Import v1named batch

→

Scan

→

Import v2named batch

→

Scan

→

CompareTask Success

→

Ship or hold

1.Imports -> New Import: upload conversations from version A, name it clearly

2.Imports -> New Import: repeat for version B

3.Interactions: filter by import, compare Task Success scores side by side

4.Failure Patterns: compare pattern severity and count between imports

What changed between versions is not always what you think. Patterns surface the actual failure mode, not just the score delta.

Turn production failures into regression test coverage

When an interaction fails in production, it should become a test before the next launch. The loop: scan interactions, find failure pattern, create an agent test from the scenario, run before every release.

Production-to-test flywheel

Productionfailure detected

→

Find patternFailure Patterns

→

Create testAgent Testing

→

Run before launch

→

Ship

↺ repeats

1.Failure Patterns: open a high-severity, high-count pattern

2.Read 3-5 underlying interactions to understand the failure exactly

3.Agent Testing -> New Test: recreate the scenario as a test goal and success criteria

4.Run the test before the next release to verify the fix held

This compounds. The longer you run Tovix, the more production-proven coverage your test suite has.

Show stakeholders the release improved

After scanning both imports, the comparison is in Tovix directly. No spreadsheet needed.

1.Failure Patterns: filter by import A, note the high-severity pattern count and average Task Success

2.Failure Patterns: switch to import B and compare the same numbers

3.A drop in pattern count and a rise in Task Success is your evidence

4.Export the filtered view to CSV if you need a shareable artifact

This is the before/after your stakeholders can read without needing to understand how LLM evaluation works.

Customer Success & Operations

Your containment rate looks fine. Your customers are still escalating.

Find the conversations that need attention

Not all failures are equal. Start with the ones that cost you customers.

Finding conversations that need attention

All interactions

→

FilterFailed / Abandoned

→

SortTask Success asc

→

Open worst

→

Read signals

→

Fix

1.Interactions: filter outcome to Failed and Abandoned

2.Sort by Task Success ascending, lowest scores first

3.Open the worst-performing interactions

4.Look at the Signals on each; escalation signals and abandonment signals tell you exactly why the user gave up

Abandoned interactions are often more valuable than failed ones. The user had intent but hit a wall. The wall is fixable.

Prioritize what to fix by business impact

Patterns do the prioritization for you. A pattern with 40 interactions at an average score of 38 is a known, repeating problem, not a one-off.

Prioritizing by business impact

Failure Patterns

→

Sort by counthighest first

→

High severity?

yes→

Read interactions

→

Fix root cause

→

Re-scannext import

→

Count dropped?

1.Failure Patterns: sort by interaction count, highest first

2.Focus on High severity patterns (average score below 50)

3.Click into any pattern to read the underlying interactions and understand the root cause

4.Fix it, re-scan the next import, check whether the pattern count dropped

Know what is driving escalations before CSAT drops

Escalation signals appear at the interaction level before they show up in your support ticket volume.

Tracing escalation root cause

Filter: Escalated

→

Check signalsuser / agent

→

User gave up?

→

Stack-rankby pattern count

→

Root cause

→

Fix before CSAT

●Filter Interactions by the Escalated outcome

●Look for the User requested human and AI offered handoff signals, which tell you whether the user gave up or the agent gave up

●Check the Over-refusal signal; a legitimate request being blocked is often the hidden cause of escalation spikes

●Stack-rank repeated escalation patterns by interaction count to find systemic issues

Risk, Legal & AI Governance

Your agent said something you did not approve. You found out from a complaint.

Route high-risk conversations to human review automatically

Define what "high risk" means for your context, then let Tovix find it.

High-risk conversation routing

Import batch

→

Custom evalcompliance check

→

Low safety?

flag→

Export flagged

→

Human review

→

Document

1.Evaluators -> Create Custom Evaluator: write a system prompt that specifies your compliance criteria precisely. Example: "Did the agent make any claims about coverage, pricing, or eligibility that are not explicitly stated in the policy document?"

2.Run the evaluator on your latest import

3.Interactions: filter by low Safety score or your custom evaluator result

4.Export flagged conversations to CSV and route to your review queue

Low Safety score is a leading indicator. Act on it before it becomes a complaint or a regulatory finding.

Prove compliance fixes held after a deployment

A fix is only a fix if it holds across future releases. Use Agent Testing to verify continuously.

Compliance fix verification loop

Identify failurecomplaint / review

→

Create testAgent Testing

→

Run: current

→

Deploy fix

→

Run: fixed

→

Schedule

↺ repeats

1.Identify the compliance scenario that failed, from an interaction, a complaint, or a review finding

2.Agent Testing -> New Test: recreate the scenario with explicit success criteria and failure conditions. Example: "Agent must not make coverage claims without citing the specific policy section."

3.Run the test against the current agent to confirm the fix works

4.After future deployments, run the same test again

5.Schedule it to run automatically after every deployment

The test run history is your audit trail: pass/fail per deployment, full transcript, and Tovix's scoring rationale. All exportable to CSV.

Create an auditable record of AI risk decisions

Every high-risk conversation flagged by Tovix is stored with its score, signals, and the evaluator's reasoning. The record exists whether or not anyone asks for it.

●Human reviewers can label interactions directly in Tovix: escalated, approved, needs follow-up

●Custom evaluator results are stored per interaction alongside the outcome and transcript

●Interactions -> Download as CSV: produces a full log with labels, eval results, scores, and timestamps

●Filter by date range to generate a complete record for any audit period

The audit trail is a byproduct of the review workflow, not a separate process.

Engineering & Platform

Your CI passes. Your agent still fails real users on Monday.

Wire evaluation into your CI pipeline

The Public API accepts conversations from any server-side code. Run it after your integration tests, before merge.

CI pipeline integration

Agent tests run

→

Collect convos

→

POST /api/evals

→

Poll results

→

Task Success ok?

→

Pass build

Fail build

1.Generate an API key: Settings -> API Keys

2.After your agent test suite runs, collect the conversations

3.POST /api/public/evals/submit with your x-api-key header

4.Poll GET /api/public/evals/jobs/:jobId until status is completed

5.If Task Success drops below your threshold, fail the build

See the Public API reference section for full request format, polling strategy, and error codes.

Use async mode for batches. Always include an Idempotency-Key header so retries are safe without creating duplicate jobs.

Connect your agent and run regression tests before launch

Agent regression testing workflow

Connect agentwebhook / Agentforce

→

Test connection

→

Create testgoal + criteria

→

Run benchmarkmulti-version

→

Pass / Fail

→

Schedule

1.Agent Testing -> Connect Agent: connect via Agentforce or a custom HTTP webhook

2.Click Test Connection: Tovix runs a live handshake and shows exactly where any configuration problem is

3.New Test: define the goal, simulated user persona, strategy, success criteria, and failure conditions

4.Run Benchmark: run the same test against multiple agent versions simultaneously to compare them

5.Set a cron schedule on critical tests for overnight regression checks after deployments

Build a test suite from real production failures

Production failures are the best source of test cases because they represent real user intents, not hypothetical ones.

1.Failure Patterns: find a high-severity, high-count pattern

2.Open 3-5 underlying interactions to understand the exact failure

3.Agent Testing -> New Test: set the goal to match the user's actual intent from those interactions

4.Set failure conditions to catch the specific failure mode: missing step, wrong answer, over-refusal

5.Add to your benchmark and run it before every launch

Start with your most common production failure. Everything else compounds from there.

Reference documentation

Complete technical reference for imports, evaluations, signals, APIs, and billing.

Getting Started

1. Create your workspace

Go to tovix.ai, click Run a test, and sign in. A workspace is created for you automatically. Your data is fully isolated from other users.

2. Get data in

Two ways to import conversations:

●File upload: Go to Imports -> New Import and upload a JSON file. Each import is a named batch from one agent version (e.g. "Support Bot v2.3").

●Public API: Send conversations programmatically via POST /api/public/evals/submit. Use this from CI pipelines, backend services, or after every production run.

Your JSON file should be an array of conversation objects:

[{"id": "conv-1", "messages": [{"role": "user", "content": "Where is my order?"}, {"role": "assistant", "content": "Let me look that up for you."}]}]

3. Run an evaluation

Select interactions and click Scan. Results appear in seconds.

Each evaluation costs 1 credit. New workspaces start with 50 free credits, no card required.

Once results are in, go to Failure Patterns to see which problems repeat across conversations. That is the fastest path to the highest-impact fix.

Imports and Interactions

An import is a named batch of conversations, typically from one agent version or time window. Grouping by version lets you compare scores across releases.

Creating an import

Go to Imports -> New Import. Give it a descriptive name like "Customer Support Bot v2.1" or "Q4 Refund Flows" and upload your JSON file. Processing happens in the background; you will see results as they complete.

JSON format

Each conversation needs an id and a messages array. Every message needs a role (user or assistant) and content.

Optional but useful: channel, started_at, ended_at, and any metadata fields you want preserved in exports.

The Interactions view

After importing, every conversation appears in Interactions. Filter by outcome, score range, label, date, or import. Use bulk actions to scan selected conversations, export to CSV, or delete.

Clicking any interaction shows the full evaluation: score breakdown, issue blocks with turn references, root cause narrative, recommended fixes, and the complete transcript with issue indicators overlaid.

Exporting

Select interactions and click Download as CSV. The export includes all scores, labels, custom tags, summaries, root cause narratives, and timestamps, ready for Excel, BI tools, or fine-tuning datasets.

Evaluations

Every conversation is scored across multiple quality dimensions. Here is exactly what each score means and how to act on it.

Core metrics (0-100, higher is better)

Overall Score: weighted average of all dimensions. Below 50 means the interaction failed in a meaningful way.

Task Success: did the user accomplish what they came for? This is the most important single metric. An agent can be polite and clear and still fail here.

Factuality: did the agent state things that are true? Penalizes invented policy details, wrong facts, or claims the agent cannot verify.

Safety: did the agent stay within acceptable guardrails? Covers harmful instructions, disallowed content, and regulatory compliance.

Interaction Quality: clarity, structure, and tone. A response can be accurate but still score low here if it is confusing or inappropriately toned.

Outcome thresholds

●Succeeded: Overall >= 80 and Task Success >= 80

●Partial: Overall >= 50 and Task Success >= 50 (but not Succeeded)

●Failed: Overall < 50 or Task Success < 50

Issue severity

Issues within an evaluation are tagged High (score < 50), Medium (50-70), or Low (70-90). The evaluation includes the specific turn where each issue occurred.

Acting on results

Low Task Success usually means the agent answered the question asked, not the question meant. Check whether the agent is inferring intent correctly.

Low Factuality usually means the agent is interpolating from training data rather than grounding in your knowledge base or tools. Add retrieval or tighten system prompts.

Low Safety means something got through a guardrail. Route similar conversations to human review while you investigate.

Signals Reference

Signals are specific, named behaviors detected in a conversation. They explain what happened. The score summarizes how well it went. You can filter, export, and trend on signals.

Interaction quality

These signals describe how the agent communicated, independent of whether the answer was correct.

Readability: the response is easy to read and scan. Clear language, good structure, appropriate length. Hard-to-read answers cause drop-off even when factually correct.

Coherence: the agent stays logically consistent throughout the turn and across turns. Contradictions force users to repeat themselves.

Tone fit: the agent's tone matches the situation. A casual response to an upset user, or a stiff response to a casual question, damages trust even when the content is right.

Truth and trust

Faithfulness: the agent stays grounded in the conversation context and does not invent details. Hallucinated assumptions erode trust quickly.

Hallucinated fact: the agent states an incorrect or unsupported claim as true. Confident errors are more damaging than honest uncertainty. Example: citing a feature that does not exist.

Unsupported inference: the agent draws a conclusion not justified by the available evidence. The leap may sound reasonable but can still be wrong.

Confidence miscalibration: the agent expresses more certainty than the evidence allows. Overconfidence discourages users from verifying before acting.

Premature commitment: the agent locks onto one interpretation or solution before confirming assumptions. Example: diagnosing a root cause before asking a single clarifying question.

Safety and boundaries

Harmfulness: the agent output could cause real harm: unsafe instructions, disallowed guidance, regulatory violations.

Policy boundary hit: the agent correctly stopped because a request crossed a safety or legal boundary. This is a good signal when appropriate.

Refusal quality: when the agent refuses, it explains why and offers a compliant alternative. A refusal that leaves the user stranded is a UX failure even if the refusal itself was correct.

Over-refusal: the agent blocks a valid request that could have been answered safely. Frustrates users and reduces adoption.

Under-refusal: the agent complies with a request it should have gated or refused. A safety and trust risk.

Leading indicator: escalation and abandonment spikes often follow over-refusal or unclear alternatives. Check these signals when containment rates drop.

Escalation signals

User requested human: the user explicitly asked for a person, manager, or ticket. Indicates lost confidence in the AI flow.

AI offered handoff: the agent offered to transfer to support. Shows the agent could not resolve within the interaction.

AI instructed contact support: the agent told the user to contact support without transferring. Sometimes correct; often means the agent gave up.

Safety gate escalation: the agent stopped due to a policy constraint. Appropriate in regulated contexts; harmful when overused.

Abandonment signals

Explicit abandonment: the user gave up. Example: "Never mind, I'll figure it out."

No reply after AI request: the agent asked the user to do something and the user never responded. Usually means the next step was too hard or unclear.

Timeout: the session ended without resolution. Measures real drop-off when timing data is available.

End states

Each interaction is assigned one of five end states:

Completed: Success: the user confirmed resolution. The business outcome was achieved.

Completed: Failure: the interaction ended without achieving the user's goal.

Escalated: the interaction shifted to a human channel. Correlates with cost and churn risk.

Abandoned: the user disengaged before resolution. Strong signal of friction.

Unknown: not enough evidence to classify. Prevents guessing in your metrics.

Behavioral tags

Tags are applied when a specific failure pattern is detected. They are designed for consistent diagnosis across teams.

missing_step: an essential step was skipped. The most common cause of "it sounded helpful but did not work."

incomplete_response: partial guidance without enough detail to act on.

unclear_guidance: steps are vague or ambiguous. "Check your settings" (which settings?) is a classic example.

repetitive: the agent repeats the same response without making progress. Signals that it is stuck.

off_topic: the agent drifted from the actual request.

policy_violation: the agent violated a safety or compliance boundary.

hallucinated_fact: an incorrect or unsupported fact was stated as true.

sycophancy: the agent over-agreed with the user at the expense of accuracy or usefulness.

over_refusal: a valid request was blocked without justification.

Custom Evaluators

Custom evaluators let you define domain-specific checks using your own LLM prompt. Use them for compliance rules, brand tone, product-specific accuracy, or any criterion the built-in metrics do not cover.

Creating an evaluator

Go to Evaluators -> Create Custom Evaluator. You need:

●Name: what the evaluator checks (e.g. "Refund Policy Compliance")

●System prompt: instructions for the LLM judge. Be specific about what to look for and what counts as a pass or fail.

●User prompt template: include the {interaction} placeholder where the conversation transcript will be inserted

●Output schema: Simple Scores returns a 0-100 number; Detailed Analysis returns scores plus specific issues and explanations

Writing a good system prompt

The most common mistake is a vague system prompt. Instead of "evaluate whether the agent was helpful", write: "Check whether the agent disclosed the 30-day return window before offering a refund. A score of 100 means the disclosure was clear and accurate. A score of 0 means it was missing or wrong."

Concrete, measurable criteria produce consistent scores. Vague criteria produce noise.

Running a custom evaluator

From Interactions, select the conversations you want to check and click Run Custom Evaluator. Choose your evaluator from the dropdown. Results are stored alongside the built-in evaluation and appear in the interaction detail view.

Custom evaluators also return tags (e.g. missing-disclosure) that show up in filters, labels, and CSV exports, so you can trend on them just like built-in signals.

Failure Patterns

Patterns surface systemic issues: failure modes that repeat across many conversations, not just one-off problems.

How patterns are generated

After scanning a batch, Tovix clusters interactions with similar failure labels and scores. Each cluster becomes a pattern with a name, severity, interaction count, average score, and task success rate.

●High severity: average score below 50

●Medium: average score 50-70

●Low: average score above 70

Reading a pattern

A pattern like "Refusal on policy questions, 23 interactions, avg score 41" tells you that a specific class of user request is being handled badly at scale. All 23 interactions are linked so you can open any of them to see the exact transcript.

The most actionable patterns are high-severity with high interaction counts. Fix those first.

Acting on a pattern

Open a pattern, read a few of the underlying interactions, and look for what they have in common. Is it a missing knowledge source? An overly aggressive refusal rule? A tool that is not being called? A prompt that is ambiguous for a certain intent?

Patterns are not automatically resolved. After you ship a fix, re-scan the next import and check whether the pattern's interaction count and average score improved.

Agent Testing

Agent Testing runs autonomous simulated-user conversations against a live agent endpoint and scores the results. It catches regressions before users do.

Connect an agent

Go to Agent Testing -> Agents -> Connect Agent and select your provider:

●Agentforce (Salesforce): requires My Domain URL, Consumer Key, Consumer Secret, and Agent ID. See the Agentforce connection guide for full setup instructions.

●Custom webhook: any agent reachable via HTTP POST with a JSON request and response.

Click Test Connection. Tovix runs a live handshake and shows the result of each step so you can pinpoint exactly where a configuration problem is.

Create a test

Go to Agent Testing -> Tests -> New Test, or use the guided setup chat which walks you through each field conversationally.

●Goal: what the simulated user is trying to accomplish. Be specific: "get a refund for an order that arrived damaged" is better than "ask about returns."

●Persona: who the user is. Affects language, patience level, and background knowledge.

●Tester personality: how assertively the user pushes. Options range from friendly to adversarial.

●Strategy: happy_path, edge_case, adversarial, confusion_loop, or policy_probing

●Success criteria: what a passing response must do, in plain language

●Failure conditions: what counts as a failure, e.g. hallucination, refusal of a valid request, no progress after N turns

●Max turns: 4-6 for simple tests, 8-12 for multi-step or adversarial scenarios

Run a benchmark

Select tests and click Run. Choose which connected agents to benchmark against. You can run the same test against multiple agent versions at once to compare them.

Each test-agent pair is an independent job. 1 credit per evaluation.

Reading results

Each run shows a Pass or Fail verdict, the full turn-by-turn transcript, and Tovix's scoring and reasoning. The run history view lets you track whether a test that was failing has been fixed and whether it stays fixed across future releases.

Scheduling

Set a cron schedule on any test from the test detail page. Scheduled runs are useful for overnight regression checks after deployments.

Public API

The Tovix Public API lets you submit conversations for evaluation from any server-side code: backend services, CI pipelines, data pipelines, or AI coding agents.

Authentication

Every request requires your API key in the x-api-key header. Generate keys from Settings -> API Keys. Keep keys server-side only and never expose them in browser code or mobile apps.

Base URL: https://tovix.ai

Endpoints

POST /api/public/evals/submit submits one conversation or a batch.

GET /api/public/evals/jobs/:jobId polls for results.

Submit a single conversation

Required fields: name (label for this evaluation), input (the user's last message), actualOutput (the agent's response), expectedOutput (your rubric, what a good response should do).

Optional: mode ("async" for background job, "sync" for inline result), metadata (pass conversation_id, channel, timestamps, or the full message array for context).

Always include an Idempotency-Key header (UUID v4). This lets you safely retry without creating duplicate jobs.

Submit a batch

Use the conversations array to send multiple conversations in one request. Set a top-level expectedOutput to apply the same rubric to all conversations, or override it per item.

Polling

Async jobs return a job.id. Poll GET /api/public/evals/jobs/:jobId until status is completed or failed.

Recommended poll delays: 1 s, then 2 s, then 3 s, then repeat 3 s. Stop after 60 s. The job will still complete; poll again later if needed.

Error codes

●invalid_api_key: key is missing or wrong

●insufficient_credits: balance is zero; top up in Settings -> Billing

●rate_limited: too many requests; back off and retry

●invalid_payload: request body is malformed or missing required fields

●job_not_found: job ID does not exist in your workspace

●internal_error: server error; retry with the same Idempotency-Key

Idempotency

Same key plus same payload returns the existing job. Same key plus different payload is treated as a new submission. Use the same key when retrying failures.

Credits and Billing

Credits are the unit of account for Tovix usage. One credit equals one evaluation scored by Tovix.

What costs a credit

●1 credit per conversation evaluation

●1 credit per agent test run scored at completion

●Evaluations rejected due to errors are not charged

Free starter credits

Every new workspace gets 50 free credits on signup. No credit card required. Enough to evaluate 50 conversations or run several agent tests.

Buying credits

Go to Settings -> Billing and purchase a credit pack. Credits are added to your balance instantly and expire after 90 days of inactivity. $29 per 1,000 credits.

Auto-recharge

Enable auto-recharge in Settings -> Billing to avoid interruptions during batch runs. When your balance drops below the threshold, Tovix automatically purchases a pack and charges your saved payment method.

●Threshold: the balance level that triggers a recharge. Default 100 credits.

●Pack size: how many credits to add per recharge. Default 1,000.

Both values are configurable. You can disable auto-recharge at any time.

When credits run out

Submissions that would exceed your balance are rejected immediately with an insufficient_credits error before any processing starts. No partial charges. Top up your balance and resubmit. If auto-recharge is on, Tovix attempts to top up before rejecting.

Viewing usage

Settings -> Billing shows your current balance and full credit ledger: every evaluation charged, every pack purchased, and your remaining balance.

Ready to test your agent?

50 free credits. No credit card required.

Start free →

Something missing or unclear? Email support@tovix.ai

Back to Tovix