Skip to main content

Overview

This page defines all metrics, terms, and concepts used in AI QA to help you understand your call analysis results.

Performance Metrics

Latency

Average Latency: Measures the end-to-end delay between a user speaking and the Voice AI beginning its spoken response. Lower latency indicates more responsive interactions. Latency P50: The 50th percentile (median) of latency measurements. This metric shows the typical response time, with half of all responses being faster and half being slower.
Latency is measured in seconds (s). Lower values indicate better performance.

Sentiment Analysis

User Sentiment: Represents the emotional state of the caller as inferred from speech content, tone, and pitch. Sentiment can be positive, negative, or neutral.
  • User Positive Sentiment Rate: Percentage of user interactions with positive sentiment
  • User Negative Sentiment Rate: Percentage of user interactions with negative sentiment
  • Negative Sentiment Rate: Overall rate of negative sentiment detected in the conversation
Agent Sentiment: Represents the emotional tone expressed by the Voice AI during speech output. This metric helps ensure your agent maintains an appropriate tone throughout conversations.
  • Agent Positive Sentiment Rate: Percentage of agent responses with positive sentiment
  • Agent Natural Tonality Rate: Measures how natural and human-like the agent’s tone sounds

Transcription Metrics

WER (Word Error Rate): Measures the accuracy of speech-to-text transcription by calculating the percentage of words that were incorrectly transcribed. Lower WER indicates better transcription accuracy. Mistranscribed Entities: Count of specific entities (names, dates, numbers, etc.) that were incorrectly transcribed during the call.
WER is calculated as: (Substitutions + Insertions + Deletions) / Total Words × 100%

Call Quality Metrics

Interruptions: Count of times the user interrupted the agent during the conversation. Higher interruption counts may indicate the agent is speaking too long or not responding appropriately. Avg. Interruptions: Average number of interruptions per call across the cohort. Agent Naturalness: Measures how human-like the agent sounded, including pronunciation, intonation, pacing, turn-taking behavior, and the absence of robotic patterns. Higher values indicate more natural-sounding speech. Natural Tonality Rate: Percentage of agent speech that sounds natural and human-like in tone and delivery.

AI Accuracy Metrics

LLM Hallucination Rate: Measures how often the Large Language Model (LLM) generated incorrect or fabricated information that wasn’t supported by the conversation context or knowledge base. Agent Hallucination: Measures how often the agent hallucinated during conversations. This is a critical metric for ensuring factual accuracy.
High hallucination rates indicate the agent may be providing incorrect information to users, which can damage trust and lead to poor outcomes.

Knowledge Base Metrics

KB Recall: Measures how effectively the agent retrieved and used relevant information from the knowledge base. Higher recall indicates better knowledge base utilization.
KB Recall is calculated as the percentage of relevant knowledge base entries that were successfully retrieved and used during the conversation.

Tool and Function Metrics

Tool Call Accuracy: Measures the rate at which the agent correctly invoked tools or functions. Higher accuracy means the agent is using the right tools at the right time. Tool Call Inaccuracy: Measures the rate at which the agent invoked incorrect tools. This is the inverse of Tool Call Accuracy. Custom Tool Success Rate: Percentage of custom tool calls that completed successfully. Avg Custom Tool Latency: Average time taken for custom tools to execute and return results.

Conversation Flow Metrics

Transition Accuracy: Measures the accuracy of transitions between conversation nodes or states. Higher accuracy indicates the agent is following the intended conversation flow correctly. Node Transition Inaccuracy: Measures incorrect node transitions in conversation flows. This metric helps identify when the agent moves to the wrong conversation state.

Call Resolution Metrics

Call Resolution Rate: Percentage of calls that were successfully resolved according to your defined resolution criteria. Average Score: Overall quality score for calls in the cohort, calculated based on your resolution criteria and weighted scoring configuration. Calls Analyzed: Total number of calls that have been analyzed in the cohort.

Transfer Metrics

Transfer Success Rate: Percentage of calls that were successfully transferred to another agent or system. Transfer Wait Time: Average time users wait before a transfer is completed.

Call-Level Data

Call Identification

Call ID: Unique identifier for each individual call in the system. Call Start Time: Timestamp indicating when a call began. Call Length: Total duration of the call, typically measured in seconds or minutes.

Evaluation Status

Eval: Evaluation status or score for individual calls, indicating whether the call met the defined resolution criteria.

Statistical Terms

Percentiles

P50 (50th Percentile): The median value, where half of all measurements are above and half are below. Also known as the median.
Percentiles help understand the distribution of metrics. P50 shows typical performance, while P95 or P99 show worst-case scenarios.

Cohort Terms

Cohort

A Cohort is a filtered set of calls that share common characteristics (agents, date range, call duration, etc.) and are analyzed together using the same resolution criteria.

Sampling

Sampling Percentage: The percentage of calls matching your filters that will be included in the cohort for analysis. Weekly Max: Maximum number of calls that can be analyzed per week, regardless of the sampling percentage.

Resolution Criteria Terms

AI Evaluated Condition

Custom criteria evaluated by AI based on call transcripts and context. These are qualitative assessments (e.g., “Call resolved”, “Customer satisfied”) rather than quantitative metrics.

Performance Metric

Quantitative thresholds that calls must meet, such as latency below 2 seconds or sentiment above 80%.

Weighted Scoring

A scoring system that assigns different weights to various resolution criteria, allowing you to prioritize certain conditions or metrics over others.

Best Practices

When interpreting metrics, consider them in context:
  • Compare metrics across similar cohorts or time periods
  • Look for trends rather than focusing on individual data points
  • Use multiple metrics together to get a complete picture of call quality
Some metrics may show “N/A” when there isn’t sufficient data or when the metric doesn’t apply to a particular call type (e.g., transfer metrics for non-transfer calls).