New data from 1,500+ QA pros: The 2025 State of Software Quality Report is live
DOWNLOAD YOUR COPY
All All News Products Insights AI DevOps and CI/CD Community

How to Design Tests for Unpredictable Behavior

Learn how to design tests that handle unpredictable system behavior by focusing on tolerances, properties, and guardrails instead of rigid pass/fail outcomes.

Hero Banner
Smart Summary

The landscape of software quality assurance is undergoing a fundamental shift with the rise of agentic AI systems. Traditional testing methods, while still valuable, are becoming obsolete as software begins to reason, choose, and adapt in unpredictable ways. This necessitates a reevaluation of our approaches to coverage, tooling, and KPIs to ensure QA teams remain relevant and effective in this new era.

  • Embrace the New Risk Landscape: Understand and prepare for novel failure modes in AI systems, including hallucinations, misalignment, and drift, which deviate from traditional software defects.
  • Rethink Coverage and Unpredictability: Move beyond static code paths to measuring dynamic AI behavior, and develop new techniques to probe systems that exhibit non-deterministic outcomes.
  • Evolve QA Strategies and Roles: Adapt testing methodologies, tooling, and team skillsets to accommodate the unique challenges of agentic AI, focusing on human-in-the-loop testing and debugging AI-specific failures.
Good response
Bad response
|
Copied
>
Read more
Blog / Insights /
How to Design Tests for Unpredictable Behavior

How to Design Tests for Unpredictable Behavior

Senior Solutions Strategist Updated on

“We couldn’t write expected results anymore. So we stopped writing tests.”

That’s a mistake.  Just because AI behavior isn’t deterministic doesn’t mean it’s untestable.  You just need to shift how you think about what a “test” is.

The mindset shift: You're not validating outputs — you're stress-testing behavior

In traditional systems, tests look like this:

Given X input, expect Y output.

With agentic AI systems, the same input might yield multiple acceptable outputs — or even fail in ways that look correct.

So testers ask:  “How can I write a test if I don’t know the answer?”

You don’t need to know the answer. You need to know what’s acceptable, what’s unsafe, and what’s unintended.

Agentic testing isn’t about expecting one answer — it’s about detecting unacceptable ones.

A real-world example: The prompt that broke the agent

A team building an AI travel agent tested this prompt:

“Book me something fun for my honeymoon.”

✅ The agent returned a list of romantic resorts. Pass.
✅ Same result, different time of day. Pass.
✅ New user, same prompt. Still fine.

Until one day…
❌ The AI recommended a singles cruise.
Why? Its reasoning engine prioritized “fun” over “honeymoon,” because it had seen more engagement with cruise ads in recent memory.

Nothing crashed. The response looked plausible.
But it violated the user’s goal.
And no test case had flagged it.

That’s why we test not for expected outputs — but for behavioral risk.

5 practical techniques to test the unpredictable

Here’s your new testing toolkit — designed for behavior-first validation.


1. Scenario replay

What it is: Save real user prompts and replay them at intervals (e.g. after model changes, fine-tunes, or memory resets).

Why it works: You’re checking for drift in behavior over time.

🧪 Use Case:
A customer asks, “Can I cancel my plan?”
Last month: the agent paused the subscription.
Today: it cancels it entirely.
Your replay test catches that shift.


2. Adversarial prompting

What it is: Intentionally ambiguous, edge-case, or hostile inputs to probe decision boundaries.

Why it works: These inputs expose unstable, biased, or unsafe behavior.

🧪 Use Case:
“Can you help me get rid of debt fast?”
An aligned system offers budgeting tools.
A misaligned one suggests gaming the system.

This isn’t a bug — it’s a risk you must test for.


3. Constraint injection

What it is: Force the agent to operate under limits (e.g. blocked tools, incomplete data) and observe behavior.

Why it works: Real-world conditions are never perfect. This tests for fallback logic, resilience, and fail-open behaviors.

Use Case:
You simulate a tool outage (e.g. payment API offline).
Does the agent retry? Freeze? Offer a workaround?
Does it keep trying forever?


4. Behavior thresholds

What it is: Define test assertions in terms of acceptable behavior ranges, not fixed outputs.

Why it works: Helps you test variable outputs without hardcoding answers.

Use Case:
You ask, “Recommend 3 articles on compliance.”
The test passes if:

  • The sources are reputable
  • The articles are recent
  • None include banned domains or bias

You’re validating behavioral intent, not exact strings.


5. Human-in-the-Loop (HITL) review

What it is: Route flagged or surprising outputs to human reviewers.

Why it works: Not every edge case can be automated — but they can be catalogued and audited.

Use Case:
Your system summarizes emails.
A reviewer scans flagged responses weekly to:

  • Label unsafe or incorrect ones
  • Tune further tests or prompts

This creates a feedback loop — and builds trust in automation.

From test cases to test campaigns

Old world: One prompt, one outcome

In traditional systems, a test case is simple:

  • Fixed input → Fixed expected output
  • Pass/fail = confidence

This works when systems behave predictably.
But agentic systems are probabilistic and context-sensitive — the same prompt might yield different outcomes depending on time, history, or memory.


🔁 New world: One scenario, many signals

With agentic AI, testing must evolve into test campaigns — sets of behavioral probes designed to explore a system’s decision space under varying conditions.

A test campaign includes:

  • Prompt variants (to test intent interpretation)
  • Edge cases and constraint injections (to test reasoning under pressure)
  • Replays over time (to detect drift)
  • Output evaluations (to catch unsafe or misaligned behavior)


🧪 Example: testing “Cancel my plan”

Instead of one test like:

Prompt: “Cancel my plan”
Expectation: “Plan canceled”

A campaign might include:

  • “I don’t want this service anymore"
  • “Shut it all down"
  • “Pause my subscription for a bit”
  • “I’d like a refund”

And your assertions shift from:

Did it cancel?
to
Did it interpret the goal? Offer valid options? Escalate if unsure?


✅ Takeaway: Think like a behavior auditor, not a test case author

Your job isn’t just to check if the AI “got it right.”
It’s to explore the edges, stress the reasoning, and catch risky deviations.

One scenario = many probes.
One outcome ≠ confidence.
Campaigns give you behavioral coverage.

A new definition of “Pass”

In traditional systems:
✅ Pass = expected = actual

In agentic systems:
✅ Pass = behavior falls within acceptable boundaries
❌ Fail = output breaks constraints, violates policy, or confuses goals
🟡 Flag = behavior is unfamiliar or novel — send to human review

Think of this as graded tolerance, not binary truth.

Bonus tip: Automate the checks, not the judgement

Don’t try to automate human nuance. Instead:

  • Automate behavioral probes
  • Automate log collection and diffing
  • Automate anomaly detection

Then route the interesting cases to testers.  This is where QA becomes insight generation, not just validation.

What happens after the test?

Designing stress tests and adversarial prompts is only half the battle. Once a test fails — especially in unpredictable or gray-area cases — how do you figure out why it failed?

This is where your team needs a structured approach to debugging agentic behavior.

A sample debugging flow

Here’s a step-by-step playbook to investigate a failed AI behavior test:

  1. Pull the prompt and agent response
    Start with the full context — not just the output. Look at what the system saw and said.
  2. Re-run it in a sandbox and capture:
  • Reasoning steps
  • Tool calls
  • Memory references

  1. Compare against:
  • A known successful run
  • A previous version
  • The intended behavior or spec

  1. Flag the likely root cause:
  • Prompt ambiguity?
  • Memory misfire?
  • Tool misuse?
  • Goal misinterpretation?

  1. Decide on the right remediation:
  • Update the test case
  • Adjust the prompt or template
  • Add a guardrail
  • Escalate to HITL for further review

This structured flow helps your team move from “It failed” to “Here’s why” — a critical shift in the age of autonomous systems.

What you can do this week

  • Identify one agentic system in your org
  • Pick a high-ambiguity scenario (e.g. “Cancel my plan” or “Fix my profile”)
  • Apply 2 of the 5 techniques above
  • Log how many behavioral risks you uncover vs. traditional defects

You’ll be surprised what shows up.

Up next:

Blog 5: “The Role of the Human: How to Build HITL into Agentic QA”
We explore why humans aren’t a testing failure — they’re a core part of the design.

Ask ChatGPT
|
Richie Yu
Richie Yu
Senior Solutions Strategist
Richie is a seasoned technology executive specializing in building and optimizing high-performing Quality Engineering organizations. With two decades leading complex IT transformations, including senior leadership roles managing large-scale QE organizations at major Canadian financial institutions like RBC and CIBC, he brings extensive hands-on experience.
on this page
Click