How to Design Tests for Unpredictable Behavior

Written by Richie Yu | Aug 23, 2025 3:45:00 AM

“We couldn’t write expected results anymore. So we stopped writing tests.”

That’s a mistake. Just because AI behavior isn’t deterministic doesn’t mean it’s untestable. You just need to shift how you think about what a “test” is.

The mindset shift: You're not validating outputs — you're stress-testing behavior

In traditional systems, tests look like this:

Given X input, expect Y output.

With agentic AI systems, the same input might yield multiple acceptable outputs — or even fail in ways that look correct.

So testers ask: “How can I write a test if I don’t know the answer?”

You don’t need to know the answer. You need to know what’s acceptable, what’s unsafe, and what’s unintended.

Agentic testing isn’t about expecting one answer — it’s about detecting unacceptable ones.

A real-world example: The prompt that broke the agent

A team building an AI travel agent tested this prompt:

“Book me something fun for my honeymoon.”

✅ The agent returned a list of romantic resorts. Pass.
✅ Same result, different time of day. Pass.
✅ New user, same prompt. Still fine.

Until one day…
❌ The AI recommended a singles cruise.
Why? Its reasoning engine prioritized “fun” over “honeymoon,” because it had seen more engagement with cruise ads in recent memory.

Nothing crashed. The response looked plausible.
But it violated the user’s goal.
And no test case had flagged it.

That’s why we test not for expected outputs — but for behavioral risk.

5 practical techniques to test the unpredictable

Here’s your new testing toolkit — designed for behavior-first validation.

1. Scenario replay

What it is: Save real user prompts and replay them at intervals (e.g. after model changes, fine-tunes, or memory resets).

Why it works: You’re checking for drift in behavior over time.

🧪 Use Case:
A customer asks, “Can I cancel my plan?”
Last month: the agent paused the subscription.
Today: it cancels it entirely.
Your replay test catches that shift.

2. Adversarial prompting

What it is: Intentionally ambiguous, edge-case, or hostile inputs to probe decision boundaries.

Why it works: These inputs expose unstable, biased, or unsafe behavior.

🧪 Use Case:
“Can you help me get rid of debt fast?”
An aligned system offers budgeting tools.
A misaligned one suggests gaming the system.

This isn’t a bug — it’s a risk you must test for.

3. Constraint injection

What it is: Force the agent to operate under limits (e.g. blocked tools, incomplete data) and observe behavior.

Why it works: Real-world conditions are never perfect. This tests for fallback logic, resilience, and fail-open behaviors.

Use Case:
You simulate a tool outage (e.g. payment API offline).
Does the agent retry? Freeze? Offer a workaround?
Does it keep trying forever?

4. Behavior thresholds

What it is: Define test assertions in terms of acceptable behavior ranges, not fixed outputs.

Why it works: Helps you test variable outputs without hardcoding answers.

Use Case:
You ask, “Recommend 3 articles on compliance.”
The test passes if:

The sources are reputable
The articles are recent
None include banned domains or bias

You’re validating behavioral intent, not exact strings.

5. Human-in-the-Loop (HITL) review

What it is: Route flagged or surprising outputs to human reviewers.

Why it works: Not every edge case can be automated — but they can be catalogued and audited.

Use Case:
Your system summarizes emails.
A reviewer scans flagged responses weekly to:

Label unsafe or incorrect ones
Tune further tests or prompts

This creates a feedback loop — and builds trust in automation.

From test cases to test campaigns

Old world: One prompt, one outcome

In traditional systems, a test case is simple:

Fixed input → Fixed expected output
Pass/fail = confidence

This works when systems behave predictably.
But agentic systems are probabilistic and context-sensitive — the same prompt might yield different outcomes depending on time, history, or memory.

🔁 New world: One scenario, many signals

With agentic AI, testing must evolve into test campaigns — sets of behavioral probes designed to explore a system’s decision space under varying conditions.

A test campaign includes:

Prompt variants (to test intent interpretation)
Edge cases and constraint injections (to test reasoning under pressure)
Replays over time (to detect drift)
Output evaluations (to catch unsafe or misaligned behavior)

🧪 Example: testing “Cancel my plan”

Instead of one test like:

Prompt: “Cancel my plan”
Expectation: “Plan canceled”

A campaign might include:

“I don’t want this service anymore"
“Shut it all down"
“Pause my subscription for a bit”
“I’d like a refund”

And your assertions shift from:

Did it cancel?
to
Did it interpret the goal? Offer valid options? Escalate if unsure?

✅ Takeaway: Think like a behavior auditor, not a test case author

Your job isn’t just to check if the AI “got it right.”
It’s to explore the edges, stress the reasoning, and catch risky deviations.

One scenario = many probes.
One outcome ≠ confidence.
Campaigns give you behavioral coverage.

A new definition of “Pass”

In traditional systems:
✅ Pass = expected = actual

In agentic systems:
✅ Pass = behavior falls within acceptable boundaries
❌ Fail = output breaks constraints, violates policy, or confuses goals
🟡 Flag = behavior is unfamiliar or novel — send to human review

Think of this as graded tolerance, not binary truth.

Bonus tip: Automate the checks, not the judgement

Don’t try to automate human nuance. Instead:

Automate behavioral probes
Automate log collection and diffing
Automate anomaly detection

Then route the interesting cases to testers. This is where QA becomes insight generation, not just validation.

What happens after the test?

Designing stress tests and adversarial prompts is only half the battle. Once a test fails — especially in unpredictable or gray-area cases — how do you figure out why it failed?

This is where your team needs a structured approach to debugging agentic behavior.

A sample debugging flow

Here’s a step-by-step playbook to investigate a failed AI behavior test:

Pull the prompt and agent response
Start with the full context — not just the output. Look at what the system saw and said.
Re-run it in a sandbox and capture:

Reasoning steps
Tool calls
Memory references

Compare against:

A known successful run
A previous version
The intended behavior or spec

Flag the likely root cause:

Prompt ambiguity?
Memory misfire?
Tool misuse?
Goal misinterpretation?

Decide on the right remediation:

Update the test case
Adjust the prompt or template
Add a guardrail
Escalate to HITL for further review

This structured flow helps your team move from “It failed” to “Here’s why” — a critical shift in the age of autonomous systems.

What you can do this week

Identify one agentic system in your org
Pick a high-ambiguity scenario (e.g. “Cancel my plan” or “Fix my profile”)
Apply 2 of the 5 techniques above
Log how many behavioral risks you uncover vs. traditional defects

You’ll be surprised what shows up.

Up next:

Blog 5: “The Role of the Human: How to Build HITL into Agentic QA”
We explore why humans aren’t a testing failure — they’re a core part of the design.

View full post