Autonomous testing: are we there yet?
On appliances and realistic expectations
Took me years to finally get a Roomba (those robot vacuums from iRobot that have been around since 2002). I was pretty skeptical about how well it would clean and how smart and autonomous it really was.
My wife was on board on the benefits from the start, but I needed some convincing. After plenty of back-and-forth (and my wife’s persistence), we went for it.
Honestly, the benefits weren’t obvious to me from the start. I’m Brazilian, living in London, where most homes are 2 or 3 stories. Ours has 3 floors and plenty of stairs.
Once my wife finishes cleaning one floor, she has to manually carry the Roomba to the next, and even then, the battery can’t cover the whole house in one charge. So, keeping the place clean still requires a bit of human effort.
It’s great for light cleaning, but for deep cleaning, we still have to do it ourselves and bring in a professional cleaner once in a while.
Plus, maybe it’s just my OCD talking, but its weird, zigzagging paths around the room drive me nuts. It looks way too random to actually be cleaning efficiently.
Source: CNET
Even though it has its quirks and requires some extra effort, my wife thinks it works great for her. At the end of the day, if she’s happy, I’m happy!
I think our disagreement comes from having different expectations. I imagined the robot would handle all the cleaning perfectly on its own, with minimal human help.
In hindsight, it is clear that my expectations were pretty unrealistic. I fell for the hype, thinking it would be a silver bullet for all our cleaning needs.
My wife, on the other hand, sees it for what it truly is: a useful assistant for light cleaning, similar to other appliances that help with household chores. This way, she can avoid the repetitive tasks allowing her to focus on more creative interests, like learning Italian as her third language (which she’s getting pretty good at).
I was focused on the negatives, seeing the glass as half empty, while my wife took a more realistic approach, viewing it as half full. She recognizes the robotic vacuum as a handy assistant with some autonomy, but not a substitute for human-driven heavy duty cleaning, given the limitations of current technology (maybe in the future, but not right now).
Autonomous testing: are we there yet?
Switching gears to the software testing field, I think the same mindset applies. We’re caught up in the hype of some amazing, almost magical advancements in AI, with many companies promoting a compelling narrative about the benefits of AI for autonomous testing.
The reality is, we’re not there yet. The technology isn’t advanced or mature enough to achieve a higher level of autonomy.
Companies have poured tons of time and money into making self-driving cars a reality, but we’re still nowhere near full autonomy. It’s a gradual process, with each level of autonomy clearly defined by the Society of Automotive Engineers (SAE). They’ve even got a guide that breaks down each stage of the journey toward fully autonomous driving, see below:
Source: SAE
Autonomous testing follows a similar journey, a gradual, step-by-step evolution, and right now, we’re only at the very first stage of this path.
Remember, in testing, we depend on consistent, deterministic, and repetitive processes, which we all know AI/GenAI is not known for just yet.
And that’s perfectly okay. We should focus on the positives, adjust our expectations, and make the most of tools that can increase our productivity and simplify our lives now, all while keeping an eye on the future.
After all, what is autonomous testing all about?
Honestly, there’s no agreed-upon definition for autonomous testing. It can be really broad or super narrow, however, when I read about autonomous testing, the discussion mostly focuses on the autonomous execution of tests and the tasks required to maintain it, such as flaky test detection, self-healing, root cause analysis, among other capabilities.
I see no problem with autonomous execution of testing; it’s a welcome addition that makes my life as a tester easier.
However, whether you use traditional approaches like the V-Model and W-Model, shift-left and shift-right testing, or any flavor of agile methods to align with modern development practices, many parts of the testing process still create bottlenecks due to manual and inefficient practices.
Tasks such as test design, test specification, coverage analysis, test prioritization, bug triage/troubleshooting, and test data generation, among many others that are often done manually and inefficiently, will be significantly enhanced with the help of AI.
There are tons of tasks and use cases where a bit of autonomy could really boost our productivity and simplify the testing process. Here are a few examples I can think of off the top of my head, but it's not a complete list:
- By leveraging AI methods like Natural Language Processing (NLP) and Generative AI, requirements can be automatically reviewed in real time as they are created or modified. This process helps ensure clarity, consistency, completeness, testability, and feasibility, among other key factors. By identifying and addressing potential issues early on, we can prevent significant challenges before they enter the requirements analysis phase, which forms the basis for test case design and specification.
- New or changed requirements prompt the automatic and autonomous generation of test cases, allowing developers and testers to review, approve, or disregard the suggestions. Based on their experience and domain expertise, they can also add any additional tests.
- Requirement traceability and test coverage are constantly and autonomously evaluated to identify gaps. Teams are alerted to risks in uncovered areas, and test cases are automatically generated and executed to fill these gaps.
- Traces and logs from production usage are monitored, and real-time user behavior and preference changes are leveraged to identify gaps in test coverage. As a result, tests are generated and executed autonomously, expanding coverage beyond what is specified in the written requirements.
- With changes in the requirements specification, code, and real-world user behavior in production, a tool could automatically carry out impact and risk analysis, generating a prioritized test suite for review and approval before execution. Eventually, it might not even need human approval, resulting in a fully autonomous process.
- Using requirements and test cases as the basis, test data could be generated automatically to satisfy the needs and preconditions of individual test cases or entire end-to-end test suites. Additionally, complete ephemeral test environments could be spun up autonomously during test execution, requiring minimal to no human involvement.
- Tests can be executed autonomously from the test case specifications written in natural language (or Gherkin), without any human involvement in script creation. Any issues that arise during execution are automatically addressed, allowing the tests to self-heal. In worst-case scenarios, test failures are analyzed, and flaky tests are quarantined. Additionally, defects in the application are automatically detected and reported to a bug tracking tool, complete with evidence for reproduction.
- Test scheduling and orchestration are performed completely autonomously. Data sources from the entire SDLC, including code and requirement specifications, past defects, test execution results, real-world usage in production, historical trends, and predictive defect analysis, among other signals, are used to recommend what to test — whether on the developer's machine, in staging, pre-production, or live production environments.
- Quality dashboards with predictive analytics leveraged by AI, showcase testing reports, quality analytics, application health, release readiness information, and other metrics. Teams are automatically alerted to issues requiring immediate attention based on quality goal thresholds and changes in trends. When quality standards are not met, clear Go/No-Go quality gates are automatically enforced to prevent faulty code from progressing to the next stage of the DevOps pipeline.
Like I said, this isn’t a complete list, and there’s so much more out there; I’m just scratching the surface of all the potential use cases. You probably have other tasks you handle every day that could really benefit from a bit of autonomy (and automation).
To AI or not to AI?
I suppose the question is no longer 'To AI or not to AI?' but rather, 'When and how should we use it?'.
As discussed earlier, with the right mindset and appropriate expectations for the outcomes produced by AI-augmented testing tools, even a small degree of autonomy can free us to focus on the most engaging, challenging, and creative aspects of testing.
By leveraging the right tools, there's potential and opportunity to automate these tasks while incorporating a certain level of autonomy, allowing human testers to focus on reviewing results/ouputs and applying their creativity and expertise on corner cases and more complex use cases or even making critical risk decisions.
I’ve read several LinkedIn posts from people claiming that reviewing AI results or guiding the AI to produce the correct output is time-consuming and sometimes pointless; they argue it’s often better to just do the work manually. I can’t disagree, there are many cases where it’s frustrating and inefficient, especially when the problem is too complex or there’s no AI-augmented tool available for that specific task.
However, by keeping the right expectations in mind, as discussed in this article from Fast Company, we should approach AI tools like a smart intern. According to the article, these tools can enhance how users perform their daily tasks, but like any intern, they can make mistakes at times. Here’s a verbatim excerpt from the article that I’d like to highlight:
"In practice, an intern mentality encourages users to think about working with GenAI as the evolution of trust in a relationship. When you first start using GenAI, just like on an intern’s first day on the job, you’re going to want to check every bit of work it produces. Over time, analogous to being a couple of months into the summer internship, you may find some tasks that the AI intern performs well enough to accept as a first pass, but still need to check and make your own. There may be other tasks the intern performs so reliably that you don’t even need to check its work. And there may be still other tasks that you don’t want to entrust to the intern at all".
At the end of the day, this is one of the challenges of being an early adopter, the technology isn’t quite there yet to tackle our most pressing problems, but we’re willing to experiment and explore its boundaries to see what’s possible (or not).
Keep in mind, this is still an emerging field, and the underlying technology is in its early stages and actively evolving. New stuff comes out all the time, but most of it isn’t quite polished yet.
I think the latest hot-off-the-press tech is Anthropic’s 'computer use'. While it wasn’t built specifically for software testing, it’s a new foundational tool that could be leveraged for smarter, more efficient autonomous testing execution. Keep in mind, though, it’s still in public beta and, as Anthropic mentioned in their announcement, it’s experimental, sometimes cumbersome, and prone to errors. But with rapid improvements, it has the potential to be a real game-changer.
I have complete confidence that AI can help with the routine and mundane work in software testing, but the best outcomes come from the collaboration between humans and AI, a concept known as Human-in-the-Loop (HITL). Realizing the full advantages of AI demands skilled testers capable of identifying what truly matters amid the noise created by companies developing AI tools.
A word of caution: as I highlighted in my previous article, 'Don’t let AI be a distraction; if pen and paper are the best solution, use them'. Avoid falling for shiny object syndrome, choose the tool and approach that best suit the task at hand. Don’t try to solve every problem with AI, as these tools are often not mature enough and may not deliver the expected results.
I'm glad you made it to the end of the article! I’d like to ask a favor: please share in the comments which tasks and use cases in your daily routine could benefit from some level of autonomy driven by AI-augmented testing tools. Feel free to share your insights and perspective so we can all learn from our collective knowledge.
About the author
Cristiano Caetano is an entrepreneur and product expert with extensive experience in software testing, B2B SaaS, and marketplaces. Founder of Zephyr Scale, the top-selling app in the Atlassian ecosystem, he is now the VP of Product Marketing at Katalon, where he continues to drive innovation in the tech space.