Table of Contents

What is Test Data Management? Definition, Tools, Best Practices

Test data management (TDM) is usually ignored despite its undeniable contribution to the overall success of the testing process. In complex testing projects with many test scenarios to cover, test data management is a highly critical area to optimize for.

QA teams need diverse and comprehensive test data to achieve higher test coverage, and that brings up the need to have a separate place where that data is properly stored, managed, maintained, and set up for future testing. That is where test data management shines through.

In this article, we will explore the concept of test data management in-depth, along with test data management best practices, strategy, and tools that you can use for this activity.

What is Test Data Management?

TDM activities

Test data management (TDM) is the process of planning, creating, and maintaining the datasets used in testing activities, ensuring that they are the right data for the right test case, in the right format, and available at the right time.

Test data is the set of input values used during the testing process of an application (software, web, mobile application, API, etc). These values represent what a user would enter the system in a real-world scenario. Testers usually can write a test script to automatically and dynamically identify the right type of values to put into the system and see how it responds to those data.

For example, test data for the testing of a login page usually has 2 columns: a Username column and a Password column. A test script or automation testing tool can open the Login page, identify the Username field, the Password field, then input the values:

Username	Password
user_123	Pass123!
testuser@email	Secret@321
admin_user	AdminPass#
jane_doe	JaneDoePass

You can have hundreds to thousands of such credential pairs representing unique test scenarios.

But having a huge database does not immediately mean all of it is high-quality. There are 4 major criteria to evaluate test data quality, including:

Relevance: it makes sense to have test data that accurately reflects the scenario being tested. Imagine testing the response of the login page when users enter the wrong set of credentials, but the test data being used is actually the correct, stored-in-database credentials. This returns inaccurate results.
Availability: what’s the point of having thousands of relevant data points yet you can’t retrieve them for testing activities? Usually QA teams have clearly defined role-based access for test data, so TDM activities are also about assigning the right level of access to the right personnel.
Updated: software constantly changes, bringing with it new complexities and dependencies. The responsibility of QA teams is to be aware of those updates and make changes to the test data accordingly to ensure that results accurately reflect the current state of the software.
Compliance: aside from the technical aspects, we should never forget compliance requirements. QA teams sometimes leverage directly production data for testing activities due to its instant availability, but production data is a tricky domain: it may contain confidential information protected by GDPR, HIPAA, PCI, or other data privacy focused policies.

Types of Test Data

Test data types

Positive Test Data: this type of data consists of input values that are valid and within the expected range and is designed to test how the system behaves under expected and normal conditions. Examples: a set of valid username and password that allows users to login to their account page on an eCommerce site.
Negative Test Data: in contrast with positive data, negative test data consists of input values that are invalid, unexpected, or outside the specified range. It is designed to test how the system behaves when users do something out of the “correct” path intended. Examples: a set of username and password that is too long.
Boundary Test Data: these are values at the edges or boundaries of acceptable input ranges chosen to assess how the system handles inputs at the upper and lower limits of the allowed range.
Invalid Test Data: these are data that does not accurately represent the real-world scenarios or conditions that the software is expected to encounter. It does not conform to the expected format, structure, or rules within a given context.

Why Test Data Management?

The 3 characteristics of test data

Here are some reasons why you should have your test data management process in place:

Diversity: high test coverage is synonymous with covering a rich array of test scenarios, and subsequently having test data for all of those scenarios. A simple registration page, for example, already requires so many datasets to cover all of the possible scenarios that can happen there:
1. Valid credentials
2. Empty username
3. Empty password
4. Incorrect username
5. SQL injection attempt
6. Special characters
7. Too long username
8. Too long password
  
  There can be more, depending on the complexity of the registration page under test. BFSI testing, for example, tends to establish multi-layer authentication for higher security, and that translates into more complex testing. TDM is there to ensure that test data for every test scenario is well-categorized and organized to best facilitate the actual testing activity.
Data Privacy: without good TDM practices, testers can risk using PII (personally identifiable information) to test, which is a breach in security. There are so many things you can do in TDM to prevent this from happening, such as data anonymization, which is essentially a process to replace real, sensitive data with similar but fictitious data. If teams decide to use real data, they can mask (i.e. encrypt) specific sensitive data fields, and use only the most necessary. Several teams employ Dynamic Data Masking (DDM) to dynamically mask data fields based on user roles and permissions.
Data Consistency: QA teams also need to ensure that their test data is uniform across the entire systems, adhering to the same format and standards, and even the relationships among the datasets must be continuously maintained over time when the complexity of the system grows.

Read More: Database Testing: A Complete Guide

Test Data Management Techniques

1. Data Masking

Data masking is the technique used to protect sensitive information in non-production environments by replacing, encrypting, or otherwise “masking” confidential data while retaining the original data's format and functionality. Data masking creates a sanitized version of the data for testing and development purposes without exposing sensitive information.

The way data is masked depends on the algorithms QA teams chose. After cloning the data, there are quite a lot of ways to “play” with it and turn it into a completely new set of data in which the original identity of the users is protected. For example, we can:

Data Masking Technique	Definition + Examples
Substitution	Definition: Replace actual sensitive data with fictional or anonymized values. You can leverage Generative AI for this approach; however, note that creating entirely new data is resource-intensive. Example: Replace actual names with randomly generated names (e.g., John Doe).
Shuffling	Definition: Randomly shuffle the order of data records to break associations between sensitive information and other data elements. This approach is faster and easier to achieve compared to the Substitution. Example: Shuffle the order of employee records, disconnecting salary information from individuals.
Encryption	Definition: Use encryption algorithms to transform sensitive data into unreadable ciphertext. Only authorized users with decryption keys can access the original data. This is a highly secured approach to take. Example: Encrypt credit card numbers, rendering them unreadable without proper decryption.
Tokenization	Definition: Replace sensitive data with randomly generated tokens. Tokens map to the original data, allowing reversible access by authorized users. Example: Replace social security numbers with unique tokens (e.g., Token123).
Character Masking	Definition: Mask specific characters within sensitive data, revealing only a portion of the information. Example: Mask all but the last four digits of a social security number (e.g., XXX-XX-1234).
Dynamic Data Masking	Definition: Dynamically control and limit the exposure of confidential data in real-time during query execution. In other words, sensitive data is masked at the moment of retrieval, just before being presented to the user (usually the masking logic is based on user roles). Example: Mask salary information in query results for users without financial access rights.
Randomization	Definition: Introduce randomness to the values of sensitive data for creating diverse test datasets. Example: Randomly adjust salary values within a specified percentage range for a group of employees.

2. Data Subsetting

Data subsetting is a technique to create a smaller yet representative subset of a production database for use in testing and development environments. There are several benefits to this technique:

Reduce data volume, especially in organizations with large datasets. For testing purposes, smaller data volume minimizes resource requirements and therefore reduces maintenance needs.
Preserve data integrity, as subsetting a dataset does not change the relationship between rows, columns, and any entities within it
Easily include/exclude data based on specific criteria relevant to the team’s testing needs, giving them a higher level of control. At the same time, this translates into improved efficiency in terms of data storage, transmission, and processing.

3. Synthetic Data Generation

Synthetic data generation is the process of creating artificial datasets that simulate real-world data without containing any sensitive or confidential information. This approach is usually reserved only for when obtaining real data is challenging (i.e. financial, medical, legal data) or risky data (i.e. employee personal information).

In such cases, generating entirely new sets of data for testing purposes is a more practical approach. These synthetic datasets aim to simulate the original dataset as closely as possible, and that means capturing its statistical properties, patterns, and relationships.

To create new test data, you can leverage Generative AI. Simply provide the AI with clear-cut prompts for how you want your dataset to be. If you want to go above and beyond, you can custom-train an AI with real-world data samples (make sure to let it know the statistical properties you want to achieve).

Of course, do not expect instant results when training an AI. However, with enough dedication, you can create a powerful engine fine-tuned to every specific test data needs of your organization.

Top Test Data Management Tools

1. Katalon

Katalon logo resized

Katalon is a well-known automation testing platform that comes with readily available test data management features that you can leverage right away. As a comprehensive platform, you can do test planning, management, execution, and analysis for web, desktop, mobile, and API testing on Katalon, with TDM best practices already built in!

The first step of TDM is to generate data. Here’s how you can achieve synthetic data generation with Katalon. Make sure that you have the latest version of Katalon installed, which you can download using the link below:

Download Katalon and Witness its Power

Once you are in Katalon, open any test case or create one if you are starting from scratch. After that, write a clear prompt to instruct GPT as to what test script you want it to create. You should use actionable language, provide necessary context, and specify the results. See the example test steps below for reference:

Synthetic Data Generation Using Katalon

After that, select the prompt and right-click, select StudioAssist, and then select “Generate Code.” The code will soon be generated based on your instructions. You can freely make any adjustments you want with it.

Synthetic Data Generation Using Katalon

2. Tricentis Tosca

Tricentis Tosca as one of the top desktop testing tools

Tricentis Tosca is a comprehensive enterprise-grade automation testing tool for web, API, mobile, and desktop applications. It has a distinctive model-based testing methodology, enabling users to scan an application’s UI or APIs to create a business-oriented model for test development and maintenance.

Tricentis comes with the Test Data Management web application that allows you to view, modify, or delete records in your test data repositories. The TDm module is automatically installed as part of the Test Data Service component in the Tricentis Tosca Server setup.

3. IBM Test Data Management

IBM functional software tester as a SAP testing tool

With IBM Test Data Management Solution, you can browse, edit, and compare data, ensuring test results alignment with original data. With support for complex data models and heterogeneous relationships, it ensures data integrity for application testing and migration.

Additionally, IBM TDM also provides data privacy features to mask sensitive information, maintaining validity for testing purposes. There are interfaces for designing, testing, and automating test data management processes, enabling self-service for end users.