Test data
Test data plays a crucial role in software development by providing inputs that are used to verify the correctness, performance, and reliability of software systems. Test data encompasses various types, such as positive and negative scenarios, edge cases, and realistic user scenarios, and it aims to exercise different aspects of the software to uncover bugs and validate its behavior. By designing and executing test cases with appropriate test data, developers can identify and rectify defects, improve the quality of the software, and ensure it meets the specified requirements. Moreover, test data can also be used for regression testing to validate that new code changes or enhancements do not introduce any unintended side effects or break existing functionalities. Overall, the effective utilization of test data in software development significantly contributes to the production of reliable and robust software systems.
Background
Some data may be used in a confirmatory way, typically to verify that a given set of inputs to a given function produces some expected result. Other data may be used in order to challenge the ability of the program to respond to unusual, extreme, exceptional, or unexpected input.[1]
Test data may be produced in a focused or systematic way (as is typically the case in domain testing), or by using other, less-focused approaches (as is typically the case in high-volume randomized automated tests). Test data may be produced by the tester, or by a program or function that aids the tester. Test data may be recorded for reuse or used only once. Test data can be created manually, by using data generation tools (often based on randomness[2]), or be retrieved from an existing production environment. The data set can consist of synthetic (fake) data, but preferably it consists of representative (real) data.[3]
Limitations
Due to privacy rules and regulations like GDPR, PCI and HIPAA it is not allowed to use privacy sensitive personal data for testing.[4] But anonymized (and preferably subsetted) production data may be used as representative data for test and development.[5] Programmers can also choose to generate mock data, but this comes with its own limitations. It is not always possible to produce enough fake or mock data for testing.[6]
AI-generated "synthetic data" can be another option to generate test data. AI-powered synthetic data generators learn the patterns and qualities of a sample database. Once the training of the AI algorithm has taken place, it can produce as much or as little test data as defined. AI-generated synthetic data needs additional privacy measures to prevent the algorithm from overfitting. Some commercially available synthetic data generators come with additional privacy and accuracy controls. The amount of data to be tested is determined or limited by considerations such as time, cost and quality. Time to produce, cost to produce and quality of the test data, and efficiency.
Domain testing
Domain testing is a family of test techniques that focus on the test data. This might include identifying common or critical inputs, representatives of a particular equivalence class model, values that might appear at the boundaries between one equivalence class and another, outrageous values that should be rejected by the program, combinations of inputs, or inputs that might drive the product towards a particular set of outputs.[7]
References
- Weyuker, E. J. (1988-06-01). "The evaluation of program-based software test data adequacy criteria". Communications of the ACM. 31 (6): 668–675. doi:10.1145/62959.62963. ISSN 0001-0782. S2CID 15141475.
- "On testing in DDD". Medium. 2022-04-24. Retrieved 2023-01-24.
- "What is test data and how is it created?". DATPROF. 2019-06-26. Retrieved 2020-04-29.
- "Get GDPR, PCI and HIPAA compliant". DATPROF. 2020-03-03. Retrieved 2020-07-09.
- "Using production data for testing". DATPROF. 2019-10-17. Retrieved 2020-07-09.
- Emam, Khaled El; Mosquera, Lucy; Hoptroff, Richard (2020-05-19). Practical Synthetic Data Generation: Balancing Privacy and the Broad Availability of Data. "O'Reilly Media, Inc.". ISBN 978-1-4920-7271-3.
- Fries, Richard C. (2019-08-15). Handbook of Medical Device Design. CRC Press. ISBN 978-1-000-69695-0.