Published: May 29, 2023
Author(s)
Krishna Khadka (UTA), Jaganmohan Chandrasekaran (Virginia Tech), Yu Lei (UTA), Raghu Kacker (NIST), Richard Kuhn (NIST)
Conference
Name: IEEE International Conference on Software Testing Verification and Validation Workshop (ICSTW 2023)
Dates: 04/16/2023 - 04/20/2023
Location: Dublin, Ireland
Citation: 2023 IEEE International Conference on Software Testing, Verification and Validation Workshop (ICSTW), pp. 228-236
Data is a crucial component in machine learning. However, many datasets contain sensitive information such as personally identifiable health and financial data. Access to these datasets must be restricted to avoid potential security concerns. Synthetic data generation addresses this problem by generating artificial data that are similar to, and thus could be used in place of, the original real-world data. This research introduces a synthetic data generation approach called CT-V AE that uses Combinatorial Testing (CT) and Variational Autoencoder (VAE). We first use VAE to learn the distribution of the real-world data and encode it in a latent, lower-dimensional space. Next, we use CT to sample the latent space by generating a t-way set of latent vectors, each of which represents a data point in the latent space. A synthetic dataset is generated from the t-way set by decoding each latent vector in the set. Our experimental evaluation suggests that machine learning models trained with synthetic datasets generated using our approach could achieve performance that is very similar to those trained with real-world datasets. Furthermore, our approach performs better than several state-of-the-art synthetic data generation approaches.
Data is a crucial component in machine learning. However, many datasets contain sensitive information such as personally identifiable health and financial data. Access to these datasets must be restricted to avoid potential security concerns. Synthetic data generation addresses this problem by...
See full abstract
Data is a crucial component in machine learning. However, many datasets contain sensitive information such as personally identifiable health and financial data. Access to these datasets must be restricted to avoid potential security concerns. Synthetic data generation addresses this problem by generating artificial data that are similar to, and thus could be used in place of, the original real-world data. This research introduces a synthetic data generation approach called CT-V AE that uses Combinatorial Testing (CT) and Variational Autoencoder (VAE). We first use VAE to learn the distribution of the real-world data and encode it in a latent, lower-dimensional space. Next, we use CT to sample the latent space by generating a t-way set of latent vectors, each of which represents a data point in the latent space. A synthetic dataset is generated from the t-way set by decoding each latent vector in the set. Our experimental evaluation suggests that machine learning models trained with synthetic datasets generated using our approach could achieve performance that is very similar to those trained with real-world datasets. Furthermore, our approach performs better than several state-of-the-art synthetic data generation approaches.
Hide full abstract
Keywords
synthetic data generation; variational autoencoders; t-way testing; combinatorial testing; latent space sampling; machine learning
Control Families
None selected