Generative AI continues to dominate business headlines. And as we dive deeper into understanding how we can make it work for us, the more complicated it becomes. The large language models (LLMs) that generative AI platforms use to be trained require massive amounts of data, and our human-provided data often needs to be improved. Enter synthetic data generated by computers to simulate real-world data. As a business leader, you’re likely asking yourself, should I be concerned about synthetic data?
As you probably guessed, the answer to that question is complex. Synthetic data is not new, but generative AI has accelerated the ability to create it, and most of the major players in the GenAI space, including Meta and Google, are using synthetic data. There is also the fact that there is simply not enough high-quality, real-world data to train increasingly sophisticated models.
Robust data sets are required to obtain high-quality, accurate results. According to a recent article from Information Week, there is a concern that LLMs will have ingested all of the real-world data available between 2026 and 2032.
Synthetic data is a cost-effective way for companies to build on their existing data sets. According to Kjell Carlsson, head of AI strategy at enterprise AI platform Domino Data Lab, “It can also help with that manual process of going in and labeling the data. You often need very talented, expensive … people to go in and do that. You can offload a lot of that work now to these models and do it effectively if not for free, very, very cheaply.”
Another benefit of synthetic data is the reduced risk of bias, hallucinations, and privacy issues. Synthetic data can address and even remove these issues from the training data. By not using real-world data, training data can be based on representations of peoples’ information instead of real people’s actual data. This is also true in the case of copyrighted information. Instead of paying to license this type of data, synthetic data should not violate any IP.
Synthetic data can be useful when training smaller, sometimes specialized, models. For example, a company can use its large LLMs to create synthetic versions of its own data sets to train smaller models. This approach is faster, less expensive, and more optimized.
On the other hand, monitoring the quality of the data produced synthetically is important. As the article states, “Provenance and lineage start to become important because as we move further and further to using AI everywhere, you have to know the quality of the data going in to be able to trust the output.” It is critical to understand where the data came from and how it’s been transformed through the process and models.
Human oversight and testing are a must for the data sets created. Without this step, the biases mentioned earlier could be amplified instead of eliminated from the final product. You also need to monitor whether IP is being infringed upon. If the output does use copyrighted material, the liability will still rest with you.
One of the most talked about concerns of synthetic data is model collapse. According to Information Week, “If AI models are continuously trained on data created by AI, they could potentially become less and less reliable. The cycle of AI models ingesting only content created by AI conjures images of a snake devouring its own tail. In this case, the snake would continue to devour until it only spits out gibberish.”
This is a genuine concern as it can result in irreversible defects in the resulting models. The best way to prevent this is through a strong data governance process. Humans must implement the proper checks and balances and maintain strict protocols.
Synthetic data is not going away any time soon. If handled properly by humans, it will enable us to build powerful models that evolve with us over time. As the models evolve, so will the need for governance. The learning curve is still steep. Expect fits and starts as the technology matures. And don’t expect it to replace real-world data; the two should be used together.
Visit our blog to learn more about how technology is impacting organizations. To learn more about our IT recruitment process, send us a note, and one of our IT recruiters will reply promptly.