Training artificial intelligence through synthetic data

AI companies are generating synthetic data to train machine learning systems.

Why it matters: Using computer-generated data to train AI systems can help address privacy concerns and cut down on bias while meeting the needs of models that operate in highly specific environments.

How it works: A synthetic data set is artificially created, rather than scraped from the real world.

For a computer vision system being trained on facial recognition, that might mean a dataset of artificially generated human faces in lieu of online photos of real people pulled off the internet — often without their explicit consent.
“This allows you to train systems in a completely virtual domain,” says Yashar Behzadi, the CEO of Synthesis AI, which generates synthetic data for computer vision models.

Details: Synthetic data has been used for some time in robotics and autonomous vehicles, which need to be trained with highly specific data — like the precise 3D position of an object — that can be expensive or difficult to pull from the real world.

But as concerns about AI bias and privacy grow, synthetic data makes it possible to generate data sets that can be molded to specification, allowing AI researchers to counter the bias that can be built into the real world.
“If we want to be robust against skin color or skin tone or demographics, any element that may not be well-represented, you can just model your distribution to equally representing each of those categories,” says Behzadi.

Yes, but: The real world contains outliers that synthetic data generators may not think to cover, which could leave models unprepared for certain situations.

And it’s still up to the generators of synthetic data to ensure that their datasets are fairer than what might be picked up in the real world.

The bottom line: Synthetic data can be even better than the real thing, but only if it’s designed the right way.

Bryan Walsh