I first encountered the concept of synthetic data back in 2013, while teaching a health informatics course as a tenure-track assistant professor at UNC Charlotte. To help students experience the complexity of Electronic Health Record (EHR) systems, we partnered with a startup that built an educational EHR platform on top of the VA’s open-source system—once considered the gold standard in the industry. It was a great way for students to “feel” the real-world challenges of clinical data.
Fast forward, the idea of synthetic data expanded into the HL7 FHIR community, enabling developers to test interoperability. That made sense. But later, while working at Merck, I saw startups pitching synthetic data as their core asset to pharma. I stayed collegial, but deep down, I was skeptical: Who will trust results generated from synthetic data—FDA, CMS, or any Health Technology Assessment body?
The argument was compelling: synthetic data can train AI models when real-world data is scarce. It’s a concept rooted in machine learning. But does it sound a bit like a perpetual motion machine—too good to be true?
Here’s why I remain cautious:
- Semantic correctness matters. Synthetic data is often generated using generative adversarial networks (GANs), variational autoencoders, or diffusion models. But how do we ensure biomedical plausibility? Can a male patient have breast cancer? (Yes, but rarely.) Can someone have both hypertension and hypotension? What about hundreds of thousands of drug-drug contraindications? These nuances matter.
- Quality metrics are unclear. How do we evaluate how “good” a synthetic dataset is? Statistical similarity isn’t enough when clinical decisions are at stake.
Despite my skepticism, the field is booming—AI hype attracts investment. Today, multiple companies specialize in synthetic healthcare data, and research papers increasingly report models trained on it.
Recently, a viewpoint in The Lancet Digital Health by researchers from Stanford and NIMHD offered a thorough and rigorous discussion on synthetic data. They proposed actionable safeguards for synthetic medical AI: standards for training data, fragility testing during development, and deployment disclosures.
Reference: Koul, Arman, Deborah Duran, and Tina Hernandez-Boussard. "Synthetic data, synthetic trust: navigating data challenges in the digital revolution." The Lancet Digital Health (Nov 30, 2025).
This reminds me of my first blog post in January 2023 when I launched Polygon Health Analytics: “Damn, it is the data!” Three years later, it feels that LLMs and generative AI have changed the world, but the shortage of high-quality, real-world data in healthcare and biomedicine remains.
If we want AI to truly revolutionize these fields (and I believe it can), our top priority must be collaborative efforts to make high-quality real-world data accessible.