pharma Bearish 8

AI-Generated Medical Deepfakes Fool Radiologists and Detection Models

· 4 min read · Verified by 2 sources ·
Share

Key Takeaways

  • A groundbreaking study from the Icahn School of Medicine at Mount Sinai reveals that AI-generated X-rays are indistinguishable from real scans to many human experts and AI models.
  • The findings highlight critical vulnerabilities in medical records that could lead to fraudulent litigation and clinical sabotage.

Mentioned

OpenAI company Google company GOOGL Meta Platforms company META Dr. Mickael Tordjman person Icahn School of Medicine at Mount Sinai company ChatGPT product RoentGen product

Key Intelligence

Key Facts

  1. 117 radiologists from 12 hospitals across 6 countries participated in the study.
  2. 2Radiologists unaware of the study's purpose identified only 41% of AI-generated images.
  3. 3Detection accuracy for radiologists rose to 75% only after they were informed of the presence of fakes.
  4. 4AI models including GPT-5 and Gemini 2.5 Pro showed detection accuracy ranging from 57% to 85%.
  5. 5GPT-4o failed to detect all synthetic images it helped create.
  6. 6The study utilized a dataset of 264 X-ray images, 50% of which were synthetic.
Detector Type
Human Radiologists 41% 75%
GPT-4o (OpenAI) N/A 85%
Llama 4 Maverick (Meta) N/A 57-85% Range
Gemini 2.5 Pro (Google) N/A 57-85% Range

Who's Affected

Hospital Networks
companyNegative
Insurance Providers
companyNegative
AI Developers
companyNeutral
Patients
personNegative

Analysis

The rapid advancement of generative artificial intelligence has reached a critical threshold in medical diagnostics, where synthetic imagery has become virtually indistinguishable from authentic patient data. A study published in the journal Radiology by researchers at the Icahn School of Medicine at Mount Sinai demonstrates that AI-generated X-rays can successfully deceive both veteran radiologists and the very large language models (LLMs) used to create them. This development marks a significant turning point for the healthcare industry, shifting the conversation from the benefits of AI-assisted diagnostics to the profound risks of medical deepfakes and the potential for systemic exploitation of digital health records.

The study involved 17 radiologists from 12 hospitals across six countries, tasked with reviewing 264 X-ray images. Half of these images were synthetic, generated by tools such as ChatGPT and RoentGen. The results were startling: when the radiologists were unaware that the dataset contained fakes, they spontaneously identified only 41% of the AI-generated images. Even after being explicitly informed that synthetic images were present, their detection accuracy only rose to 75%. This suggests that even with a high degree of suspicion, a quarter of synthetic medical images would still be accepted as genuine by human experts, posing a direct threat to diagnostic integrity.

Detection accuracy among these models ranged from a low of 57% to a high of 85%.

Perhaps more concerning is the failure of advanced AI models to detect their own output. The researchers tested four major LLMs—OpenAI’s GPT-4o and GPT-5, Google’s Gemini 2.5 Pro, and Meta’s Llama 4 Maverick. Detection accuracy among these models ranged from a low of 57% to a high of 85%. Notably, GPT-4o, the model responsible for creating some of the deepfakes, failed to identify all of its own synthetic creations. This recursive vulnerability suggests that the industry cannot currently rely on AI as a foolproof gatekeeper against AI-generated misinformation. If the creators of these models cannot build reliable detection mechanisms, the barrier to entry for bad actors remains dangerously low.

The implications of this research extend far beyond the radiology suite, entering the realms of legal liability and insurance fraud. Dr. Mickael Tordjman, the lead researcher, noted that the ability to fabricate a fracture or a pathology that is indistinguishable from a real one creates a high-stakes vulnerability for fraudulent litigation. In an era where medical evidence is the cornerstone of multi-million dollar personal injury and malpractice claims, the introduction of undetectable synthetic evidence could undermine the entire judicial process. Insurance providers may soon find themselves in a technological arms race, needing to verify the provenance of every digital asset submitted in a claim.

What to Watch

From a cybersecurity perspective, the study highlights a new vector for 'clinical chaos.' If hackers were to gain access to a hospital’s Picture Archiving and Communication System (PACS), they could theoretically inject synthetic images into patient files. This would not only lead to incorrect diagnoses and unnecessary treatments but would also erode the fundamental trust that clinicians place in the digital medical record. The potential for widespread disruption is immense, as a single compromised network could cast doubt on the validity of thousands of patient histories, forcing a return to manual verification or redundant testing.

Looking forward, the medical community must prioritize the development of digital safeguards. Researchers are already calling for the implementation of invisible watermarks and blockchain-based provenance tracking to embed ownership and history into every medical image. As AI continues to evolve, the 'tip of the iceberg' described by Dr. Tordjman suggests that the industry is entering a period of profound uncertainty. The focus must now shift toward building a 'zero-trust' architecture for medical data, where the authenticity of an image is verified by its digital signature rather than its visual appearance.

How we covered this story

Every story in our biotech coverage is assembled from multiple primary sources, cross-referenced for factual consistency, and scored along three independent dimensions: sentiment, operational impact, and source-cluster confidence. Single-source rumors and unverifiable claims do not pass our editorial gate. When a story shows "Verified by N sources" with N≥2, the development is independently corroborated; when N=1, we mark it explicitly so readers can weigh the signal accordingly.

Impact scoring uses a 1-10 scale weighted toward regulatory, financial, and operational consequence rather than coverage volume. A topic that runs in every outlet but moves no real decisions ranks lower than a niche regulatory filing that reshapes how operators in the biotech space have to behave. Read our full methodology for the scoring rubric, our glossary for term definitions, and our trends index for the longitudinal view across the beat.