Generative AI has arrived in medicine.
Normally, when a new device or drug enters the U.S. market, the Food and Drug Administration (FDA) reviews it for safety and efficacy before it becomes widely available.
This process not only protects the public from unsafe and ineffective tests and treatments but also helps health professionals decide whether and how to apply it in their practices.
Unfortunately, the usual approach to protecting the public and helping doctors and hospitals manage new health care technologies won’t work for generative AI.
To realize the full clinical benefits of this technology while minimizing its risks, we will need a regulatory approach as innovative as generative AI itself.
The reasons lie in the nature of the FDA’s regulatory process and of this remarkable new technology. Generally speaking, the FDA requires producers of new drugs and devices to demonstrate that they are safe and effective for very specific clinical purposes.
To comply, manufacturers undertake elaborate and carefully monitored clinical trials to assess the safety and efficacy of their product for one indication — say, reducing blood sugar levels in diabetics or detecting abnormal heart rhythms in patients with heart disease.
A drug or a piece of software might be approved for treatment of one illness or condition but not for any other until a clinical trial has demonstrated that the drug or device worked well for that additional indication. Insurance companies, including the federal government, will generally only pay for new treatments after they get FDA approval.
And clinicians can be assured that if they use those technologies for FDA-approved uses and in accordance with FDA-approved guidance, they will not be liable for damaging side effects — at least if they warned patients about them.
Clinicians can and do use approved drugs and devices for unapproved, so called “off-label,” purposes, but they face additional liability risk, and patients’ insurance companies may not pay for the drug or device.
Why won’t this this well-established framework work for generative AI?
The large language models (LLMs) that power products like ChatGPT, Gemini, and Claude are capable of responding to almost any type of question, in health care and beyond.
(Disclosure: One of us — Bakul Patel — works for Google, the company that owns Gemini.) In other words, they have not one specific health care use but tens of thousands, and subjecting them to traditional pre-market assessments of safety and efficacy for each of those potential applications would require untold numbers of expensive and time-consuming studies.
To further complicate matters, LLMs’ capabilities are constantly changing as their computational power grows and they are trained on new datasets. It’s as though the chemical composition and effects of a new drug were continuously and endlessly changing from the moment of its approval.
The way forward may involve conceiving of LLMs not as new devices but as novel forms of intelligence. Lest this seem totally bizarre, consider the following.
Society has long experience with regulating human intelligence applied to health care for the purpose of assuring that humans are qualified to practice their professions.
Physicians and nurses undergo elaborate training and testing before they are allowed to apply their skills. Practicing medicine without a license is a crime, and governments can revoke licenses when clinicians’ intelligence is compromised through illness or substance abuse.
In the human case, the evolution of intelligence is not a problem but a benefit as clinicians learn from experience and from new scientific discoveries. Periodic recertification through testing is increasingly common for physicians who wish to be considered board certified in a particular specialty like cardiology or family practice.
How could this paradigm be applied to generative AI?
Before approving their use by clinicians, government could require that LLMs undergo a prescribed training regimen modeled on the training of physicians and other clinicians.
This could involve exposing the models to specific training materials, then testing their command of those materials through tailored examinations.
LLMs could also be required to undergo a period of supervised application in which expert human clinicians observed and corrected their responses — as medical school faculty do for physicians in training during their internships, residencies, and fellowships. LLMs could further be required to undergo periodic retraining (to keep up with changes in the field) and then retesting to assure that they had mastered the materials.
There would be some novel challenges in this regime that would require research and development to lay the appropriate groundwork.
The performance measures applied to humans might not be fully appropriate for assessing generative AI. Testing would have to be based on standards for acceptable numbers of errors to which LLMs are prone, such as “hallucinations” — when the devices make up answers out of whole cloth.
Since LLMs also tend to change their answers to the same question, they would need to be evaluated for consistency of response and for their ability to take imprecise questions (or “prompts” as they are called in the field) and respond with relevant and correct answers.
An important element in the assessment of LLMs would also involve transparency on not only the performance of the models but also on the data they were trained on and who fashioned the models.
This would help inform users of potential biases in the LLMs, including those arising from the financial interests of their creators.
No regulatory regime is likely to guarantee that clinical generative AI performs perfectly. But then, neither does the regulation of human clinical intelligence. Few physicians ace their licensing or board certification exams. And even the most senior and capable clinicians make errors over the course of their careers.
Of course, some might argue that creating this new regulatory framework is an unnecessary intrusion on the freedom of clinicians to decide which new devices to use in their practices.
Our response is that physicians and other clinicians lack the expertise and time to individually evaluate each of the proliferating LLMs that will be available to them.
Some physicians and other clinicians will be able to rely on their hospitals and health systems to provide guidance, but many smaller hospitals and independent physicians won’t have this luxury.
Regulatory review might also help payers, including Medicare, decide which LLM-related services to pay for and reduce the chance that practitioners will be held liable for their use if and when patients suffer adverse events.
In other words, regulatory scrutiny might increase the likelihood that the benefits of LLMs will be realized as clinicians feel they are safe to use.
Technological breakthroughs like generative AI offer huge prospective benefits. But to realize those benefits, we may need breakthroughs in health policy as dramatic as generative AI itself.
Comments