When your AI thinks Digoxin treats hypertension: Why healthcare AI needs rigorous evaluation
.jpg)
The Balancing Act of Implementing LLMs
AI services are popping up everywhere.Going beyond what is available in e.g. ChatGPT web terminal means using LLMs in different more customised, repeatable and scalable ways for example, if you want access to an LLM you can use the Claude webapp, spin up a N8N instance on your computer, or download a tiny language model on your phone. But the artificial neural networks underlying these technologies are complex and the ecosystem hasn't stabilised to clear best practices yet. Implementing AI at scale is a balancing act of the following factors:
- Cost
- Quality
- Privacy
- Engineering workload
- Speed
Using the latest ChatGPT or Claude versions will give you quality and speed. However, in the absence of an (expensive) enterprise agreement guaranteeing data siloing, they come with a possibly unacceptable tradeoff in privacy, likely violating data protection regulations like POPIA and GDPR. Building your own LLM machine can give you privacy, but your consumer GPU likely won't be as good or fast as what's available on the cloud. And spinning up a virtual machine can allow you to use strong models with low latency, but beware the day your boss gets the invoice. And whichever path you choose, did you price in the engineer staring sadly at the inexplicable Linux error logs, because you decided to rig your own tooling? Each choice trades off against a combination of these factors.
Where Things Go Wrong
An LLM is, at its core, a massive matrix of numbers, and outputs are the result of billions of calculations. A human can’t comprehend how these result in a plain text output, and even if you did, it's very difficult to predict ahead of time where things will go wrong. In our work we've seen some interesting "failure modes" emerging:
Model knowledge gaps:
a lot of knowledge can be instilled in the weights of a model, but, like we've seen, even very recent models can say things that would place them in mortal danger on an internal medicine ward round. The challenge is, firstly, an electronic Dunning-Kruger effect: LLMs can be stupidly confident in their ignorance. Secondly, that you don't know where the knowledge gap will affect your particular use case before testing.
System architecture failures:
You can level up your LLM infrastructure with a RAG database and agentic search functions, both increasing and tightening the available knowledge base so the LLM does not speak out of turn. But now your unpredictable system is open to even more potential mayhem. It can give incorrectly formatted answers breaking your system, retrieve irrelevant chunks (because how do you even decide what similarity score is enough?), or just ignore your finely manicured knowledge base and hallucinate its own answer anyway.
Weird, rare effects:
We've seen some bizarre errors that LLMs can make, like spelling "contraception" as "contracception", "contracepion", and "contracepption", or "ceftriaxone" becoming "keftriaxone" or "Cef-Triatzone". Try building reliable downstream search or analytics when your model can't even spell the medication consistently.
Scaling Up the Vibe Checks
If you want to have AI systems in production, you really do have to look at the data produced: the chatbot transcripts, the structured outputs of your unstructured data analysis, the long, arcane logs your low-code system produces. Your subject matter expert needs to be involved in this. One place where vibe-coding really shines is making custom web app UIs to make your logs and traces comprehensible, both to the AI engineer and the domain expert. Here you will see your use case-specific failure modes emerging, from the common and expected ("Urght, that prompt is not doing the trick") to the weird and the wonderful ("Where did you get that spelling?", "Why would you lie about this? I trust you and believe in you, and I told my boss you were the best model"). Failure mode patterns emerge and can be categorised. Sometimes some pedantic structure checks are all you need. Sometimes LLM 2 needs to rat out LLM 1's hallucinations. Your RAG system might confidently cite a source that says the opposite of what you asked. Groundedness needs to be checked and verified, and similarity score thresholds need to be tweaked*.
Manual review gets you the insights. But it doesn't scale. Eventually an evaluation harness, i.e. the infrastructure that runs LLM evaluations end-to-end, will need to be built and productionised. You'll want to check if future or cheaper models are performing better and should be swapped in. Your beloved foundation model can secretly get quantised** with a drop in quality for your use case. You need to have concrete checks in place to get as close as possible to a relevant and objective assessment of AI system performance. As Shreya Shankar (UC Berkeley) puts it in their seminal paper, "scaling up your vibe checks" (Shankar et al., "Who Validates the Validators?", UIST 2024).
What This Means For You
Building reliable AI in healthcare requires a rare combination of skills: you need engineers who can wrangle LLMs, RAG systems, and evaluation harnesses, but you also need clinicians who know that digoxin will kill your patient before it lowers their blood pressure. Without domain expertise baked into your evaluation pipeline, you're not just missing edge cases, you're missing the cases that matter most. An evaluation harness is only as good as the knowledge feeding it, and in healthcare, you don't know what you don't know until a pharmacist reads your model's output and goes pale.
The cost of getting this wrong isn't a bad product recommendation or a clunky chatbot experience. It's a regulatory breach, a loss of patient trust, or a clinical decision made on confidently wrong information. Investing in a proper evaluation framework — one that combines subject matter expertise with systematic AI testing — is the cheapest insurance you can buy. We've spent the last year building exactly this: healthcare-specific evaluation systems that catch the failures before your patients do. If you're deploying AI in a clinical or health data context, Wimmy would love to show you how evaluation frameworks that span both health and AI domain expertise are crucial to success.
* Retrieval augmented generation systems work by computing the mathematical simalarity of languege model embeddings (the first layers of a language model) with a those of a query (the question to which you seek the answer). This can be mathematically calculated and is the "similarity" (usually cosine similarity). But this number does not necessarily translate into retrieved data that actually answer the question, if it did that, it would be called "grounded" -- something that's much more difficult to check and prove than the mathematical cosine similarity.
** This is when the floating point precision of models is reduced, with the aim of reducing memory requirements and associated costs, while retaining model intelligence. The model intelligence does, however, sometimes suffer the consequences of the loss in precision.


