Beyond APIs: Building a sovereign AI factory

Why Build an AI Factory?
Generative AI has fundamentally changed what’s possible in data consulting. But for many businesses (ours included) the default path of piping sensitive client data through third-party APIs simply isn’t good enough. In many instances, we need AI sovereignty, i.e. full control over the data, the models and the infrastructure.
So we built a structured internal project we’re calling our AI Factory, a systematic exploration of every viable option for running large language models outside the standard API route. The results: now we understand the real trade-offs, have built genuine technical knowledge, and ultimately developed solutions we can offer to our clients with confidence.
Our reasons were straightforward. First, data privacy: we work with sensitive client data, and we need categorical guarantees that client data wouldn’t be used to train or improve third-party models. Second, cost control: cloud API bills can spiral quickly when experimenting at scale. Third, flexibility: we wanted to evaluate the wave of new open-weight models as they emerge, not be locked into whatever a single vendor decides to offer.
Three Paths, One Framework
Rather than committing to a single infrastructure path, we evaluated three distinct approaches in objective head-to-head comparison tests, running each against real medical data.
Option 1 - On-Premises GPU Workstation A desktop server in our office equipped with a consumer-grade GPU (e.g., RTX 4090 or similar), a sufficiently powerful CPU, and adequate system RAM. Full control, no data leaves your infrastructure, unlimited experimentation.
Option 2 - GPU-as-a-Service (Lambda Labs, CoreWeave, Modal) Rent enterprise-grade GPUs (A100, H100) by the hour from specialist providers. No hardware purchase, access to far more powerful compute than local alternatives, while retaining full control over the model stack and inference layer.
Option 3 - Cloud Compute (GCP Vertex AI, AWS Bedrock, Azure) Managed LLM endpoints or self-hosted VMs on hyperscale cloud. Convenient and scalable, but model availability in the South African region is currently limited, a real consideration for data residency requirements.
The Pros & Cons
No single option wins outright. Each approach involves genuine trade-offs that will matter differently depending on a client’s size, risk tolerance, and use case.
Option 1: On-Premises GPU Workstation
✅ Advantages
• Complete data sovereignty: sensitive information never leaves your infrastructure
• Unlimited experimentation at near-zero marginal cost once hardware is provisioned
• Full model flexibility: run any open-weight model, from 7B to 70B parameters
• No dependency on vendor API availability or pricing changes
• Builds genuine internal capability and technical knowledge
❌ Challenges
• Significant upfront hardware cost (R80k–R120k+ per workstation)
• Hardware becomes outdated relatively quickly in a fast-moving field
• Operational overhead: you own the uptime, backups, and security hardening
• Remote access requires engineering investment
• Physical infrastructure overhead: power supply, fire safety, cooling and supply, and physical security are manageable for a small single workstation but become substantial cost and compliance considerations as on-prem infrastructure grows
Option 2: GPU-as-a-Service
✅ Advantages
• Access to enterprise-grade compute (A100, H100) without capital expenditure
• Bridges the gap between on-prem limitations and full cloud convenience
• Full control over the model stack and inference layer
• Scale up or down based on workload without long-term commitment
❌ Challenges
• Costs escalate fast if instances are left running idle
• Still dependent on a third-party provider for infrastructure availability
• Less suitable for always-on, latency-sensitive production workloads
Option 3: Cloud Compute (GCP / AWS / Azure)
✅ Advantages
• Convenient and highly scalable with minimal operational overhead
• Access to managed LLM endpoints and integrated MLOps tooling
• Suitable for global teams and multi-region deployments
❌ Challenges
• As of the time of writing, GPU availability and major model endpoints in South African regions remain limited and inconsistent across providers, which is a key consideration for data residency requirements.
• Data residency requirements may not be met without in-country infrastructure
• Pricing changes or API deprecations are outside your control
• Least flexibility when it comes to running custom or fine-tuned open-weight models
Agents
Here’s where the project gets genuinely interesting. Most conversations about LLM infrastructure focus on single-turn inference, one prompt in, one response out. But the real power of modern AI systems comes from agents: models that can reason, take actions, call tools, and iteratively loop until a task is complete.
A key enabler of this kind of agentic architecture is the Model Context Protocol (MCP). MCP is an open standard, originally introduced by Anthropic and now governed under the Agentic AI Foundation, that defines how AI models communicate with external tools, data sources, and services. Think of it as a universal interface layer: instead of building custom integrations for every tool an agent might need to call (a database, a file system, a web browser, a code executor), MCP provides a standardised way for models to discover and invoke those capabilities at runtime. This is what allows agents to go beyond text generation and actually do things, retrieve documents, run queries, trigger workflows, or call APIs in a structured and auditable way. The significance for sovereign infrastructure is that MCP servers can be hosted entirely on-premises, meaning agents don’t jeopardise data privacy guarantees.
Agents change everything about the infrastructure equation, and they’re central to how we think about the AI Factory.
Orchestrating smaller models instead of relying on one large one
One of the most compelling patterns we’ve explored is multi-agent orchestration by chaining several smaller, specialised models rather than routing everything through a single large one. Imagine a document processing pipeline where one lightweight model extracts structured data, a second classifies intent, and a third drafts a summary. Each model is small enough to run comfortably on a single consumer GPU, but together they handle tasks that would otherwise require a much larger (and far more expensive) model. Agents are what make this orchestration practical, they automate the handoffs between models, taking the structured output of one and passing it as a formatted input to the next, without any human intervention in the loop. What would otherwise require custom glue code for every pipeline becomes a reusable, auditable workflow.
This approach becomes significantly more viable when you control your own infrastructure. When every agent call hits a paid API endpoint, costs compound fast. On-prem or GPU-as-a-service provide the best value for money in this case.
Automated benchmarking with evaluation agents
Because no model is perfectly accurate and models vary in the efficiency with which they consume tokens and memory, it is always necessary to carefully evaluate the performance of any model before putting it into production. Manually running benchmarks across model versions and infrastructure options doesn’t scale. Instead, we built evaluation agents that automatically run our problem set against any new model we want to test, collect performance metrics (tokens/sec, quality scores, memory usage), and produce a structured comparison report, without human intervention. As new open-weight models become available (which happens constantly) this gives us a repeatable, auditable way to stay updated.
Agentic workflows as client use cases
Agents also unlock the most compelling client-facing use cases: document ingestion and summarisation pipelines that run overnight, research assistants that pull from internal knowledge bases, code review agents that flag issues before a human ever looks. These aren’t science fiction, they’re achievable today on modest hardware, provided the infrastructure is set up correctly.
Having our own stack means we have prototyped these workflows with real medical data in a controlled environment, validated them, and can now propose the right deployment model, whether that’s on the client’s own hardware, a private cloud instance, or a managed service.
Agents make the case for sovereign infrastructure even stronger
There’s a subtler point here worth calling out explicitly. Agentic workflows amplify every infrastructure cost and risk. A single user prompt that activates a multi-agent workflow might trigger dozens of model calls. Any latency, pricing change, or data policy update at the API layer is multiplied accordingly. The more agentic your AI usage becomes, the stronger the argument for owning or at minimum controlling your inference stack.
“A single agentic task can trigger dozens of model calls. At per-token API pricing, the cost compounds fast. Owning your inference stack isn’t just about privacy, it’s also about making AI-enabled workflows economically viable.”
The Bigger Picture
What this project is really about is building trust infrastructure. As AI becomes central to more of our clients’ enterprise operations, hard questions demand convincing answers: Where does my data go? Who can see it? Can I audit the model? Can I run this without an internet connection? Is this approach POPI and GDPR compliant? What risk for data breaches are we exposed to?
Having worked through all three infrastructure tiers ourselves and built agentic workflows on top of each, we can answer those questions honestly. Not with vendor marketing, but with our own benchmarks and code.
We watch the model landscape closely. Open-weight models are improving dramatically. A model that required a 4×A100 cluster a year ago now runs comfortably on a single consumer GPU. That trajectory, combined with the maturation of agent frameworks, changes what’s achievable for companies that want serious AI capability without cloud dependency.
If you're navigating similar decisions around sovereign AI, data privacy, or turning medical data into secure, scalable agentic systems, let's talk.
Schedule a 30-minute call with our team: https://calendly.com/d/cx2h-9fg-hx9/ai-factory
We built this knowledge specifically to help South African organisations make informed, confident choices in a space that moves faster than most vendor pitch decks let on.


