Architecting for Scale: Why your AI strategy is only as good as your Data Infrastructure
The Iceberg Problem
When organisations visualise an AI system, they tend to picture the model: the algorithm, the predictions, the interface. What they do not picture is the 80% of the system that sits below the waterline — the data pipelines, the storage layers, the transformation logic, the feature engineering, the quality checks, the access controls. This is the iceberg problem. The visible part of AI is the model. The part that determines whether the system works in production is the infrastructure beneath it.
In our experience, the ratio of infrastructure work to model work in a well-engineered AI system is roughly 3:1. For every hour spent on model development, three hours are spent on the data systems that feed it. Teams that invert this ratio — spending most of their time on models and treating data infrastructure as an afterthought — consistently struggle to ship.
The Four Layers of AI-Ready Data Infrastructure Layer 1: Ingestion and Integration
The first layer is the ability to reliably collect data from all relevant sources — operational databases, event streams, third-party APIs, IoT sensors, documents — and bring it into a centralised or federated store. The key properties at this layer are reliability (data arrives completely and on time), idempotency (reprocessing the same data does not create duplicates), and observability (you know immediately when something breaks). In our work with Emerson on industrial sensor systems, the ingestion layer was the most critical component. Acoustic sensors generate continuous high-frequency data streams. A gap in ingestion does not just mean missing data — it means the inference algorithms that depend on continuous signal context produce unreliable outputs. Building a robust, fault-tolerant ingestion layer was a prerequisite for everything else.
Layer 2: Storage and Organisation
The second layer is how data is stored and organised. The dominant pattern for AI workloads is the data lakehouse: a storage architecture that combines the flexibility of a data lake (raw, unstructured, schema-on-read) with the reliability of a data warehouse (structured, governed, schema-on-write). The practical implication is that you need at least three zones in your storage architecture: • Raw zone: immutable, append-only storage of data exactly as it arrived • Curated zone: cleaned, validated, and standardised data ready for analysis • Feature zone: pre-computed features optimised for model training and inference The raw zone is non-negotiable. The ability to replay history — to reprocess raw data with new logic — is what allows you to improve your models over time without losing the ability to audit what happened. Layer 3: Transformation and Feature Engineering
The third layer is where raw data becomes model-ready features. This is the most labour-intensive layer and the one most frequently built in an ad hoc, undocumented way that creates technical debt. The solution is to treat feature engineering as a first-class engineering discipline: • Feature pipelines should be versioned, tested, and deployed like application code • Features should be documented with their business meaning, data lineage, and known limitations • A feature store should serve features consistently to both training and inference, eliminating training-serving skew • Feature computation should be separated from model training so features can be reused across multiple models Training-serving skew — where the features used to train a model differ subtly from the features served at inference time — is one of the most common and most insidious causes of model degradation in production. A well-designed feature store eliminates this class of problem entirely.
Layer 4: Governance and Access Control
The fourth layer is governance: who can access what data, under what conditions, with what audit trail. This layer is frequently treated as a compliance checkbox rather than an engineering concern. That is a mistake. In regulated environments — financial services, healthcare, public sector — data governance is not optional. But even outside regulated industries, poor governance creates real problems: data scientists training models on data they should not have access to, production systems reading from development databases, sensitive customer data appearing in model outputs. In our work for Banco de Portugal and the European Commission, data governance was a first-order constraint. Every data access was logged, every transformation was auditable, and every model output could be traced back to its source data. Building these properties in from the start is far cheaper than retrofitting them later.
Cloud Architecture Patterns for AI Workloads
The choice of cloud platform matters less than the architectural patterns applied on top of it. That said, there are patterns that consistently work well for AI workloads: Event-driven ingestion: Use message queues (Pub/Sub, Kafka, Kinesis) to decouple data producers from consumers. This enables real-time processing without tight coupling and provides natural replay capabilities. Containerised pipelines: Package data transformation and model training jobs as containers (Docker) orchestrated by Kubernetes. This ensures reproducibility, simplifies scaling, and makes it easy to run the same pipeline in development, staging, and production. Infrastructure as code: Define all data infrastructure — storage buckets, compute clusters, network configuration, access policies — in code (Terraform, Pulumi). This makes infrastructure reproducible, auditable, and version-controlled. Separation of compute and storage: Use cloud-native storage (GCS, S3, Azure Blob) decoupled from compute. This allows you to scale processing independently of storage and dramatically reduces costs compared to always-on compute clusters.
The Real-Time vs. Batch Decision
One of the most consequential architectural decisions in an AI system is whether to process data in real time or in batch. The answer depends on the latency requirements of the use case, not on what is technically impressive. Batch processing is appropriate when decisions can tolerate latency of minutes to hours: daily risk scoring, weekly recommendation updates, monthly churn prediction. It is simpler, cheaper, and easier to debug. Real-time processing is appropriate when decisions must be made in milliseconds to seconds: fraud detection, real-time sensor monitoring, live recommendation systems. It is more complex and more expensive, and should only be chosen when the business case genuinely requires it.
In some of our projects with Emerson Eletric, real-time processing was non-negotiable — the value of acoustic particle monitoring comes from detecting anomalies as they happen, not hours later. In contrast, the corrosion monitoring application used a hybrid approach: real-time data ingestion with batch-computed analytics, because the business decisions based on corrosion rates operate on a timescale of days, not seconds. Data Infrastructure as Competitive Advantage
The organisations that will win with AI over the next decade are not necessarily the ones with the best models. They are the ones with the best data infrastructure — the ones that can move from idea to production faster, iterate more quickly, and build systems that improve over time. Data infrastructure is not a cost centre. It is the foundation on which every AI capability you will ever build depends. Investing in it early, building it with engineering discipline, and treating it as a strategic asset is the single highest-leverage thing most organisations can do to improve their AI outcomes.
If you are assessing your current data infrastructure or planning a new AI initiative, we are happy to help you think through the architecture. It is the conversation we wish more clients had with us at the beginning of their AI journey.