The shape of modern data platforms.
A high-level blueprint for the data platform we keep recommending to mid-sized organizations, and the trade-offs behind each layer.
The platform conversation in most organizations is shaped by the loudest vendor in the building. That’s a bad way to architect anything that has to last more than three years. Below is the reference architecture we keep arriving at for mid-sized organizations, not because it’s novel, but because it’s the shape that consistently survives contact with the next migration, the next acquisition, the next regulation.
The five-layer view
We think of a modern data platform as five layers, top to bottom: products, metrics, models, storage and ingestion. Each layer has a clear contract with the one below it, which is how the whole thing stays maintainable as the organization grows.
1. Products
Dashboards, embedded analytics, ML services, AI workflows, anything an end user touches. We treat these as products with owners, roadmaps and adoption metrics, not as one-off deliverables.
2. Metrics
A governed semantic layer where business metrics are defined once and consumed everywhere. dbt’s semantic layer, Cube, LookML, pick one, but pick one. Without this layer, “active customers” means five different things across the org and every meeting starts with a twenty-minute reconciliation.
3. Models
Curated, tested transformations of raw data into trusted entities. This is where dbt earns its keep. We model in layers (staging, intermediate, marts) so each transformation is small, testable and reviewable.
The point isn’t one tool. It’s clear contracts between layers so the platform survives the next migration.
4. Storage
The lakehouse or warehouse where data lives. Snowflake, Databricks, BigQuery, Synapse, the choice matters less than the discipline around it: layered storage (bronze / silver / gold), partitioning, cost monitoring, and a clear retention policy.
5. Ingestion
Pipelines that bring source data in. Fivetran or Airbyte for SaaS sources; Kafka, CDC or batch jobs for operational systems. The unglamorous layer where most of the platform’s reliability problems are actually born.
Cross-cutting concerns
Three concerns cut across all five layers and deserve their own owners, not afterthoughts:
- Governance, role-based access, lineage, glossaries, PII tagging. The work that lets you move quickly without taking on compliance risk.
- Observability, freshness SLAs, quality tests, alerting. You should know your pipeline broke before your stakeholder does.
- Cost, query cost, storage cost, compute autoscaling. The bill is a feedback signal; ignored, it compounds.
What we don’t recommend
Reverse-ETL as the primary integration story. Real-time everything by default. Building a custom internal metadata system before you’ve tried the off-the-shelf one. Adopting a third semantic layer because the first two had a single annoying limitation.
The recurring lesson: pick boring, well-supported tools at every layer, invest in the contracts between layers, and resist the urge to re-architect because something new is trending. The compounding value comes from a platform that’s been allowed to mature.
How the layers stay decoupled
The reason this architecture survives migrations is that each layer communicates with the one below it through a stable contract. The products layer doesn’t care which warehouse the metrics layer reads from. The metrics layer doesn’t care whether transformations were written in dbt or SQLMesh. The transformation layer doesn’t care whether storage is Delta or Iceberg.
Concretely: every metric has a documented definition in the semantic layer; every transformation has a tested contract on its outputs; every storage table has a documented schema with a version. When the organization decides to move the warehouse, the storage and ingestion layers change, but the metrics and products layers don’t even notice. That’s the property that justifies the additional complexity of the layered approach.
Where most platforms quietly rot
Reference architectures look clean on paper. In practice, three failure modes account for most of the rot we’re asked to clean up.
Bypass paths.A team in a hurry connects a BI tool directly to the warehouse, skipping the semantic layer. Six months later there are nine versions of “active customer” in production. The semantic layer is only as valuable as the discipline of insisting that every consumer goes through it.
Unowned pipelines.A pipeline ships, the original engineer moves on, and nobody is named as the owner. Two years later it’s the source of the worst incidents. Every pipeline needs a named owner and a documented SLA from day one, the same way every microservice does.
Cost drift.Warehouse spend grows by 10% a month for quiet reasons (a poorly indexed view, a forgotten daily refresh on a huge table, a notebook that re-trains an embedding every hour). Without cost observability and a quarterly review, the bill compounds invisibly until it’s an executive-level problem.
Build vs buy: where to draw the line
For mid-sized organizations, the default should be: buy the layers that are commoditized, build the layer that’s your differentiator. Ingestion (Fivetran, Airbyte), storage (Snowflake, BigQuery, Databricks), orchestration (Airflow, Dagster), even the semantic layer (dbt, Cube, LookML) are mature commodity layers. Building any of these from scratch is almost always a mistake.
What’s worth building: the small set of metrics, models and products that encode your organization’s actual logic. That’s where the leverage lives, and it’s where the work is worth doing well.
Governance, observability and cost in detail
The three cross-cutting concerns deserve more than a bullet point.
Governance.Role-based access wired to your identity provider. PII tagged on ingestion and masked at the semantic layer. Lineage automatically generated from the transformation graph, not maintained by hand. A data catalog that’s honest about which datasets are deprecated and which are trusted. Without these, the platform becomes a liability the day an auditor asks a hard question.
Observability.Schema, freshness, volume and distribution checks at every layer. Alerts routed to the team that owns the failing pipeline, not to a generic mailing list. Runbooks for the top failure modes so on-call isn’t guessing. The goal is a platform where downstream consumers stop noticing problems before the data team does.
Cost.A dashboard showing spend by team, by pipeline, by warehouse. Autoscaling for ephemeral workloads, right-sizing for steady ones. A quarterly review of the top-spend queries and pipelines. Cost is a feedback signal, if you can’t see it, you can’t respond to it.
What a useful 12-month roadmap looks like
For an organization standing this up from a fragmented baseline:
- Quarter 1. Stand up the warehouse and the first ingestion paths. Build dbt models for the five most important data entities. Wire basic observability and access controls.
- Quarter 2. Introduce the semantic layer with the ten metrics that matter most. Migrate the most-used reports onto it. Begin parallel-running with legacy.
- Quarter 3. Expand to the next tier of data products. Decommission the first wave of legacy reports. Add cost observability and the quarterly review cadence.
- Quarter 4. Stand up the first ML or AI workflow on the platform. Lock in the data-product operating model, intake, SLAs, on-call.
The arc is intentionally unglamorous. The compounding value comes from the discipline of doing each step well, not from the speed of doing them all.