A plain-language guide to three data architectures, why they exist, and why the lakehouse is where everything is heading.
The Three Architectures
Every enterprise stores and analyzes data. Over the past 30 years, three major approaches have emerged — each one a reaction to the limitations of the last.
Data Warehouse — The Organized Filing Cabinet
1990s – 2010s
Structured data only. Everything must be cleaned, formatted, and loaded into predefined tables before anyone can query it. Think of it as a library where every book must be cataloged before it goes on a shelf.
Examples: Oracle, Teradata, SQL Server, Greenplum
Strengths:
- Fast, reliable queries
- Strong governance and security
- ACID transactions (data is always consistent)
- SQL — everyone knows it
Weaknesses:
- Expensive storage (proprietary formats)
- Rigid schemas — changing structure is painful
- Cannot handle unstructured data (images, logs, JSON)
- Vendor lock-in (your data is in their format)
Data Lake — The Giant Storage Locker
2010s
Store everything — structured, semi-structured, unstructured — cheaply and figure out how to use it later. Think of it as a warehouse where you dump boxes and sort them when you need something.
Examples: Hadoop HDFS, S3 + Hive, Azure Data Lake
Strengths:
- Cheap storage (commodity hardware or cloud object storage)
- Handles any data type — CSV, JSON, Parquet, images, video
- Massive scalability
- Good for data science and machine learning
Weaknesses:
- No transactions — reads and writes can conflict
- No governance — "who changed this data?" is unanswerable
- "Data swamp" problem — data goes in but nobody can find it
- Slow queries — not built for interactive BI
- No schema enforcement — garbage in, garbage stays
Data Lakehouse — The Best of Both Worlds
2020s
Cheap storage of a lake plus the governance, transactions, and query speed of a warehouse. Enabled by open table formats like Apache Iceberg that add structure on top of raw files.
Examples: Databricks, Snowflake (moving toward it), open lakehouse platforms
Strengths:
- ACID transactions on object storage
- Schema evolution — change structure without downtime
- Time travel — query data as it was at any point in time
- Multi-engine — BI, ETL, and ML all on the same data
- Open formats — no vendor lock-in
- Handles all data types
Weaknesses:
- Newer technology — some teams are still getting familiar with it
- On-premises deployments require Kubernetes expertise
Analogy: A data warehouse is a high-end restaurant with a fixed menu — amazing food, but you can only order what's listed, and it's expensive. A data lake is a massive buffet — everything is available, but there's no waiter, the food isn't labeled, and half of it might be stale. A data lakehouse is a high-end restaurant with an unlimited menu: you get the quality and service of the restaurant with the variety of the buffet.
The Evolution Story
Each architecture emerged because the previous one couldn't solve a critical business problem.
Act 1: The Warehouse Era
In the 1990s and 2000s, enterprises needed reliable analytics. Data warehouses from Oracle, Teradata, and others delivered fast, governed, trustworthy queries. Finance teams could get quarterly numbers they trusted.
But then the internet happened. Mobile apps, IoT sensors, social media, log files — suddenly enterprises had 10× or 100× more data, and most of it was unstructured. Warehouses couldn't handle it, and the storage costs were brutal. Customers were paying millions per year just to store data in proprietary formats.
Act 2: The Lake Era
Hadoop and cloud object storage (S3) promised a solution: store everything, cheaply, in any format. Enterprises built massive data lakes. Storage costs dropped 90%.
But lakes had no rules. No transactions, no governance, no schema enforcement. Data went in and never came out in a useful way. The industry coined the term "data swamp" to describe what most lakes became. Enterprises found themselves running both a warehouse (for trusted analytics) and a lake (for cheap storage and data science) — paying double.
Act 3: The Lakehouse Era
Open table formats like Apache Iceberg changed the game. They add a metadata layer on top of object storage files that enables warehouse-grade features: ACID transactions, schema evolution, time travel, partition pruning. Suddenly you can have warehouse reliability on lake-cheap storage.
That's the lakehouse: one platform that does what both the warehouse and the lake used to do separately. The key enabler is the open table format.
Side-by-Side Comparison
| Dimension | Data Warehouse | Data Lake | Data Lakehouse |
|---|---|---|---|
| Storage Cost | High (proprietary formats, expensive appliances) | Low (commodity object storage) | Low (same object storage as lake) |
| Query Speed | Fast (optimized indexes, columnar) | Slow (full scans, no optimization) | Fast (Iceberg metadata, caching, multiple engines) |
| Data Types | Structured only (tables, rows, columns) | Any (structured, semi-structured, unstructured) | Any (all types, with schema when you want it) |
| Transactions (ACID) | Yes (core strength) | No (reads/writes can conflict) | Yes (via Iceberg/Delta table format) |
| Schema Flexibility | Rigid (schema-on-write, changes are painful) | None (schema-on-read, no enforcement) | Flexible (schema evolution, add/rename columns live) |
| Governance | Strong (built-in access control, audit logs) | Weak (bolted-on, inconsistent) | Strong (unified security model across all engines) |
| Vendor Lock-in | High (proprietary data formats) | Medium (open files, but tooling varies) | Low (open table formats, portable data) |
| Scalability | Limited (scale-up, expensive to add capacity) | High (add nodes cheaply) | High (elastic compute, separate storage and compute) |
| ML / AI Support | Poor (data must be exported to ML tools) | Good (data scientists work directly on lake) | Excellent (ML engines read the same governed data in place) |
What Most Enterprises Have Today
The reality: most large enterprises run both a warehouse and a lake. Sometimes multiple of each. The result is a set of compounding operational problems.
Data Duplication — The same data lives in the warehouse and the lake. Customer records, transaction logs, product catalogs — copied and maintained in two places. Storage costs double.
Inconsistent Results — The warehouse says revenue was $12.3M last quarter. The lake-based dashboard says $12.1M. Which is right? Nobody knows. Trust erodes.
Double Licensing Costs — Paying for an Oracle or Teradata license and a Cloudera or Databricks license. Two vendor contracts, two support teams, two renewal cycles.
Complex ETL Pipelines — An army of data engineers writing and maintaining pipelines to move data between the warehouse and the lake. Fragile, expensive, and the leading source of data quality issues.
Running both a warehouse and a lake is like maintaining two kitchens in the same restaurant — one for appetizers, one for entrees. Ingredients get carried back and forth, some get lost along the way, and you're paying rent on two kitchens. A lakehouse is one kitchen that does everything.
How a Lakehouse Unifies This
A modern lakehouse replaces the warehouse + lake combination with a single, open platform. Here's how the capabilities map:
| Capability | How It's Delivered | What It Replaces |
|---|---|---|
| Storage | All data in Apache Iceberg format on S3-compatible object storage | Proprietary warehouse storage + HDFS/S3 lake |
| Fast BI Queries | Sub-second analytics engine with materialized views and real-time ingestion | The data warehouse (Oracle, Teradata, etc.) |
| Reporting + Procedural SQL | SQL engine with procedural logic support — stored procedures, cursors, batch reporting | Oracle PL/SQL, Teradata BTEQ, SQL Server T-SQL |
| Data Federation | Query engine that spans systems (Postgres, Oracle, Kafka, S3) without moving data | ETL pipelines that copy data between systems |
| ETL + Machine Learning | Distributed processing engine — batch, streaming, and ML model training on the same data | Separate data lake + ML platform (Cloudera, Hadoop) |
| Transactions | ACID transactions via Apache Iceberg on object storage | Warehouse-only transactions |
| Security | One unified security model across all query engines | Separate security configs for warehouse and lake |
| Data Format | Open Apache Iceberg — no vendor lock-in, data is always portable | Proprietary formats that trap your data |
The key architectural insight is separation of storage and compute. Because data is stored in open formats on object storage, any compatible engine can read it. You're not locked into one vendor's query layer — you can run the right tool for each workload while all engines operate on the same governed, consistent dataset.
Positioning Against Current Approaches
Different organizations come from different starting points. Here's how the lakehouse addresses each:
vs. Oracle / Teradata (Warehouse-First Organizations) A traditional warehouse is reliable but expensive and unable to handle unstructured data. A lakehouse delivers everything the warehouse does — including procedural SQL compatibility — plus lake capabilities, at a fraction of the storage cost. Open formats mean no lock-in going forward.
vs. Hadoop / Cloudera (Lake-First Organizations) A data lake provides cheap storage but lacks governance and is too slow for interactive BI. A lakehouse adds warehouse-grade governance, ACID transactions, and sub-second query performance on top of existing object storage. It turns the data swamp back into a usable asset.
vs. Snowflake / Databricks (Cloud Lakehouse Platforms) Cloud lakehouse platforms are excellent but cloud-only. Organizations with data sovereignty requirements, regulatory constraints, or existing on-premises infrastructure need a platform that can run in their data center, in their cloud, or in a hybrid configuration. Open table formats (Iceberg) also ensure your data remains portable regardless of which query engine you use.
vs. Keeping Both Systems Consolidating onto a single platform eliminates data duplication, removes the ETL pipelines between systems, reduces licensing costs, and produces one version of the truth governed by one security model.
Key Technical Concepts
Apache Iceberg
Apache Iceberg is an open table format for large analytic datasets. It sits between your object storage and your query engines, providing:
- Snapshot isolation — readers see a consistent view while writers commit changes
- Schema evolution — add, rename, or drop columns without rewriting data
- Hidden partitioning — Iceberg handles partition management automatically
- Time travel — query any historical snapshot by timestamp or snapshot ID
- Partition pruning — only read the files relevant to your query
-- Query data as it existed 7 days ago
SELECT * FROM orders FOR SYSTEM_TIME AS OF (NOW() - INTERVAL 7 DAY);
-- Roll back a table to a previous snapshot
ALTER TABLE orders EXECUTE rollback_to_snapshot(3051729675574597004);
Schema-on-Write vs. Schema-on-Read
- Schema-on-write (warehouse): Data is validated and structured before it is stored. Fast reads, strict quality, but inflexible.
- Schema-on-read (lake): Data is stored as-is. Structure is applied when you query it. Flexible ingestion, but no quality guarantees — invalid data is invisible until you try to use it.
- Lakehouse: Uses schema-on-write with schema evolution. Data is validated at write time, but the schema can be changed incrementally without rewriting existing data.
Storage-Compute Separation
Traditional warehouses couple storage and compute in the same appliance. You scale both together, even if you only need more of one. In a lakehouse, compute (query engines) and storage (object store) scale independently. You can run multiple compute clusters against the same data simultaneously, or scale storage without touching compute.
Choosing the Right Architecture
The right choice depends on your organization's current state, not a fixed formula. Some questions worth working through:
-
What types of data do you need to analyze? If it's purely structured relational data with a stable schema, a traditional warehouse may still be appropriate. As soon as you need logs, events, documents, or sensor data, the lakehouse model becomes necessary.
-
Do you run both a warehouse and a lake today? If yes, quantify the cost: duplicated storage, ETL engineering time, licensing, and — critically — the cost of inconsistent data. That total is the business case for consolidation.
-
What are your data residency and sovereignty requirements? Cloud-only platforms are off the table if data cannot leave a specific jurisdiction or data center. Open-format, on-premises lakehouse deployments are the only viable path.
-
How much procedural SQL logic exists in your current warehouse? Stored procedures, cursors, and complex batch reporting are the hardest part of any warehouse migration. Platforms with procedural SQL compatibility significantly reduce that rewrite effort.
-
Who consumes your data? If it's only BI dashboards, a fast SQL engine may be sufficient. If it includes data scientists training models and engineers running streaming pipelines, you need multiple compute engines working on the same data without copying it first.
Summary
The data warehouse solved reliability. The data lake solved scale. The data lakehouse solves both — and eliminates the operational complexity of running them side by side.
The enabling technology is the open table format (primarily Apache Iceberg), which adds transactional semantics, schema management, and time travel to plain object storage. The result is a platform where a BI analyst, a data scientist, and an ETL engineer can all operate on the same dataset simultaneously, with consistent results and unified governance.
For organizations evaluating their data infrastructure, the core question is no longer "warehouse or lake" — it's "how do we consolidate onto a single open platform and stop paying for two architectures at once."