Every data platform moves data from point A to point B. The difference is when that movement happens.
Batch vs. Streaming in Plain Language
Batch Processing — collect, then process in bulk
- Run at scheduled intervals — hourly, nightly, weekly
- Process large volumes at once
- Nightly ETL, end-of-day reports, monthly aggregations
- Tools: Spark, traditional ETL, stored procedures
Stream Processing — process data as it arrives
- Event by event, continuously, no waiting
- Sub-second to seconds latency
- Fraud detection, real-time dashboards, IoT monitoring
- Tools: Apache Flink, Kafka Streams, Spark Structured Streaming
CDC (Change Data Capture) — stream database changes
- Capture every INSERT, UPDATE, DELETE from a source database
- Keep the lakehouse in sync with Oracle, PostgreSQL, SAP, and others
- Essential for migration and real-time replication
- Tools: Debezium, GoldenGate, Qlik Replicate
Batch = doing laundry. You collect dirty clothes all week, then wash everything on Sunday. Efficient per load, but you wait.
Streaming = a conveyor belt at a factory. Each item gets processed the moment it arrives. No waiting, but you need the belt running continuously.
CDC = a security camera on your source database. It records every change as it happens and streams that footage to your lakehouse.
When to Use Each Pattern
| Use Case | What You Need | Why |
|---|---|---|
| Nightly financial reports | Batch | Data only needs to be current as of close-of-business |
| Real-time fraud detection | Streaming | Every second of delay = money lost |
| Dashboard refreshing every hour | Batch is fine | Hourly latency is acceptable — no need for streaming complexity |
| Dashboard with sub-minute data | Streaming | Users expect near-live numbers |
| Keeping lakehouse in sync with Oracle | CDC (streaming) | Capture every transaction as it happens in the source |
| Historical analysis, ML model training | Batch | Working with terabytes of historical data — throughput matters, not latency |
| IoT sensor monitoring | Streaming | Thousands of events per second, need immediate anomaly detection |
| Regulatory reporting (end of day) | Batch | Regulators want a point-in-time snapshot, not a live feed |
| Migrating from Oracle to lakehouse | CDC + Batch | Batch for initial load, CDC to keep delta sync during cutover |
Most enterprises need both batch and streaming. The question is never "batch or streaming?" — it is "what is the ratio, and can your platform handle both without buying two separate products?"
The Modern Pattern: Streaming + Iceberg
The industry is converging on a standard architecture: stream data into Apache Iceberg tables, query them with SQL engines, maintain them with Spark.
The Streaming-to-Iceberg Pipeline:
Source DB (Oracle, PG, MySQL)
→ CDC / Flink (capture changes)
→ Iceberg Tables (equality-delete files)
→ SQL Engines (Impala, StarRocks, Trino)
→ Dashboards (BI tools, reports)
Background Maintenance (Batch):
Spark (compaction, cleanup)
→ Iceberg Tables (merge delete files, optimize)
Here is what each piece does:
- Flink + CDC (Debezium) — Captures every change from source databases and writes them as streaming events into Iceberg tables. Updates create "equality-delete" files — Iceberg's way of saying "this old row is replaced by this new one."
- Iceberg Tables — The single source of truth. Both streaming-ingested data and batch-loaded data land in the same tables. ACID transactions. Time travel. Schema evolution.
- SQL Engines (Impala, StarRocks, Trino) — Query the Iceberg tables with standard SQL. No need to know whether the data arrived via batch or streaming.
- Spark (maintenance) — Runs in the background to compact small files created by streaming, merge equality-delete files, and keep query performance sharp.
How the Alphyn Lakehouse Handles Both
Alphyn ships both batch and streaming built in, with unified Iceberg storage underneath.
Batch stack
- Spark 4.0.1 — ETL, ML pipelines, Iceberg maintenance (compaction, cleanup)
- Impala + LPSQL — Batch stored procedures migrated from Oracle PL/SQL
- Flex Loader (Alphyn DR) — Batch extraction from source systems (Oracle, SAP, files)
- Airflow — Orchestration and scheduling of all batch workflows
Streaming / CDC stack
- Alphyn ASM (Analytical Stream Manager) — Built on Apache Flink + Debezium
- CDC from: Oracle, PostgreSQL, MS SQL, MySQL, SAP, MongoDB
- Writes to Iceberg — Streaming data lands in the same tables as batch data
- Equality-delete optimization — Alphyn's Iceberg innovation for efficient streaming writes
What is the equality-delete optimization?
When Flink streams CDC updates into Iceberg, every UPDATE creates a small "delete file" plus a new data file. At scale, thousands of tiny delete files pile up and degrade query performance. Stock Spark struggles to compact them efficiently. Alphyn's Iceberg fork includes optimized equality-delete compaction that keeps streaming tables performant — a real differentiator that other Iceberg platforms struggle to match.
Vendor Comparison: Streaming + Batch Support
Confluent (Kafka)
Verdict: Streaming only
Best-in-class event streaming infrastructure. Apache Kafka for message transport, ksqlDB for stream processing, Flink-based stream processing recently added.
- Strength: Gold standard for event streaming and Kafka management
- Gap: Not a data platform — no batch, no SQL analytics, no data lake. Must pair with Databricks, Snowflake, or similar.
Databricks
Verdict: Full batch + streaming
Spark Structured Streaming for stream processing. Delta Live Tables for unified batch + streaming ETL pipelines. Auto Loader for incremental file ingestion.
- Strength: Unified Spark-based batch + streaming, excellent auto-scaling
- Gap: Cloud-only (no on-prem). No procedural SQL. Delta Lake format (now also supports Iceberg). Expensive at scale.
Snowflake
Verdict: Near-real-time (not true streaming)
Snowpipe for near-real-time ingestion (micro-batch). Snowpipe Streaming for lower-latency ingestion. Dynamic Tables for incremental materialized views.
- Strength: Simple, fully managed, good enough for near-real-time use cases
- Gap: Not true streaming (seconds-to-minutes latency). Cloud-only. Consumption-based pricing adds up fast.
Cloudera CDP
Verdict: Full batch + streaming
Spark for batch, Flink for streaming (Cloudera Stream Processing), Kafka, NiFi for data flow. Full Hadoop-era stack modernized.
- Strength: Complete stack — batch, streaming, Kafka, Flink, NiFi all included
- Gap: Complex to operate. Expensive licensing. Kubernetes wrapper (not native). Heavy operational overhead. Not Iceberg-first.
Starburst (Trino)
Verdict: Query only — no streaming
Query engine only. Can read from Kafka topics via Trino connector (read-only, not stream processing). No CDC. No stream-to-Iceberg pipeline.
- Strength: Can federate queries across streaming and batch sources
- Gap: No processing, no CDC, no ingestion pipeline. Must buy and integrate a separate streaming platform.
Dremio
Verdict: Query only — no streaming
Query engine with Arctic (Iceberg catalog) for table management. No native streaming or CDC. Must pair with external Flink/Kafka/Debezium.
- Strength: Good Iceberg catalog (Arctic) and query acceleration
- Gap: Analytics only — no data ingestion, no streaming, no processing. Similar gap to Starburst.
ClickHouse
Verdict: Kafka ingestion only
Kafka engine for consuming from Kafka topics. MaterializedView for continuous aggregation. Very fast ingestion and real-time aggregation.
- Strength: Extremely fast ingestion and aggregation from Kafka
- Gap: No CDC, no Flink-equivalent. Limited to Kafka input. No procedural SQL. Proprietary format (not Iceberg).
CelerData (StarRocks)
Verdict: Kafka ingestion only
StarRocks routine load from Kafka topics. Fast real-time analytics on ingested data. Single-engine platform.
- Strength: Fast real-time analytics on Kafka-ingested data
- Gap: No CDC, no Flink, limited to Kafka input. No batch ETL or procedural SQL.
Oracle
Verdict: Mature CDC, proprietary
GoldenGate for CDC (expensive add-on). Full integration with Oracle DB ecosystem. Mature but locked-in.
- Strength: GoldenGate is battle-tested CDC for Oracle sources
- Gap: Extremely expensive. Oracle-only ecosystem. Proprietary everything. GoldenGate is a separate product with its own licensing.
Teradata
Verdict: No modern streaming
QueryGrid for cross-system queries. Mature batch analytics, legacy streaming capabilities.
- Strength: Mature analytics engine for batch workloads
- Gap: No modern streaming. No Flink/Kafka native integration. Appliance model. No open-format lakehouse.
Quick Comparison Matrix
| Vendor | Batch | Streaming | CDC Built-In | Iceberg | On-Prem |
|---|---|---|---|---|---|
| Alphyn | Spark, Impala, LPSQL | Flink (ASM) | Yes (Debezium) | Native | Yes |
| Confluent | No | Kafka, ksqlDB, Flink | Debezium connectors | No | Yes |
| Databricks | Spark | Structured Streaming | Via partners | Supported | No |
| Snowflake | Yes | Snowpipe (micro-batch) | No | Supported | No |
| Cloudera CDP | Spark, Hive | Flink, NiFi | NiFi-based | Supported | Yes |
| Starburst | Query only | No | No | Query only | Yes |
| Dremio | Query only | No | No | Arctic catalog | Yes |
| ClickHouse | Yes | Kafka consumer | No | No | Yes |
| CelerData | Limited | Kafka consumer | No | Read only | Yes |
| Oracle | Yes | No | GoldenGate ($$$) | No | Yes |
| Teradata | Yes | Limited | No | No | Yes (appliance) |
Why an Integrated Stack Wins
When an organization buys a query-only engine (Starburst, Dremio) or a streaming-only platform (Confluent), they still need to buy, integrate, and operate the missing pieces. That means multiple vendors, multiple contracts, multiple support queues — and the inevitable "it broke between Flink and Trino and neither vendor will own it" finger-pointing.
An integrated stack avoids this entirely:
- Unified Iceberg storage — Streaming data and batch data land in the same tables. No data silos. One source of truth.
- Unified security — Ranger policies apply to all data, whether it arrived via CDC, Spark ETL, or manual load. No security gaps between components.
- Equality-delete optimization — An optimized Iceberg fork efficiently handles the small-file problem that streaming creates. Other platforms either struggle with this or require expensive manual tuning.
- No integration tax — ASM (Flink) writes to the same Iceberg tables that Impala, StarRocks, and Spark read. No connectors, no glue code, no "data pipeline engineering."
- On-prem and air-gapped — Unlike Databricks and Snowflake, a self-hosted lakehouse runs where the data lives. Streaming is included, not a cloud add-on.
Questions Worth Asking
The following questions are useful for understanding whether an existing architecture has a streaming gap — and what closing that gap would unlock.
"How do you get data from your operational systems — Oracle, SAP, PostgreSQL — into your analytics platform today? Is it a nightly extract, or something more continuous?"
"Is there a delay between when a transaction happens in your source system and when it appears in reports? How long is that delay — and is it acceptable?"
"Do you have any real-time requirements today? Fraud detection, live dashboards, IoT monitoring, compliance alerts?"
"Are you running Kafka or any event streaming infrastructure today? If so, what consumes from it?"
"If you could get your Oracle data into the lakehouse within seconds of a transaction, what use cases would that unlock for you?"
That last question is particularly revealing — it shifts the conversation from "do we need streaming?" to "what would we do with it?" and typically surfaces high-value use cases that haven't been pursued simply because the current platform can't handle them.