Streaming vs. Batch Processing: When to Use What

Every data platform moves data from point A to point B. The difference is when that movement happens.

Batch vs. Streaming in Plain Language

Batch Processing — collect, then process in bulk

Run at scheduled intervals — hourly, nightly, weekly
Process large volumes at once
Nightly ETL, end-of-day reports, monthly aggregations
Tools: Spark, traditional ETL, stored procedures

Stream Processing — process data as it arrives

Event by event, continuously, no waiting
Sub-second to seconds latency
Fraud detection, real-time dashboards, IoT monitoring
Tools: Apache Flink, Kafka Streams, Spark Structured Streaming

CDC (Change Data Capture) — stream database changes

Capture every INSERT, UPDATE, DELETE from a source database
Keep the lakehouse in sync with Oracle, PostgreSQL, SAP, and others
Essential for migration and real-time replication
Tools: Debezium, GoldenGate, Qlik Replicate

Batch = doing laundry. You collect dirty clothes all week, then wash everything on Sunday. Efficient per load, but you wait.

Streaming = a conveyor belt at a factory. Each item gets processed the moment it arrives. No waiting, but you need the belt running continuously.

CDC = a security camera on your source database. It records every change as it happens and streams that footage to your lakehouse.

When to Use Each Pattern

Use Case	What You Need	Why
Nightly financial reports	Batch	Data only needs to be current as of close-of-business
Real-time fraud detection	Streaming	Every second of delay = money lost
Dashboard refreshing every hour	Batch is fine	Hourly latency is acceptable — no need for streaming complexity
Dashboard with sub-minute data	Streaming	Users expect near-live numbers
Keeping lakehouse in sync with Oracle	CDC (streaming)	Capture every transaction as it happens in the source
Historical analysis, ML model training	Batch	Working with terabytes of historical data — throughput matters, not latency
IoT sensor monitoring	Streaming	Thousands of events per second, need immediate anomaly detection
Regulatory reporting (end of day)	Batch	Regulators want a point-in-time snapshot, not a live feed
Migrating from Oracle to lakehouse	CDC + Batch	Batch for initial load, CDC to keep delta sync during cutover

Most enterprises need both batch and streaming. The question is never "batch or streaming?" — it is "what is the ratio, and can your platform handle both without buying two separate products?"

The Modern Pattern: Streaming + Iceberg

The industry is converging on a standard architecture: stream data into Apache Iceberg tables, query them with SQL engines, maintain them with Spark.

The Streaming-to-Iceberg Pipeline:

Source DB (Oracle, PG, MySQL)
  → CDC / Flink (capture changes)
  → Iceberg Tables (equality-delete files)
  → SQL Engines (Impala, StarRocks, Trino)
  → Dashboards (BI tools, reports)

Background Maintenance (Batch):

Spark (compaction, cleanup)
  → Iceberg Tables (merge delete files, optimize)

Here is what each piece does:

Flink + CDC (Debezium) — Captures every change from source databases and writes them as streaming events into Iceberg tables. Updates create "equality-delete" files — Iceberg's way of saying "this old row is replaced by this new one."
Iceberg Tables — The single source of truth. Both streaming-ingested data and batch-loaded data land in the same tables. ACID transactions. Time travel. Schema evolution.
SQL Engines (Impala, StarRocks, Trino) — Query the Iceberg tables with standard SQL. No need to know whether the data arrived via batch or streaming.
Spark (maintenance) — Runs in the background to compact small files created by streaming, merge equality-delete files, and keep query performance sharp.

How the Alphyn Lakehouse Handles Both

Alphyn ships both batch and streaming built in, with unified Iceberg storage underneath.

Batch stack

Spark 4.0.1 — ETL, ML pipelines, Iceberg maintenance (compaction, cleanup)
Impala + LPSQL — Batch stored procedures migrated from Oracle PL/SQL
Flex Loader (Alphyn DR) — Batch extraction from source systems (Oracle, SAP, files)
Airflow — Orchestration and scheduling of all batch workflows

Streaming / CDC stack

Alphyn ASM (Analytical Stream Manager) — Built on Apache Flink + Debezium
CDC from: Oracle, PostgreSQL, MS SQL, MySQL, SAP, MongoDB
Writes to Iceberg — Streaming data lands in the same tables as batch data
Equality-delete optimization — Alphyn's Iceberg innovation for efficient streaming writes

What is the equality-delete optimization?

When Flink streams CDC updates into Iceberg, every UPDATE creates a small "delete file" plus a new data file. At scale, thousands of tiny delete files pile up and degrade query performance. Stock Spark struggles to compact them efficiently. Alphyn's Iceberg fork includes optimized equality-delete compaction that keeps streaming tables performant — a real differentiator that other Iceberg platforms struggle to match.

Vendor Comparison: Streaming + Batch Support

Confluent (Kafka)

Verdict: Streaming only

Best-in-class event streaming infrastructure. Apache Kafka for message transport, ksqlDB for stream processing, Flink-based stream processing recently added.

Strength: Gold standard for event streaming and Kafka management
Gap: Not a data platform — no batch, no SQL analytics, no data lake. Must pair with Databricks, Snowflake, or similar.

Databricks

Verdict: Full batch + streaming

Spark Structured Streaming for stream processing. Delta Live Tables for unified batch + streaming ETL pipelines. Auto Loader for incremental file ingestion.

Strength: Unified Spark-based batch + streaming, excellent auto-scaling
Gap: Cloud-only (no on-prem). No procedural SQL. Delta Lake format (now also supports Iceberg). Expensive at scale.

Snowflake

Verdict: Near-real-time (not true streaming)

Snowpipe for near-real-time ingestion (micro-batch). Snowpipe Streaming for lower-latency ingestion. Dynamic Tables for incremental materialized views.

Strength: Simple, fully managed, good enough for near-real-time use cases
Gap: Not true streaming (seconds-to-minutes latency). Cloud-only. Consumption-based pricing adds up fast.

Cloudera CDP

Verdict: Full batch + streaming

Spark for batch, Flink for streaming (Cloudera Stream Processing), Kafka, NiFi for data flow. Full Hadoop-era stack modernized.

Strength: Complete stack — batch, streaming, Kafka, Flink, NiFi all included
Gap: Complex to operate. Expensive licensing. Kubernetes wrapper (not native). Heavy operational overhead. Not Iceberg-first.

Starburst (Trino)

Verdict: Query only — no streaming

Query engine only. Can read from Kafka topics via Trino connector (read-only, not stream processing). No CDC. No stream-to-Iceberg pipeline.

Strength: Can federate queries across streaming and batch sources
Gap: No processing, no CDC, no ingestion pipeline. Must buy and integrate a separate streaming platform.

Dremio

Verdict: Query only — no streaming

Query engine with Arctic (Iceberg catalog) for table management. No native streaming or CDC. Must pair with external Flink/Kafka/Debezium.

Strength: Good Iceberg catalog (Arctic) and query acceleration
Gap: Analytics only — no data ingestion, no streaming, no processing. Similar gap to Starburst.

ClickHouse

Verdict: Kafka ingestion only

Kafka engine for consuming from Kafka topics. MaterializedView for continuous aggregation. Very fast ingestion and real-time aggregation.

Strength: Extremely fast ingestion and aggregation from Kafka
Gap: No CDC, no Flink-equivalent. Limited to Kafka input. No procedural SQL. Proprietary format (not Iceberg).

CelerData (StarRocks)

Verdict: Kafka ingestion only

StarRocks routine load from Kafka topics. Fast real-time analytics on ingested data. Single-engine platform.

Strength: Fast real-time analytics on Kafka-ingested data
Gap: No CDC, no Flink, limited to Kafka input. No batch ETL or procedural SQL.

Oracle

Verdict: Mature CDC, proprietary

GoldenGate for CDC (expensive add-on). Full integration with Oracle DB ecosystem. Mature but locked-in.

Strength: GoldenGate is battle-tested CDC for Oracle sources
Gap: Extremely expensive. Oracle-only ecosystem. Proprietary everything. GoldenGate is a separate product with its own licensing.

Teradata

Verdict: No modern streaming

QueryGrid for cross-system queries. Mature batch analytics, legacy streaming capabilities.

Strength: Mature analytics engine for batch workloads
Gap: No modern streaming. No Flink/Kafka native integration. Appliance model. No open-format lakehouse.

Quick Comparison Matrix

Vendor	Batch	Streaming	CDC Built-In	Iceberg	On-Prem
Alphyn	Spark, Impala, LPSQL	Flink (ASM)	Yes (Debezium)	Native	Yes
Confluent	No	Kafka, ksqlDB, Flink	Debezium connectors	No	Yes
Databricks	Spark	Structured Streaming	Via partners	Supported	No
Snowflake	Yes	Snowpipe (micro-batch)	No	Supported	No
Cloudera CDP	Spark, Hive	Flink, NiFi	NiFi-based	Supported	Yes
Starburst	Query only	No	No	Query only	Yes
Dremio	Query only	No	No	Arctic catalog	Yes
ClickHouse	Yes	Kafka consumer	No	No	Yes
CelerData	Limited	Kafka consumer	No	Read only	Yes
Oracle	Yes	No	GoldenGate ($$$)	No	Yes
Teradata	Yes	Limited	No	No	Yes (appliance)

Why an Integrated Stack Wins

When an organization buys a query-only engine (Starburst, Dremio) or a streaming-only platform (Confluent), they still need to buy, integrate, and operate the missing pieces. That means multiple vendors, multiple contracts, multiple support queues — and the inevitable "it broke between Flink and Trino and neither vendor will own it" finger-pointing.

An integrated stack avoids this entirely:

Unified Iceberg storage — Streaming data and batch data land in the same tables. No data silos. One source of truth.
Unified security — Ranger policies apply to all data, whether it arrived via CDC, Spark ETL, or manual load. No security gaps between components.
Equality-delete optimization — An optimized Iceberg fork efficiently handles the small-file problem that streaming creates. Other platforms either struggle with this or require expensive manual tuning.
No integration tax — ASM (Flink) writes to the same Iceberg tables that Impala, StarRocks, and Spark read. No connectors, no glue code, no "data pipeline engineering."
On-prem and air-gapped — Unlike Databricks and Snowflake, a self-hosted lakehouse runs where the data lives. Streaming is included, not a cloud add-on.

Questions Worth Asking

The following questions are useful for understanding whether an existing architecture has a streaming gap — and what closing that gap would unlock.

"How do you get data from your operational systems — Oracle, SAP, PostgreSQL — into your analytics platform today? Is it a nightly extract, or something more continuous?"

"Is there a delay between when a transaction happens in your source system and when it appears in reports? How long is that delay — and is it acceptable?"

"Do you have any real-time requirements today? Fraud detection, live dashboards, IoT monitoring, compliance alerts?"

"Are you running Kafka or any event streaming infrastructure today? If so, what consumes from it?"

"If you could get your Oracle data into the lakehouse within seconds of a transaction, what use cases would that unlock for you?"

That last question is particularly revealing — it shifts the conversation from "do we need streaming?" to "what would we do with it?" and typically surfaces high-value use cases that haven't been pursued simply because the current platform can't handle them.