Apache Iceberg vs Apache Paimon: Practical Benchmarks for Real-World Lakehouse Workloads

At Alphyn.AI we have accumulated deep hands-on experience running Apache Iceberg at the boundary between traditional batch processing and near-real-time pipelines — specifically with Apache Flink in the mix. When the Flink community released Apache Paimon as a new open table format (OTF), we had to put it to the test. This article documents our benchmarks on production-class infrastructure and the practical conclusions we drew from them.

TL;DR — jump to conclusions

Paimon helps in specific scenarios, but a broad migration away from Iceberg is premature. Details below.

Background: Pain Points of Open Table Formats in the Data Lake

Open table formats delivered something data engineers had long dreamed of: they married the storage and read efficiency of Apache Parquet with the ability to update records without rewriting entire datasets. The mechanism is Merge-On-Read and deferred deletion — information about which rows to remove is written to deletion files rather than applied immediately. For streaming frameworks such as Flink, this enables near-real-time updates directly in the Data Lake. For batch engines — Spark, Impala, Trino, StarRocks — it dramatically cuts the resource cost of merging incremental data into data marts.

The trade-off is right there in the name: Merge-On-Read shifts the cost of applying deletes and inserts to read time, which has two direct consequences:

Read performance from OTF tables degrades with each successive update cycle;
OTF tables require periodic compaction to merge deletion files back into data files.

Iceberg is fully subject to these dynamics. Our colleagues have written in detail about table maintenance strategies and what we consider the optimal approach to solving this problem.

Against that backdrop, we designed a comparison of Paimon and Iceberg across the following dimensions:

MERGE throughput for a new data batch:
- via Apache Spark;
- via Apache Flink;
Full-table SCAN speed via Apache Spark.

We also tracked how both metrics degrade over multiple iterations with no compaction performed. In our view these are the canonical scenarios and tools that most data engineers encounter in practice.

How Paimon Differs from Iceberg

When Spark reads an Iceberg table, each data file becomes a task that reads the file itself along with any associated deletion files. This is the only read implementation Iceberg exposes to Spark.

Paimon took a different approach and introduced two distinct table types:

Merge-On-Write — despite the name, this is a functional equivalent of Iceberg's Merge-On-Read (do not confuse the two!): deletions are written to deletion files, and table scans are parallelized at the data-file level;
Merge-On-Read — has no direct counterpart in Iceberg. In this mode the Paimon table must be bucketed. Bucketing (i.e., hash-partitioning) also exists in Iceberg, where it is an optional tool for improving selectivity on key lookups. In Paimon, buckets play a fundamentally different role: each bucket is an independent LSM tree — there are no separate deletion files, data files are organized into levels, and when reading, the most recently written value for a given key takes precedence. The implication is immediate: read parallelism for such tables is bounded by the number of buckets.

Our benchmark therefore covers three table variants:

Iceberg Merge-On-Read;
Paimon Merge-On-Read;
Paimon Merge-On-Write.

Test Environment

We ran the experiments on cloud IaaS infrastructure: compute VMs and S3-compatible object storage. Iceberg and Paimon tables were stored in S3 using a file-based catalog. Two virtual machines with 16 vCPU and 64 GB RAM each hosted the Flink and Spark clusters (local disks on the VMs).

To keep things simple we deployed each cluster in its minimal configuration: one master process (Spark Master / Flink JobManager) and one worker process (Spark Executor / Flink TaskManager) per machine, with the worker having access to the full VM resources. For Spark this also has a meaningful optimization effect on equality-delete file reads (discussed below) — with more executors each running on fewer resources, read efficiency drops considerably.

Software versions used:

Spark 3.5.4;
Flink 1.20;
Paimon 1.10;
Iceberg 1.10 and 1.8 Alphyn Lakehouse edition.

Data Schema

The Paimon tables use the following schemas:

Merge-On-Write

CREATE TABLE TableMOW (
          ID BIGINT,
          VAL_S STRING,
          VAL_TS TIMESTAMP,
          VAL_DT DATE,
          VAL_DEC DECIMAL(38,12)
) TBLPROPERTIES (
   'deletion-vectors.enabled' = 'true',
   'bucket'=4,
   'primary-key' = 'ID');



Merge-On-Read
CREATE TABLE TableMOR (
        ID BIGINT,
        VAL_S STRING,
        VAL_TS TIMESTAMP,
        VAL_DT DATE,
        VAL_DEC DECIMAL(38,12)
) TBLPROPERTIES (
   'primary-key' = 'ID',
   'write-only'='true',
   'bucket'=4);

The Iceberg table schema:

CREATE TABLE TableIcebergMORv3 (
ID BIGINT,
BUCKET_ID INT,
VAL_S STRING,
VAL_TS TIMESTAMP,
VAL_DT DATE,
VAL_DEC DECIMAL(38,12),
PRIMARY KEY(ID) NOT ENFORCED
)
WITH (
    'write.format.default' = 'parquet',
    'write.delete.mode' = 'merge-on-read',
    'write.update.mode' = 'merge-on-read',
    'write.merge.mode' = 'merge-on-read',
    'write.upsert.enabled'='true',
    'format-version'='3');

Each table was seeded with 10,000,000 randomly generated records — unique IDs, 1 KB per row — then compacted to a target Parquet file size of 128 MB. The resulting table size in S3 was approximately 7 GB, spread across roughly 60 data files.

The bucket count for Paimon was chosen to be well below the available CPU count on the Spark cluster (16 vCPU), while still landing within Paimon's own recommendation of up to 1 GB per bucket:

Paimon bucket sizing guidance

https://paimon.apache.org/docs/master/primary-key-table/overview/

A bucket is the smallest storage unit for reads and writes, so the number of buckets limits the maximum processing parallelism. This number should not be too big, though, as it will result in lots of small files and low read performance. In general, the recommended data size in each bucket is about 200MB — 1GB.

For Iceberg Merge-On-Read we applied no partitioning, which is consistent with how we would handle a table of this size in a real project.

Apache Spark Benchmarks

We generated 21 batches of 1,000,000 random rows (1 KB each, same schema as the target tables, IDs drawn uniformly from 1–10,000,000). We then ran MERGE for each batch in sequence using a simple key equality join (example for the first batch):

MERGE INTO TableIcebergMOR target
  USING Source_1_1000000 source
  ON target.id = source.id
  WHEN MATCHED THEN
  UPDATE SET *
  WHEN NOT MATCHED
  THEN INSERT *

After each MERGE we performed a full table scan and recorded the elapsed time. Because the ID ranges overlap completely, the total row count never grows — so we get a clean measurement of how the merge operation itself degrades scan performance over time.

Results

The charts below show MERGE duration and SCAN duration after each iteration for all three table types:

Key observations:

Paimon Merge-On-Read is the clear loser. As expected, it has the slowest scans and degrades fastest. What was surprising is that MERGE itself also scans the target table, so scan degradation drags MERGE down with it. Logically, an LSM-based table should not need to scan the target to perform a keyed MERGE (Flink writes to Paimon without doing so, for example) — yet Spark does it anyway.
Comparing Paimon Merge-On-Write and Iceberg Merge-On-Read (which, recall, have nearly equivalent storage structures) is more nuanced:
1. At iteration 16, Paimon triggered an auto-compaction during MERGE — visible as the spike on the histogram.
2. Up to that point, both formats showed roughly the same scan speed with a similar degradation trend.
3. However, Paimon's MERGE was dramatically slower throughout. The reason:
  1. Both formats need to locate the file position of each row being deleted in order to build deletion vectors;
  2. Iceberg does this efficiently: it leverages Parquet's columnar storage and reads only the ID column used in the MERGE predicate. The amount of data shuffled is correspondingly small, as visible in the query plan:
    
    Fig. 5 — Data shuffle for MERGE in Iceberg
  3. Paimon reads the entire table.
Fig. 6 — Data shuffle for MERGE in Paimon

This Paimon behavior is something to account for explicitly when designing ETL pipelines. We have added an R&D task to investigate the relevant section of the Paimon source code further.

Spark — Summary

Paimon Merge-On-Read is unfit for production use: it has hard scan parallelism limits and degrades rapidly.
Paimon Merge-On-Write offers no read advantage over Iceberg Merge-On-Read and is substantially slower for MERGE operations.
Overall, the value of Paimon for batch processing on Apache Spark is questionable.

Apache Flink Benchmarks

For the Flink test we created a synthetic stream at 2,000 records per second under a deliberately stressful NRT ODS scenario: 70% of records are updates to existing keys (each decomposed into a RowKind.UPDATE_BEFORE / RowKind.UPDATE_AFTER pair before being written), and 30% are inserts of new keys. Checkpoint interval was set to 5 minutes. After each checkpoint we read the table from Spark and recorded SCAN duration. This models a scenario familiar to any Lakehouse user: Flink writing data in near-real-time (Kafka, CDC), with downstream consumers reading via Spark or SQL engines and expecting consistent, predictable runtimes.

For Paimon we benchmarked only the Merge-On-Read table. The reason: when writing to a Merge-On-Write Paimon table, Flink unconditionally runs a full compaction after every checkpoint — pausing stream processing for minutes at a time. We considered that behavior unacceptable for any realistic streaming scenario.

Results

The picture here is completely different. Paimon performs stably with no visible degradation, while open-source Iceberg degrades sharply — by checkpoint 11, scans start throwing OutOfMemory errors (visible as missing bars on the histogram).

This is a known issue rooted in the extremely inefficient implementation of equality delete files in open-source Iceberg. We then tested the same Spark + Iceberg setup with the optimization patches developed by the Alphyn.AI team.

Running a single Spark executor worked in our favor here: shifting from task-level to executor-level caching reduced degradation to near-zero. More importantly, in absolute terms the patched Iceberg now beats Paimon in scan speed — because, unlike Paimon's bucket-bounded parallelism, Iceberg imposes no hard ceiling on SCAN concurrency.

Open Table Format Conclusions

Paimon outperforms open-source Iceberg for real-time Data Lake writes when UPDATE traffic on the primary key exceeds roughly 30% of the full table (not just the increment).
The optimizations applied in Alphyn Lakehouse effectively eliminate this weakness and neutralize Paimon's architectural advantages over Iceberg. This is why we deliberately choose Iceberg for these scenarios in Alphyn Lakehouse deployments.
At present, Paimon is worth considering only for the primary RAW / Landing layer and should not be used for higher Lakehouse tiers — data will need to be re-ingested into Iceberg further downstream regardless.
Using a single format across all layers is obviously preferable, but without resolving the OSS Iceberg equality-delete problem and without managed compaction, any Lakehouse handling 300+ objects and 50+ TB of near-real-time streaming data will exhaust cluster resources on table reads within a few weeks of going to production — and all ETL jobs will grind to a halt. Paimon MOR can relieve that pain, but only at the landing layer.

Conclusion

Paimon is a young format with a rich set of configuration options and still-sparse documentation. Some of its characteristics are architectural fundamentals; others may be addressed in upcoming releases; and in some cases we may not have explored the full configuration space. Given our accumulated Iceberg expertise and the proprietary optimizations we have built into Alphyn Lakehouse, we are not rushing to adopt Paimon in production projects — but we are watching its development closely.

See it on your own data

If you're weighing how this would handle your workloads, we'd be glad to walk you through Alphyn Lakehouse on a real scenario. Book a sovereign-lakehouse walkthrough →

About Alphyn.AI

We build the Alphyn Lakehouse, a Kubernetes-native, high-performance, multi-engine lakehouse for any enterprise data and analytical workload — from agentic AI and BI to structured and unstructured data. Built entirely on open standards and an open architecture, Alphyn Lakehouse is a sovereign, on-premises solution for regulated enterprises across the GCC and the wider MENA region.

Learn more at alphyn.ai and follow us on LinkedIn.

Apache Iceberg vs Apache Paimon: Practical Benchmarks for Real-World Lakehouse Workloads

Background: Pain Points of Open Table Formats in the Data Lake

How Paimon Differs from Iceberg

Test Environment

Data Schema

Apache Spark Benchmarks

Results

Spark — Summary

Apache Flink Benchmarks

Results

Open Table Format Conclusions

Conclusion

See it on your own data

Get the latest posts in your inbox

Continue Reading

Terabytes of Data from Teradata to Trino: An Efficient Transfer Method

Why You Can't Build a Lakehouse Without Spark

Benchmarking Apache Ranger Dynamic Data Masking: Performance Impact in a Production Lakehouse