Open Table Formats — Iceberg vs Paimon — Practical Experience
TL;DR: Paimon can help in certain scenarios, but it is too early to switch to it en masse. See detailed findings at the end of this article.
Our streaming data team has gained extensive practical experience working with Apache Iceberg in tasks at the intersection of traditional batch processing and near-real-time, specifically using Flink-based technologies within Alphyn SDI. Therefore we could not ignore the new open table format (OTF) Paimon from the Apache Flink developers. In this article we describe our experience and the practical conclusions we drew from industrial environments, in the form of representative testing that illustrates the key practical scenarios.
Preface: Pain Points of OTFs in the Data Lake
The emergence of open table formats fulfilled a long-held dream of data engineers: combining the storage and read efficiency of Apache Parquet with the ability to update data without full rewrites. This is achieved through the Merge-On-Read paradigm and "deferred deletion," where information about deleting old record versions is written to deletion files. For stream processing frameworks like Flink, this opens up possibilities for updating data directly in the Data Lake in near-real-time mode. For the platform's SQL engines, it reduces resource costs for MERGEing new data portions into data marts.
The cost of this innovation is obvious from the name of the Merge-On-Read approach: the overhead of applying deletions and insertions is moved to the data read stage, leading to two consequences:
- Reading from OTF tables degrades with each update;
- OTF tables require periodic maintenance to merge deletion files with data files.
Iceberg is fully subject to this, as described in detail by our colleagues in their article on Iceberg table maintenance optimizations.
We therefore decided to compare Paimon and Iceberg on the following metrics:
- Speed of MERGEing a new data batch:
- via Apache Spark;
- via Apache Flink;
- Table scan speed via Apache Spark.
We also tracked the degradation of these metrics over several iterations without maintenance. In our view, these are typical scenarios and tools that most data engineers encounter in practice.
Paimon Characteristics vs Iceberg
When Spark reads Iceberg tables, a task is raised per data file, which reads the file itself and its associated deletion files. This is the only read implementation for Iceberg in Spark.
Paimon "took a different path" and introduced 2 distinct table types:
- Merge-On-Write — despite the name, this is a full equivalent of Iceberg Merge-On-Read (important not to confuse!): deletions are written to deletion files, table scan parallelism is based on data files;
- Merge-On-Read — has no direct equivalent in Iceberg. In this mode a Paimon table must be bucketed. Bucketing (= hash partitioning) is present in Iceberg too, but there it is an additional tool for increasing selectivity on key-based queries. The role of buckets in Paimon is fundamentally different: each bucket is a separate LSM tree — there are no dedicated deletion files, data files are divided into levels, and on read the value visible to the client for a given key is taken from the "freshest" file. The implication is immediately apparent: read parallelism for such tables is limited by the number of buckets.
So in our test we compare 3 table types:
- Iceberg Merge-On-Read;
- Paimon Merge-On-Read;
- Paimon Merge-On-Write.
Test Environment
For the experiment we use cloud infrastructure: Compute and Object Storage. Iceberg and Paimon tables are stored in S3 using a file catalog. Two virtual machines with 16 vCPU and 64 GB RAM each are used for deploying Flink and Spark clusters (local disks on the virtual machines).
For simplicity, clusters are set up in the simplest configuration: 1 master process (Spark Master and Flink Job Manager) and 1 worker process (Spark Executor and Flink Task Manager) each. Workers have access to all VM resources. For Spark this has an additional benefit in optimizing the reading of equality-delete files (see below) — with more Spark Executors each having fewer resources, read efficiency would be significantly lower.
Software component versions:
- Spark 3.5.4;
- Flink 1.20;
- Paimon 1.10;
- Iceberg 1.10 and 1.8 Alphyn edition.
Data Schema
Paimon tables have the following schemas:
-- Merge-On-Write
CREATE TABLE TableMOW (
ID BIGINT,
VAL_S STRING,
VAL_TS TIMESTAMP,
VAL_DT DATE,
VAL_DEC DECIMAL(38,12)
) TBLPROPERTIES (
'deletion-vectors.enabled' = 'true',
'bucket'=4,
'primary-key' = 'ID');
-- Merge-On-Read
CREATE TABLE TableMOR (
ID BIGINT,
VAL_S STRING,
VAL_TS TIMESTAMP,
VAL_DT DATE,
VAL_DEC DECIMAL(38,12)
) TBLPROPERTIES (
'primary-key' = 'ID',
'write-only'='true',
'bucket'=4);
Iceberg table schema:
CREATE TABLE TableIcebergMORv3 (
ID BIGINT,
BUCKET_ID INT,
VAL_S STRING,
VAL_TS TIMESTAMP,
VAL_DT DATE,
VAL_DEC DECIMAL(38,12),
PRIMARY KEY(ID) NOT ENFORCED
)
WITH (
'write.format.default' = 'parquet',
'write.delete.mode' = 'merge-on-read',
'write.update.mode' = 'merge-on-read',
'write.merge.mode' = 'merge-on-read',
'write.upsert.enabled'='true',
'format-version'='3');
Tables are populated with 10,000,000 randomly generated records with unique ID values and a record size of 1 KB each, after which Parquet file compaction is performed to a target file size of 128 MB. The resulting average size of each table in S3 is approximately 7 GB with an average of ~60 data files.
The number of buckets for Paimon was initially chosen to be significantly less than the available CPU count on the Spark cluster (16 vCPU), while also roughly fitting within Paimon's recommendation for bucket size up to 1 GB (recommended data size per bucket: 200 MB -- 1 GB).
For Iceberg Merge-On-Read we did not define partitioning, as we would not for tables of this size in real tasks.
Apache Spark Testing
For the Spark test we generated 21 batches of 1,000,000 random rows of 1 KB each, identical in schema to the target tables, with an ID range from 1 to 10,000,000. We then sequentially perform MERGE of each batch by simple ID equality (example for the first batch):
MERGE INTO TableIcebergMOR target
USING Source_1_1000000 source
ON target.id = source.id
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED
THEN INSERT *
After each MERGE, we read the table and measure the SCAN time. Since the number of records in the table does not grow after the operations due to full ID overlap, we get a clean measurement of the merge process's impact on read speed.
Spark Measurement Results
Graphs show MERGE and SCAN duration after each iteration for all 3 table types. The key findings are:
-
Paimon Merge-On-Read — clearly the laggard. As expected, it has the slowest SCAN and rapid degradation. Surprisingly, during MERGE the target table is also scanned — so SCAN degradation drives MERGE degradation as well. One would think that using an LSM for MERGE by key would require no scan of the target table at all (when writing to Paimon via Flink, no such scan occurs), yet Spark does it anyway.
-
Comparing Paimon Merge-On-Write and Iceberg Merge-On-Read (recall they have nearly equivalent storage structures), the situation is more nuanced:
- On the 16th iteration, Paimon performed auto-compaction during MERGE — visible as a spike in the histogram.
- Before the auto-compaction, both formats showed approximately equal scan speed with a degradation trend;
- However, we again observe a multiple-fold lag in Paimon's MERGE speed. The explanation:
- Both formats need the row's position in the data file to form deletion vectors for the target table;
- Iceberg does this optimally — when scanning the target table it uses columnar storage in Parquet and reads only the ID column used for the MERGE, so the amount of data read (visible in the query plan) is small;
- Paimon reads the entire table.
This Paimon behavior must be carefully considered when designing ETL processes. We added a task to the R&D team backlog for further investigation of Paimon code in this area.
Brief Spark Conclusions
- Paimon Merge-On-Read is unsuitable for practical tasks, as it has scan limitations and degrades rapidly;
- Paimon Merge-On-Write does not surpass Iceberg Merge-On-Read in read speed and significantly lags in MERGE speed;
- Overall, the value of Paimon for batch processing on Apache Spark is questionable.
Apache Flink Testing
For Flink testing we created a synthetic stream at 2,000 records per second, using a fairly stressful scenario for an NRT ODS: 70% of records are updates to existing keys (which before insertion decompose into RowKind.UPDATE_BEFORE and RowKind.UPDATE_AFTER pairs), and 30% are inserts of new keys. Checkpoint interval: 5 minutes. After each checkpoint we read the table using Spark and measure SCAN speed. This simulates a scenario well familiar to Lakehouse platform users: Flink writes data in Near Real Time (Kafka, CDC), which then needs to be read through SQL engines with expectations of consistency and predictable execution time.
For Paimon we only write to the Merge-On-Read table. The reason: when writing to Merge-On-Write, Flink unconditionally performs compaction of the target table after each checkpoint, halting stream processing for minutes. For real scenarios we deemed this behavior unacceptable.
Flink Measurement Results
Here we observe a completely different picture. Paimon works stably, showing no visible degradation, while Iceberg degrades rapidly — by checkpoint 11, reads begin producing OutOfMemory errors.
This problem is known and related to the extremely poor implementation of equality deletes in open-source Iceberg. Let's check how the Spark + Iceberg combination works with the optimization improvements developed by the Alphyn.ai team.
With the Alphyn-patched Iceberg: the use of a single Spark executor works in our favor here. Switching to executor-level caching (rather than task-level caching) reduces degradation to a minimum. Moreover, in absolute numbers — thanks to the absence of parallelism constraints on SCAN — the patched Iceberg now outperforms Paimon.
Brief OTF Conclusions
- Paimon outperforms open-source Iceberg for the Real-Time data write use case into a Data Lake when there is a substantial share of UPDATEs by PK (>30% at the scale of the entire table, not just a single increment);
- The improvements applied in the Alphyn Lakehouse platform effectively solve this problem and negate Paimon's architectural advantages relative to Iceberg. Therefore in projects using the platform we consciously use Iceberg for these scenarios;
- Overall, at this point Paimon makes sense to consider only for the primary RAW / Landing layer, and should not be used for layers above RAW in the Lakehouse — further data transfer to Iceberg is required anyway;
- It is undeniably more convenient to use a single format across all layers, but without resolving the open-source Iceberg problems and without managed compaction functionality for the Lakehouse — when there are 300+ Near Real Time streaming objects and storage size exceeds 50 TB — after a couple of weeks of production use all ETL processes will halt due to enormous cluster resource consumption during table reads. Paimon MOR can relieve this pain only at the initial stage.
Conclusion
Paimon is undoubtedly a young format with a rich set of configuration options and currently rather thin documentation. Some of its characteristics are architectural, some may be corrected soon, and in some places we may not have tuned everything optimally. For now, given our accumulated Iceberg expertise and our own customizations, we are in no hurry to adopt Paimon in Data Lake projects — but we are watching its development closely.