Back to Blog
Benchmarks13 min read

The Small Files Problem: S3/MinIO vs HDFS vs Greenplum — Benchmarks

14 experiments from 256 KB to 128 MB, 4 concurrent TPC-DS streams, and a deep analysis of Greenplum pg_catalog bloat.

Alphyn.ai Engineering Team

The Small Files Problem: S3/MinIO vs HDFS vs Greenplum

Not long ago, a vendor's company blog published a study testing the behavior of various distributed file systems when working with small files (~2 MB). Brief conclusion: based on the test results, HDFS handles the small files problem best, degrading 1.5x; S3 based on MinIO could not handle it, slowing down 8x; S3 API over Ozone degrades 4x; and the most preferable system for working with small files, according to the authors of that study, is Greenplum — including for companies in the "exabyte club." The authors of that study also performed enormous work searching for "theoretical confirmations of unexpected metrics."

The test results in the S3 MinIO part seemed unconvincing to our team, and we hypothesized that they might be related to:

  • insufficient practical operational experience with SQL compute over S3 and S3 in general;
  • characteristics of distribution builds;
  • lack of experience working with MinIO clusters. In particular in a high-load production environment with 200+ TB of compressed columnar Iceberg/parquet data, especially in scenarios where the small files problem quickly becomes relevant.

We thank the authors of that study for the idea and inspiration to conduct similar testing. Let us work through this.

Test Environment Description

The test environment was assembled in a cloud virtual environment.

Configuration of virtual nodes for MinIO:

| Parameter | Value | |-----------|-------| | Number of nodes | 4 | | vCores | 8 | | RAM, GB | 32 | | Disk subsystem | 4 network SSD disks of 1024 GB each (16 network disks total per cluster) |

As the SQL engine we also used Impala — 4.5_2025.04 as part of the Alphyn Lakehouse 2025.04 platform release.

Configuration of Kubernetes nodes for Impala:

| Parameter | Value | |-----------|-------| | Number of nodes | 4 | | vCores | 32 | | RAM, GB | 252 |

Local data caching of the SQL engine on worker nodes was completely disabled so that during testing reads went only from S3.

The test stand configuration was intentionally skewed in favor of compute capacity, so that storage would become the bottleneck in the event of problems. Both MinIO and the SQL engine were tuned for maximum performance at the level of the cloud infrastructure environment, the operating system, and with individual settings and parameters. In solving any practical task, all components in the chain are involved. Extensive experience in production operations of such systems has enriched us sufficiently with the knowledge of fine-tuning.

Experiment Description

Testing was also conducted using the TPC-DS methodology. But we decided not to limit ourselves to half-measures in the form of sequentially running 99 SQL queries — we conducted measurements in 4 simultaneous TPC-DS sessions with a total SQL query count of 396. All strictly per the methodology requirements for concurrent execution. Our goal was to create a maximally uncomfortable scenario in terms of load on the file storage. Literally the desire was for the "bad MinIO" to choke on engine queries for listing and reading small files, and for the deep dive into "theoretical investigation of unexpected metrics" from the original article to have been justified.

The authors of that study considered a file size of 2.3 MB to be sufficiently small and characteristic of a Change Data Capture approach, especially when working with HDFS. But we decided to sink to the very bottom and gradually climb back up by increasing the size. Our data was prepared via Spark in parquet format across 14 schemas, each with its own target parquet file size, ranging from 256 KB to 128 MB.

| Experiment number | File size, MB | Number of files in S3 | |-------------------|---------------|----------------------| | 1 | 128 | 248 | | 2 | 96 | 322 | | 3 | 64 | 463 | | 4 | 48 | 612 | | 5 | 32 | 956 | | 6 | 24 | 1,339 | | 7 | 16 | 1,878 | | 8 | 12 | 3,338 | | 9 | 8 | 7,037 | | 10 | 4 | 11,733 | | 11 | 2 | 28,107 | | 12 | 1 | 46,836 | | 13 | 0.5 | 102,178 | | 14 | 0.25 | 140,102 |

In the worst-case scenario, the total number of files in the test dataset exceeded the file count from the other vendor's test by 10 times — 140,000 vs 14,000.

As you can see, we intentionally did not limit ourselves to the extreme values of smallest and largest, to show the dynamics of the performance "drop" depending on file size.

Results

The chart presents the results of each of the 4 independent TPC-DS streams launched simultaneously:

  • Y axis — execution time in seconds with a step of 125 seconds;
  • X axis (logarithmic scale) — file size in MB.

For simplified analysis, the total test completion time is the time of the slowest stream in the group.

Conclusions

  • At the reference point of 2 MB — result is 1.8x worse (recall the other study's test results where the file count was 10 times fewer: HDFS — 1.5x, S3 API Ozone — 4x, MinIO — 8x);
  • In the conducted experiment, significant degradation (more than 1.8x) started only with file sizes smaller than 4 MB;
  • Overall it can be concluded that average file sizes of 10–12 MB can still be acceptable for operation, and degradation begins beyond that:
    • Our best practice for maximum performance: the average file size metric across the entire cluster (total volume / total number of files) should be no less than 30–40 MB;
  • Maximum degradation of 2.5x relative to the 128 MB file size was recorded at 256 KB;
  • Despite the fact that the intensity of object storage access is several times higher than in the other study's experiment, the maximum performance reduction of 2.5x at 256 KB and 1.8x at 2 MB in our case is far from the claimed 10x+ (up to 1000%+). During our experiment the SQL engine created a load on object storage 40 times greater than in the other study's test.

To reinforce the results obtained and strengthen our position, we will not resort to finding links on the internet and compiling PDF documents in the style of analytical notes. Instead, we propose inviting our team to an in-person testing, to verify and draw conclusions from personal experience.

Does the Small Files Problem Actually Exist?

Of course it does! For timely prevention of degradation in Lakehouse or Data Lake class solutions, it is important to proactively combat this problem and monitor object states by assessing the average file size. The small files problem is especially pronounced with real-time data loading. For such scenarios, the compaction and maintenance process is configured before the data stream is connected.

At the extreme, this problem not only affects performance, but in the case of the now-mainstream Iceberg tables, "breaks" them by bringing them into an unserviceable state. Ultimately, in advanced cases an unreasonably large amount of cluster resources will be required to perform table compaction. Read queries are significantly slowed at the SQL compute level when it becomes necessary to read 100x+ more files and Iceberg metadata.

In our Alphyn Lakehouse platform there is a built-in managed Iceberg maintenance service with a graphical interface, which is part of the platform and frees users from creating their own project solutions.

What About HDFS?

From a practical standpoint, the HDFS NameNode is limited to 200–300 million files, regardless of their size. Although there is an absolute maximum limit of fsimage inode 2^31 (2 billion).

Why does this happen? There are several reasons:

  • Heap memory utilization on the NameNode (the old faithful JVM) and GC of large heaps in themselves. On average, for 150 million objects the heap area will be around 100 GB;
  • "Snapshotting" of HDFS — a frequently used approach in large production clusters, at minimum for the replication task and isolating changes without using open table formats. If enabled, with 5 snapshots per object and only 5% changes in each (both figures are very conservative estimates; in my practice of operating a 5 PB HDFS cluster, snapshotting uses up to 20% of volume with 3 snapshots per object), the heap size is multiplied by 1.3: for 150 million objects — 130 GB. Continuing the experiment, for 300 million objects — 260 GB;
  • This calculation is characteristic of columnar file formats parquet/orc. Now imagine that we have an Iceberg table format. What will it add? There are many metadata files. Their count is comparable to the number of data files + delete files from accumulated changes + greater fragmentation of the data files themselves + Iceberg versioning. As a consequence, the number of "useful" files (specifically, data files of the current Iceberg snapshot) will already be practically 30–50 million on the NameNode, as the remaining "quota" is occupied by the surrounding Iceberg environment. One should also not forget about service continuity: NameNode startup time in practice is approximately 30 minutes.

How are small files problems in HDFS addressed? First and foremost, by monitoring the average file size and consolidating files. In the case of Iceberg, as with S3, servicing and compaction of Iceberg tables and manifests helps. If the file count, even with scheduled maintenance, still exceeds the practical limit, HDFS federation is used — but the complexity of the solution and installation increases significantly, because:

  • Each NameNode requires its own HA;
  • Namespaces are isolated: move for relocating is impossible, only copying;
  • Load balancing across NameNodes becomes manual: there is no automatic distribution; data placement planning is done manually, as data nodes can be overloaded by a single namespace;
  • Operational complexity: backup/restore/replication, HA configurations, version updates, quota assignments (per namespace), resource allocation, and administrator competency requirements all become more complex.

All these problems were the motivation for migrating HDFS federations to native S3 solutions (Ceph, MinIO, public cloud S3), and also the reason for creating the Apache Ozone project itself.

What About Greenplum?

First — some well-known facts about how postgres/GP works:

  1. In postgres, each partition of each table has its own set of data files:
    1. Data files themselves with page-based internal organization and splitting by 1 GB (each next gigabyte — a new file);
    2. Service files: free space map (FSM), visibility map (VM), _init (for unlogged tables);
    3. Index files (if there are indexes);
  2. GP consists of segments. On one physical node, 4 to 8 segments run, each segment is a postgres instance + one active common master and its standby postgres;
  3. The primary table type in GP for analytics/DWH is Append Only Column Oriented tables (AOCO): on each segment, for each partition, for each column — a separate data file with the logic from point 1 in terms of composition and splitting. In addition to visible columns there are also technical columns: 33 per AOCO, with one file each per partition;
  4. The second key table type is AOT Row — Append Only Row Oriented — with compression but row-based storage, substantially less optimal for HTAP/OLAP workload for large clusters (less compression, no column reads). But the file logic is closer to postgres: one data file per 1 GB of data per partition;
  5. All large tables must be distributed close to uniform distribution across all segments (which are essentially the units of parallelism). Accordingly, each segment has its own piece of each partition with all its files for parallel processing (storage and compute are tightly coupled + symmetric MPP processing);
  6. The pg_catalog catalog is replicated to every segment.

Where the term "number of partitions" in the cluster (or more precisely "number of objects in pg_catalog terms") is used below, this means the sum of all tables without partitions + the sum of all partitions of all partitioned tables across the entire cluster.

Real Practical Experiment of pg_catalog Degradation

  1. Test stand sizing:
    1. On-premise dedicated hardware, 4 segment hosts with characteristics: 64 CPU cores, 1 TB RAM, 8 segments per host;
    2. Disk subsystem of each segment host consists of 2 RAID-10 arrays of 4 Enterprise NVMe PCIe gen4 x4 disks each (8 disks total);
    3. Local network: 100 Gbps plus additional 100 Gbps bonding for redundancy.

(!) All figures obtained in the experiment below represent maximum performance due to the use of NVMe disk subsystem and 100 Gbps network bandwidth.

  1. Create 331 AOCO tables. Each table has 100 columns (including technical columns — 133). Total across all tables — 5.4 million partitions, 32 million small files per segment (!) . The tables have 0 rows — the files per column have not yet been created. At the first INSERT of even one row — wherever the row lands, there will be +100 files on the segment (for fully populated partitions), at least a hundredfold multiplication of partitions per table per segment, or approximately x15 files for this case. File creation on columns will not further affect catalog performance, but will radically change the behavior of SELECT/DML operations and the physical node file system. But we are testing the effects of working with the catalog for now, because the cluster unfortunately already stops here;

  2. The system catalog occupies 12 TB before vacuum (see diagram) and 9 TB after, partly because it is replicated on all segments. Data files with 0 actual rows occupied 11 TB — already real big data literally out of nowhere!

  3. Measurements and conclusions with 0 rows in the AOCO data tables themselves:

    1. At 400K partitions (taken as the most relevant point for large clusters), catalog queries already degrade to 5 seconds;
    2. At 5.4M partitions — degradation already exceeds one minute (75 seconds, and after vacuum full of the system catalog — 55 seconds);
    3. Degradation dynamics are linear: every 10K empty partitions adds an additional 100 ms;
    4. At 5.4M partitions: non-concurrent DDL in a single session (CREATE of a new empty additional table with 1000 partitions) — 10 seconds; DROP of an empty table — 10 minutes!
    5. Concurrent DML/DDL operations on 80–100 sessions against the catalog at 400K partitions completely stop cluster operation!
  4. With AOT Row things are better: 1 million AOT Row partitions and 500 simultaneous queries on the indicated disk type with proper tuning — this is the threshold where the cluster does not yet degrade in terms of pg_catalog access.

From a practical standpoint in production, the number of concurrent sessions matters. And the optimizer itself uses the catalog on each query as well.

It turns out that the practical architectural limitations of Greenplum are as follows:

  • AOCO — number of partitions within a single powerful physical cluster built with NVMe disks: approximately 300K per cluster and 100 concurrent sessions with minimal load — a limit regardless of hardware for a single physical GP cluster;
  • AOT Row — 1 million partitions and 500 sessions for NVMe disks and sufficient RAM.

Significant slowdown of SELECT/DML/DDL operations, cluster downtime during vacuum operations on the catalog and data files, speed and even execution of maintenance operations — backup/restore and replication — is called the "GP catalog bloat" problem.

Unfortunately, no matter how much one might wish, Greenplum cannot handle a large number of small files due to its architecture, simply stalling at the primary step — listing objects in pg_catalog.

Summary

On one hand, an extreme experiment was conducted with 5.4 million partitions in Greenplum, which is of course fantastic. But it is not us who claim that GP scales to PETABYTES of compressed data with significant concurrent load.

On the other hand, large GP clusters of a couple hundred TB in real life have precisely those 100–400 thousand AOCO partitions, which as we can see is close to the technological limits of a single cluster.

Topics

s3miniohdfsgreenplumbenchmarksicebergsmall-files

Ready to Modernize Your Data Platform?

See how Alphyn Lakehouse can transform your data infrastructure.