Benchmarking MPP SQL Engines: Impala vs Trino vs GreenPlum
Successful performance testing and load testing are the most critical prerequisites for selecting an analytical system for massively parallel big data processing. In this publication we share the approaches to testing used by our team and introduce the results of comparing various engines and systems. In this study we benchmark two popular SQL engines — Impala and Trino — against GreenPlum, evaluating them under concurrent workloads on the Alphyn Lakehouse platform.
You will learn how to set goals correctly, choose a methodology, what scenarios to include in it, how to record results and draw conclusions. And most importantly — you will get the answer to the question: who is faster, the hare Trino or the antelope Impala?
Testing Goals and Methodology
Selecting a target system, framework, or engine is the most obvious and seemingly self-explanatory goal for conducting tests. Let's examine in detail what needs to be verified and researched:
- The system/engine's application area and role in the landscape. There are no inherently good or bad systems. Some perform better in concurrent BI access or complex ad hoc queries; some in complex sequential data transformation tasks; some handle all the requirements of a modern data warehouse or data lake well.
- Identifying architectural characteristics. During the research you need not just collect numbers in a report, but understand: what level of expertise will be required from end users and developers; what limitations or functional peculiarities they will face; which design approaches are favorable and which are undesirable.
- Obtaining metrics for target system sizing projections. When selecting a system or engine, it is important to understand how architectural characteristics affect the target sizing. Different solutions have different infrastructure and hardware requirements.
Setting goals defines the requirements for the testing methodology. I propose to discuss methodology in reverse: by identifying the signs of a bad testing approach.
Signs of Poor Testing Methodology
Red Flags — Poor Methodology
- The test dataset volume is several times smaller than the system's total RAM
- Test queries execute sequentially: run a query, record the result in a spreadsheet, move to the next query, repeat
- The same test queries run without varying parameters and without pre-clearing the result cache or filesystem cache
- The physical model is prepared in advance specifically for an agreed-upon limited set of test queries (indexes/projections/sort orders tuned for specific queries)
- Only SELECT statements are used — no write operations (no INSERT/CREATE)
These signs should alert you. Is this a misunderstanding of the requirements that a modern analytical system faces in real production use, or is it deliberate manipulation by a vendor or integrator?
Signs of Good Testing:
- The test dataset volume is several times larger than the total RAM. The ideal data-to-memory ratio is at least 5:1
- At least one test iteration is conducted without optimizing the physical data model for specific queries
- Testing runs under concurrent load. Multiple different queries execute simultaneously, or the same query is submitted by a group with different predicates to prevent cache hits. The cache is flushed before each iteration
- "Zerg Rush" — dozens or hundreds of heterogeneous queries arrive simultaneously or in groups over an extended period
Why do we impose such requirements?
In real production the system is always heavily loaded. It often processes dozens or even hundreds of queries simultaneously, and these are not necessarily just DML queries — and certainly not just SELECT. Statistics may not have been collected at all, may be partially collected, or may be sampled (which is why I sometimes recommend running additional iterations without statistics, or on a sparse fragmented dataset that has been "forgotten" to be maintained). Conducting tests only in the "greenhouse" conditions of over-tuning that is often unachievable in production is incorrect.
A single running query often does not fully utilize all hardware resources, and in concurrent-access systems it should not. The result of such a test shows only how the system behaves when running one specific query and represents a special case. This can be considered a functional verification, but under no circumstances a performance test.
The starting assumption should be the most unfavorable scenario where nobody pre-built indexes, projections, materialized views, or sorted data. Designing additional structures always has overhead — in storage volume, in write time, and in the design expertise required. If the goal is to test optimization techniques, it is better to plan two test iterations: the first without physical model optimization, the second with optimization — with mandatory recording of overhead costs. You always pay a price for performance gains on specific queries: write operations slow down, data volume grows due to additional structures, and so on.
Remember: the primary goal of analytical system testing is to measure throughput — the number of typical queries that a specific system and hardware configuration can process per unit of time, fundamentally per hour.
Choosing a Methodology
Methodology relevance must stem from client scenarios and data.
Option 1. Real client data of sufficient volume with a varied query set — the best possible option. But the main problem is that clients often cannot provide such a dataset of the required volume, or at the moment do not have sufficient query variety and test scenarios.
Option 2. Using an open testing methodology, such as TPC-DS or TPC-H. Both methodologies are suitable, but it is important to adapt them to your relevant scenarios. Remember that it is incorrect to draw conclusions about a system from running all 99 TPC-DS queries sequentially and recording the results. TPC-DS was originally designed as a functional test of ANSI SQL compatibility: "will the query run without rewriting?" If the proposed methodologies are unsuitable for some reason, you can develop your own — as long as it meets the criteria and goals.
Option 3. Combining Option 1 and Option 2 to add test scenarios that the client cannot currently provide from their own data.
Plan for Ideal Testing
Let's try to outline a plan for ideal testing. Sequence of actions:
- Generate a test dataset exceeding total RAM (ideally by 5x) if no representative production data is available
- Run all selected methodology queries sequentially to verify they work on the specific system or engine. Based on this step, begin creating scenarios
- Create concurrent-access test scenarios with varied queries from the base list, either by query groups or by running all queries simultaneously. Test queries should use result materialization wherever more than 1,000 rows are expected in the result set
- Configure the system for concurrent operation mode
- Measure scenario execution times and calculate "average queries per hour" (or other time unit) as needed
- Optionally supplement the methodology with various analytical tasks or building materialized data marts
- Compile a test report with results
During testing, resource utilization must be monitored. This is needed for target sizing design. You need to understand which resources the solution demands: all systems and engines differ in hardware requirements due to architectural differences. Without utilization analysis it is impossible to conclude: how efficiently and which specific resources are consumed; what bottlenecks exist or may exist in the target solution; and most importantly — whether the system in this test iteration is configured for maximum throughput or can still be optimized further. Without this knowledge it is impossible to forecast an appropriate target sizing.
When generally evaluating system performance from test results, all nuances must be considered:
- What concurrent access management mechanisms exist (resource queues/groups, splitting into compute tenants within a shared cluster, etc.)
- Which queries and operators are favorable for the engine and which may cause problems
- What compression algorithms were applied and what compression ratio they demonstrated relative to raw data (this also affects sizing and total cost of ownership calculations)
Comparing Results
If identical hardware is allocated for both systems, comparing results is straightforward — raw numbers and nothing else. But what if systems were tested on different hardware, or have such different architectural approaches that they were deployed on different hardware types and topologies? Or if tests were conducted in a cloud environment? In that case, the correct approach is to introduce a cost-per-computation metric: how much in monetary terms does it cost to execute a scenario (or complete the entire test from start to finish). In the cloud, these numbers are very easy to obtain using billing data or the cloud calculator provided by every provider. The billing captures all costs that may not be immediately obvious, making this the most accurate and honest comparison.
In an on-premise environment such a calculation can only be done if you know the purchase and operational costs of the hardware or a specific compute unit quantum (private cloud infrastructure often introduces such a metric for cost allocation to internal consumers).
Comparative Performance Testing: Trino vs Impala
With the theory covered, it's time to move to practice and results. Two frequently asked questions arise in client engagements when evaluating SQL engines for lakehouse platforms: "why consider two engines?" and "which one is faster?" Let's focus on the second. Here are the results of internal testing on the Alphyn Lakehouse platform.
Cloud Environment Testing
Test environment:
- Cloud environment with managed K8S and managed S3
- Engines run on identical K8S worker machines with identical parameters
- Data loaded into managed S3
- Both engines configured for maximum throughput (resource queues, parameter tuning for maximum resource utilization, configurable parallelism for each scenario, query plan review, etc.)
- Fault-tolerance mode was disabled for Trino to ease its S3 communication overhead
Methodology:
- Synthetic dataset approximately 16 TB in volume from the banking domain, snowflake schema
- Typical queries selected, representative of various data research and analytical layer/data mart construction tasks
- Queries returning large datasets execute with result materialization
- All queries run with concurrency of 10, with predicates selected such that each query in the group reads its own range (to eliminate caching)
- Scenarios executed using JMeter, which records maximum, minimum, and average times per group
- A separate scenario runs all queries from all groups at 10 and 20 sessions each, creating a load of 90 and 180 simultaneous queries respectively
The methodology was developed approximately 8 years ago and has proven itself well for use-case scenarios validating analytical data warehouse solutions. The methodology can be provided on request if it is of interest.
The comparison table and bar chart show per-query execution times for each engine. Among the overall picture, Query 9 stands out and warrants special attention. After thoroughly analyzing query profiles, we concluded that in this query Impala — having pre-selected the file and block scan range using min/max filtering at the Parquet file / row group / page level — performs twice as many reads as Trino during page-index filtering, even though both use the same min/max page-level filtering. We are currently investigating the reasons for this behavior in debug mode and plan to fix it.
Throughput and Cost-Effectiveness Metrics
Let's translate the results into throughput and cost-efficiency metrics. For this comparison we use the 90 and 180 simultaneous query scenarios.
Configuration costs were calculated using the cloud provider's billing calculator. In this example the monetary comparison is more accurate and representative, since we used different types and numbers of K8S worker machines. Throughput is a metric only for a specific engine on a specific configuration.
On-Premise Environment Testing
Below is an example of a similar test in a client's on-premise private cloud environment. Both engines were on equal footing. The test dataset was doubled to 32 TB. MinIO was used as S3. Newer versions of both engines were used compared to the previous cloud testing.
Production Real-World Scenario
Perhaps the results depend on the methodology, queries, and data? Let's look at real-world production results.
Testing was conducted in a client's private cloud infrastructure using real production data. All queries for the scenarios were generated by a BI tool, so per test conditions the queries were run without any modifications (if a plan is suboptimal or you want to add a hint, that is simply not possible in this context). Data was in a format favorable for the engines: Iceberg Parquet with ZSTD compression level 3. Both engines were configured for maximum performance (optimal intra-node parallelism, maximum utilization of allocated compute cluster resources). Fault-tolerance mode was disabled for Trino (favorable for the engine from a performance standpoint).
Real-world operating and testing experience with both engines shows that in high-concurrency tasks and where maximum utilization of all available resources is required, Impala is more efficient than Trino — the antelope outpaces the hare. The reasons are that Trino:
- Has higher overhead as a Java application running in a JVM environment. On the same node, Impala will work stably with 90% of RAM allocated, while Trino requires either more aggressive concurrency limiting via queuing or cutting resources down to 70% of available memory for stable operation — all of which negatively impacts throughput
- Consumes more RAM compared to Impala, which only manifests as a disadvantage under concurrent access
- Has declaratively constrained resource management (it formally exists and roughly works, but under high load it "breaks through" both its own resource group settings and the JVM limits in its operating environment)
None of these facts change the reality that Trino is a good, modern, high-performance processing SQL engine that also supports federated access (Impala's federated access is currently rudimentary and supports few sources). Trino remains a valuable tool for federated access scenarios, and the team continues to make improvements to it — especially in information security and performance.
Yes, Impala is currently faster, but work continues on both fronts: native execution for Trino to try to approach Impala's results, and improvements to Impala itself — replacing the legacy Java Hadoop S3 interaction with a custom C++ implementation. Stay tuned for updates.
Comparing Trino and Impala with GreenPlum
Now let's compare these engines with another popular solution on the market — GreenPlum — and look at the full test protocol with all participants.
The Alphyn Lakehouse solution — on both the Trino and Impala engines running in virtual private cloud infrastructure — demonstrates performance comparable to GreenPlum, which has an order of magnitude more hardware resources: more than 40 physical segment servers with a disk subsystem totaling 900 SSD drives, 100 Gb/s network connectivity, and RAM comparable in volume to the entire dataset. The GreenPlum solution is of course designed for maximum performance and optimization (even distribution key, join colocation, partitioning, appropriate per-query memory values, no spills, resource queues, and other available techniques), as it is used in production. The GreenPlum measurement was conducted during regular (daily) ETL workload that utilizes 80% of cluster resources for computing the data marts referenced in the test.
But the comparison clearly shows how outdated the legacy approach to massively parallel computing architecture has become. Systems like GreenPlum, which operate on full-scan operations and lack modern optimization techniques such as dynamic bloom filtering and two-level storage index filtering, extremely inefficiently utilize their hardware capabilities and lose out to modern architectures and processing engines. GreenPlum's "performance per cost" metric relative to SQL MPP Lakehouse is simply not competitive.
Future Plans
The test results presented here may not match your opinion or experience. If you disagree with the conclusions and observations, feel free to reach out — we will investigate and verify together with you, because the best testing is the testing you conduct yourself with the help of experts. The key is choosing the right experts.
What we plan to do and show in the near term:
- Publish TPC-DS methodology results in its classic execution
- Supplement TPC-DS with concurrent load scenarios and scenarios typical of ETL pipeline processing — such as mass DDL operations (which will also be a good test for the metacatalog)
- Conduct testing in fast access scenarios for materialized data marts (tens and hundreds of queries per second by key and several selection conditions)
- Continue improving Trino and Impala in terms of performance and share results
Our operating principles:
- Only verified solutions reach clients
- Product development planning must be based on experience and experiments close to real production use — not on star counts or third-party materials from the internet