Run Apache Spark and Iceberg 4.5x faster than open source Spark with Amazon EMR | Amazon Web Services

This post shows how Amazon EMR 7.12 can make your Apache Spark and Iceberg workloads perform up to 4.5x faster.

The Amazon EMR runtime for Apache Spark provides a high-performance runtime with full API compatibility with open source Apache Spark and Apache Iceberg. Amazon EMR on EC2, Amazon EMR Serverless, Amazon EMR on Amazon EKS, Amazon EMR on AWS Outposts, and AWS Glue use an optimized runtime environment.

Our benchmarks show that Amazon EMR 7.12 runs a TPC-DS 3 TB workload 4.5x faster than open source Spark 3.5.6 with Iceberg 1.10.0.

Performance improvements include optimizations for metadata caching, parallel I/O, adaptive query scheduling, data type handling, and fault tolerance. There were also some Iceberg-specific regressions around data scans that we identified and fixed.

These optimizations allow you to match the performance of Parquet on Amazon EMR while maintaining the core features of Iceberg’s core features: ACID transactions, time travel, and schema evolution.

Benchmark results compared to open source

To evaluate the performance of the Spark engine with the Iceberg table format, we ran benchmarks using the 3TB dataset TPC-DS, version 2.13, a popular standard benchmark. Amazon EMR runtime benchmarks for Apache Spark and Apache Iceberg were performed on Amazon EMR 7.12 EC2 clusters compared to open source Apache Spark 3.5.6 and Apache Iceberg 1.10.0 on EC2 clusters.

Note: Our results derived from the TPC-DS dataset are not directly comparable to the official TPC-DS results due to setup differences.

Setup instructions and technical details are available on our GitHub repository. To minimize the influence of external catalogs such as AWS Glue and Hive, we used the Hadoop catalog for Iceberg tables. This uses the underlying file system, specifically Amazon S3, as the catalog. We can define this setting by configuring the property spark.sql.catalog..type. The fact tables used a default partition by data column that varies from 200 to 2,100 partitions. No pre-computed statistics were used for these tables.

We ran a total of 104 SparkSQL queries in 3 consecutive rounds, and the average running time of each query in those rounds was taken for comparison. The average run time for 3 rounds on Amazon EMR 7.12 with Iceberg enabled was 0.37 hours, showing a 4.5x speedup compared to open source Spark 3.5.6 and Iceberg 1.10.0. The following figure shows the total run times in seconds.

The following table summarizes the metrics.

Metric Amazon EMR 7.12 on EC2 Amazon EMR 7.5 on EC2 Open source Apache Spark 3.5.6 and Apache Iceberg 1.10.0
Average run time in seconds 1349.62 1535.62 6113.92
Geometric mean for queries in seconds 7.45910 8.30046 22.31854
Costs* $4.81 $5.47 $17.65

*Detailed cost estimates are discussed later in this post.

The following chart shows the query performance improvement of Amazon EMR 7.12 compared to open source Spark 3.5.6 and Iceberg 1.10.0. The range of speedups varies from query to query, with the fastest being up to 13.6x faster for q23b, with Amazon EMR outperforming open source Spark with Iceberg tables. The horizontal axis ranks the TPC-DS 3TB benchmark queries in descending order based on the performance improvement seen with Amazon EMR, and the vertical axis shows the magnitude of that speedup as a ratio.

Cost comparison breakdown

Our benchmark provides total runtime and geometric mean data to assess the performance of Spark and Iceberg in a complex real-world decision support scenario. For more information, we also examine the cost aspect. We calculate cost estimates using formulas that account for EC2 On-Demand instances, Amazon Elastic Block Store (Amazon EBS), and Amazon EMR costs.

  • Amazon EC2 price (including SSD price) = number of instances * r5d.4xlarge hourly rate * job runtime in hours
    • 4x Big Hourly Rate = $1,152 per hour
  • Amazon EBS Root Cost = Number of Instances * Amazon EBS per GB-Hour Rate * EBS Root Volume Size * Job Runtime in Hours
  • Amazon EMR price = number of instances * r5d.4xlarge Amazon EMR price * job processing time in hours
    • 4x Amazon EMR Grand Price = $0.27 per hour
  • Total cost = Amazon EC2 price + Amazon EBS root price + Amazon EMR price

Calculations reveal that the Amazon EMR 7.12 benchmark delivers a 3.6x cost-effectiveness improvement over open source Spark 3.5.6 and Iceberg 1.10.0 when running the benchmark job.

Metric Amazon EMR 7.12 Amazon EMR 7.5 Open source Apache Spark 3.5.6 and Apache Iceberg 1.10.0
Run time in seconds 1349.62 1535.62 6113.92

Number of EC2 instances

(Includes primary node)

9 9 9
Amazon EBS size 20 GB 20 GB 20 GB

Amazon EC2

(Total cost per run)

$3.89 $4.42 $17.61
Amazon EBS Price 0.01 USD 0.01 USD 0.04 USD
Amazon EMR Price $0.91 $1.04 $0
Total cost $4.81 $5.47 $17.65
Cost savings Amazon EMR 7.12 is 3.6x better Amazon EMR 7.5 is 3.2x better Base line

In addition to the time metrics discussed so far, Spark event log data shows that Amazon EMR scanned approximately 4.3x less data from Amazon S3 and 5.3x fewer records than the open source version in the TPC-DS 3TB benchmark. This reduction in Amazon S3 data scanning directly contributes to cost savings for Amazon EMR workloads.

Run Apache Spark open source benchmarks on Apache Iceberg tables

To test open source Spark 3.5.6 and Amazon EMR 7.12 for the Iceberg workload, we used separate EC2 clusters, each equipped with 9 r5d.4xlarge instances. The primary node was equipped with 16 vCPUs and 128 GB of memory, and the 8 worker nodes had a combined 128 vCPUs and 1024 GB of memory. We ran tests using default Amazon EMR settings to demonstrate a typical user experience and minimally tweaked Spark and Iceberg settings to maintain a balanced comparison.

The following table summarizes the Amazon EC2 configurations for the primary node and 8 r5d.4xlarge worker nodes.

EC2 instances vCPU Memory (GiB) Instance Storage (GB) EBS root volume (GB)
r5d.4xlarge 16 128 2 x 300 NVMe SSDs 20 GB

Prerequisites

The following prerequisites are required to run benchmarking:

  1. Use the instructions in the emr-spark-benchmark GitHub repository to set up the TPC-DS source data in your S3 bucket and on your local machine.
  2. Build the benchmark application by following the steps to build the spark-benchmark-assembly application and copy the benchmark application to your S3 bucket. Alternatively, copy spark-benchmark-assembly-3.5.6.jar to your S3 bucket.
  3. Create Iceberg tables from TPC-DS source data. Follow the instructions on GitHub to create Iceberg tables using the Hadoop catalog. For example, the following code uses an Amazon EMR 7.12 cluster with Iceberg enabled to create the tables:
aws emr add-steps --cluster-id  --steps Type=Spark,Name="Create Iceberg Tables",
Args=(--class,com.amazonaws.eks.tpcds.CreateIcebergTables,--conf,spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,
--conf,spark.sql.catalog.hadoop_catalog=org.apache.iceberg.spark.SparkCatalog,
--conf,spark.sql.catalog.hadoop_catalog.type=hadoop,
--conf,spark.sql.catalog.hadoop_catalog.warehouse=s3:////,
--conf,spark.sql.catalog.hadoop_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO,
s3:////spark-benchmark-assembly-3.5.6.jar,s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/,
/home/hadoop/tpcds-kit/tools,parquet,3000,true,,true,true),ActionOnFailure=CONTINUE --region 

Note: Hadoop catalog repository location and database name from the previous step. We use the same Iceberg tables to run benchmarks with Amazon EMR 7.12 and open source Spark.

This benchmark application is built from the tpcds-v2.13_iceberg branch. If you are building a new benchmark application, switch to the correct branch after downloading the source code from the GitHub repository.

Create and configure a YARN cluster on Amazon EC2

To benchmark Iceberg performance between Amazon EMR on Amazon EC2 and open source Spark on Amazon EC2, follow the instructions in the emr-spark-benchmark GitHub repository to create an open source Spark cluster on Amazon EC2 using Flintrock with 8 worker nodes.

Based on the cluster selection for this test, the following configurations are used:

Don’t forget to replace the placeholder in the yarn-site.xml file, with the IP address of the primary node of your Flintrock cluster.

Run the TPC-DS benchmark with Apache Spark 3.5.6 and Apache Iceberg 1.10.0

To run the TPC-DS benchmark, do the following:

  1. Log in to the primary node of the open source cluster with flintrock login $CLUSTER_NAME.
  2. Submit your Spark job:
    1. Select the correct location of the Iceberg catalog warehouse and the database that has the Iceberg tables created.
    2. The results are created in s3:///benchmark_run.
    3. You can track the progress of /media/ephemeral0/spark_run.log.
spark-submit \
--master yarn \
--deploy-mode client \
--class com.amazonaws.eks.tpcds.BenchmarkSQL \
--conf spark.driver.cores=4 \
--conf spark.driver.memory=10g \
--conf spark.executor.cores=16 \
--conf spark.executor.memory=100g \
--conf spark.executor.instances=8 \
--conf spark.network.timeout=2000 \
--conf spark.executor.heartbeatInterval=300s \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.shuffle.service.enabled=false \
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.InstanceProfileCredentialsProvider \
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.jars.packages=org.apache.hadoop:hadoop-aws:3.3.4,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.0,org.apache.iceberg:iceberg-aws-bundle:1.10.0 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions   \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog    \
--conf spark.sql.catalog.local.type=hadoop  \
--conf spark.sql.catalog.local.warehouse=s3a://// \
--conf spark.sql.defaultCatalog=local   \
--conf spark.sql.catalog.local.io-impl=org.apache.iceberg.aws.s3.S3FileIO   \
spark-benchmark-assembly-3.5.6.jar   \
s3:///benchmark_run 3000 1 false  \
q1-v2.13,q10-v2.13,q11-v2.13,q12-v2.13,q13-v2.13,q14a-v2.13,q14b-v2.13,q15-v2.13,q16-v2.13,\
q17-v2.13,q18-v2.13,q19-v2.13,q2-v2.13,q20-v2.13,q21-v2.13,q22-v2.13,q23a-v2.13,q23b-v2.13,\
q24a-v2.13,q24b-v2.13,q25-v2.13,q26-v2.13,q27-v2.13,q28-v2.13,q29-v2.13,q3-v2.13,q30-v2.13,\
q31-v2.13,q32-v2.13,q33-v2.13,q34-v2.13,q35-v2.13,q36-v2.13,q37-v2.13,q38-v2.13,q39a-v2.13,\
q39b-v2.13,q4-v2.13,q40-v2.13,q41-v2.13,q42-v2.13,q43-v2.13,q44-v2.13,q45-v2.13,q46-v2.13,\
q47-v2.13,q48-v2.13,q49-v2.13,q5-v2.13,q50-v2.13,q51-v2.13,q52-v2.13,q53-v2.13,q54-v2.13,\
q55-v2.13,q56-v2.13,q57-v2.13,q58-v2.13,q59-v2.13,q6-v2.13,q60-v2.13,q61-v2.13,q62-v2.13,\
q63-v2.13,q64-v2.13,q65-v2.13,q66-v2.13,q67-v2.13,q68-v2.13,q69-v2.13,q7-v2.13,q70-v2.13,\
q71-v2.13,q72-v2.13,q73-v2.13,q74-v2.13,q75-v2.13,q76-v2.13,q77-v2.13,q78-v2.13,q79-v2.13,\
q8-v2.13,q80-v2.13,q81-v2.13,q82-v2.13,q83-v2.13,q84-v2.13,q85-v2.13,q86-v2.13,q87-v2.13,\
q88-v2.13,q89-v2.13,q9-v2.13,q90-v2.13,q91-v2.13,q92-v2.13,q93-v2.13,q94-v2.13,q95-v2.13,\
q96-v2.13,q97-v2.13,q98-v2.13,q99-v2.13,ss_max-v2.13    \
true  > /media/ephemeral0/spark_run.log 2>&1 &!

Summarize the results

After the Spark job completes, load the test results file from the S3 output bucket to s3:///benchmark_run/timestamp=xxxx/summary.csv/xxx.csv. This can be done either through the Amazon S3 console by navigating to the specified bucket location or using the Amazon Command Line Interface (AWS CLI). The Spark benchmark organizes the data by creating a timestamps folder and placing the summary file in the designated folder summary.csv. The output CSV files contain 4 columns without headers:

  • Query name
  • Median time
  • Minimum time
  • Maximum time

With data from 3 separate test runs with 1 iteration each time, we can calculate the mean and geometric mean of the benchmark runs.

Run the TPC-DS benchmark with the Amazon EMR runtime for Apache Spark

Most of the instructions are similar to Steps to Run Spark Benchmarking with a few details specific to Iceberg.

Prerequisites

Perform the following necessary steps:

  1. Run aws configure configure the AWS CLI to point to the AWS benchmark account. See Configuring the AWS CLI for instructions.
  2. Upload the benchmark application JAR file to Amazon S3.

Deploy an Amazon EMR cluster and run the benchmark job

To run the benchmark job, perform the following steps:

  1. Use the AWS CLI command as shown in Deploying EMR on an EC2 cluster to run the benchmark job to deploy Amazon EMR on an EC2 cluster. Don’t forget to enable Iceberg. See Creating an Iceberg Cluster for more details. Select the correct Amazon EMR version, root volume size, and the same resource configuration as the open source Flintrock setup. See create-cluster for a detailed description of the AWS CLI options.
  2. Save the cluster ID from the response. We need it for the next step.
  3. Submit a benchmark job in Amazon EMR using add-steps from the AWS CLI:
    1. Replace with the cluster ID from step 2.
    2. The comparison application is on s3:///spark-benchmark-assembly-3.5.6.jar.
    3. Select the correct location of the Iceberg catalog warehouse and the database that has the Iceberg tables created. It should be the same as the one used to run the open source TPC-DS benchmark.
    4. The results will be in s3:///benchmark_run.
aws emr add-steps   --cluster-id 
--steps Type=Spark,Name="SPARK Iceberg EMR TPCDS Benchmark Job",
Args=(--class,com.amazonaws.eks.tpcds.BenchmarkSQL,
--conf,spark.driver.cores=4,
--conf,spark.driver.memory=10g,
--conf,spark.executor.cores=16,
--conf,spark.executor.memory=100g,
--conf,spark.executor.instances=8,
--conf,spark.network.timeout=2000,
--conf,spark.executor.heartbeatInterval=300s,
--conf,spark.dynamicAllocation.enabled=false,
--conf,spark.shuffle.service.enabled=false,
--conf,spark.sql.iceberg.data-prefetch.enabled=true,
--conf,spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,
--conf,spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog,
--conf,spark.sql.catalog.local.type=hadoop,
--conf,spark.sql.catalog.local.warehouse=s3:///,
--conf,spark.sql.defaultCatalog=local,
--conf,spark.sql.catalog.local.io-impl=org.apache.iceberg.aws.s3.S3FileIO,
s3:///spark-benchmark-assembly-3.5.6.jar,
s3:///benchmark_run,3000,1,false,
'q1-v2.13\,q10-v2.13\,q11-v2.13\,q12-v2.13\,q13-v2.13\,q14a-v2.13\,q14b-v2.13\,q15-v2.13\,q16-v2.13\,q17-v2.13\,q18-v2.13\,q19-v2.13\,q2-v2.13\,q20-v2.13\,q21-v2.13\,q22-v2.13\,q23a-v2.13\,q23b-v2.13\,q24a-v2.13\,q24b-v2.13\,q25-v2.13\,q26-v2.13\,q27-v2.13\,q28-v2.13\,q29-v2.13\,q3-v2.13\,q30-v2.13\,q31-v2.13\,q32-v2.13\,q33-v2.13\,q34-v2.13\,q35-v2.13\,q36-v2.13\,q37-v2.13\,q38-v2.13\,q39a-v2.13\,q39b-v2.13\,q4-v2.13\,q40-v2.13\,q41-v2.13\,q42-v2.13\,q43-v2.13\,q44-v2.13\,q45-v2.13\,q46-v2.13\,q47-v2.13\,q48-v2.13\,q49-v2.13\,q5-v2.13\,q50-v2.13\,q51-v2.13\,q52-v2.13\,q53-v2.13\,q54-v2.13\,q55-v2.13\,q56-v2.13\,q57-v2.13\,q58-v2.13\,q59-v2.13\,q6-v2.13\,q60-v2.13\,q61-v2.13\,q62-v2.13\,q63-v2.13\,q64-v2.13\,q65-v2.13\,q66-v2.13\,q67-v2.13\,q68-v2.13\,q69-v2.13\,q7-v2.13\,q70-v2.13\,q71-v2.13\,q72-v2.13\,q73-v2.13\,q74-v2.13\,q75-v2.13\,q76-v2.13\,q77-v2.13\,q78-v2.13\,q79-v2.13\,q8-v2.13\,q80-v2.13\,q81-v2.13\,q82-v2.13\,q83-v2.13\,q84-v2.13\,q85-v2.13\,q86-v2.13\,q87-v2.13\,q88-v2.13\,q89-v2.13\,q9-v2.13\,q90-v2.13\,q91-v2.13\,q92-v2.13\,q93-v2.13\,q94-v2.13\,q95-v2.13\,q96-v2.13\,q97-v2.13\,q98-v2.13\,q99-v2.13\,ss_max-v2.13',
true,),ActionOnFailure=CONTINUE --region 

Summarize the results

After completing the step, you can see the summary result of the benchmark at s3:///benchmark_run/timestamp=xxxx/summary.csv/xxx.csv in the same way as in the previous run and calculate the mean and geometric mean of the query runs.

Clean up

To avoid future charges, please delete the resources you’ve created by following the instructions in Cleaning up your GitHub repository.

Summary

Amazon EMR optimizes runtime for Spark when used with Iceberg tables and achieves 4.5x faster performance than open source Apache Spark 3.5.6 and Apache Iceberg 1.10.0 with Amazon EMR 7.12 on TPC-DS 3 TB, v2.13. This represents a significant advance over Amazon EMR 7.5, which delivered 3.6x faster performance, and closes the parquet performance gap on Amazon EMR, so customers can take advantage of Iceberg without performance constraints.

We recommend that you keep up to date with the latest versions of Amazon EMR to take full advantage of continuous performance improvements.

To stay informed, subscribe to the AWS Big Data blog RSS feed for Amazon EMR runtime updates for Spark and Iceberg, as well as configuration best practices tips and tuning recommendations.


About the authors

Atul Felix Payapilly is a software development engineer for Amazon EMR at Amazon Web Services.

Akshaya KP is a software development engineer for Amazon EMR at Amazon Web Services.

Hari Kishore Chaparala is a software development engineer for Amazon EMR at Amazon Web Services.

Giovanni Matteo is a Senior Manager for the Amazon EMR Spark and Iceberg group.

Leave a Comment