Modernize Apache Spark workflows with Spark Connect on Amazon EMR on Amazon EC2

Introduced in Spark 3.4, Apache Spark Connect enhances the Spark ecosystem by offering a client-server architecture that separates the Spark runtime from the client application. Spark Connect enables more flexible and efficient interactions with Spark clusters, especially in scenarios where direct access to cluster resources is limited or impractical.

A key use case for Spark Connect on Amazon EMR is the ability to connect directly from your local development environments to Amazon EMR clusters. With this decoupled approach, you can write and test Spark code on your laptop while using Amazon EMR clusters for execution. This feature reduces development time and simplifies data processing with Spark on Amazon EMR.

In this post, we demonstrate how to implement Apache Spark Connect on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) to create decoupled data processing applications. We’ll show how to securely set up and configure Spark Connect so you can develop and test Spark applications locally while running them on remote Amazon EMR clusters.

Architectural solution

The architecture centers around an Amazon EMR cluster with two types of nodes. The primary node hosts both the Spark Connect API endpoint and the Spark Core components that act as a gateway for client connections. The core node provides additional computing capacity for distributed processing. Although this solution demonstrates a two-node architecture for simplicity, it is scalable to support multiple kernel nodes and tasks based on workload requirements.

TLS/SSL network encryption is not inherently supported in Apache Spark Connect version 4.x. We’ll show you how to implement secure communication by deploying an Amazon EMR cluster with Spark Connect on Amazon EC2 using an Application Load Balancer (ALB) with TLS termination as the secure interface. This approach enables encrypted data transfer between Spark Connect clients and Amazon Virtual Private Cloud (Amazon VPC) resources.

The operational flow is as follows:

Bootstrap script – During Amazon EMR initialization, the primary node loads and starts start-spark-connect.sh file from Amazon Simple Storage Service (Amazon S3). This script starts the Spark Connect server.
Server availability – After the bootstrap process is complete, the Spark Server enters the waiting state and is ready to accept incoming connections. The Spark Connect API endpoint is exposed on the configured port (usually 15002) and listens for gRPC connections from remote clients.
Customer interaction – Spark Connect clients can make secure connections to the Application Load Balancer. These clients translate DataFrame operations into unresolved logical query plans, encode those plans using log buffers, and send them to the Spark Connect API using gRPC.
Encryption in transit – The Application Load Balancer receives incoming gRPC or HTTPS traffic, performs TLS termination (traffic decryption), and forwards requests to the primary node. The certificate is stored in AWS Certificate Manager (ACM).
Application processing – The Spark Connect API accepts outstanding logical plans, converts them to Spark’s built-in logical plan operators, passes them to Spark Core for optimization and execution, and streams the results back to the client as batches of rows encoded with Apache Arrow.
(Optional) Operational approach – Administrators can securely connect to both primary and master nodes through Session Manager, an AWS Systems Manager feature that enables troubleshooting and maintenance without exposing SSH ports or managing key pairs.

The following diagram shows the architecture of this post’s demo for sending outstanding Spark logical plans to EMR clusters using Spark Connect.

Apache Spark Connect on Amazon EMR solution architecture diagram

Prerequisites

To continue with this post, make sure you have the following:

Implementation steps

In this recipe, using the AWS CLI commands:

Prepare the bootstrap scriptbash script to run Spark Connect on Amazon EMR.
Set permissions for Amazon EMR to provide resources and perform service-level actions with other AWS services.
Create an Amazon EMR cluster with these associated roles and permissions and optionally attach the prepared script as a bootstrap action.
Deploy the Application Load Balancer and the certificate with ACM secure data when transmitted over the Internet.
Edit the security group of the primary node for Spark Connect clients to connect.
Connect to the test application client connection to the Spark Connect server.

Prepare the bootstrap script

To prepare the bootstrap script, follow these steps:

Create an Amazon S3 bucket to host the bootstrap bash script:

REGION=
BUCKET_NAME=
aws s3api create-bucket \
   --bucket $BUCKET_NAME \ 
   --region $REGION \
   --create-bucket-configuration LocationConstraint=$REGION

Open your preferred text editor, add the following commands to a new file named e.g start-spark-connect.sh. If the script is running on the primary node, it will start the Spark Connect server. If it is running on a task or root node, it does nothing:

#!/bin/bash
if grep isMaster /mnt/var/lib/info/instance.json | grep false;
then
    echo "This is not master node, do nothing."
    exit 0
fi
echo "This is master, continuing to execute script"
SPARK_HOME=/usr/lib/spark
SPARK_VERSION=$(spark-submit --version 2>&1 | grep "version" | head -1 | awk '{print $NF}' | grep -oE '(0-9)+\.(0-9)+\.(0-9)+')
SCALA_VERSION=$(spark-submit --version 2>&1 | grep -o "Scala version (0-9.)*" | awk '{print $3}' | grep -oE '(0-9)+\.(0-9)+')
echo "Spark version ${SPARK_VERSION} is installed under ${SPARK_HOME} running with scala version ${SCALA_VERSION}"
sudo "${SPARK_HOME}"/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_"${SCALA_VERSION}:${SPARK_VERSION}"

Upload the script to the bucket created in step 1:

aws s3 cp start-spark-connect.sh s3://$BUCKET_NAME

Set permissions

Before creating a cluster, you must create a service role and instance profile. A service role is an IAM role that Amazon EMR assumes to provide resources and perform service-level actions with other AWS services. An EC2 instance profile for Amazon EMR assigns a role to each EC2 instance in the cluster. The instance profile must specify a role that has access to the resources for your bootstrap action.

Create an IAM role:

aws iam create-role \
--role-name AmazonEMR-ServiceRole-SparkConnectDemo \
--assume-role-policy-document '{
	"Version": "2012-10-17",
	"Statement": ({
		"Effect": "Allow",
		"Principal": {"Service": "elasticmapreduce.amazonaws.com"},
		"Action": "sts:AssumeRole"
		})
}'

Attach the necessary managed policies to the service role to allow Amazon EMR to manage Amazon EC2 and Amazon S3 core services on your behalf, and optionally grant the instance to interact with Systems Manager:

aws iam attach-role-policy \
--role-name AmazonEMR-ServiceRole-SparkConnectDemo \
--policy-arn arn:aws:iam::aws:policy/service-role/AmazonEMRServicePolicy_v2

aws iam attach-role-policy \
--role-name AmazonEMR-ServiceRole-SparkConnectDemo \
--policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

aws iam attach-role-policy \
--role-name AmazonEMR-ServiceRole-SparkConnectDemo \
--policy-arn arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole

Create an Amazon EMR instance role to grant permissions to EC2 instances to interact with Amazon S3 or other AWS services:

aws iam create-role \
--role-name EMR_EC2_SparkClusterNodesRole \
--assume-role-policy-document '{
"Version": "2012-10-17",
"Statement": ({
   "Effect": "Allow",
   "Principal": {"Service": "ec2.amazonaws.com"},
   "Action": "sts:AssumeRole"
   })
}'

To enable the primary instance to read from Amazon S3, mount the file AmazonS3ReadOnlyAccess policies for the Amazon EMR instance role. For a production environment, this access policy should be reviewed and replaced with a custom policy of least privilege that grants only the specific permissions needed for your use case:
```
aws iam attach-role-policy \
--role-name EMR_EC2_SparkClusterNodesRole \
--policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
```
Attaching the AmazonSSMManagedInstanceCore policy enables instances to use core Systems Manager features such as Session Manager and Amazon CloudWatch:
```
aws iam attach-role-policy \
--role-name EMR_EC2_SparkClusterNodesRole \
--policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
```
To pass EMR_EC2_SparkClusterInstanceProfile For information about the IAM role for EC2 instances when they launch, create an Amazon EMR EC2 instance profile:
```
aws iam create-instance-profile \
--instance-profile-name EMR_EC2_SparkClusterInstanceProfile
```

Attach a role EMR_EC2_SparkClusterNodesRole created in step 3 to the newly instanced profile:

aws iam add-role-to-instance-profile \
--instance-profile-name EMR_EC2_SparkClusterInstanceProfile \
--role-name EMR_EC2_SparkClusterNodesRole

Create an Amazon EMR cluster

To create an Amazon EMR cluster, follow these steps:

Set the environment variables where your EMR cluster and load-balancer must be deployed:
```
VPC_ID=
EMR_PRI_SB_ID_1=
ALB_PUB_SB_ID_1=
ALB_PUB_SB_ID_2=
```

Create an EMR cluster with the latest version of Amazon EMR. Replace the placeholder with your actual S3 bucket name where the bootstrap action script is stored:

CLUSTER_ID=$(aws emr create-cluster \
--name "Spark Connect cluster demo" \
--applications Name=Spark \
--release-label emr-7.9.0 \
--service-role AmazonEMR-ServiceRole-SparkConnectDemo \
--ec2-attributes InstanceProfile=EMR_EC2_SparkClusterInstanceProfile,SubnetId=$EMR_PRI_SB_ID_1 \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m5.xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=m5.xlarge \
--bootstrap-actions Path="s3://$BUCKET_NAME/start-spark-connect.sh" \
--query 'ClusterId' --output text)
echo CLUSTER_ID="$CLUSTER_ID"

To modify the security group of the primary node to allow Systems Manager to initiate a session.

Get the security group identifier of the primary node. Make a note of the identifier as you will need it for subsequent configuration steps in which primary-node-security-group-id it is mentioned:

PRIMARY_NODE_SG=$(aws emr describe-cluster \
--cluster-id $CLUSTER_ID \
--query 'Cluster.Ec2InstanceAttributes.EmrManagedMasterSecurityGroup' \
--output text)
echo PRIMARY_NODE_SG=$PRIMARY_NODE_SG

Find the EC2 instance connection prefix list ID for your region. You can use EC2_INSTANCE_CONNECT filter using the description-managed-prefix-lists command. Using a managed prefix list provides dynamic security configuration to authorize Systems Manager EC2 instances to connect to the primary and base nodes using SSH:
```
IC_PREFIX_LIST=$(aws ec2 describe-managed-prefix-lists \
--filters Name=prefix-list-name,Values=com.amazonaws.$REGION.ec2-instance-connect \
--query 'PrefixLists(0).PrefixListId' \
--output text)
echo IC_PREFIX_LIST=$IC_PREFIX_LIST
```
Modify the inbound security group rules of the primary node to allow SSH access (port 22) to the primary node of the EMR cluster from resources that are part of the specified Instance Connect service contained in the prefix list:
```
aws ec2 authorize-security-group-ingress \
--region $REGION \
--group-id $PRIMARY_NODE_SG \
--ip-permissions "({\"IpProtocol\":\"tcp\",\"FromPort\":22,\"ToPort\":22,\"PrefixListIds\":({\"PrefixListId\":\"$IC_PREFIX_LIST\"})})"
```

Optionally, you can repeat the previous steps 1-3 for the cluster master nodes (and workloads) to enable Amazon EC2 Instance Connect to access the EC2 instance via SSH.

Deploy the Application Load Balancer and the certificate

To deploy the Application Load Balancer and certificate, follow these steps:

Create a load balancer security group:

ALB_SG_ID=$(aws ec2 create-security-group \
--group-name spark-connect-alb-sg \
--description "Security group for Spark Connect ALB" \
--region $REGION \
--vpc-id $VPC_ID \
--query 'GroupId' \
--output text)

Add a rule to accept TCP traffic from a trusted IP on port 443. We recommend using the IP address of the local development machine. You can check your current public IP address here: https://checkip.amazonaws.com:
```
aws ec2 authorize-security-group-ingress \
--group-id $ALB_SG_ID \
--protocol tcp \
--port 443 \
--cidr /32
```

Create a new gRPC target group that targets the Spark Connect server instance and the port the server is listening on:

ALB_TG_ARN=$(aws elbv2 create-target-group \
--name spark-connect-tg \
--protocol HTTP \
--protocol-version GRPC \
--port 15002 \
--target-type instance \
--health-check-enabled \
--health-check-protocol HTTP \
--health-check-path / \
--vpc-id $VPC_ID \
--query 'TargetGroups(0).TargetGroupArn' \
--output text)
echo "ALB TG created (ARN)=$ALB_TG_ARN"

Create an application load balancer:

ALB_ARN=$(aws elbv2 create-load-balancer \
--name spark-connect-alb \
--type application \
--scheme internet-facing \
--subnets $ALB_PUB_SB_ID_1 $ALB_PUB_SB_ID_2 \
--security-groups $ALB_SG_ID \
--query 'LoadBalancers(0).LoadBalancerArn' \
--output text)
echo "ALB created (ARN)=$ALB_ARN"

Get the DNS name of the load balancer:

ALB_DNS=$(aws elbv2 describe-load-balancers \
--load-balancer-arns $ALB_ARN \
--query 'LoadBalancers(0).DNSName' \
--output text)
echo "ALB DNS=$ALB_DNS"

Get the Amazon EMR Primary Node ID:

PRIMARY_NODE_ID=$(aws emr list-instances --cluster-id $CLUSTER_ID --instance-group-types MASTER --query 'Instances(0).Ec2InstanceId' --output text)
echo PRIMARY_NODE_ID=$PRIMARY_NODE_ID

(Optional) The load balancer needs a certificate to encrypt and decrypt traffic. You can skip this step if you already have a trusted certificate in ACM. Otherwise, create a certificate signed by your holder:
```
PRIVATE_KEY_PATH=./sc-private-key.key
CERTIFICATE_PATH=./sc-certificate.cert
sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout $PRIVATE_KEY_PATH -out $CERTIFICATE_PATH -subj "/CN=$ALB_DNS"
```

Upload to ACM:

ACM_CERT_ARN=$(aws acm import-certificate \
--certificate fileb://$CERTIFICATE_PATH \
--private-key fileb://$PRIVATE_KEY_PATH \
--region $REGION \
--query CertificateArn \
--output text)
echo "Certificate created (ARN)=$ACM_CERT_ARN"

Create a load balancer listener:

ALB_LISTENER_ARN=$(aws elbv2 create-listener \
--load-balancer-arn $ALB_ARN \
--protocol HTTPS \
--port 443 \
--certificates CertificateArn=$ACM_CERT_ARN \
--ssl-policy ELBSecurityPolicy-TLS13-1-2-2021-06 \
--default-actions Type=forward,TargetGroupArn=$ALB_TG_ARN \
--region $REGION \
--query 'Listeners(0).ListenerArn' \
--output text)
echo "ALB listener created (ARN)=$ALB_LISTENER_ARN"

After establishing the listener, register the primary node to the target group:

aws elbv2 register-targets \
--target-group-arn $ALB_TG_ARN \
--targets Id=$PRIMARY_NODE_ID

Modify the security group of the primary node to allow Spark Connect clients to connect

To connect to Spark Connect, edit only the primary security group. Add an inbound rule to the primary node’s security group to accept Spark Connect TCP connections on port 15002 from the trusted IP address of your choice:

aws ec2 authorize-security-group-ingress \
--group-id $PRIMARY_NODE_SG \
--protocol tcp \
--port 15002 \
--source-group $ALB_SG_ID

Connect to the test application

This example shows that a client with a newer version of Spark (4.0.1) can successfully connect to an older version of Spark on an Amazon EMR cluster (3.5.5), demonstrating the version compatibility feature of Spark Connect. This version combination is for demonstration purposes only. Running older versions can pose security risks in a production environment.

We provide the following Python test application to test the client-server connection. We recommend creating and activating a virtual Python environment (venv) before installing the packages. This helps isolate dependencies for that particular project and avoid conflicts with other Python projects. To install the packages, run the following command:

pip install pyspark-client==4.0.1

In your integrated development environment (IDE), copy and paste the following code, replace the placeholder, and call it. The code creates a Spark DataFrame containing two rows and displays its data:

from pyspark.sql import SparkSession
import os
os.environ('GRPC_DEFAULT_SSL_ROOTS_FILE_PATH') = os.path.expanduser('sc-certificate.cert')
spark = SparkSession.builder \
    .remote("sc://:443/;use_ssl=true") \
    .config('spark.sql.execution.pandas.inferPandasDictAsMap', True) \
    .config('spark.sql.pyspark.legacy.inferMapTypeFromFirstPair.enabled', True) \
    .getOrCreate()
spark.createDataFrame((("sue", 32),("li", 3)),("first_name", "age")).show()

The following shows the output of the application:

+----------+---+
|first_name|age|
+----------+---+
|       sue| 32|
|        li|  3|
+----------+---+

Clean up

When you no longer need the cluster, release the following resources to avoid being charged:

Remove the listener, target group, and application load balancer.
Remove the ACM certificate.
Remove load balancers and security groups from the Amazon EMR node.
Terminate the EMR cluster.
Empty the Amazon S3 bucket and delete it.
Remove AmazonEMR-ServiceRole-SparkConnectDemo and EMR_EC2_SparkClusterNodesRole role a EMR_EC2_SparkClusterInstanceProfile instance profile.

Considerations

Security considerations with Spark Connect:

Private subnet deployment – Keep EMR clusters in private subnets without direct Internet access, using NAT gateways for outbound connections only.
Access logging and monitoring – Enable VPC Flow Logs, AWS CloudTrail, and bastion host access logs for audit trails and security monitoring.
Security group restrictions – Configure security groups to allow access to the Spark Connect port (15002) only from the bastion host or specific IP ranges.

Conclusion

In this post, we’ve shown how you can adopt modern developer workflows and debug Spark apps from local IDEs or laptops to step code execution. With Spark Connect’s client-server architecture, a Spark cluster can run on a different version than client applications, so operations teams can perform infrastructure upgrades and patches independently.

As cluster operators gain experience, they can customize bootstrap actions and add data processing steps. Consider exploring Amazon Managed Workflows for Apache Airflow (MWAA) to orchestrate your data pipeline.

Modernize Apache Spark workflows with Spark Connect on Amazon EMR on Amazon EC2 | Amazon Web Services