Introduced in Spark 3.4, Apache Spark Connect enhances the Spark ecosystem by offering a client-server architecture that separates the Spark runtime from the client application. Spark Connect enables more flexible and efficient interactions with Spark clusters, especially in scenarios where direct access to cluster resources is limited or impractical.
A key use case for Spark Connect on Amazon EMR is the ability to connect directly from your local development environments to Amazon EMR clusters. With this decoupled approach, you can write and test Spark code on your laptop while using Amazon EMR clusters for execution. This feature reduces development time and simplifies data processing with Spark on Amazon EMR.
In this post, we demonstrate how to implement Apache Spark Connect on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) to create decoupled data processing applications. We’ll show how to securely set up and configure Spark Connect so you can develop and test Spark applications locally while running them on remote Amazon EMR clusters.
Architectural solution
The architecture centers around an Amazon EMR cluster with two types of nodes. The primary node hosts both the Spark Connect API endpoint and the Spark Core components that act as a gateway for client connections. The core node provides additional computing capacity for distributed processing. Although this solution demonstrates a two-node architecture for simplicity, it is scalable to support multiple kernel nodes and tasks based on workload requirements.
TLS/SSL network encryption is not inherently supported in Apache Spark Connect version 4.x. We’ll show you how to implement secure communication by deploying an Amazon EMR cluster with Spark Connect on Amazon EC2 using an Application Load Balancer (ALB) with TLS termination as the secure interface. This approach enables encrypted data transfer between Spark Connect clients and Amazon Virtual Private Cloud (Amazon VPC) resources.
The operational flow is as follows:
- Bootstrap script – During Amazon EMR initialization, the primary node loads and starts
start-spark-connect.shfile from Amazon Simple Storage Service (Amazon S3). This script starts the Spark Connect server. - Server availability – After the bootstrap process is complete, the Spark Server enters the waiting state and is ready to accept incoming connections. The Spark Connect API endpoint is exposed on the configured port (usually 15002) and listens for gRPC connections from remote clients.
- Customer interaction – Spark Connect clients can make secure connections to the Application Load Balancer. These clients translate DataFrame operations into unresolved logical query plans, encode those plans using log buffers, and send them to the Spark Connect API using gRPC.
- Encryption in transit – The Application Load Balancer receives incoming gRPC or HTTPS traffic, performs TLS termination (traffic decryption), and forwards requests to the primary node. The certificate is stored in AWS Certificate Manager (ACM).
- Application processing – The Spark Connect API accepts outstanding logical plans, converts them to Spark’s built-in logical plan operators, passes them to Spark Core for optimization and execution, and streams the results back to the client as batches of rows encoded with Apache Arrow.
- (Optional) Operational approach – Administrators can securely connect to both primary and master nodes through Session Manager, an AWS Systems Manager feature that enables troubleshooting and maintenance without exposing SSH ports or managing key pairs.
The following diagram shows the architecture of this post’s demo for sending outstanding Spark logical plans to EMR clusters using Spark Connect.
Apache Spark Connect on Amazon EMR solution architecture diagram
Prerequisites
To continue with this post, make sure you have the following:
Implementation steps
In this recipe, using the AWS CLI commands:
- Prepare the bootstrap scriptbash script to run Spark Connect on Amazon EMR.
- Set permissions for Amazon EMR to provide resources and perform service-level actions with other AWS services.
- Create an Amazon EMR cluster with these associated roles and permissions and optionally attach the prepared script as a bootstrap action.
- Deploy the Application Load Balancer and the certificate with ACM secure data when transmitted over the Internet.
- Edit the security group of the primary node for Spark Connect clients to connect.
- Connect to the test application client connection to the Spark Connect server.
Prepare the bootstrap script
To prepare the bootstrap script, follow these steps:
- Create an Amazon S3 bucket to host the bootstrap bash script:
- Open your preferred text editor, add the following commands to a new file named e.g
start-spark-connect.sh. If the script is running on the primary node, it will start the Spark Connect server. If it is running on a task or root node, it does nothing: - Upload the script to the bucket created in step 1:
Set permissions
Before creating a cluster, you must create a service role and instance profile. A service role is an IAM role that Amazon EMR assumes to provide resources and perform service-level actions with other AWS services. An EC2 instance profile for Amazon EMR assigns a role to each EC2 instance in the cluster. The instance profile must specify a role that has access to the resources for your bootstrap action.
- Create an IAM role:
- Attach the necessary managed policies to the service role to allow Amazon EMR to manage Amazon EC2 and Amazon S3 core services on your behalf, and optionally grant the instance to interact with Systems Manager:
- Create an Amazon EMR instance role to grant permissions to EC2 instances to interact with Amazon S3 or other AWS services:
- To enable the primary instance to read from Amazon S3, mount the file
AmazonS3ReadOnlyAccesspolicies for the Amazon EMR instance role. For a production environment, this access policy should be reviewed and replaced with a custom policy of least privilege that grants only the specific permissions needed for your use case: - Attaching the AmazonSSMManagedInstanceCore policy enables instances to use core Systems Manager features such as Session Manager and Amazon CloudWatch:
- To pass
EMR_EC2_SparkClusterInstanceProfileFor information about the IAM role for EC2 instances when they launch, create an Amazon EMR EC2 instance profile: - Attach a role
EMR_EC2_SparkClusterNodesRolecreated in step 3 to the newly instanced profile:
Create an Amazon EMR cluster
To create an Amazon EMR cluster, follow these steps:
- Set the environment variables where your EMR cluster and load-balancer must be deployed:
- Create an EMR cluster with the latest version of Amazon EMR. Replace the placeholder with your actual S3 bucket name where the bootstrap action script is stored:
To modify the security group of the primary node to allow Systems Manager to initiate a session.
- Get the security group identifier of the primary node. Make a note of the identifier as you will need it for subsequent configuration steps in which
primary-node-security-group-idit is mentioned: - Find the EC2 instance connection prefix list ID for your region. You can use
EC2_INSTANCE_CONNECTfilter using the description-managed-prefix-lists command. Using a managed prefix list provides dynamic security configuration to authorize Systems Manager EC2 instances to connect to the primary and base nodes using SSH: - Modify the inbound security group rules of the primary node to allow SSH access (port 22) to the primary node of the EMR cluster from resources that are part of the specified Instance Connect service contained in the prefix list:
Optionally, you can repeat the previous steps 1-3 for the cluster master nodes (and workloads) to enable Amazon EC2 Instance Connect to access the EC2 instance via SSH.
Deploy the Application Load Balancer and the certificate
To deploy the Application Load Balancer and certificate, follow these steps:
- Create a load balancer security group:
- Add a rule to accept TCP traffic from a trusted IP on port 443. We recommend using the IP address of the local development machine. You can check your current public IP address here: https://checkip.amazonaws.com:
- Create a new gRPC target group that targets the Spark Connect server instance and the port the server is listening on:
- Create an application load balancer:
- Get the DNS name of the load balancer:
- Get the Amazon EMR Primary Node ID:
- (Optional) The load balancer needs a certificate to encrypt and decrypt traffic. You can skip this step if you already have a trusted certificate in ACM. Otherwise, create a certificate signed by your holder:
- Upload to ACM:
- Create a load balancer listener:
- After establishing the listener, register the primary node to the target group:
Modify the security group of the primary node to allow Spark Connect clients to connect
To connect to Spark Connect, edit only the primary security group. Add an inbound rule to the primary node’s security group to accept Spark Connect TCP connections on port 15002 from the trusted IP address of your choice:
Connect to the test application
This example shows that a client with a newer version of Spark (4.0.1) can successfully connect to an older version of Spark on an Amazon EMR cluster (3.5.5), demonstrating the version compatibility feature of Spark Connect. This version combination is for demonstration purposes only. Running older versions can pose security risks in a production environment.
We provide the following Python test application to test the client-server connection. We recommend creating and activating a virtual Python environment (venv) before installing the packages. This helps isolate dependencies for that particular project and avoid conflicts with other Python projects. To install the packages, run the following command:
In your integrated development environment (IDE), copy and paste the following code, replace the placeholder, and call it. The code creates a Spark DataFrame containing two rows and displays its data:
The following shows the output of the application:
Clean up
When you no longer need the cluster, release the following resources to avoid being charged:
- Remove the listener, target group, and application load balancer.
- Remove the ACM certificate.
- Remove load balancers and security groups from the Amazon EMR node.
- Terminate the EMR cluster.
- Empty the Amazon S3 bucket and delete it.
- Remove
AmazonEMR-ServiceRole-SparkConnectDemoandEMR_EC2_SparkClusterNodesRolerole aEMR_EC2_SparkClusterInstanceProfileinstance profile.
Considerations
Security considerations with Spark Connect:
- Private subnet deployment – Keep EMR clusters in private subnets without direct Internet access, using NAT gateways for outbound connections only.
- Access logging and monitoring – Enable VPC Flow Logs, AWS CloudTrail, and bastion host access logs for audit trails and security monitoring.
- Security group restrictions – Configure security groups to allow access to the Spark Connect port (15002) only from the bastion host or specific IP ranges.
Conclusion
In this post, we’ve shown how you can adopt modern developer workflows and debug Spark apps from local IDEs or laptops to step code execution. With Spark Connect’s client-server architecture, a Spark cluster can run on a different version than client applications, so operations teams can perform infrastructure upgrades and patches independently.
As cluster operators gain experience, they can customize bootstrap actions and add data processing steps. Consider exploring Amazon Managed Workflows for Apache Airflow (MWAA) to orchestrate your data pipeline.
About the authors