How to: Configure an ec2 instance to run spark job on emr cluster

This setup is for submitting Apache Spark jobs to the Amazon EMR cluster from a remote machine, such as the ec2 instance from the Elastic beanstalk instance. This might be useful for the services running in ebs which would also interact with the spark engine for computation.

Short Description

To submit Spark jobs to an EMR cluster from a remote machine, the following must be true:

Confirm that network traffic is allowed from the remote machine to all cluster nodes

The above config will help you to submit the spark job in cluster mode to emr, but if you are looking to query the tables and hold data along in the remote machine make sure you have this remote machine too in the same security group as emr.

Install the Spark and other dependent binaries on the remote machine

To install the binaries, copy the files from the EMR cluster’s master node, as explained in the following steps. This is the easiest way to be sure that the same version is installed on both the EMR cluster and the remote machine.

sudo mkdir -p /var/aws/emr/ 
sudo mkdir -p /etc/hadoop/conf
sudo mkdir -p /etc/spark/conf
sudo mkdir -p /var/log/spark/user/
sudo chmod 777 -R /var/log/spark/

2. Copy the following files from the EMR cluster’s master node to the remote machine. Don’t change the folder structure or file names.


3. Run the following commands to install the Spark and Hadoop binaries:

sudo yum install -y hadoop-client 
sudo yum install -y hadoop-hdfs
sudo yum install -y spark-core
sudo yum install -y java-1.8.0-openjdk

4. If you want to use the AWS Glue Data Catalog with Spark, run the following command on the remote machine to install the AWS Glue libraries:

sudo yum install -y libgssglue

if you are using Amazon Linux 2, the above lib is not available, you might also need to install

sudo yum install aws-hm-client.noarch

Additionally, we will have to configure an AWS account that will have the role AWSGlueServiceRole to access aws glue, you can configure this aws user via AWS configure command

5. Emr libs for accessing the tables data in s3:

sudo yum install -y emrfs 
sudo yum install -y emr-*

6. Create working directories, create (mkdir -p) and allow write to (chmod 1777) the various scratch directories used by the EMR configuration (this list may change a bit with EMR versions)

/mnt/s3 # fs.s3.buffer.dir
/mnt1/s3 # fs.s3.buffer.dir
/mnt/var/lib/hadoop/tmp # hadoop.tmp.dir
/mnt/tmp #
/var/log/hive/user # hive.log.dir

Create the configuration files and point them to the EMR cluster

Note: You can also tools such as Rsync to copy the configuration files from the EMR master node to the remote instance.

aws s3 cp /etc/spark/conf s3://yours3bucket/emrhadoop-conf/sparkconf/ --recursive aws s3 cp /etc/hadoop/conf s3://yours3bucket/emrhadoop-conf/hadoopconf/ --recursive

2. Download the configuration files from the S3 bucket to the remote machine by running the following commands on the core and task nodes. Replace yours3bucket with the name of the bucket that you used in the previous step.

sudo aws s3 cp s3://yours3bucket/emrhadoop-conf/hadoopconf/ /etc/hadoop/conf/ --recursive sudo aws s3 cp s3://yours3bucket/emrhadoop-conf/sparkconf/ /etc/spark/conf/ --recursive

3. Create the HDFS home directory for the user who will submit the Spark job to the EMR cluster. In the following commands, replace spark user with the name of your user.

hdfs dfs –mkdir /user/sparkuser hdfs dfs -chown sparkuser:sparkuser /user/sparkuser

The remote machine is now ready for accessing Spark.