How to run Spark on EC2


I have recently been trying to run Spark on Amazon’s EC2. Here’s how it worked – and how it didn’t.

I have Spark 1.2.1 installed locally on my machine. Spark already comes with a script for running it on EC2 – that does most of that work for you. You just need

  • well, you need Spark. Get it from the Spark project site. The Documentation there on how to set up Spark on EC2 is quite good.
  • a working AWS account. This is easy, but takes a few minutes.
  • a user specific access key + key id. This needs to be generated at the (not ultra intuitive) AWS web interface.
  • amazon_keysa key pair. This is required for the ssh login. It’s easiest created in the amazon web interface:amazon_keys2You will only see that RSA private key one, so save it in a text file “mykey.pem” and put it somewhere useful, e.g. ~/.ssh/

Then, open a shell, and

1. set variables:

[su_highlight background=”#fdeaf1″]export AWS_SECRET_ACCESS_KEY=l+m2qlU+DlERPQvnXRZEXAMPLE[/su_highlight]

[su_highlight background=”#fdeaf1″]export AWS_ACCESS_KEY_ID=AKIAJJ222FEXAMPLE[/su_highlight]

 

2.  change permissions on “mykey.pem”, such that your user can access the file:

[su_highlight background=”#fdeaf1″]chmod 400 ~/.ssh/mykey.pem[/su_highlight]

 

3. cd into the right folder (your-spark-version/ec2), and launch a cluster instance:

[su_highlight background=”#fdeaf1″]./spark-ec2 –spark-version=1.2.1 –key-pair=mykey –identity-file=mykey.pem launch test-cluster[/su_highlight]

Now wait. This is probably going to stall at for a while – telling you it’s just getting all instances ssh-ready.

When it’s done, you should see something like this:

Screen Shot 2015-03-01 at 22.34.50

If you don’t, then try to resume the job:

[su_highlight background=”#fdeaf1″]./spark-ec2 –spark-version=1.2.1 –key-pair=mykey –identity-file=mykey.pem launch test-cluster -resume[/su_highlight]

After that, I saw what I wanted to see. The failed tests are actually non-critical errors, these processes are just not running when it’s trying to shut them down.

4. Now, ssh into your cluster:

[su_highlight background=”#fdeaf1″]./spark-ec2 -k mykey -i /Users/daniela.drechsel/mykey.pem -w 400 login test-cluster[/su_highlight]

and see this:

Screen Shot 2015-03-01 at 22.34.57You’re in! At some point, you should probably also log out (quit), and shut down your cluster. That’s one with:

[su_highlight background=”#fdeaf1″]./spark-ec2 destroy test-cluster [/su_highlight]