Run Berkeley Spark’s PySpark using Docker in a couple minutes

For those of you interested in running the BDAS Spark stack in a virtualized cluster really quickly, the fastest way is using Linux Containers controlled by Docker. Using Andre Schumacher’s fantastic Berkeley Spark on Docker scripts and tutorial, you can get yourself a virtual cluster of whatever size you’d like in a couple minutes!

However, the tutorial is Scala centric, and you will be instantly dropped into a Scala shell. I am primarily interested in using Python as my tool to do analysis and data science tasks, so we needed to do a couple more steps.

Follow Andre’s tutorial, and start up a Spark 0.8.0 cluster on top of Docker as you normally would. Here I am starting up a 6-worker cluster:

1 user@aliens:~/Documents/docker-scripts⟫ sudo deploy/ -i amplab/spark:0.8.0 -w 6 -c
*** Starting Spark 0.8.0 ***
starting nameserver container
started nameserver container: 5093b46c4df527528cae0194a8b2849a258e314dc2e0b847c67950776b5715df
DNS host->IP file mapped: /tmp/dnsdir_10034/0hosts
waiting for nameserver to come up
starting master container
started master container: 4d431889af3c7176fa1a9ffee850c6658840a307e86ad6fbf2691e54fe8fb792

...lots more output

We are interested in the part that tells us how to connect with SSH into the master node…

start shell via:            sudo /home/user/Documents/docker-scripts/deploy/ -i amplab/spark-shell:0.8.0 -n 5093b46c4df527528cae0194a8b2849a258e314dc2e0b847c67950776b5715df

visit Spark WebUI at:
visit Hadoop Namenode at:
ssh into master via:        ssh -i /home/user/Documents/docker-scripts/deploy/../apache-hadoop-hdfs-precise/files/id_rsa -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@

/data mapped:

kill master via:           sudo docker kill 4d431889af3c7176fa1a9ffee850c6658840a307e86ad6fbf2691e54fe8fb792

You can see in the second output above that you can SSH into the master using the ‘ssh -i ….’ command. However, there is a bug with permission on the id_rsa file, and SSH will not let you get into the master node.

Fix this with:

chmod 0600 docker-scripts/apache-hadoop-hdfs-precise/files/id_rsa

Great, now we can enter the master node with this command:

ssh -i docker-scripts/apache-hadoop-hdfs-precise/files/id_rsa -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@

Of course use the IP addresses that Docker is generating for you, in the output above MASTER_IP.

So from inside the master node, we want to use Python2.7 to do our work, so go find ‘pyspark’ inside of /opt/spark-VERSION


We have a huge problem!

root@master:/opt/spark-0.8.0# ./pyspark
Python 2.7.3 (default, Apr 20 2012, 22:39:59)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
  File "/opt/spark-0.8.0/python/pyspark/", line 25, in 
    import pyspark
  File "/opt/spark-0.8.0/python/pyspark/", line 41, in 
    from pyspark.context import SparkContext
  File "/opt/spark-0.8.0/python/pyspark/", line 21, in 
    from threading import Lock
ImportError: No module named threading
>>> sc
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'sc' is not defined

Fix Number two:

Get out of the Python / PySpark Shell (Ctrl-D), and install python2.7:

sudo apt-get install python2.7

Great! Now run ‘pyspark’ again, and you should see it working perfectly:

root@master:/opt/spark-0.8.0# ./pyspark
Python 2.7.3 (default, Apr 20 2012, 22:39:59)
[GCC 4.6.3] on linux2
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 0.8.0

Using Python version 2.7.3 (default, Apr 20 2012 22:39:59)
Spark context avaiable as sc.

>>> sc


There you go! Python 2.7 on Spark using Docker, isn’t it lovely?


Leave a Reply

Your email address will not be published. Required fields are marked *