Big Data on the Cloud using Ansible, RHadoop, AppScale, and AWS/Eucalyptus


Big Data has been a hot topic over the last few years.  Big Data on public clouds, such as AWS’s Elastic MapReduce, has been gaining even more popularity as cloud computing becomes more of an industry standard.

R  is an open source project for statistical computing and graphics.  It has been growing in popularity for doing  linear and nonlinear modeling, classical statistical tests, time-series analysis and others, at various Universities and companies.

R Project
R Project

RHadoop was developed by Revolution Analytics to interface with Hadoop.  Revolution Analytics builds analytic software solutions using R.

Revolution Analytics
Revolution Analytics

AppScale is an open source PaaS that implements the Google AppEngine API on IaaS environments.  One of the Google AppEngine APIs that is implemented is AppEngine MapReduce.   The back-end support for this API that AppScale using Cloudera’s Distribution for Apache Hadoop.

AppScale Inc.
AppScale Inc.

Ansible is an open source orchestration software that utilizes SSH for handling configuration management for physical/virtual machines, and machines running in the cloud.

Ansible Works
Ansible Works

Amazon Web Services is a public IaaS that provides infrastructure and application services in the cloud.  Eucalyptus is an open source software solution that provides the AWS APIs for EC2, S3, and IAM for on-premise cloud environments.

Amazon AWS EC2
Amazon AWS EC2
Eucalyptus Systems Inc.
Eucalyptus Systems Inc.

This blog entry will cover how to deploy AppScale (either on AWS or Eucalyptus), then use Ansible to configure each AppScale node with R, and the RHadoop packages in order allow programs written in R to utilize MapReduce in the cloud.


To get started, the following is needed on a desktop/laptop computer:

*NOTE:  These variables are used by AppScale Tools version 1.6.9.  Check the AWS and Eucalyptus documentation regarding obtaining user credentials. 



After installing AppScale Tools and Ansible, the AppScale cluster needs to be deployed.  After defining the AWS/Eucalyptus variables,  initialize the creation of the AppScale cluster configuration file – AppScalefile.

$ ./appscale-tools/bin/appscale init cloud

Edit the AppScalefile, providing information for the keypair, security group, and AppScale AMI/EMI.  The keypair and security group do not need to be pre-created. AppScale will handle this.  The AppScale AMI on AWS (us-east-1) is ami-4e472227.  The Eucalyptus EMI will be unique based upon the Eucalyptus cloud that is being used.  In this example, the AWS AppScale AMI will be used, and the AppScale cluster size will be 3 nodes.  Here is the example AppScalefile:

group : 'appscale-rmr'
infrastructure : 'ec2'
instance_type : 'm1.large'
keyname : 'appscale-rmr'
machine : 'ami-4e472227'
max : 3
min : 3
table : 'hypertable'

After editing the AppScalefile, start up the AppScale cluster by running the following command:

$ ./appscale-tools/bin/appscale up

Once the cluster finishes setting up, the status of the cluster can be seen by running the command below:

$ ./appscale-tools/bin/appscale status

R, RHadoop Installation Using Ansible

Now that the cluster is up and running, grab the Ansible playbook for installing R, and RHadoop rmr2 and rhdfs packages onto the AppScale nodes.  The playbook can be downloaded from github using git:

$ git clone

After downloading the playbook, the ansible-r-appscale-playbook/production file needs to be populated with the information of the AppScale cluster.  Grab the cluster  node information by running the following command:

$ ./appscale-tools/bin/appscale status | grep amazon | grep Status | awk '{print $5}' | cut -d ":" -f 1

Add those DNS entries to the ansible-r-appscale-playbook/production file.  After editing, the file will look like the following:


Now the playbook can be executed.  The playbook requires the SSH private key to the nodes.  This key will be located under the ~/.appscale folder.  In this example, the key file is named appscale-rmr.key.  To execute the playbook, run the following command:

$ ansible-playbook -i r-appscale-deployment/production 
--private-key=~/.appscale/appscale-rmr.key -v r-appscale-deployment/site.yml

Testing Out The Deployment – Wordcount.R

Once the playbook has finished running, the AppScale cluster is now ready to be used.  To test out the setup, SSH into the head node of the AppScale cluster.  To find out the head node of the cluster, execute the following command:

$ ./appscale-tools/bin/appscale status

After discovering the head node, SSH into the head node using the private key located in the ~/.appscale directory:

$ ssh -i ~/.appscale/appscale-rmr.key

To test out the R setup on all the nodes, grab the wordcount.R program:

root@appscale-image0:~# tar zxf rmr2_2.0.2.tar.gz rmr2/tests/wordcount.R

In the wordcount.R file, the following lines are present

rmr2:::hdfs.put("/etc/passwd", "/tmp/wordcount-test")
out.hadoop = from.dfs(wordcount("/tmp/wordcount-test", pattern = " +"))

When the wordcount.R program is executed, it will grab the /etc/password file from the head node, copy it to the hdfs filesystem, then run wordcount on /etc/password to look for the pattern ” +”.   NOTE: wordcount.R can be edited to use any file and pattern desired.

Run wordcount.R:

root@appscale-image0:~# R

R version 2.15.3 (2013-03-01) -- "Security Blanket"
Copyright (C) 2013 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

[Previously saved workspace restored]

> source('rmr2/tests/wordcount.R')
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
13/04/05 02:33:41 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
13/04/05 02:33:43 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
packageJobJar: [/tmp/RtmprcYtsu/rmr-local-env19811a7afd54, /tmp/RtmprcYtsu/rmr-global-env1981646cf288, /tmp/RtmprcYtsu/rmr-streaming-map198150b6ff60, /tmp/RtmprcYtsu/rmr-streaming-reduce198177b3496f, /tmp/RtmprcYtsu/rmr-streaming-combine19813f7ea210, /var/appscale/hadoop/hadoop-unjar5632722635192578728/] [] /tmp/streamjob8198423737782283790.jar tmpDir=null
13/04/05 02:33:44 WARN snappy.LoadSnappy: Snappy native library is available
13/04/05 02:33:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/04/05 02:33:44 INFO snappy.LoadSnappy: Snappy native library loaded
13/04/05 02:33:44 INFO mapred.FileInputFormat: Total input paths to process : 1
13/04/05 02:33:44 INFO streaming.StreamJob: getLocalDirs(): [/var/appscale/hadoop/mapred/local]
13/04/05 02:33:44 INFO streaming.StreamJob: Running job: job_201304042111_0015
13/04/05 02:33:44 INFO streaming.StreamJob: To kill this job, run:
13/04/05 02:33:44 INFO streaming.StreamJob: /root/appscale/AppDB/hadoop-0.20.2-cdh3u3/bin/hadoop job  -Dmapred.job.tracker= -kill job_201304042111_0015
13/04/05 02:33:44 INFO streaming.StreamJob: Tracking URL: http://appscale-image0:50030/jobdetails.jsp?jobid=job_201304042111_0015
13/04/05 02:33:45 INFO streaming.StreamJob:  map 0%  reduce 0%
13/04/05 02:33:51 INFO streaming.StreamJob:  map 50%  reduce 0%
13/04/05 02:33:52 INFO streaming.StreamJob:  map 100%  reduce 0%
13/04/05 02:33:59 INFO streaming.StreamJob:  map 100%  reduce 33%
13/04/05 02:34:02 INFO streaming.StreamJob:  map 100%  reduce 100%
13/04/05 02:34:04 INFO streaming.StreamJob: Job complete: job_201304042111_0015
13/04/05 02:34:04 INFO streaming.StreamJob: Output: /tmp/RtmprcYtsu/file1981524ee1a3
13/04/05 02:34:05 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
13/04/05 02:34:07 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
13/04/05 02:34:08 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
13/04/05 02:34:10 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
Deleted hdfs://

Thats it!  The AppScale cluster is ready for additional R programs that utilize MapReduce.   Enjoy the world of Big Data on public/private IaaS.

Big Data on the Cloud using Ansible, RHadoop, AppScale, and AWS/Eucalyptus

AWS EBS-backed AMI to Eucalyptus Walrus-backed EMI


A few weeks back, I was doing some testing with the guys from AppScale to get an Eucalyptus Machine Image (EMI) to run on Eucalyptus.  The image that was provided to me was an EBS-backed Amazon Machine Image (AMI), using a published EC2 Lucid Ubuntu Cloud image.  This blog entry describes the procedure to convert an EBS-backed AMI to an Walrus-backed EMI.  The goal here is to demonstrate how easy it is to use Ubuntu Cloud images to set up AppScale on both AWS and Eucalyptus as a hybrid cloud use case.  There are many other hybrid cloud use cases that can be done with this setup, but this blog entry will focus on the migration of AMI images to EMI images.

*NOTE* This entry assumes that a user is experienced with both Amazon Web Services and Eucalyptus.  For additional information, please refer to the following resources:


Before getting started, the following is needed:

*NOTE* Make sure there is an understanding of the IAM policies on AWS and Eucalyptus.  These are key in making sure that the user on both AWS and Eucalyptus can perform all the steps covered in this topic.

Work in AWS…

After setting up the command-line tools for AWS EC2,  and adding in the necessary EC2 and S3 IAM policies, everything is in place to get started with working with the AWS instances and images. *NOTE* To get help with setting up the IAM policies, check out the AWS Policy Generator.    To make sure things look good, I tested out my EC2 access by running ec2-describe-availability-zones:

$ ec2-describe-availability-zones 
AVAILABILITYZONE us-east-1a available us-east-1 
AVAILABILITYZONE us-east-1b available us-east-1 
AVAILABILITYZONE us-east-1c available us-east-1 
AVAILABILITYZONE us-east-1d available us-east-1

After that, I set up a keypair and SSH access for any instance that is launched within the default security group:

$ ec2-create-keypair hspencer-appscale –region ap-northeast-1 > hspencer-appscale.pem

$ ec2-authorize -P tcp -p 22 -s default –region ap-northeast-1

With everything looking good, I went ahead and checked out the AMI that I was asked to test.  Below is the AMI that was given to me:

$ ec2-describe-images ami-2e4bf22f --region ap-northeast-1
IMAGE ami-2e4bf22f 839953741869/appscale-lite-1.6.3-testing 839953741869 available public x86_64 machine aki-d409a2d5 ebs paravirtual xen
BLOCKDEVICEMAPPING EBS /dev/sda1 snap-7953a059 8 true standard

As you can see, the AMI given to me is an EBS-backed image, and it is in a different region (ap-northeast-1).  I could have done all my work in the ap-northeast-1 region, but I wanted to test out region-to-region migration of images on AWS S3 using ec2-migrate-manifest.  In order to access the EBS-backed instance that is launched, I set up a keypair and SSH access for any instance that is launched within the default security group:

$ ec2-create-keypair hspencer-appscale --region ap-northeast-1 > hspencer-appscale.pem
$ ec2-authorize -P tcp -p 22 -s default --region ap-northeast-1

Now that I have my image, keypair and security group access,  I am ready to launch an instance, so I can use the ec2-bundle-vol command to create an image of the instance.  To launch the instance, I ran the following:

$ ec2-run-instances -k hspencer-appscale ami-2e4bf22f –region ap-northeast-1

After the instance is up and running, I scp’d my EC2_PRIVATE_KEY and EC2_CERT to the instance using the keypair created (hspencer-appscale.pem).  The instance already had the latest  version of ec2-api-tools and ec2-ami-tools as part of the installation of AppScale.  Similar to the instructions provided by AWS for creating an instance-store backed AMI from an existing AMI, I  used ec2-bundle-vol to bundle a new image and used /mnt/ (which is ephemeral storage) to store the manifest information.

root@ip-10-156-123-126:~# ec2-bundle-vol -u 9xxxxxxx3 -k pk-XXXXXXXXXXXXXXXX.pem -c cert-XXXXXXXXXXXXXXXXX.pem -d /mnt/ -e /mnt/
Please specify a value for arch [x86_64]: x86_64
Copying / into the image file /mnt/image...
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00990555 s, 106 MB/s
mke2fs 1.41.11 (14-Mar-2010)
Bundling image file...
Splitting /mnt/image.tar.gz.enc...
Created image.part.000

Next, I need to inform the manifest to use us-west-1 as the region to store the image, and not ap-northeast-1.  To do this, I used ec2-migrate-manifest.  *NOTE* This tool can only be used in the following regions: EU,US,us-gov-west-1,us-west-1,us-west-2,ap-southeast-1,ap-southeast-2,ap-northeast-1,sa-east-1.

root@ip-10-156-123-126:~# ec2-migrate-manifest -m /mnt/image.manifest.xml -c cert-XXXXXXXXX.pem -k pk-XXXXXXXXXXXX.pem -a XXXXXXXXXX -s XXXXXXXXX --region us-west-1
Backing up manifest...
warning: peer certificate won't be verified in this SSL session
warning: peer certificate won't be verified in this SSL session
warning: peer certificate won't be verified in this SSL session
warning: peer certificate won't be verified in this SSL session
Successfully migrated /mnt/image.manifest.xml
It is now suitable for use in us-west-1.

Time to upload the bundle to S3 using ec2-upload-bundle:

root@ip-10-156-123-126:~# ec2-upload-bundle -b appscale-lite-1.6.3-testing -m /mnt/image.manifest.xml -a XXXXXXXXXX -s XXXXXXXXXX --location us-west-1
You are bundling in one region, but uploading to another. If the kernel
or ramdisk associated with this AMI are not in the target region, AMI
registration will fail.
You can use the ec2-migrate-manifest tool to update your manifest file
with a kernel and ramdisk that exist in the target region.
Are you sure you want to continue? [y/N]y
Creating bucket...
Uploading bundled image parts to the S3 bucket appscale-lite-1.6.3-testing ...
Uploaded image.part.000
Uploaded image.part.001
Uploaded image.part.002
Uploaded image.part.003

After the image has been uploaded successfully, all that is left to do is register the image.

root@ip-10-156-123-126:~# export JAVA_HOME=/usr
root@ip-10-156-123-126:~# ec2-register -K pk-XXXXXXXXXXXX.pem -C cert-XXXXXXXXXX.pem --region us-west-1 appscale-lite-1.6.3-testing/image.manifest.xml --name appscale1.6.3-testing
IMAGE ami-705d7c35
$ ec2-describe-images ami-705d7c35 --region us-west-1
IMAGE ami-705d7c35 986451091583/appscale1.6.3-testing 986451091583 available private x86_64 machine aki-9ba0f1de instance-store paravirtual xen

Work in Eucalyptus…

Now that we have the image registered, we can use ec2-download-bundle and ec2-unbundle to get the machine image to an instance running on Eucalyptus, so that we can bundle, upload and register the image to Eucalyptus.

To start off, I followed the instructions for setting up my command-line environment, and Eucalyptus IAM policies on Eucalyptus – similar to what was done for AWS.

Next, I downloaded the lucid-server-cloudimg-amd64.tar.gz file from the Ubuntu Cloud Images (Lucid) site.  After that, I bundled, uploaded and registered the following images:

  • lucid-server-cloudimg-amd64-loader (ramdisk)
  • lucid-server-cloudimg-amd64-vmlinuz-virtual (kernel)
  • lucid-server-cloudimg-amd64.img (root image)

After bundling, uploading and registering those images, I created a keypair, and SSH access for the instance that is launched within the default security group:

euca-add-keypair hspencer-euca > hspencer-euca.pem
euca-authorize -P tcp -p 22 -s default

Now, I run the EMI for the Lucid image that was registered:

euca-run-instance -k hspencer-euca --user-data-file cloud-init.config -t m1.large emi-29433329

I used vm.type m1.large so that I can use the space on ephemeral to store the image that I will pull from AWS.

Once the instance is running, I scp’d my EC2_PRIVATE_KEY and EC2_CERT to the instance using the keypair created (hspencer-euca.pem).  After installing the ec2-ami-tools on the instance, I used ec2-download-bundle to download the bundle to /media/ephemeral0, and ec2-unbundle the image:

# ec2-download-bundle -b appscale-lite-1.6.3-testing -d /media/ephemeral0/ -a XXXXXXXXXXX -s XXXXXXXXXXXX -k pk-XXXXXXXXX.pem --url
# ec2-unbundle -m /media/ephemeral0/image.manifest.xml -s /media/ephemeral0/ -d /media/ephemeral0/ -k pk-XXXXXXXXXX.pem

Now that I have the root image from AWS, I just need to bundle, upload and register the root image to Eucalyptus.  To do so, I scp’d my Eucalyptus user credentials to the instance.  After copying the Eucalyptus credentials to the instance, I ssh’ed into the instance and source the Eucalyptus credentials.

Since I have already bundled the kernel and ramdisk for the Ubuntu Cloud Lucid image before, I just need to upload, bundle and register the image I unbundled from AWS.  To do so, I did the following:

euca-bundle-image -i image  
euca-upload-bundle -b appscale-1.6.3-x86_64 -m /tmp/image.manifest.xml
euca-register -a x86_64 appscale-1.6.3-x86_64/image.manifest.xml

Now the image is ready to be launched on Eucalyptus.


As demonstrated above,  because of the AWS fidelity that Eucalyptus provides, it enables setting up hybrid cloud environments with Eucalyptus and AWS that can be leveraged by applications, like AppScale.

Other examples of AMI to EMI conversions can be found here:




AWS EBS-backed AMI to Eucalyptus Walrus-backed EMI