Big Data on the Cloud using Ansible, RHadoop, AppScale, and AWS/Eucalyptus


Big Data has been a hot topic over the last few years.  Big Data on public clouds, such as AWS’s Elastic MapReduce, has been gaining even more popularity as cloud computing becomes more of an industry standard.

R  is an open source project for statistical computing and graphics.  It has been growing in popularity for doing  linear and nonlinear modeling, classical statistical tests, time-series analysis and others, at various Universities and companies.

R Project
R Project

RHadoop was developed by Revolution Analytics to interface with Hadoop.  Revolution Analytics builds analytic software solutions using R.

Revolution Analytics
Revolution Analytics

AppScale is an open source PaaS that implements the Google AppEngine API on IaaS environments.  One of the Google AppEngine APIs that is implemented is AppEngine MapReduce.   The back-end support for this API that AppScale using Cloudera’s Distribution for Apache Hadoop.

AppScale Inc.
AppScale Inc.

Ansible is an open source orchestration software that utilizes SSH for handling configuration management for physical/virtual machines, and machines running in the cloud.

Ansible Works
Ansible Works

Amazon Web Services is a public IaaS that provides infrastructure and application services in the cloud.  Eucalyptus is an open source software solution that provides the AWS APIs for EC2, S3, and IAM for on-premise cloud environments.

Amazon AWS EC2
Amazon AWS EC2
Eucalyptus Systems Inc.
Eucalyptus Systems Inc.

This blog entry will cover how to deploy AppScale (either on AWS or Eucalyptus), then use Ansible to configure each AppScale node with R, and the RHadoop packages in order allow programs written in R to utilize MapReduce in the cloud.


To get started, the following is needed on a desktop/laptop computer:

*NOTE:  These variables are used by AppScale Tools version 1.6.9.  Check the AWS and Eucalyptus documentation regarding obtaining user credentials. 



After installing AppScale Tools and Ansible, the AppScale cluster needs to be deployed.  After defining the AWS/Eucalyptus variables,  initialize the creation of the AppScale cluster configuration file – AppScalefile.

$ ./appscale-tools/bin/appscale init cloud

Edit the AppScalefile, providing information for the keypair, security group, and AppScale AMI/EMI.  The keypair and security group do not need to be pre-created. AppScale will handle this.  The AppScale AMI on AWS (us-east-1) is ami-4e472227.  The Eucalyptus EMI will be unique based upon the Eucalyptus cloud that is being used.  In this example, the AWS AppScale AMI will be used, and the AppScale cluster size will be 3 nodes.  Here is the example AppScalefile:

group : 'appscale-rmr'
infrastructure : 'ec2'
instance_type : 'm1.large'
keyname : 'appscale-rmr'
machine : 'ami-4e472227'
max : 3
min : 3
table : 'hypertable'

After editing the AppScalefile, start up the AppScale cluster by running the following command:

$ ./appscale-tools/bin/appscale up

Once the cluster finishes setting up, the status of the cluster can be seen by running the command below:

$ ./appscale-tools/bin/appscale status

R, RHadoop Installation Using Ansible

Now that the cluster is up and running, grab the Ansible playbook for installing R, and RHadoop rmr2 and rhdfs packages onto the AppScale nodes.  The playbook can be downloaded from github using git:

$ git clone

After downloading the playbook, the ansible-r-appscale-playbook/production file needs to be populated with the information of the AppScale cluster.  Grab the cluster  node information by running the following command:

$ ./appscale-tools/bin/appscale status | grep amazon | grep Status | awk '{print $5}' | cut -d ":" -f 1

Add those DNS entries to the ansible-r-appscale-playbook/production file.  After editing, the file will look like the following:


Now the playbook can be executed.  The playbook requires the SSH private key to the nodes.  This key will be located under the ~/.appscale folder.  In this example, the key file is named appscale-rmr.key.  To execute the playbook, run the following command:

$ ansible-playbook -i r-appscale-deployment/production 
--private-key=~/.appscale/appscale-rmr.key -v r-appscale-deployment/site.yml

Testing Out The Deployment – Wordcount.R

Once the playbook has finished running, the AppScale cluster is now ready to be used.  To test out the setup, SSH into the head node of the AppScale cluster.  To find out the head node of the cluster, execute the following command:

$ ./appscale-tools/bin/appscale status

After discovering the head node, SSH into the head node using the private key located in the ~/.appscale directory:

$ ssh -i ~/.appscale/appscale-rmr.key

To test out the R setup on all the nodes, grab the wordcount.R program:

root@appscale-image0:~# tar zxf rmr2_2.0.2.tar.gz rmr2/tests/wordcount.R

In the wordcount.R file, the following lines are present

rmr2:::hdfs.put("/etc/passwd", "/tmp/wordcount-test")
out.hadoop = from.dfs(wordcount("/tmp/wordcount-test", pattern = " +"))

When the wordcount.R program is executed, it will grab the /etc/password file from the head node, copy it to the hdfs filesystem, then run wordcount on /etc/password to look for the pattern ” +”.   NOTE: wordcount.R can be edited to use any file and pattern desired.

Run wordcount.R:

root@appscale-image0:~# R

R version 2.15.3 (2013-03-01) -- "Security Blanket"
Copyright (C) 2013 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

[Previously saved workspace restored]

> source('rmr2/tests/wordcount.R')
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
13/04/05 02:33:41 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
13/04/05 02:33:43 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
packageJobJar: [/tmp/RtmprcYtsu/rmr-local-env19811a7afd54, /tmp/RtmprcYtsu/rmr-global-env1981646cf288, /tmp/RtmprcYtsu/rmr-streaming-map198150b6ff60, /tmp/RtmprcYtsu/rmr-streaming-reduce198177b3496f, /tmp/RtmprcYtsu/rmr-streaming-combine19813f7ea210, /var/appscale/hadoop/hadoop-unjar5632722635192578728/] [] /tmp/streamjob8198423737782283790.jar tmpDir=null
13/04/05 02:33:44 WARN snappy.LoadSnappy: Snappy native library is available
13/04/05 02:33:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/04/05 02:33:44 INFO snappy.LoadSnappy: Snappy native library loaded
13/04/05 02:33:44 INFO mapred.FileInputFormat: Total input paths to process : 1
13/04/05 02:33:44 INFO streaming.StreamJob: getLocalDirs(): [/var/appscale/hadoop/mapred/local]
13/04/05 02:33:44 INFO streaming.StreamJob: Running job: job_201304042111_0015
13/04/05 02:33:44 INFO streaming.StreamJob: To kill this job, run:
13/04/05 02:33:44 INFO streaming.StreamJob: /root/appscale/AppDB/hadoop-0.20.2-cdh3u3/bin/hadoop job  -Dmapred.job.tracker= -kill job_201304042111_0015
13/04/05 02:33:44 INFO streaming.StreamJob: Tracking URL: http://appscale-image0:50030/jobdetails.jsp?jobid=job_201304042111_0015
13/04/05 02:33:45 INFO streaming.StreamJob:  map 0%  reduce 0%
13/04/05 02:33:51 INFO streaming.StreamJob:  map 50%  reduce 0%
13/04/05 02:33:52 INFO streaming.StreamJob:  map 100%  reduce 0%
13/04/05 02:33:59 INFO streaming.StreamJob:  map 100%  reduce 33%
13/04/05 02:34:02 INFO streaming.StreamJob:  map 100%  reduce 100%
13/04/05 02:34:04 INFO streaming.StreamJob: Job complete: job_201304042111_0015
13/04/05 02:34:04 INFO streaming.StreamJob: Output: /tmp/RtmprcYtsu/file1981524ee1a3
13/04/05 02:34:05 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
13/04/05 02:34:07 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
13/04/05 02:34:08 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
13/04/05 02:34:10 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
Deleted hdfs://

Thats it!  The AppScale cluster is ready for additional R programs that utilize MapReduce.   Enjoy the world of Big Data on public/private IaaS.

Big Data on the Cloud using Ansible, RHadoop, AppScale, and AWS/Eucalyptus

What’s new in Ansible 1.1 for AWS and Eucalyptus users?


Ansible providing more AWS/Eucalyptus support!

Originally posted on Take that to the bank and cash it!:

I thought the Ansible 1.0 development cycle was busy but 1.1 is crammed full of orchestration goodness.  On Tuesday, 1.1 was released and you can read more about it here:

For those working on AWS and Eucalyptus, 1.1 brings some nice module improvements as well as a new cloudformation and s3 module.  It’s great to see the AWS-related modules becoming so popular so quickly.  Here are some more details about the changes but you can find info in the changelog here:

Security group ID support

It’s now possible to specify the security group by its ID.  This is quite typically behaviour in EC2 and Eucalyptus will support this with the pending 3.3 release.  The parameter is optional.

VPC subnet ID

VPC users can now specify a subnet ID associated with their instance.

Instance state wait timeout

In 1.0 there was no way to specify how long to wait…

View original 300 more words

What’s new in Ansible 1.1 for AWS and Eucalyptus users?

Using Ansible to Deploy Neo4j HA Cluster on AWS/Eucalyptus

As a follow-up to my last Neo4j, AWS/Eucalyptus blog, this entry demonstrates another great example of AWS/Eucalyptus fidelity by using Ansible to deploy a Neo4j High Available cluster.

Neo4j - Graph Database
Neo4j – Graph Database
Amazon AWS EC2
Amazon AWS EC2
Eucalyptus Systems Inc.
Eucalyptus Systems Inc.


In order to use this Ansible playbook on AWS/Eucalyptus, the following is needed:

Before deploying the cluster, a security group needs to be created that the cluster will use.  The security group must allow the following:

  • port 22 (SSH)
  • all instances part of the security group allowed to community with each other (ports 0 – 65535)

To create the security group and authorize the ports, make sure the user’s access key, secret access key, and EC2 URL are noted, and do the following:

  1. Create the security group

    ec2-create-group --aws-access-key <EC2_ACCESS_KEY> 
    --aws-secret-key <EC2_SECRET_KEY> 
    --url <EC2_URL> -g neo4j-cluster -d "Neo4j HA Cluster"

  2. Authorize port for SSH in neo4j-cluster security group

    --aws-access-key <EC2_ACCESS_KEY> 
    --aws-secret-key <EC2_SECRET_KEY> 
    --url <EC2_URL> -P tcp -p 22 -s neo4j-cluster

  3. Authorize all port communication between cluster members 

    --aws-access-key <EC2_ACCESS_KEY> --aws-secret-key <EC2_SECRET_KEY> 
    --url <EC2_URL> -P tcp -o neo4j-cluster -p -1 neo4j-cluster

After completing these steps, use


to view the security group:

ec2-describe-group --aws-access-key <EC2_ACCESS_KEY> 
--aws-secret-key <EC2_SECRET_KEY> --url <EC2_URL> neo4j-cluster

GROUP sg-1cbc5777 986451091583 neo4j-cluster Neo4j HA Cluster
PERMISSION 986451091583 neo4j-cluster ALLOWS tcp 0 65535 FROM 
USER 986451091583 NAME neo4j-cluster ID sg-1cbc5777 ingress
PERMISSION 986451091583 neo4j-cluster 
ALLOWS tcp 22 22 FROM CIDR ingress

Neo4j HA Cluster Deployment

Once the security group is created with the correct ports authorized, the cluster can be deployed.  To deploy the cluster, do the following:

  1. Obtain Ansible from git and setup the environment by following the instructions mentioned here –
  2. Obtain the Ansible Playbook for Neo4j HA Cluster using git

    git clone

  3. Change directory into ansible-neo4j-cluster  

    cd ansible-neo4j-cluster

  4. Set up /etc/ansible/hosts with the following information:
  5. Populate vars/ec2-config with either Eucalyptus/AWS information. vars/ec2-config contains the following variables:
    keypair: <EC2/Eucalyptus Keypair>
    ec2_access_key: <EC2_ACCESS_KEY>
    ec2_secret_key: <EC2_SECRET_KEY>
    ec2_url: <EC2_URL>
    instance_type: m1.small
    security_group: <AWS/Eucalyptus Security Group>
    image: <AMI/EMI>

    Execute the following command:

    ansible-playbook neo4j-cluster.yml \
     --private-key=<AWS/Eucalyptus Private Key file> --extra-vars "node_count=3"
  7. After the playbook finishes, there will be an URL provided to access the cluster – similar to the example below:
    TASK: [Display HAProxy URL] *********************
    changed: [] => {"changed": true, "cmd": 
    "echo \"HAProxy URL for Neo4j -\" ",
     "delta": "0:00:00.006835", "end": "2013-03-30 19:54:31.104320", 
    "rc": 0, "start": "2013-03-30 19:54:31.097485", "stderr": "", 
    "HAProxy URL for Neo4j -"}

    To view the status of cluster in the browser, open up

  8. To get the status of the cluster, use curl:
    curl -H "Content-Type:application/json" -d '["org.neo4j:*"]'

Thats it!  A Neo4j HA cluster with an HA Proxy server serving as an endpoint is available to be used.   If a bigger cluster is desired, just change the


value.   For additional information regarding this playbook, and how it handles the cluster membership, please refer to the following URL –

Hope you enjoy!  As always, questions/comments/suggestions are always welcome.

Using Ansible to Deploy Neo4j HA Cluster on AWS/Eucalyptus

Ansible Playbook for Eucalyptus Deployment == Next Level Configuration Management for the Cloud


Ansible playbook for Eucalyptus…NICE! Great work Lester!

Originally posted on Take that to the bank and cash it!:

The first cut of the Ansible deployment playbook for deploying Eucalyptus private clouds is ready.  I’ve merged the first “release” into the master branch here: Feedback and contributions are very welcome, please file issues against the project.

This playbook allows a user to deploy a single front-end cloud (i.e. all component on a single system) and as many NC’s as they want.  Although admittedly I’ve only tested with one so far.  I’ve followed, to a certain degree, the best practices described here:

Overall I’m pretty happy with it, there are some areas which I’d like to re-write and improve but the groundwork is there.  It was all very fluid to start with and doing something multi-tier has been enjoyable. I’ve also learnt what its like to automate a deployment of Eucalyptus and there are probably a number of things we should improve to make it easier in this…

View original 614 more words

Ansible Playbook for Eucalyptus Deployment == Next Level Configuration Management for the Cloud

Ansible Module for EC2


Nice write-up on first time experience writing an Ansible module. Good stuff!

Originally posted on Take that to the bank and cash it!:

As is probably quite evident, I’ve recently been using Ansible to deploy workloads into EC2 and Eucalyptus.  One of the ideas behind this is the convenience of being able to leverage the common API to achieve a hybrid deployment scenario.  Thanks to various folk (names mentioned in previous posts) we have a solid ec2 boto-based Python module for instance launching.   One thing I wanted to do when spinning up instances and configuring a workload was to add some persistent storage for the application.  To do this I had to create and attach a volume as a manual step or run a local_action against euca2ools.  I figured I could try to write my own module to practice a bit more python (specifically boto).  The result is something terribly (perhaps in more ways that one?) simple but I think this is a testament to just how easy it is to write…

View original 253 more words

Ansible Module for EC2

Second part of Ansible work with AWS and Eucalyptus


More Ansible Love with AWS and Eucalyptus!

Originally posted on Take that to the bank and cash it!:

My previous post talked a little bit about new functionality from (new and updated) ec2-related modules in the Ansible 1.0 release.  In this post I’ll go through the practical example of launching multiple instances with the ec2 module and then configuring them with a workload.  In this case the workload will be the Eucalyptus User Console :)

For those who are unfamiliar with ansible, check out the online documentation here. The documentation itself is in a pretty good order for newcomers, so make sure to take it step by step to get the most out of it.  Once you’ve finished you’ll then come to realise how flexible it is but hopefully also how *easy* it is.

To launch my instances and then configure them, I’m going to use a playbook.  For want of a better explanation a playbook is a yaml formatted file containing a series of tasks which…

View original 1,672 more words

Second part of Ansible work with AWS and Eucalyptus