Big Data on the Cloud using Ansible, RHadoop, AppScale, and AWS/Eucalyptus


Big Data has been a hot topic over the last few years.  Big Data on public clouds, such as AWS’s Elastic MapReduce, has been gaining even more popularity as cloud computing becomes more of an industry standard.

R  is an open source project for statistical computing and graphics.  It has been growing in popularity for doing  linear and nonlinear modeling, classical statistical tests, time-series analysis and others, at various Universities and companies.

R Project
R Project

RHadoop was developed by Revolution Analytics to interface with Hadoop.  Revolution Analytics builds analytic software solutions using R.

Revolution Analytics
Revolution Analytics

AppScale is an open source PaaS that implements the Google AppEngine API on IaaS environments.  One of the Google AppEngine APIs that is implemented is AppEngine MapReduce.   The back-end support for this API that AppScale using Cloudera’s Distribution for Apache Hadoop.

AppScale Inc.
AppScale Inc.

Ansible is an open source orchestration software that utilizes SSH for handling configuration management for physical/virtual machines, and machines running in the cloud.

Ansible Works
Ansible Works

Amazon Web Services is a public IaaS that provides infrastructure and application services in the cloud.  Eucalyptus is an open source software solution that provides the AWS APIs for EC2, S3, and IAM for on-premise cloud environments.

Amazon AWS EC2
Amazon AWS EC2
Eucalyptus Systems Inc.
Eucalyptus Systems Inc.

This blog entry will cover how to deploy AppScale (either on AWS or Eucalyptus), then use Ansible to configure each AppScale node with R, and the RHadoop packages in order allow programs written in R to utilize MapReduce in the cloud.


To get started, the following is needed on a desktop/laptop computer:

*NOTE:  These variables are used by AppScale Tools version 1.6.9.  Check the AWS and Eucalyptus documentation regarding obtaining user credentials. 



After installing AppScale Tools and Ansible, the AppScale cluster needs to be deployed.  After defining the AWS/Eucalyptus variables,  initialize the creation of the AppScale cluster configuration file – AppScalefile.

$ ./appscale-tools/bin/appscale init cloud

Edit the AppScalefile, providing information for the keypair, security group, and AppScale AMI/EMI.  The keypair and security group do not need to be pre-created. AppScale will handle this.  The AppScale AMI on AWS (us-east-1) is ami-4e472227.  The Eucalyptus EMI will be unique based upon the Eucalyptus cloud that is being used.  In this example, the AWS AppScale AMI will be used, and the AppScale cluster size will be 3 nodes.  Here is the example AppScalefile:

group : 'appscale-rmr'
infrastructure : 'ec2'
instance_type : 'm1.large'
keyname : 'appscale-rmr'
machine : 'ami-4e472227'
max : 3
min : 3
table : 'hypertable'

After editing the AppScalefile, start up the AppScale cluster by running the following command:

$ ./appscale-tools/bin/appscale up

Once the cluster finishes setting up, the status of the cluster can be seen by running the command below:

$ ./appscale-tools/bin/appscale status

R, RHadoop Installation Using Ansible

Now that the cluster is up and running, grab the Ansible playbook for installing R, and RHadoop rmr2 and rhdfs packages onto the AppScale nodes.  The playbook can be downloaded from github using git:

$ git clone

After downloading the playbook, the ansible-r-appscale-playbook/production file needs to be populated with the information of the AppScale cluster.  Grab the cluster  node information by running the following command:

$ ./appscale-tools/bin/appscale status | grep amazon | grep Status | awk '{print $5}' | cut -d ":" -f 1

Add those DNS entries to the ansible-r-appscale-playbook/production file.  After editing, the file will look like the following:


Now the playbook can be executed.  The playbook requires the SSH private key to the nodes.  This key will be located under the ~/.appscale folder.  In this example, the key file is named appscale-rmr.key.  To execute the playbook, run the following command:

$ ansible-playbook -i r-appscale-deployment/production 
--private-key=~/.appscale/appscale-rmr.key -v r-appscale-deployment/site.yml

Testing Out The Deployment – Wordcount.R

Once the playbook has finished running, the AppScale cluster is now ready to be used.  To test out the setup, SSH into the head node of the AppScale cluster.  To find out the head node of the cluster, execute the following command:

$ ./appscale-tools/bin/appscale status

After discovering the head node, SSH into the head node using the private key located in the ~/.appscale directory:

$ ssh -i ~/.appscale/appscale-rmr.key

To test out the R setup on all the nodes, grab the wordcount.R program:

root@appscale-image0:~# tar zxf rmr2_2.0.2.tar.gz rmr2/tests/wordcount.R

In the wordcount.R file, the following lines are present

rmr2:::hdfs.put("/etc/passwd", "/tmp/wordcount-test")
out.hadoop = from.dfs(wordcount("/tmp/wordcount-test", pattern = " +"))

When the wordcount.R program is executed, it will grab the /etc/password file from the head node, copy it to the hdfs filesystem, then run wordcount on /etc/password to look for the pattern ” +”.   NOTE: wordcount.R can be edited to use any file and pattern desired.

Run wordcount.R:

root@appscale-image0:~# R

R version 2.15.3 (2013-03-01) -- "Security Blanket"
Copyright (C) 2013 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

[Previously saved workspace restored]

> source('rmr2/tests/wordcount.R')
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
13/04/05 02:33:41 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
13/04/05 02:33:43 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
packageJobJar: [/tmp/RtmprcYtsu/rmr-local-env19811a7afd54, /tmp/RtmprcYtsu/rmr-global-env1981646cf288, /tmp/RtmprcYtsu/rmr-streaming-map198150b6ff60, /tmp/RtmprcYtsu/rmr-streaming-reduce198177b3496f, /tmp/RtmprcYtsu/rmr-streaming-combine19813f7ea210, /var/appscale/hadoop/hadoop-unjar5632722635192578728/] [] /tmp/streamjob8198423737782283790.jar tmpDir=null
13/04/05 02:33:44 WARN snappy.LoadSnappy: Snappy native library is available
13/04/05 02:33:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/04/05 02:33:44 INFO snappy.LoadSnappy: Snappy native library loaded
13/04/05 02:33:44 INFO mapred.FileInputFormat: Total input paths to process : 1
13/04/05 02:33:44 INFO streaming.StreamJob: getLocalDirs(): [/var/appscale/hadoop/mapred/local]
13/04/05 02:33:44 INFO streaming.StreamJob: Running job: job_201304042111_0015
13/04/05 02:33:44 INFO streaming.StreamJob: To kill this job, run:
13/04/05 02:33:44 INFO streaming.StreamJob: /root/appscale/AppDB/hadoop-0.20.2-cdh3u3/bin/hadoop job  -Dmapred.job.tracker= -kill job_201304042111_0015
13/04/05 02:33:44 INFO streaming.StreamJob: Tracking URL: http://appscale-image0:50030/jobdetails.jsp?jobid=job_201304042111_0015
13/04/05 02:33:45 INFO streaming.StreamJob:  map 0%  reduce 0%
13/04/05 02:33:51 INFO streaming.StreamJob:  map 50%  reduce 0%
13/04/05 02:33:52 INFO streaming.StreamJob:  map 100%  reduce 0%
13/04/05 02:33:59 INFO streaming.StreamJob:  map 100%  reduce 33%
13/04/05 02:34:02 INFO streaming.StreamJob:  map 100%  reduce 100%
13/04/05 02:34:04 INFO streaming.StreamJob: Job complete: job_201304042111_0015
13/04/05 02:34:04 INFO streaming.StreamJob: Output: /tmp/RtmprcYtsu/file1981524ee1a3
13/04/05 02:34:05 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
13/04/05 02:34:07 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
13/04/05 02:34:08 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
13/04/05 02:34:10 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
Deleted hdfs://

Thats it!  The AppScale cluster is ready for additional R programs that utilize MapReduce.   Enjoy the world of Big Data on public/private IaaS.

Big Data on the Cloud using Ansible, RHadoop, AppScale, and AWS/Eucalyptus

Big Data Movement in Boston


Boston + Big Data == De Facto Hub of Big Data

Originally posted on Gigaom:

The push to make Boston the de facto hub of big data will continue Thursday with the gala launch of the hack/reduce space in Cambridge, Massachusetts.

The historic space — in the old brick Kendall Boiler and Tank building — will be home to big data practitioners from academia, private and public industry. Hack/reduce, which sports the motto “code big or go home”, aims to bring top talent together to train the next-generation of sorely needed data scientists, Chris Lynch,  co-founder of hack/reduce told me.

The facility can accommodate 150 dedicated hackers and is fielding 50 applications per week for spots. The first residents are Sqrrll, a big data startup launched by former National Security Agency technologists. “These 7 young men out of NSA spent 5 years building a big data store in Washington and now we have it in Boston,” said Lynch, who co-founded Vertica(s hpq).

Lynch said he was inspired to…

View original 181 more words

Big Data Movement in Boston