i
Exploring R
Evolution of R
Programming Features of R
R for Machine Learning
R for Data Analysis
Application of R
R vs. Python vs. SAS
R vs. Excel vs.Tableau
Install R base on Windows
Install R Studio on Windows
Install R base on Ubuntu
Install R Studio on Ubuntu
R Starter
First R Program
Working with R Packages
R Workplace and R Sessions
Manage working directory
Customize R studio
RStudio Debugger
RStudio History and Environment variables
R Syntax
R Variables
R Data Types & Structures
R Arithmetic Operators
R Logical Operators
R If Statement
R - If…Else Statement
If…else if…Else Statement
R for loop
R while loop
R repeat loop
R String Construction
R String Manipulation Functions
Creating Character Strings
R Functions
R built-in functions
Working with Vector
R Vector Indexing
R Vector Modification
R Arithmetic Vector Operations
R Lists
Access List elements (List Slicing)
List modification
R Matrix construction
Access Matrix elements
R Matrix Modification
R Matrix Operations
R Array Construction
Accessing Array Elements
Manipulating Array Elements
R Data Frames
Data Extraction
Data Frame Expansion
R Built-in Data frames
R Factors
Manage Factor levels
Factor Functions
R Contingency Tables
R Data Visualization
R – Charts and Graphs
R Density Plot
R Strip Charts
R Boxplots
R Violin Plots
R Bar Charts
R Pie Charts
R Area Plots
R Time Series
Graphics with ggplot2
Ggplot2 Structure
ggplot2 Bar Charts
ggplot2 Pie Chart
ggplot2 Area Plot
ggplot2 Histogram
ggplot2 Scatter Plot
ggplot2 Box Plot
Mean & Median
Standard Deviation
Normal Distribution
Correlation
T-Tests
Chi-Square Test
ANOVA Test
Survival Analysis
Data Pre-processing and Missing Value Analysis
Missing data treatment
Missing value analysis with mice package
Outlier Analysis
Problems with outliers
Outlier Detection
Outlier Treatment
Simple Linear Regression
Mathematical Computation
Linear Regression in R
A complete Simple Regression Analysis
Multiple Linear Regression
Mathematical Analysis
Model Interpretation
A complete Multiple Regression Analysis
Logistic Regression
Mathematical Computation in R
Logistic Regression in R
Heart Risk Analysis using LR
Support Vector Machine
Heart Risk Analysis using SVM
Decision Trees
Random Forest
K means Clustering
Big data Analytics using R-Hadoop
RHADOOP Packages:
rJava: Low-Level R to Java Interface
rhdfs: Integrate R with HDFS
rmr2: MapReduce job in R
plyrmr: Data Manipulation with MapReduce job
rhbase: Integrate HBase with R
Environment setup for RHADOOP
Getting Started with RHADOOP
After successfully install all the required RHADOOP packages, we will set the environmental variables, and then our system will be ready for first Code execution. To test our RHADOOP, we will execute the code that will generate the square of a list of value range (1 to 10). In the first section, I will explain how I have organized the code, and in the second part, I will explain how the code will execute and generate the result.
First Code Analysis:
In this section, I will explain the first RHADOOP code that will help us to test our integration. We are using the hdfs and MapReduce functionality in our example. We are also using time functions to calculate the execution time, which will help us to understand the performance of the code.
I have started my coding by assigning the rmr2 and rfdfs libraries and initialized rhdfs using hdfs.init() function. Then I have assigned a sequence of values (1 to 10) in the sample variable. The next lines are for MapReduce jobs.
This will be the requirement to write our first MapReduce job in rmr. The first line takes the data into HDFS, where for MapReduce to run on, the bulk of the data has to reside. We use to.dfs to work with big data. to.dfs is very useful for a variety of uses like learning, writing test cases, and debugging. to.dfs can put the data in a file of our own choosing, but if we don't specify one, it will create temp files and clean them up when done. The return value is something we call a big data object. We may assign it to variables, pass it on to other rmr functions MapReduce jobs, or re-read it in. It is a stub; which means the data is not available in memory; only some information that helps to find and manage the data. This way, we can refer to very large data sets whose size exceeds memory limits.
Now onto the second line, it has MapReduce with the map function. We prefer named arguments with MapReduce because there are quite a few possible arguments, but it's not mandatory. The input is the variable sints that represents the output of to.dfs, which in its HDFS is a stub for our small number data set, but it could be a file path or a list that contains a combination of both. The map function as opposed to the reduce function, which is a regular R function with a few constraints:
1. It has two arguments, a collection of keys and values.
2. It returns key-value pairs using the function keyval, which can have vectors, lists, matrices, or data.frames as arguments; we can also return NULL. We can avoid calling keyval explicitly, but the return value x will be converted with a call to keyval(NULL,x).
In my example, the return value is a big data object, and we can pass it as input to other jobs or read it into memory with from.dfs. from.dfs is complementary to to.dfs and returns a key-value pair collection. We use as.data.frame to return the output as a frame from.dfs, which is useful in defining map-reduce algorithms whenever a MapReduce job produces something of reasonable size, like a summary, that can fit in memory and needs to be inspected to decide on the next steps or to visualize it. It is much more important than to.dfs in production work.
I have also used time functions (Sys.time()) in my R code. In the beginning, I have used time function to initialize the time as start.time, and after the execution of all code sections, I kept track of the time and kept it in end.time. From these two variables, I could easily calculate the code execution time.
Result:
The final section is designated for the result. The system will generate the square of the series of values (1 to 10). In the resulting system, the time of code execution will also be shown. The total execution time is 1.089565.
Don't miss out!