In this article we go through the basics of R and MongoDB.
Introduction to R:
To start with R, it is statistical programming language. It is an interpreted language so it executes instruction directly rather than first compiling it and than executing it, it directly executes the instruction from the console. R due to its statistical inbuilt and add on packages, is very popular among the statisticians and data miners. On top of all these features, R also provides package to visualize the data in 2D and 3D way to get more clear picture of the data and result for better analysis.
Get Started With R
For R: http://cran.rstudio.com/
R Studio: http://www.rstudio.com/products/rstudio/download/
The very first package that one should know in order to start learning R are:
- help(function name) / ?functionName
- to get help on function whose name is provided
- example(function name)
- to get example on function whose name is provided
- apropos(“function name”)
- to show all function available whose name contains string provided in parameter
Lets see the sample examples of above commands for clarity:
Below command can be directly written to R console or R Studio.
- Suppose anyone need help to understand about the function called : min. He/She can make use of help function as shown below.
Above will pop up the page for the help on function min.
- Suppose someone still needs to understand how to use the min function. He/She can make use of example function as shown below.
min> require(stats); require(graphics)
min> min(5:1, pi) #-> one number
min> pmin(5:1, pi) #-> 5 numbers
 3.141593 3.141593 3.000000 2.000000 1.000000
min> x <- sort(rnorm(100)); cH <- 1.35
min> pmin(cH, quantile(x)) # no names
 -2.4030962 -0.4229549 0.1632506 0.8253795 1.3500000
Suppose someone is unable to find function for month then they can make use of apropos function as shown below.
 “month.abb” “month.name” “monthplot” “months” “months.Date” “months.POSIXt” “sunspot.month”
Data Types supported by R:
For data analysis on data having different types of data types R provides many data types to cover most of them.
Below are the main data types that are widely used for data mining and machine learning purpose.
As mentioned above, R provides many packages to visualize the data. For example, lets take
It provides so many ways to visualize the data in different form.
- persp(volcano, expand = 0.5)
As discussed above, there are many data types that R has provided.
R has also provided a structure of keeping data in form of tables, it is Data Frame.
It is list of vectors of equal length. Different type of data can be imported to R and stored into Data Frame. Source can be csv, xls, table, txt etc.
For example below command will load and store the data of data2013.txt kept on local file to sampleDataFrame in R.
> sampleDataFrame <- read.csv(“~/SanJose/HistoricalDataSet/data2013.txt”, header=FALSE)
One more feature that R has provided for a quick view, one can simply take a snap of any data by CTRL+C and import the data to R.
x <- read.table(file = “clipboard”, sep=”\t”, header=TRUE)
In real world scenario, most data are stored in the RDBMS. R has provided interface to connect to them easily. R has also provided interface for No SQL data base like MongoDB which is in most demand for the BigData analysis and mining.
R has provided RODBC, RMySQL, ROracle, RJDBC interfaces to integrate with relation data base, and it has also provided RMongo for the MongoDB (No SQL database), RNeo4j for Neo4j (Graph Data base).
It is very easy to use these interfaces in R.
Database Integration with MongoDB:
Consider an example where one need data from contacts collection of users database in MongoDB.
Steps to import data from MongoDB to R:
Classification Example in R:
As data analytic or data miner, requirement of classification and clustering comes very often, and as R has very rich packages, there are many packages in R is provided for the same.
- Dimensionality Reduction
- Frequent Pattern Mining
- Sequence Mining
Same goes for any specific problem for above. For example suppose anyone wants to do a SVM classification, there are e1072, kernlab, klaR, svmpath, shogun packages available to achieve same.
Lets take an example with e1072 package.
As shown above, we have loaded the package of e1072 and also the sample data of cats in the R.
Now in order to do the classification, we need to create a model from the available data set.
To visualize the above model:
For a classification problem, will need a test and training data set.
- Divide data into training set and test set
index <- 1:nrow(cats)
testindex <- sample(index, trunc(length(index)/3))
testset <- cats[testindex,]
trainset <- cats[-testindex,] trainset <- cats[-testindex,]
- Train Model
model <- svm(Sex~., data = trainset)
prediction <- predict(model, testset[,-1])
To verify the result there are many packages available like Gain and Lift Charts, K-S or Kolmogorov-Smirnov chart, ROC Chart, Area Under the Curve etc.
Below is the confusion matrix.
tab <- table(pred = prediction, true = testset[,1])
As shown above, we have correctly classified 37 instances and 11 instances are wrongly classified.
So concluding, R is very rich language to use as it has wide range of packages available for data modeling, analysis, and visualization.
Basic introduction to MongoDB:
It is an Document Database which is not tightly bounded with schema, so it is well known for its features like Schema less, Clear Structure. Considering large amount of data, MongoDB is proven to be deliver high Performance, high Availability and easy scalability.
Data Format in MongoDB:
It stores data (documents) in BSON format which is binary-encoded serialization of JSON. One documents has a size limit of 16 MB. Below is the documents format that is maintained in MongoDB.
- MongoDB is easy to install.
- Download URL: https://www.mongodb.org/downloads
- Run mongod.exe from command prompt to start database
- Run mongo.exe from command prompt to connect and manipulate data
Comparision with RDBMS
Importing data from files to MongoDB.
- Easily import data from CSV, JSON
- Example: mongoimport –db users –collection contacts –type csv –headerline –file /opt/backups/contacts.csv
- To limit the number of data
- To limit the number of row
- To apply conditions limit check, sorting etc.
Applying limit: Data<10
Find Using REGEX
Update multiple Document using REGEX
Concluding MongoDB by mentioning its very useful feature that is MongoDB has provided MapReduce Support as well.
Thus, R is an enriched statistical programming language with many predefined easy to use packages and MongoDB is schema free highly scalable database available mostly suited for big data mining.