Introduction to R and MongoDB

In this article we go through the basics of R and MongoDB.

Introduction to R: 

To start with R, it is statistical programming language. It is an interpreted language so it executes instruction directly rather than first compiling it and than executing it, it directly executes the instruction from the console. R due to its statistical inbuilt and add on packages, is very popular among the statisticians and data miners. On top of all these features, R also provides package to visualize the data in 2D and 3D way to get more clear picture of the data and result for better analysis.

Get Started With R

Download URL:

For R: http://cran.rstudio.com/
R Studio: http://www.rstudio.com/products/rstudio/download/

The very first package that one should know in order to start learning R are:

  • help(function name)  / ?functionName
    • to get help on function whose name is provided
  • example(function name)
    • to get example on function whose name is provided
  • apropos(“function name”)
    • to show all function available whose name contains string provided in parameter

Lets see the sample examples of above commands for clarity:

Below command can be directly written to R console or R Studio.

  • Suppose anyone need help to understand about the function called : min. He/She can make use of help function as shown below.

>  help(min)

Above will pop up the page for the help on function min.

  • Suppose someone still needs to understand how to use the min function. He/She can make use of example function as shown below.

> example(min)
min> require(stats); require(graphics)
min> min(5:1, pi) #-> one number
[1] 1
min> pmin(5:1, pi) #-> 5 numbers
[1] 3.141593 3.141593 3.000000 2.000000 1.000000
min> x <- sort(rnorm(100)); cH <- 1.35
min> pmin(cH, quantile(x)) # no names
[1] -2.4030962 -0.4229549 0.1632506 0.8253795 1.3500000

Suppose someone is unable to find function for month then they can make use of apropos function as shown below.

> apropos(“month”)
[1] “month.abb” “month.name” “monthplot” “months” “months.Date” “months.POSIXt” “sunspot.month”

Data Types supported by R:

For data analysis on data having different types of data types  R provides many data types to cover most of them.

Below are the main data types that are widely used for data mining and machine learning purpose.

  • Vector
  • Matrices
  • Arrays
  • List
  • Factors

Visualization Example:

As mentioned above, R provides many packages to visualize the data. For example, lets take
function persp();

It provides so many ways to visualize the data in different form.

  • persp(volcano, expand = 0.5)

DataDistribution_R_3

DataDistribution

DataDistribution_R

DataDistribution_R_1

As discussed above, there are many data types that R has provided.
R has also provided a structure of keeping data in form of tables, it is Data Frame.

Data Frame:

It is list of vectors of equal length. Different type of data can be imported to R and stored into Data Frame. Source can be csv, xls, table, txt etc.

For example below command will load and store the data of data2013.txt kept on local file to sampleDataFrame in R.

> sampleDataFrame <- read.csv(“~/SanJose/HistoricalDataSet/data2013.txt”, header=FALSE)

One more feature that R has provided for a quick view, one can simply take a snap of any data by CTRL+C and import the data to R.

x <- read.table(file = “clipboard”, sep=”\t”, header=TRUE)

Database Integration:

In real world scenario, most data are stored in the RDBMS. R has provided interface to connect to them easily. R has also provided interface for No SQL data base like MongoDB which is in most demand for the BigData analysis and mining.

R has provided RODBC, RMySQL, ROracle, RJDBC interfaces to integrate with relation data base, and it has also provided RMongo for the MongoDB (No SQL database), RNeo4j for Neo4j (Graph Data base).

It is very easy to use these interfaces in R.

Database Integration with MongoDB:

Consider an example where one need data from contacts collection of users database in MongoDB.

Steps to import data from MongoDB to R:

R_Mongo_integration

Classification Example in R:

As data analytic or data miner, requirement of classification and clustering comes very often, and as R has very rich packages, there are many packages in R is provided for the same.

  • Dimensionality Reduction
  • Frequent Pattern Mining
  • Sequence Mining
  • Clustering
  • Classification

Same goes for any specific problem for above. For example suppose anyone wants to do a SVM classification, there are e1072, kernlab, klaR, svmpath, shogun packages available to achieve same.

Lets take an example with e1072 package.

R_PackageInstalled

sampleDataLoading

As shown above, we have loaded the package of e1072 and also the sample data of cats in the R.

Now in order to do the classification, we need to create a model from the available data set.

SVM_R_EXAMPLE

To visualize the above model:

> plot(model,cats)

SVM_Classification_Plot

For a classification problem, will need a test and training data set.

  • Divide data into training set and test set

index <- 1:nrow(cats)
testindex <- sample(index, trunc(length(index)/3))
testset <- cats[testindex,]
trainset <- cats[-testindex,] trainset <- cats[-testindex,]

  • Train Model

model <- svm(Sex~., data = trainset)
prediction <- predict(model, testset[,-1])

To verify the result there are many packages available like Gain and Lift Charts, K-S or Kolmogorov-Smirnov chart, ROC Chart, Area Under the Curve etc.

Below is the confusion matrix.

tab <- table(pred = prediction, true = testset[,1])

SVM_ConfusionMatrix

As shown above, we have correctly classified 37 instances and 11 instances are wrongly classified.

So concluding, R is very rich language to use as it has wide range of packages available for data modeling, analysis, and visualization.

Basic introduction to MongoDB:

It is an Document Database which is not tightly bounded with schema, so it is well known for its features like Schema less, Clear Structure. Considering large amount of data, MongoDB is proven to be deliver high Performance, high Availability and easy scalability.

Data Format in MongoDB:

It stores data (documents) in BSON format which is binary-encoded serialization of JSON. One documents has a size limit of 16 MB. Below is the documents format that is maintained in MongoDB.

MongoDocumentFormat

  • MongoDB is easy to install.
  • Run mongod.exe from command prompt to start database
  • Run mongo.exe from command prompt to connect and manipulate data

 Comparision with RDBMS

SQL_MongoDB

Query Categories

QueryStructures

Importing data from files to MongoDB.

  • Easily import data from CSV, JSON
  • Example: mongoimport –db users –collection contacts –type csv –headerline –file /opt/backups/contacts.csv

Sample Queries

Insert Queries

MongoCMD

Projection

  • To limit the number of data
  • To limit the number of row
  • To apply conditions limit check, sorting etc.

Projection

Applying limit: Data<10

LimitExample

Find Using REGEX

MongoDB_REGEX

Skip, Sort

Sort_Skip_Mongo

Update multiple Document using REGEX

MongoDB_Update_REGEX

Remove Data

MongoDB_Remove

Concluding MongoDB by mentioning its very useful feature that is MongoDB has provided MapReduce Support as well.

MongoMapReduce

Thus, R is an enriched statistical programming language with many predefined easy to use packages and MongoDB is schema free highly scalable database available mostly suited for big data mining.

Leave a Reply

Your email address will not be published. Required fields are marked *