Introduction to R and MongoDB

In this article we go through the basics of R and MongoDB.

Introduction to R:

To start with R, it is statistical programming language. It is an interpreted language so it executes instruction directly rather than first compiling it and than executing it, it directly executes the instruction from the console. R due to its statistical inbuilt and add on packages, is very popular among the statisticians and data miners. On top of all these features, R also provides package to visualize the data in 2D and 3D way to get more clear picture of the data and result for better analysis.

Get Started With R

Download URL:

For R: http://cran.rstudio.com/
R Studio: http://www.rstudio.com/products/rstudio/download/

The very first package that one should know in order to start learning R are:

help(function name) / ?functionName
- to get help on function whose name is provided
example(function name)
- to get example on function whose name is provided
apropos(“function name”)
- to show all function available whose name contains string provided in parameter

Lets see the sample examples of above commands for clarity:

Below command can be directly written to R console or R Studio.

Suppose anyone need help to understand about the function called : min. He/She can make use of help function as shown below.

> help(min)

Above will pop up the page for the help on function min.

Suppose someone still needs to understand how to use the min function. He/She can make use of example function as shown below.

> example(min)
min> require(stats); require(graphics)
min> min(5:1, pi) #-> one number
[1] 1
min> pmin(5:1, pi) #-> 5 numbers
[1] 3.141593 3.141593 3.000000 2.000000 1.000000
min> x <- sort(rnorm(100)); cH <- 1.35
min> pmin(cH, quantile(x)) # no names
[1] -2.4030962 -0.4229549 0.1632506 0.8253795 1.3500000

Suppose someone is unable to find function for month then they can make use of apropos function as shown below.

> apropos(“month”)
[1] “month.abb” “month.name” “monthplot” “months” “months.Date” “months.POSIXt” “sunspot.month”

Data Types supported by R:

For data analysis on data having different types of data types R provides many data types to cover most of them.

Below are the main data types that are widely used for data mining and machine learning purpose.

Vector
Matrices
Arrays
List
Factors

Visualization Example:

As mentioned above, R provides many packages to visualize the data. For example, lets take
function persp();

It provides so many ways to visualize the data in different form.

persp(volcano, expand = 0.5)

As discussed above, there are many data types that R has provided.
R has also provided a structure of keeping data in form of tables, it is Data Frame.

Data Frame:

It is list of vectors of equal length. Different type of data can be imported to R and stored into Data Frame. Source can be csv, xls, table, txt etc.

For example below command will load and store the data of data2013.txt kept on local file to sampleDataFrame in R.

> sampleDataFrame <- read.csv(“~/SanJose/HistoricalDataSet/data2013.txt”, header=FALSE)

One more feature that R has provided for a quick view, one can simply take a snap of any data by CTRL+C and import the data to R.

x <- read.table(file = “clipboard”, sep=”\t”, header=TRUE)

Database Integration:

In real world scenario, most data are stored in the RDBMS. R has provided interface to connect to them easily. R has also provided interface for No SQL data base like MongoDB which is in most demand for the BigData analysis and mining.

R has provided RODBC, RMySQL, ROracle, RJDBC interfaces to integrate with relation data base, and it has also provided RMongo for the MongoDB (No SQL database), RNeo4j for Neo4j (Graph Data base).

It is very easy to use these interfaces in R.

Database Integration with MongoDB:

Consider an example where one need data from contacts collection of users database in MongoDB.

Steps to import data from MongoDB to R:

Classification Example in R:

As data analytic or data miner, requirement of classification and clustering comes very often, and as R has very rich packages, there are many packages in R is provided for the same.

Dimensionality Reduction
Frequent Pattern Mining
Sequence Mining
Clustering
Classification

Same goes for any specific problem for above. For example suppose anyone wants to do a SVM classification, there are e1072, kernlab, klaR, svmpath, shogun packages available to achieve same.

Lets take an example with e1072 package.

sampleDataLoading

As shown above, we have loaded the package of e1072 and also the sample data of cats in the R.

Now in order to do the classification, we need to create a model from the available data set.

To visualize the above model:

> plot(model,cats)

For a classification problem, will need a test and training data set.

Divide data into training set and test set

index <- 1:nrow(cats)
testindex <- sample(index, trunc(length(index)/3))
testset <- cats[testindex,]
trainset <- cats[-testindex,] trainset <- cats[-testindex,]

Train Model

model <- svm(Sex~., data = trainset)
prediction <- predict(model, testset[,-1])

To verify the result there are many packages available like Gain and Lift Charts, K-S or Kolmogorov-Smirnov chart, ROC Chart, Area Under the Curve etc.

Below is the confusion matrix.

tab <- table(pred = prediction, true = testset[,1])

As shown above, we have correctly classified 37 instances and 11 instances are wrongly classified.

So concluding, R is very rich language to use as it has wide range of packages available for data modeling, analysis, and visualization.

Basic introduction to MongoDB:

It is an Document Database which is not tightly bounded with schema, so it is well known for its features like Schema less, Clear Structure. Considering large amount of data, MongoDB is proven to be deliver high Performance, high Availability and easy scalability.

Data Format in MongoDB:

It stores data (documents) in BSON format which is binary-encoded serialization of JSON. One documents has a size limit of 16 MB. Below is the documents format that is maintained in MongoDB.

MongoDB is easy to install.
- Download URL: https://www.mongodb.org/downloads
Run mongod.exe from command prompt to start database
Run mongo.exe from command prompt to connect and manipulate data

Comparision with RDBMS

Query Categories

Importing data from files to MongoDB.

Easily import data from CSV, JSON
Example: mongoimport –db users –collection contacts –type csv –headerline –file /opt/backups/contacts.csv

Sample Queries

Insert Queries

Projection

To limit the number of data
To limit the number of row
To apply conditions limit check, sorting etc.

Applying limit: Data<10

Find Using REGEX

Skip, Sort

Sort_Skip_Mongo

Update multiple Document using REGEX

Remove Data

Concluding MongoDB by mentioning its very useful feature that is MongoDB has provided MapReduce Support as well.

Thus, R is an enriched statistical programming language with many predefined easy to use packages and MongoDB is schema free highly scalable database available mostly suited for big data mining.

withexample.com

Posts

Introduction to R and MongoDB