Category Archives: R

Distance Matrix in R

I’ll do an exercise to make a distance matrix, for that we only need the function spDistN1 from the library “sp”. 

The basic use of the function is the following:

coordinates(meuse) <- c("x", "y")
spDistsN1(meuse, meuse[1,], longlat=TRUE)

The limitation this is it only return the distance of the first element (or given row), because what the function does is to measure the distance of all the elements in relation with only one geometric element (meuse[1, ]) . To solve that we can use the apply function:

apply(meuse@coords, 1, function(x) spDistsN1(meuse@coords, x, longlat = T))

This is a great solution, but the issue is when the data is too big. In that case it will take too much time to process. (Imagine the case I do this with a for loop instead of apply, it would be much slower than now).

To solve the aforementioned problem I must to process with parallelization and as you can guess, the function parApply is just like apply but with parallelization development on it:

m.coord <- meuse@coords
ncore <- detectCores()
cl <- makeCluster(ncore)
clusterExport(cl, c("m.coord"))
clusterEvalQ(cl = cl, expr = c(library(sp)))
parApply(cl = cl, X = m.coord, MARGIN = 1, 
FUN = function(x) spDistsN1(m.coord, x, longlat = T))

Then we have a nice and an efficient solution.

Note: I know it can be easily done with a GIS software, but in this way it
can be incorporated in a bigger and automatized process.


Nearest Facility Threshold

In the previous post I created a R function which allows to find the nearest facility, now I want to show a bit update in which we can select only the cases under a threshold distance.

For this we can easily use the dplyr function named as filter, so we add a simple line of code.

DF <- filter(.data = DF, DIST <= threshold)

Then we can add this to the function which the final function will be:

DistMin_Cent <- function(DF, Destination, Distance,threshold){
DF <- spread_(DF, Destination, Distance)
DF$MIN <- apply(DF[, c(2:ncol(DF))], 1, FUN = min)
c_col <- c(ncol(DF)-1)
DF$CName <- as.numeric(colnames(DF[, c(2:c_col)])[apply(DF[,
c(2:c_col)], 1, which.min)])
DF <- DF[, c(1, ncol(DF), c_col)]
names(DF) <- c("PA", "PB", "DIST")
DF <- filter(.data = DF, DIST <= threshold)

An example of use for this is:

DistMin_Cent(Dist_mtx, "D", "DIST", 1000)

Nearest Facility

I haven’t blogged since a long long time, on this opportunity I want to show a very simple way to obtain the nearest centroid.

Let’s imagine there is a group of markets in a town and they need to obtain their supply at the nearest  supermarket.

With the help of a GIS Software, any facility has their geographical position and consecutively we can obtain a distance matrix. We assume that the supply only comes from only one supermarket. With this elements at hand we can make a code in R.

First of all to be a reproducible example we simulate a distance matrix. For this I can create and bring a matrix to a data frame. As we’ll on the code, I’m assuming that the random distances have a normal distribution.

As the output of a distance matrix of a GIS is tabular and it has not as a matrix , I have to convert the matrix to tabular data, for that I need to use the gather function found in the dplyr library.

Dist_mtx <-, mean = 1000, sd = 800)), nrow = 20))
names(Dist_mtx) <- seq(1:5)
Dist_mtx$O <- seq(10000, 10019, 1)
Dist_mtx <- gather(Dist_mtx, key = D, value = DIST, 1:5)
Dist_mtx <- Dist_mtx[order(Dist_mtx$O),]

After that there is the need to create a function with three parameters, we'll call her DistMin_Cent.

DistMin_Cent <- function(DF, Destination, Distance){

For this function we will need the libraries dplyr and tidyr because there is the need to transform the tabular data into a matrix form. This need of this transformation exist to extract the closest supermarket, this is with the spread function.

DF <- spread_(DF, Destination, Distance)

To extract the min distance we need the apply function, inside this function we need to set the argument MARGIN = 1, this is necessary to do the calculation per row.

DF$MIN <- apply(DF[, c(2:ncol(DF))], 1, FUN = min)

The following code helps to obtain to identify the name of the column which has the minimum distance.

c_col <- c(ncol(DF)-1)
DF$CName <- as.numeric(colnames(DF[, c(2:c_col)])[apply(DF[,
c(2:c_col)], 1, which.min)])

After that we select only the columns is useful for us.

All is summarized in a the new function to be capable of doing this with any data frame.

DistMin_Cent <- function(DF, Destination, Distance){
DF <- spread_(DF, Destination, Distance)
DF$MIN <- apply(DF[, c(2:ncol(DF))], 1, FUN = min)
c_col <- c(ncol(DF)-1)
DF$CName <- as.numeric(colnames(DF[, c(2:c_col)])[apply(DF[,
c(2:c_col)], 1, which.min)])
DF <- DF[, c(1, ncol(DF), c_col)]
names(DF) <- c("PA", "PB", "DIST")

DistMin_Cent(Dist_mtx, "D", "DIST")

And then we can see the results of the process.


Update on Snowdoop, a MapReduce Alternative

Mad (Data) Scientist

In blog posts a few months ago, I proposed an alternative to MapReduce, e.g. to Hadoop, which I called “Snowdoop.” I pointed out that systems like Hadoop and Spark are very difficult to install and configure, are either too primitive (Hadoop)  or too abstract (Spark) to program, and above all, are SLOW. Spark is of course a great improvement on Hadoop, but still suffers from these problems to various extents.

The idea of Snowdoop is to

  • retain the idea of Hadoop/Spark to work on top of distributed file systems (“move the computation to the data rather than vice versa”)
  • work purely in R, using familiar constructs
  • avoid using Java or any other external language for infrastructure
  • sort data only if the application requires it

I originally proposed Snowdoop just as a concept, saying that I would slowly develop it into an actual package. I later put the beginnings of a…

View original post 601 more words

Installing R in Ubuntu

First of all, It’s possible to install R from the Ubuntu Software Center

Screenshot from 2015-05-24 02:33:20

But it’s so outdated, so some packages won’t work for maintenance issues.

To be able to install the current version you must modify the file sources.list

To do that, go to the terminal and type:

sudo nano /etc/apt/sources.list

And you should add the following line:

deb trusty/

That adress is because I’m on Chile, thus you have to replace it for the right mirror belonging to your country. To know that, just go to:

After you have modified the sources.list type the following:

sudo apt-get install r-base
sudo apt-get install r-base-dev

And now you have R ready to use it. Though, I recommend to use it along with RStudio.

Data Visualization cheatsheet, plus Spanish translations

RStudio Blog

data visualization cheatsheet

We’ve added a new cheatsheet to our collection. Data Visualization with ggplot2 describes how to build a plot with ggplot2 and the grammar of graphics. You will find helpful reminders of how to use:

  • geoms
  • stats
  • scales
  • coordinate systems
  • facets
  • position adjustments
  • legends, and
  • themes

The cheatsheet also documents tips on zooming.

Download the cheatsheet here.

Bonus – Frans van Dunné of Innovate Online has provided Spanish translations of the Data Wrangling, R Markdown, Shiny, and Package Development cheatsheets. Download them at the bottom of the cheatsheet gallery.

View original post

RStudio v0.99 Preview: Code Completion

Great! This is very useful, it’s something we were waiting for.

RStudio Blog

We’re busy at work on the next version of RStudio (v0.99) and this week will be blogging about some of the noteworthy new features. If you want to try out any of the new features now you can do so by downloading the RStudio Preview Release.

The first feature to highlight is a fully revamped implementation of code completion for R. We’ve always supported a limited form of completion however (a) it only worked on objects in the global environment; and (b) it only worked when expressly requested via the tab key. As a result not nearly enough users discovered or benefitted from code completion. In this release code completion is much more comprehensive.

Smarter Completion Engine

Previously RStudio only completed variables that already existed in the global environment, now completion is done based on source code analysis so is provided even for objects that haven’t been fully evaluated:


Completions are also provided…

View original post 419 more words