Nearest Facility Threshold

In the previous post I created a R function which allows to find the nearest facility, now I want to show a bit update in which we can select only the cases under a threshold distance.

For this we can easily use the dplyr function named as filter, so we add a simple line of code.

DF <- filter(.data = DF, DIST <= threshold)

Then we can add this to the function which the final function will be:

DistMin_Cent <- function(DF, Destination, Distance,threshold){
library(dplyr)
library(tidyr)
DF <- spread_(DF, Destination, Distance)
DF$MIN <- apply(DF[, c(2:ncol(DF))], 1, FUN = min)
c_col <- c(ncol(DF)-1)
DF$CName <- as.numeric(colnames(DF[, c(2:c_col)])[apply(DF[,
c(2:c_col)], 1, which.min)])
DF <- DF[, c(1, ncol(DF), c_col)]
rm(c_col)
names(DF) <- c("PA", "PB", "DIST")
DF <- filter(.data = DF, DIST <= threshold)
DF
}

An example of use for this is:

DistMin_Cent(Dist_mtx, "D", "DIST", 1000)

Advertisements

Nearest Facility

I haven’t blogged since a long long time, on this opportunity I want to show a very simple way to obtain the nearest centroid.

Let’s imagine there is a group of markets in a town and they need to obtain their supply at the nearest  supermarket.

With the help of a GIS Software, any facility has their geographical position and consecutively we can obtain a distance matrix. We assume that the supply only comes from only one supermarket. With this elements at hand we can make a code in R.

First of all to be a reproducible example we simulate a distance matrix. For this I can create and bring a matrix to a data frame. As we’ll on the code, I’m assuming that the random distances have a normal distribution.

As the output of a distance matrix of a GIS is tabular and it has not as a matrix , I have to convert the matrix to tabular data, for that I need to use the gather function found in the dplyr library.

Dist_mtx <- as.data.frame(matrix(abs(rnorm(100, mean = 1000, sd = 800)), nrow = 20))
names(Dist_mtx) <- seq(1:5)
Dist_mtx$O <- seq(10000, 10019, 1)
library(dplyr)
Dist_mtx <- gather(Dist_mtx, key = D, value = DIST, 1:5)
Dist_mtx <- Dist_mtx[order(Dist_mtx$O),]

After that there is the need to create a function with three parameters, we'll call her DistMin_Cent.

DistMin_Cent <- function(DF, Destination, Distance){
}

For this function we will need the libraries dplyr and tidyr because there is the need to transform the tabular data into a matrix form. This need of this transformation exist to extract the closest supermarket, this is with the spread function.

DF <- spread_(DF, Destination, Distance)

To extract the min distance we need the apply function, inside this function we need to set the argument MARGIN = 1, this is necessary to do the calculation per row.

DF$MIN <- apply(DF[, c(2:ncol(DF))], 1, FUN = min)

The following code helps to obtain to identify the name of the column which has the minimum distance.

c_col <- c(ncol(DF)-1)
DF$CName <- as.numeric(colnames(DF[, c(2:c_col)])[apply(DF[,
c(2:c_col)], 1, which.min)])

After that we select only the columns is useful for us.

All is summarized in a the new function to be capable of doing this with any data frame.

DistMin_Cent <- function(DF, Destination, Distance){
library(dplyr)
library(tidyr)
DF <- spread_(DF, Destination, Distance)
DF$MIN <- apply(DF[, c(2:ncol(DF))], 1, FUN = min)
c_col <- c(ncol(DF)-1)
DF$CName <- as.numeric(colnames(DF[, c(2:c_col)])[apply(DF[,
c(2:c_col)], 1, which.min)])
DF <- DF[, c(1, ncol(DF), c_col)]
rm(c_col)
names(DF) <- c("PA", "PB", "DIST")
DF
}

DistMin_Cent(Dist_mtx, "D", "DIST")

And then we can see the results of the process.

min_dist

Update on Snowdoop, a MapReduce Alternative

Mad (Data) Scientist

In blog posts a few months ago, I proposed an alternative to MapReduce, e.g. to Hadoop, which I called “Snowdoop.” I pointed out that systems like Hadoop and Spark are very difficult to install and configure, are either too primitive (Hadoop)  or too abstract (Spark) to program, and above all, are SLOW. Spark is of course a great improvement on Hadoop, but still suffers from these problems to various extents.

The idea of Snowdoop is to

  • retain the idea of Hadoop/Spark to work on top of distributed file systems (“move the computation to the data rather than vice versa”)
  • work purely in R, using familiar constructs
  • avoid using Java or any other external language for infrastructure
  • sort data only if the application requires it

I originally proposed Snowdoop just as a concept, saying that I would slowly develop it into an actual package. I later put the beginnings of a…

View original post 601 more words

Installing R in Ubuntu

First of all, It’s possible to install R from the Ubuntu Software Center

Screenshot from 2015-05-24 02:33:20

But it’s so outdated, so some packages won’t work for maintenance issues.

To be able to install the current version you must modify the file sources.list

To do that, go to the terminal and type:

sudo nano /etc/apt/sources.list

And you should add the following line:

deb http://dirichlet.mat.puc.cl//bin/linux/ubuntu trusty/

That adress is because I’m on Chile, thus you have to replace it for the right mirror belonging to your country. To know that, just go to:

http://cran.r-project.org/mirrors.html

After you have modified the sources.list type the following:

sudo apt-get install r-base
sudo apt-get install r-base-dev

And now you have R ready to use it. Though, I recommend to use it along with RStudio.

Data Visualization cheatsheet, plus Spanish translations

RStudio Blog

data visualization cheatsheet

We’ve added a new cheatsheet to our collection. Data Visualization with ggplot2 describes how to build a plot with ggplot2 and the grammar of graphics. You will find helpful reminders of how to use:

  • geoms
  • stats
  • scales
  • coordinate systems
  • facets
  • position adjustments
  • legends, and
  • themes

The cheatsheet also documents tips on zooming.

Download the cheatsheet here.

Bonus – Frans van Dunné of Innovate Online has provided Spanish translations of the Data Wrangling, R Markdown, Shiny, and Package Development cheatsheets. Download them at the bottom of the cheatsheet gallery.

View original post

RStudio v0.99 Preview: Code Completion

Great! This is very useful, it’s something we were waiting for.

RStudio Blog

We’re busy at work on the next version of RStudio (v0.99) and this week will be blogging about some of the noteworthy new features. If you want to try out any of the new features now you can do so by downloading the RStudio Preview Release.

The first feature to highlight is a fully revamped implementation of code completion for R. We’ve always supported a limited form of completion however (a) it only worked on objects in the global environment; and (b) it only worked when expressly requested via the tab key. As a result not nearly enough users discovered or benefitted from code completion. In this release code completion is much more comprehensive.

Smarter Completion Engine

Previously RStudio only completed variables that already existed in the global environment, now completion is done based on source code analysis so is provided even for objects that haven’t been fully evaluated:

document-inferred

Completions are also provided…

View original post 419 more words

Writing multiple csv files from a xlsx

What I used for this example is an open data about “Recycling places”, you can find it on the web page of Portal de datos Públicos.

The data, is an xlsx file

xlsx_wordpress

The file has 8 columns, one of them is town. So, now, the questions is:

How do I generate multiple files, one for each town?. The answer is simple: R

Why R? Because, you can automatize it. It avoid you to make different filters, and save the new file each time.

Let’s start:

To read the file, you can use the XLConnect package, and to split the data: the plyr package.

You can load the file with the function readWorksheet, but, as you can see, on the first row, six of the eight cells, are merged. So, when you load the file, that will disappear. Thus, we will read the file with headers and we’ll rename the columns with troubles; and then we will remove the row without data.

After that, we will use the d_ply function, which lets us to split the data. On this function, we put the field on which the split should be based. Then, we use the sdf function, which allows us to write the csv files; and as a final step, we extract the names from the field chosen, to paste them on the name of the new files.

library(XLConnect) #Functions to read excel
library(plyr) #Functions to split data

wb = loadWorkbook("Formato_Puntos_de_almacenamiento_Muni_consolidado.xlsx")
df = readWorksheet(wb, sheet = "Hoja1", header = TRUE)
colnames(df)[c(5, 6)] <- c("Este UTM", "Norte UTM")  #change col names
df2 <- df[-1,] #remove first row

d_ply(df2, .(Comuna),
      function(sdf) write.csv(sdf,
                              file=paste(sdf$Comuna[[1]],".csv",sep=""))) #write multiple csv