Author Archives: Ariel E. Fuentes Díaz

About Ariel E. Fuentes Díaz

I'm a Geographical Engineer from Chile and I work on Public Transport. I love learning new things, almost as much as I love music. That's why I play piano and I also sing.

Update on Snowdoop, a MapReduce Alternative

Mad (Data) Scientist

In blog posts a few months ago, I proposed an alternative to MapReduce, e.g. to Hadoop, which I called “Snowdoop.” I pointed out that systems like Hadoop and Spark are very difficult to install and configure, are either too primitive (Hadoop)  or too abstract (Spark) to program, and above all, are SLOW. Spark is of course a great improvement on Hadoop, but still suffers from these problems to various extents.

The idea of Snowdoop is to

  • retain the idea of Hadoop/Spark to work on top of distributed file systems (“move the computation to the data rather than vice versa”)
  • work purely in R, using familiar constructs
  • avoid using Java or any other external language for infrastructure
  • sort data only if the application requires it

I originally proposed Snowdoop just as a concept, saying that I would slowly develop it into an actual package. I later put the beginnings of a…

View original post 601 more words

Installing R in Ubuntu

First of all, It’s possible to install R from the Ubuntu Software Center

Screenshot from 2015-05-24 02:33:20

But it’s so outdated, so some packages won’t work for maintenance issues.

To be able to install the current version you must modify the file sources.list

To do that, go to the terminal and type:

sudo nano /etc/apt/sources.list

And you should add the following line:

deb http://dirichlet.mat.puc.cl//bin/linux/ubuntu trusty/

That adress is because I’m on Chile, thus you have to replace it for the right mirror belonging to your country. To know that, just go to:

http://cran.r-project.org/mirrors.html

After you have modified the sources.list type the following:

sudo apt-get install r-base
sudo apt-get install r-base-dev

And now you have R ready to use it. Though, I recommend to use it along with RStudio.

Data Visualization cheatsheet, plus Spanish translations

RStudio Blog

data visualization cheatsheet

We’ve added a new cheatsheet to our collection. Data Visualization with ggplot2 describes how to build a plot with ggplot2 and the grammar of graphics. You will find helpful reminders of how to use:

  • geoms
  • stats
  • scales
  • coordinate systems
  • facets
  • position adjustments
  • legends, and
  • themes

The cheatsheet also documents tips on zooming.

Download the cheatsheet here.

Bonus – Frans van Dunné of Innovate Online has provided Spanish translations of the Data Wrangling, R Markdown, Shiny, and Package Development cheatsheets. Download them at the bottom of the cheatsheet gallery.

View original post

RStudio v0.99 Preview: Code Completion

Great! This is very useful, it’s something we were waiting for.

RStudio Blog

We’re busy at work on the next version of RStudio (v0.99) and this week will be blogging about some of the noteworthy new features. If you want to try out any of the new features now you can do so by downloading the RStudio Preview Release.

The first feature to highlight is a fully revamped implementation of code completion for R. We’ve always supported a limited form of completion however (a) it only worked on objects in the global environment; and (b) it only worked when expressly requested via the tab key. As a result not nearly enough users discovered or benefitted from code completion. In this release code completion is much more comprehensive.

Smarter Completion Engine

Previously RStudio only completed variables that already existed in the global environment, now completion is done based on source code analysis so is provided even for objects that haven’t been fully evaluated:

document-inferred

Completions are also provided…

View original post 419 more words

Writing multiple csv files from a xlsx

What I used for this example is an open data about “Recycling places”, you can find it on the web page of Portal de datos Públicos.

The data, is an xlsx file

xlsx_wordpress

The file has 8 columns, one of them is town. So, now, the questions is:

How do I generate multiple files, one for each town?. The answer is simple: R

Why R? Because, you can automatize it. It avoid you to make different filters, and save the new file each time.

Let’s start:

To read the file, you can use the XLConnect package, and to split the data: the plyr package.

You can load the file with the function readWorksheet, but, as you can see, on the first row, six of the eight cells, are merged. So, when you load the file, that will disappear. Thus, we will read the file with headers and we’ll rename the columns with troubles; and then we will remove the row without data.

After that, we will use the d_ply function, which lets us to split the data. On this function, we put the field on which the split should be based. Then, we use the sdf function, which allows us to write the csv files; and as a final step, we extract the names from the field chosen, to paste them on the name of the new files.

library(XLConnect) #Functions to read excel
library(plyr) #Functions to split data

wb = loadWorkbook("Formato_Puntos_de_almacenamiento_Muni_consolidado.xlsx")
df = readWorksheet(wb, sheet = "Hoja1", header = TRUE)
colnames(df)[c(5, 6)] <- c("Este UTM", "Norte UTM")  #change col names
df2 <- df[-1,] #remove first row

d_ply(df2, .(Comuna),
      function(sdf) write.csv(sdf,
                              file=paste(sdf$Comuna[[1]],".csv",sep=""))) #write multiple csv

A DBI for PostgreSQL on R

Between the capabilities of R there is the possibility of querying databases thorough R. The DBMS that I know more it’s PostgreSQL. What I like about it, that it is an open source object-relational DBMS. It’s so simple, an also it has an extension for Spatial and Geographical objects called PostGIS.

Thus, the DBI (Database Interface) package I’ve chosen for querying PostgreSQL is RPostgreSQL. To work with it, just I have to download the package from the Repository and use the following code:

library(RPostgreSQL)

drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, host = "localhost", user= "***", password="***", dbname="Procesos_UChile")

dbListConnections(drv)
dbGetInfo(drv)
summary(con)

df = dbReadTable(con,'etapas_sept_2013')
summary(df)
head(df, 3)
dbDisconnect(con)

This DBI is a nice product, but it’s limited by the ram, this problem appears when I tried to read a table over 10GB. So, I’m stuck on here. I know, this year was released a library called PivotalR, which allows you to manage big amounts of data with the library madlib.

Pivotal is a software company that provides software and services for the development of custom applications for data and analytics based on cloud computing technology.

And they made a an open-source library for scalable in-database analytics that provides data parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data called Madlib.

The next step is trying to installing this library on ubuntu to see how it works. The instructions are on this URL:

https://gist.github.com/thinkerbot/8699369

You can also watch a presentation of PivotalR with a demo on the following video:
https://www.youtube.com/watch?v=6cmyRCMY6j0

Some Reflextions about Open Source Software

On my time at the University, I learned some of Mapinfo, and Arcview 3.x as a main GIS package, including Avenue as a programming language. Well, and mapobject too.

On my latest years, they acquire a couple of licences of the brand new ArcGIS. When I putted my hands on it, I was a bit excited because the kind of things you can do.

Obviously, the problem with Commercial Softwares is when appears bugs, and the new versions are coming; or whether you need them on your own laptop. So, first of all, I found ArcExplorer, but I didn’t like it, because, you can’t  do much with just a viewer. If you are looking for a viewer, just transform your shapefiles into a kml.

Just by accident, saw a GIS called gvSIG, the problem of this, it was that it used to crush a lot. I’ve heard that know is in a very good shape and is stable. Even on that time, it had some interesting things and for free. I get back to Arcgis, but some time after that I found Quantum GIS, now called renamed as QGIS. I was astonish, for me it was better than ARCGIS, and it’s improving everyday, and you can see how often are new plugins. Well, since that time I’ve been using it.

Being in QGIS itself, I have discovered the existence of GRASS GIS, it’s a bit hard to understand it. His logic it isn’t the same as with others GIS Softwares, but the amazing of it, is you can it with the command console, and as with QGIS, you can use Python with it.

Beside the greats algorithms of GRASS, is the ability of having a connexion with R. Though, I haven’t tried it, I know you can use SAGA with R. Well, there is another tool for Geospatial things, it’s PostGIS (this allow you to deal with geometry), which is an extension of the DBMS called PostgreSQL; another great of this DBMS is pgRouting.

As I come from the GIS world, it isn’t very easy for me, but, I’m learning R to do Spatial Analysis and Data Analysis. It’s amazing what the people have done with R, it’s very fast. And what I love about the open source, is not a black box as tend to occurs with Commercial Software. With an Open Source Software, you can check what algorithms they apply, and it is free.

What I don’t know, is if anyone is doing a development about the Four-step model for transportation planning. If not, we still have to pay for the black box which is TransCAD.

If you’d like to check the things the package you can use on R, I recommend you to follow the next link: http://cran.r-project.org/web/views/Spatial.html

In fact, there are more projects on the open source line. You can check the web page of the GeoDA Center.