Monthly Archives: December 2014

Writing multiple csv files from a xlsx

What I used for this example is an open data about “Recycling places”, you can find it on the web page of Portal de datos Públicos.

The data, is an xlsx file

xlsx_wordpress

The file has 8 columns, one of them is town. So, now, the questions is:

How do I generate multiple files, one for each town?. The answer is simple: R

Why R? Because, you can automatize it. It avoid you to make different filters, and save the new file each time.

Let’s start:

To read the file, you can use the XLConnect package, and to split the data: the plyr package.

You can load the file with the function readWorksheet, but, as you can see, on the first row, six of the eight cells, are merged. So, when you load the file, that will disappear. Thus, we will read the file with headers and we’ll rename the columns with troubles; and then we will remove the row without data.

After that, we will use the d_ply function, which lets us to split the data. On this function, we put the field on which the split should be based. Then, we use the sdf function, which allows us to write the csv files; and as a final step, we extract the names from the field chosen, to paste them on the name of the new files.

library(XLConnect) #Functions to read excel
library(plyr) #Functions to split data

wb = loadWorkbook("Formato_Puntos_de_almacenamiento_Muni_consolidado.xlsx")
df = readWorksheet(wb, sheet = "Hoja1", header = TRUE)
colnames(df)[c(5, 6)] <- c("Este UTM", "Norte UTM")  #change col names
df2 <- df[-1,] #remove first row

d_ply(df2, .(Comuna),
      function(sdf) write.csv(sdf,
                              file=paste(sdf$Comuna[[1]],".csv",sep=""))) #write multiple csv

A DBI for PostgreSQL on R

Between the capabilities of R there is the possibility of querying databases thorough R. The DBMS that I know more it’s PostgreSQL. What I like about it, that it is an open source object-relational DBMS. It’s so simple, an also it has an extension for Spatial and Geographical objects called PostGIS.

Thus, the DBI (Database Interface) package I’ve chosen for querying PostgreSQL is RPostgreSQL. To work with it, just I have to download the package from the Repository and use the following code:

library(RPostgreSQL)

drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, host = "localhost", user= "***", password="***", dbname="Procesos_UChile")

dbListConnections(drv)
dbGetInfo(drv)
summary(con)

df = dbReadTable(con,'etapas_sept_2013')
summary(df)
head(df, 3)
dbDisconnect(con)

This DBI is a nice product, but it’s limited by the ram, this problem appears when I tried to read a table over 10GB. So, I’m stuck on here. I know, this year was released a library called PivotalR, which allows you to manage big amounts of data with the library madlib.

Pivotal is a software company that provides software and services for the development of custom applications for data and analytics based on cloud computing technology.

And they made a an open-source library for scalable in-database analytics that provides data parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data called Madlib.

The next step is trying to installing this library on ubuntu to see how it works. The instructions are on this URL:

https://gist.github.com/thinkerbot/8699369

You can also watch a presentation of PivotalR with a demo on the following video:
https://www.youtube.com/watch?v=6cmyRCMY6j0

Some Reflextions about Open Source Software

On my time at the University, I learned some of Mapinfo, and Arcview 3.x as a main GIS package, including Avenue as a programming language. Well, and mapobject too.

On my latest years, they acquire a couple of licences of the brand new ArcGIS. When I putted my hands on it, I was a bit excited because the kind of things you can do.

Obviously, the problem with Commercial Softwares is when appears bugs, and the new versions are coming; or whether you need them on your own laptop. So, first of all, I found ArcExplorer, but I didn’t like it, because, you can’t  do much with just a viewer. If you are looking for a viewer, just transform your shapefiles into a kml.

Just by accident, saw a GIS called gvSIG, the problem of this, it was that it used to crush a lot. I’ve heard that know is in a very good shape and is stable. Even on that time, it had some interesting things and for free. I get back to Arcgis, but some time after that I found Quantum GIS, now called renamed as QGIS. I was astonish, for me it was better than ARCGIS, and it’s improving everyday, and you can see how often are new plugins. Well, since that time I’ve been using it.

Being in QGIS itself, I have discovered the existence of GRASS GIS, it’s a bit hard to understand it. His logic it isn’t the same as with others GIS Softwares, but the amazing of it, is you can it with the command console, and as with QGIS, you can use Python with it.

Beside the greats algorithms of GRASS, is the ability of having a connexion with R. Though, I haven’t tried it, I know you can use SAGA with R. Well, there is another tool for Geospatial things, it’s PostGIS (this allow you to deal with geometry), which is an extension of the DBMS called PostgreSQL; another great of this DBMS is pgRouting.

As I come from the GIS world, it isn’t very easy for me, but, I’m learning R to do Spatial Analysis and Data Analysis. It’s amazing what the people have done with R, it’s very fast. And what I love about the open source, is not a black box as tend to occurs with Commercial Software. With an Open Source Software, you can check what algorithms they apply, and it is free.

What I don’t know, is if anyone is doing a development about the Four-step model for transportation planning. If not, we still have to pay for the black box which is TransCAD.

If you’d like to check the things the package you can use on R, I recommend you to follow the next link: http://cran.r-project.org/web/views/Spatial.html

In fact, there are more projects on the open source line. You can check the web page of the GeoDA Center.