This is a coding diary rather then a sub-section. It helps me keeping track of useful tips and chunks of code I stumble across during my research and/or for fun. Each post can be accessed through the menu on your left.

Shapefiles in R, mapping!

ArcGis and QGis are amazing pieces of softwere. However, there is a relatively steep learning curve that you might not necessarily want to walk on right away. If you are familiar with R, dplyr, and ggplot producing high quality maps using shapefiles is pretty straightforward. The folloging example is based on a municipal shapefile you can download on the ISTAT website, as a zipped bundle. If you ever use this code for your own research, would you mind citing this page?

# Use getwd() to get your working directory
# Use setwd() if you wish to change your working directory

# install.packages('ggplot')
library(ggplot2)
# install.packages('dplyr')
library(dplyr)
# install.packages('rgdal')
library(rgdal)
# install.packages('broom')
library(broom)
# install.packages("viridis")
library(viridis)

remove(list = ls(all = T))

# Always set seed for reproducible examples.

set.seed(1234) 

muni <- readOGR('Comuni2011.shp', 'Comuni2011', stringsAsFactors = FALSE, verbose = FALSE)

# After having read the data, have a look at its structure. In RStudio just click on the object you have just created. 
# Take note of the variables you see in muni@data, you'll need one of them (`COM`) later on 

longlat <- '+init=epsg:4121 +proj=longlat +ellps=GRS80 +datum=GGRS87 +no_defs +towgs84=-199.87,74.79,246.62'

muni <- SpatialPolygonsDataFrame(spTransform(muni, CRS(longlat)), muni@data)

# Putting in region (`COM`) is important because it defines each id.

muni_map <- tidy(muni, region = 'COM') 

# We going to create a vector of 8094 random values (one per each Italian municipalities) that we will use to colour our map. 
# In this case I'm asking R to draw 8094 numbers from a normal distribution with a mean = 100 and a standard deviation of 10.
# You can also # use other functions like sample() or runif(). I then create a data.frame with two colums: random values and ids
# taken from the muni_map data.frame.

df <- data.frame(randomv = rnorm(8094, mean = 100, sd = 10),
                 id = unique(muni@data$COM))

# We are now ready to get the map using ggplot:

ggplot() +
  geom_map(data = df, 
           map = muni_map, aes(fill = randomv,
                               map_id = id)) + 
  scale_fill_viridis('Random values', discrete = F) +
  coord_map() + ggthemes::theme_map() +
  theme(legend.position = 'bottom') +
  geom_map(data = muni_map, map = muni_map, aes(x = long, y = lat, map_id = id), 
           colour = NA, fill = NA, size = 0.5) 

# You can then save the image using the following:
# ggsave('map.pdf', dpi = 300, width = 7.5, height = 9)

Raster maps

Raster maps, are another cool example of what - amazing - things you can do in R. Say that you have downloaded a massive data set containing 0.5 x 0.5 (or whatever else) gridded data and you would like to get a map out of it. In general, when reading meteorological files into R, you usually end up dealing with lists of matrices whose rows represent latitute - going from - 90 to 90 by 0.5 - and columns represent latitude - going from - 180 to 180 or from 0 to 360. As an example, you can download this .csv file. It is a long-lat matrix that represents the 2 metre temperature of the globe recorded by ECMWF on January 1st, 1979.

# install.packages('ggplot')
library(ggplot2)
# install.packages('dplyr')
library(dplyr)
# install.packages('reshape')
library(reshape2)
# install.packages("viridis")
library(viridis)

a <- read.csv('raster_example.csv', header = T, stringsAsFactors = F, row.names = 1, check.names = F)

# When you have data in that format you just need to melt it and use ggplot to create your map.
# What melt is doing under the hood is pretty simple. It literally melts your data in a ggplot-friendly format; in other
# words, it recreates all the possible long-lat pairs, fit them into 2 colums, and then store temperature values in a third one.

# One thing to notice. Always make sure that you convert your object into a data.matrix()

a <- melt(data.matrix(a), c('latitude', 'longitude'))

ggplot(a, aes(x = longitude, y = latitude)) + geom_raster(aes(fill = value)) + 
  scale_fill_viridis('Temperature, ?K', discrete = F, na.value = 'white') +
  theme(legend.position = 'bottom',
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  theme(panel.background = element_blank())

Scraping pdfs in R

Historical censuses, statistical yearbooks, and the like could be a goldmine for researchers. Nevertheless, it is often the case that hidden gems require an impressive amount of time - and funding - to be digitised. Here is a (cheeky) way of scraping data off pdfs, provided that some sort of text has been already coded into them.

# Use getwd() to get your working directory
# Use setwd() if you wish to change your working directory

# install.packages('pdftools')
library(pdftools)

remove(list = ls(all = T))

# pdf_text() extracts text data from a pdf. In this case, I selected the first 10 pages of file.pdf, run pdf_text()
# and store the result into a character list, txt. Each page corresponds to an element of txt, that can be accessed 
# with the usual list syntax.

txt <- pdf_text('file.pdf')[1:10] 

# I then write a function that creates 10 separate data.frames, each of which contains the the cat - print with less conversion -
# terminal output.

f <- function(x){
  data.frame(Page = capture.output(cat(txt[[x]])))
}

# I lapply() f to each page and get a list of 10 data.frames - that I give names to, for simplicity.

A <- lapply(1:10, f)
names(A) <- paste0('Page_', 1:10)

# The last trick is to write different .csv files for each page, again using lapply.

lapply(1:10, function(x) write.csv(A[[x]], 
                                   file = paste0(names(A[x]), '.csv'),
                                   row.names = FALSE))

The output still requires to be cleaned but you won’t have to copy it ypurself!

Simple OCR

What happens when there is absolutely no text embedded into a pfd? Well, you can use OCR, i.e. Optical character recognition! This technology is quite fun - and often impressive. Here is a basic example of how to perform good quality OCR in R, free of charge.

# install.packages('tesseract')
# install.packages('magick')
library(tesseract)
library(magick)

remove(list = ls(all = T))

# Say you want to OCR a png image. You can read it in using the `magick` library. In brackets, ImageMagick is an advanced piece of 
# softwere that easy manage image processing and editing. The approach is much less 'graphical' than Photoshop and the like but does
# shed some light on the mathematical/geometrical side of things. Have a look here for a complete guide:
# https://cran.r-project.org/web/packages/magick/vignettes/intro.html

sample <- image_read('sample_table.png')
image_info(sample)

# It is often good practice to convert your .png into a .tiff. Anecdotal evidence points in the direction of OCR being more effective
# when dealing with this format.

image_write(sample, 
            path = 'sample_table.tiff', 
            format = 'tiff', 
            quality = 100, 
            flatten = T)

In my own experience, OCR performs a lot better when given two, essential, pieces of information: - A language; - A set of characters to look for. Say that you want to OCR a table. Letters and numbers might be tricky to distinguish. In this case, I set the language to english and use the tessedit_char_whitelist option to tell R that he is dealing with a table containing numbers only.

data <- ocr(sample, 
            engine = tesseract('eng',
                               options = list(tessedit_char_whitelist = '0123456789.-',
                                              tessedit_pageseg_mode = 'auto',
                                              textord_tabfind_find_tables = '1',
                                              textord_tablefind_recognize_tables = '1')))

# Then use cat(data) to see what the result is.

You can also use the following tabulizer:

  • Say you have a massive pdf comprising 1000+ pages and you are only interested in a few tables. You can restrict your focus through the locate_areas() function. It opens up a window within RStudio and you sipmy draw the contour of the area you need to OCR;
  • You can easily extract tables and transfer the output to a .csv.
# install.packages('tabulizer')
library(tabulizer)

locate_areas('file.pdf', pages = 1) 

table <- extract_tables('file.pdf', 
                         encoding = 'UTF-8', 
                         pages = 1,
                         guess = F)

write.csv(table, 'table.csv', row.names = F)

Fread all files in folder

You can use the following when reading and binding all the files in a given folder. In this case I’m (f)reading large .csv files before binding them together.

# install.packages('data.table')
library(data.table)

files <- list.files(path = '',
                    pattern = '*.csv',
                    full.names = T)
temp <- lapply(files, fread, sep = ',')
data <- rbindlist(temp)

Some REGEX magic and string stuff

x <- c('gaspare_1_2', 'gaspare_1/2_3_4')
    
gsub("^.*?_", "", x) # Remove everything until first underscore
## [1] "1_2"     "1/2_3_4"
gsub("^.*_", "", x) # Remove everything until last underscore
## [1] "2" "4"
gsub("_.*?$", "", x) # Remove everything after first underscore
## [1] "gaspare" "gaspare"
gsub("_.$", "", x) # Remove everything after last underscore
## [1] "gaspare_1"     "gaspare_1/2_3"
gsub("[_]", "", x) # Removes underscore
## [1] "gaspare12"    "gaspare1/234"
gsub("[^_]", "", x) # Removes everything but the undercores
## [1] "__"  "___"
gsub('^.', '', x) # Removes the initial character of the string
## [1] "aspare_1_2"     "aspare_1/2_3_4"
gsub('.$', '', x) # Removes the last character of the string
## [1] "gaspare_1_"     "gaspare_1/2_3_"
gsub('[\bgaspare\b]', '', x) # Removes the word gaspare as defined by word boundaries
## [1] "_1_2"     "_1/2_3_4"
gsub('[^\bgaspare\b]', '', x) # Removes everything but the word gaspare as defined by word boundaries
## [1] "gaspare" "gaspare"

When dealing with basic string manipulation, you can - and probably should! - also use the stringr package. It’s pretty handy if you want to quickly remove/keep/repalce all the blanks, punctuation signs, digits, alphanumeric characters etc.

library(stringr)

str_replace_all(x, '[[:blank:]]', '') # Replaces all spaces with ''
## [1] "gaspare_1_2"     "gaspare_1/2_3_4"
str_replace_all(x, '[[:punct:]]', '') # Replaces all punctuation signs with ''
## [1] "gaspare12"   "gaspare1234"
str_replace_all(x, '[[:digit:]]', '') # Replaces all digits with ''
## [1] "gaspare__"   "gaspare_/__"

Have a look at this page for more: https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html