This is a coding diary rather then a sub-section. It helps me keeping track of useful tips and chunks of code I stumble across during my research and/or for fun. Each post can be accessed through the menu on your left.
ArcGis and QGis are amazing pieces of softwere. However, there is a relatively steep learning curve that you might not necessarily want to walk on right away. If you are familiar with R, dplyr
, and ggplot
producing high quality maps using shapefiles is pretty straightforward. The folloging example is based on a municipal shapefile you can download on the ISTAT website, as a zipped bundle. If you ever use this code for your own research, would you mind citing this page?
# Use getwd() to get your working directory
# Use setwd() if you wish to change your working directory
# install.packages('ggplot')
library(ggplot2)
# install.packages('dplyr')
library(dplyr)
# install.packages('rgdal')
library(rgdal)
# install.packages('broom')
library(broom)
# install.packages("viridis")
library(viridis)
remove(list = ls(all = T))
# Always set seed for reproducible examples.
set.seed(1234)
muni <- readOGR('Comuni2011.shp', 'Comuni2011', stringsAsFactors = FALSE, verbose = FALSE)
# After having read the data, have a look at its structure. In RStudio just click on the object you have just created.
# Take note of the variables you see in muni@data, you'll need one of them (`COM`) later on
longlat <- '+init=epsg:4121 +proj=longlat +ellps=GRS80 +datum=GGRS87 +no_defs +towgs84=-199.87,74.79,246.62'
muni <- SpatialPolygonsDataFrame(spTransform(muni, CRS(longlat)), muni@data)
# Putting in region (`COM`) is important because it defines each id.
muni_map <- tidy(muni, region = 'COM')
# We going to create a vector of 8094 random values (one per each Italian municipalities) that we will use to colour our map.
# In this case I'm asking R to draw 8094 numbers from a normal distribution with a mean = 100 and a standard deviation of 10.
# You can also # use other functions like sample() or runif(). I then create a data.frame with two colums: random values and ids
# taken from the muni_map data.frame.
df <- data.frame(randomv = rnorm(8094, mean = 100, sd = 10),
id = unique(muni@data$COM))
# We are now ready to get the map using ggplot:
ggplot() +
geom_map(data = df,
map = muni_map, aes(fill = randomv,
map_id = id)) +
scale_fill_viridis('Random values', discrete = F) +
coord_map() + ggthemes::theme_map() +
theme(legend.position = 'bottom') +
geom_map(data = muni_map, map = muni_map, aes(x = long, y = lat, map_id = id),
colour = NA, fill = NA, size = 0.5)
# You can then save the image using the following:
# ggsave('map.pdf', dpi = 300, width = 7.5, height = 9)
Raster maps, are another cool example of what - amazing - things you can do in R. Say that you have downloaded a massive data set containing 0.5 x 0.5 (or whatever else) gridded data and you would like to get a map out of it. In general, when reading meteorological files into R, you usually end up dealing with lists of matrices whose rows represent latitute - going from - 90 to 90 by 0.5 - and columns represent latitude - going from - 180 to 180 or from 0 to 360. As an example, you can download this .csv
file. It is a long-lat matrix that represents the 2 metre temperature of the globe recorded by ECMWF on January 1st, 1979.
# install.packages('ggplot')
library(ggplot2)
# install.packages('dplyr')
library(dplyr)
# install.packages('reshape')
library(reshape2)
# install.packages("viridis")
library(viridis)
a <- read.csv('raster_example.csv', header = T, stringsAsFactors = F, row.names = 1, check.names = F)
# When you have data in that format you just need to melt it and use ggplot to create your map.
# What melt is doing under the hood is pretty simple. It literally melts your data in a ggplot-friendly format; in other
# words, it recreates all the possible long-lat pairs, fit them into 2 colums, and then store temperature values in a third one.
# One thing to notice. Always make sure that you convert your object into a data.matrix()
a <- melt(data.matrix(a), c('latitude', 'longitude'))
ggplot(a, aes(x = longitude, y = latitude)) + geom_raster(aes(fill = value)) +
scale_fill_viridis('Temperature, ?K', discrete = F, na.value = 'white') +
theme(legend.position = 'bottom',
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
theme(panel.background = element_blank())
Historical censuses, statistical yearbooks, and the like could be a goldmine for researchers. Nevertheless, it is often the case that hidden gems require an impressive amount of time - and funding - to be digitised. Here is a (cheeky) way of scraping data off pdfs, provided that some sort of text has been already coded into them.
# Use getwd() to get your working directory
# Use setwd() if you wish to change your working directory
# install.packages('pdftools')
library(pdftools)
remove(list = ls(all = T))
# pdf_text() extracts text data from a pdf. In this case, I selected the first 10 pages of file.pdf, run pdf_text()
# and store the result into a character list, txt. Each page corresponds to an element of txt, that can be accessed
# with the usual list syntax.
txt <- pdf_text('file.pdf')[1:10]
# I then write a function that creates 10 separate data.frames, each of which contains the the cat - print with less conversion -
# terminal output.
f <- function(x){
data.frame(Page = capture.output(cat(txt[[x]])))
}
# I lapply() f to each page and get a list of 10 data.frames - that I give names to, for simplicity.
A <- lapply(1:10, f)
names(A) <- paste0('Page_', 1:10)
# The last trick is to write different .csv files for each page, again using lapply.
lapply(1:10, function(x) write.csv(A[[x]],
file = paste0(names(A[x]), '.csv'),
row.names = FALSE))
The output still requires to be cleaned but you won’t have to copy it ypurself!
What happens when there is absolutely no text embedded into a pfd? Well, you can use OCR, i.e. Optical character recognition! This technology is quite fun - and often impressive. Here is a basic example of how to perform good quality OCR in R, free of charge.
# install.packages('tesseract')
# install.packages('magick')
library(tesseract)
library(magick)
remove(list = ls(all = T))
# Say you want to OCR a png image. You can read it in using the `magick` library. In brackets, ImageMagick is an advanced piece of
# softwere that easy manage image processing and editing. The approach is much less 'graphical' than Photoshop and the like but does
# shed some light on the mathematical/geometrical side of things. Have a look here for a complete guide:
# https://cran.r-project.org/web/packages/magick/vignettes/intro.html
sample <- image_read('sample_table.png')
image_info(sample)
# It is often good practice to convert your .png into a .tiff. Anecdotal evidence points in the direction of OCR being more effective
# when dealing with this format.
image_write(sample,
path = 'sample_table.tiff',
format = 'tiff',
quality = 100,
flatten = T)
In my own experience, OCR performs a lot better when given two, essential, pieces of information: - A language; - A set of characters to look for. Say that you want to OCR a table. Letters and numbers might be tricky to distinguish. In this case, I set the language to english and use the tessedit_char_whitelist option to tell R that he is dealing with a table containing numbers only.
data <- ocr(sample,
engine = tesseract('eng',
options = list(tessedit_char_whitelist = '0123456789.-',
tessedit_pageseg_mode = 'auto',
textord_tabfind_find_tables = '1',
textord_tablefind_recognize_tables = '1')))
# Then use cat(data) to see what the result is.
You can also use the following tabulizer
:
locate_areas()
function. It opens up a window within RStudio and you sipmy draw the contour of the area you need to OCR;# install.packages('tabulizer')
library(tabulizer)
locate_areas('file.pdf', pages = 1)
table <- extract_tables('file.pdf',
encoding = 'UTF-8',
pages = 1,
guess = F)
write.csv(table, 'table.csv', row.names = F)
You can use the following when reading and binding all the files in a given folder. In this case I’m (f)reading large .csv files before binding them together.
# install.packages('data.table')
library(data.table)
files <- list.files(path = '',
pattern = '*.csv',
full.names = T)
temp <- lapply(files, fread, sep = ',')
data <- rbindlist(temp)
x <- c('gaspare_1_2', 'gaspare_1/2_3_4')
gsub("^.*?_", "", x) # Remove everything until first underscore
## [1] "1_2" "1/2_3_4"
gsub("^.*_", "", x) # Remove everything until last underscore
## [1] "2" "4"
gsub("_.*?$", "", x) # Remove everything after first underscore
## [1] "gaspare" "gaspare"
gsub("_.$", "", x) # Remove everything after last underscore
## [1] "gaspare_1" "gaspare_1/2_3"
gsub("[_]", "", x) # Removes underscore
## [1] "gaspare12" "gaspare1/234"
gsub("[^_]", "", x) # Removes everything but the undercores
## [1] "__" "___"
gsub('^.', '', x) # Removes the initial character of the string
## [1] "aspare_1_2" "aspare_1/2_3_4"
gsub('.$', '', x) # Removes the last character of the string
## [1] "gaspare_1_" "gaspare_1/2_3_"
gsub('[\bgaspare\b]', '', x) # Removes the word gaspare as defined by word boundaries
## [1] "_1_2" "_1/2_3_4"
gsub('[^\bgaspare\b]', '', x) # Removes everything but the word gaspare as defined by word boundaries
## [1] "gaspare" "gaspare"
When dealing with basic string manipulation, you can - and probably should! - also use the stringr
package. It’s pretty handy if you want to quickly remove/keep/repalce all the blanks, punctuation signs, digits, alphanumeric characters etc.
library(stringr)
str_replace_all(x, '[[:blank:]]', '') # Replaces all spaces with ''
## [1] "gaspare_1_2" "gaspare_1/2_3_4"
str_replace_all(x, '[[:punct:]]', '') # Replaces all punctuation signs with ''
## [1] "gaspare12" "gaspare1234"
str_replace_all(x, '[[:digit:]]', '') # Replaces all digits with ''
## [1] "gaspare__" "gaspare_/__"
Have a look at this page for more: https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html