# Spatial Data With R

SciencesPo Intro To Programming 2023

Florian Oswald

29 April, 2024

## Intro

In this lecture we will cover some basics about geospatial data and how to handle it with R. Spatial data is getting always more important, so we need a powerful tool to work with it.

tl;dr

Yes, R is a fully fledged GIS. No, you don’t need an ArchGIS (or other) license to do real work with spatial data (I don’t have one, and I use it for real work 😉).

### Resources

1. Geocomputation with R is our main reference.
2. The sf package vignettes are outstanding.

## Spatial Data Basics

• One prime example of spatial data are of course maps, providing an answer to the age-old question where is what.
• Fundamentally, spatial data still provide an answer to the same question, it is just that the what part has gotten much richer over the years.
• The attribute location may be only one of many other features of information on a certain observation.
• Multiple measurements imply that observations can be observed moving in space.
• There are two fundamentally different ways in which to consider spatial data:

## Spatial Data Types

1. Vector Data

We represent things with points, lines and polygons. We can scale and stretch and transform those easily with mathematical operations. Can increase precision to arbitrary levels (can always zoom in futher).

2. Raster Data

We have fixed-size tiles or cells (like a mosaic, or like pixels), which form a grid. Fixed resolution.

👉 This lecture deals only with Vector Data.

# Vector Data and Coordinate Reference Systems

## Representation of Vector Data

• Basically, we concentrate on a 2-dimensional space, even though three-dimensional spaces can be useful as well (any ideas for examples?)
• In other words, we denote a location with a tuple of coordinates $(x,y)$, or $(x,y,z)$ as the case may be, where each coordinate gives the distance from the origin in each direction. For example, we could represent Paris by the tuple c(2.34,48.85)
• One key question in the context of spatial data concerning planet earth you should ask is: Where is the Origin?
• Another question is, related to the well known fact that the earth is quasi-elipsoid (i.e. a bit like a squashed football and - just to be sure: not flat), how to represent locations in three dimensions on a 2-dimensional map?

## Coordinate Reference Systems (CRS)

• CRSs use longitude and latitude to identify locations.
• One widely used CRS is the World Geodetic System 1984, or WGS84 (used on google maps). It measures angular distance in degrees in a geocentric datum (made for the entire planet).
• longitude measures East-West distance from the Prime Meridian Plane. (left-to-right distance from a starting point)
• latitude measures North-South distance of Equatorial Plane. (up-down distance from a starting point)

## One Standard CRS: WGS84

• The dashed lines are the WGS84 elipsoid coordinate frame
• The blue circle is the origin at $(0,0)$ :
1. 0 degrees longitude (x-direction): Prime Meridian through Greenwhich, London.
2. 0 degrees latitude (y-direction): Equator.

## Paris Where?

1. Search for NTF Lambert North France
2. What does c(600256.4, 127726.4) actually mean?

## Geocentric vs Local Datum

Figure from Geocomputation with R. Geocentric and local geodetic datums shown on top of a geoid (in false color and the vertical exaggeration by 10,000 scale factor). Image of the geoid is adapted from the work of Ince et al. (2019)

# Vector Spatial Data in R

## Working with (Vector) Spatial Data in R

• We rely on a few core libraries.
• sf being the main one. That itself relies on several other lower level libraries.
install.packages("sf")
• Don’t try to build from source unless you know why.
• Let’s try to load the library:
library(sf)
Warning: package 'sf' was built under R version 4.2.3
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
• I highly recommend the package vignettes!
vignette(package = "sf") # see which vignettes are available
vignette("sf1")          # an introduction to the package

## Working with sf 1

Let’s read a shapefile from the sf package:

nc = st_read(system.file("shape/nc.shp", package="sf"))
Reading layer nc' from data source
/Users/floswald/Library/R/arm64/4.2/library/sf/shape/nc.shp'
using driver ESRI Shapefile'
Simple feature collection with 100 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
Geodetic CRS:  NAD27
head(nc[,c("AREA","NAME","FIPS","BIR79")])
Simple feature collection with 6 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -81.74107 ymin: 36.07282 xmax: -75.77316 ymax: 36.58965
AREA        NAME  FIPS BIR79                       geometry
1 0.114        Ashe 37009  1364 MULTIPOLYGON (((-81.47276 3...
2 0.061   Alleghany 37005   542 MULTIPOLYGON (((-81.23989 3...
3 0.143       Surry 37171  3616 MULTIPOLYGON (((-80.45634 3...
4 0.070   Currituck 37053   830 MULTIPOLYGON (((-76.00897 3...
5 0.153 Northampton 37131  1606 MULTIPOLYGON (((-77.21767 3...
6 0.097    Hertford 37091  1838 MULTIPOLYGON (((-76.74506 3...

## Working with sf 2

• Notice the geometry column.
• This is basically a geo-referenced data.frame.
plot(nc[,"AREA"])  # plot feature "AREA" (i.e. column 1)

## Working with sf 3

• Works also with ggplot2
library(ggplot2)
ggplot(nc) + geom_sf(aes(fill = AREA)) +
scale_fill_viridis_c(name = "Area")

## Working with sf 4: CRS Transform

ggplot(nc) + geom_sf(aes(fill = AREA)) +
scale_fill_viridis_c(name = "Area")
nc %>%
st_transform("+proj=moll") %>%
ggplot() + geom_sf(aes(fill = AREA)) +
scale_fill_viridis_c(name = "Area") +
ggtitle("Mollweide Projection")

## Geometric Operations with sf 1

• the simple features standard specifies a series of operations.
• the relevant functions start with st_ (for spatio-temporal)
• For 2 geometries x,y we can compute things like st_distance(x,y), st_intersect(x,y), etc
• For single geometries we can do things like st_area(x), st_union(x), st_buffer(x,dist) etc
st_area(st_union(nc))
127025870730 [m^2]
• Ooof, how many square km is that now? 🤔
st_area(st_union(nc)) %>% units::set_units(km2)
127025.9 [km2]

## Geometric Operations with sf 2

# copied from https://github.com/uo-ec607/lectures
nc_centroid = st_centroid(nc)

ggplot(nc) +
geom_sf(fill = "black", alpha = 0.8, col = "white") +
geom_sf(data = nc_centroid, col = "red") + ## Notice how easy it is to combine different sf objects
labs(
title = "Counties of North Carolina",
subtitle = "Centroids in red"
)

## Mapping the Seine 1

# copied from https://github.com/uo-ec607/lectures
# install.packages(c("maps","spData"))
## Get the data
france = st_as_sf(
maps::map('france',
plot = FALSE,
fill = TRUE)
)
data("seine",
package = "spData")

## Make sure they have the same projection
seine = st_transform(seine,
crs = st_crs(france))
# now, make a base plot:
pseine = ggplot() +
geom_sf(data = france,
alpha = 0.8,
fill = "black",
col = "gray50") +
labs(
title = "Administrative regions of France"
)
ggsave(plot = pseine,
"images/seine.png",
width=6, height=6)

## Mapping the Seine 2

# let's add the seine!
pseine2 = pseine +
geom_sf(data = seine, col = "#05E9FF", lwd = 1) +
labs(
title = "Administrative regions of France",
subtitle = "Also showing the Seine, Marne and Yonne rivers"
)
ggsave(plot = pseine2,
"images/seine2.png",
width=6, height=6)

## Intersect two sf objects

seine = st_transform(seine, crs = st_crs(france))
sf_use_s2(FALSE)  # need to turn off because of invalid geometry
france_intersected = st_intersection(france, seine)
head(france_intersected,2)
Simple feature collection with 2 features and 2 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 3.254238 ymin: 48.63712 xmax: 4.872966 ymax: 49.09028
Geodetic CRS:  +proj=longlat +ellps=clrk66 +no_defs +type=crs
ID  name                           geom
Aisne Aisne Marne LINESTRING (3.608053 49.089...
Marne Marne Marne LINESTRING (4.872966 48.637...
pl3 = france_intersected %>%
ggplot() +
geom_sf(alpha = 0.8, aes(fill = ID, col = ID)) +
labs(
title = "Seine, Marne and Yonne rivers",
caption = "Colours depict French administrative regions"
) +
theme(legend.title = element_blank())
ggsave(plot = pl3,"images/seine3.png",
width=7, height=5)

## Join two sf objects

pl4 = st_join(france, seine) %>%
## Get rid of regions with no overlap
dplyr::filter(!is.na(name)) %>%
## Some regions are duplicated b/c two
## branches of the river network flow through them
dplyr::distinct(ID, .keep_all = T) %>%
## pipe into ggplot
ggplot() +
geom_sf(alpha = 0.5,
fill = "#01731f",
col = "#fcb4b3",  # of borders
linewidth = 0.5) + # of borders
geom_sf(data = seine, col = "#05E9FF", lwd = 1) +
labs(title = "Intersected regions only") +
theme_bw()
ggsave(plot = pl4,"images/seine4.png",
width=7, height=5)

• Modify the code chunk on the previous slide.
• We want to have different colors for the shown departements, instead of all “#01731f”.
• I.e. make this for me 👉

d5 = st_join(france, seine) %>%
## Get rid of regions with no overlap
dplyr::filter(!is.na(name)) %>%
## Some regions are duplicated b/c two
## branches of the river network flow through them
dplyr::distinct(ID, .keep_all = T)

my_colors = palette.colors(nrow(d5), palette = "Alphabet")
names(my_colors) <- NULL

## pipe into ggplot
pl5 = ggplot(data = d5) +
geom_sf(aes(fill = ID),
col = "#fcb4b3",  # of borders
linewidth = 0.5) + # of borders
geom_sf(data = seine, col = "#05E9FF", lwd = 1.5) +
labs(title = "Intersected regions only", fill = "Departement") +
theme_bw() +
scale_fill_manual(values = my_colors)
ggsave(plot = pl5,"images/seine5.png",
width=7, height=5)

## Distances

• Another typical question could be:

What’s the (straight-line) distance between 2 points?

As in

What’s the distance between the centroids of the Seine-Maritime and Nievre Departements?

Modifying the plot from the previous task, produce 2 new plots

1. One that colors only the concerned departments, and marks their respective centroids with a point.
2. Another one with the same coloring, but where a straight solid line connects both centroids, and we print the distance in km into the table title.

Hint:

# start from here
p6 = ggplot(d5) + geom_sf()

Desired Outputs

cvec = rep(NA, length(unique(d5$ID))) names(cvec) <- unique(d5$ID)
cvec["Seine-Maritime"] <- "purple"
cvec["Nievre"] <- "brown"

p6 = ggplot(d5) + geom_sf()
p6 = ggplot(d5) + geom_sf(aes(fill = ID))
p6 = p6 + scale_fill_manual(values = cvec, limits= c("Seine-Maritime","Nievre"))
subdeps = d5 %>% dplyr::filter(ID %in% c("Seine-Maritime","Nievre"))
p6 = p6 + geom_sf(data = st_centroid(subdeps))
ggsave(plot = p6, "images/distance1.png", width = 5,height=4)

dists = st_distance(subdeps) %>% units::set_units("km")
coords = st_centroid(subdeps) %>% st_coordinates()
coords = data.frame(lon = coords[1,"X"],
lat = coords[1,"Y"],
lon_end = coords[2,"X"],
lat_end = coords[2,"Y"])
p7 = p6 + geom_segment(data = coords, aes(lon, lat, xend = lon_end, yend = lat_end))
p7 = p7 + ggtitle(paste("Distance between Centroids:",round(dists[1,2],0), "km"))
ggsave(plot = p7, "images/distance2.png",width = 5,height=4)`

# Raster Data

## What’s Different?

• We have a grid (pixels) where each cell contains one single data value - usually our measure of interest.
• We can have multiple layers of measurements (e.g. temperature, humidity and elevation for a grid cell)
• CRS considerations equally apply.
• Remote Sensing Data (e.g. Satelitte images) are often in raster format.