vignettes/tidyverse-pitfalls.Rmd
tidyverse-pitfalls.Rmd
This article is about working with spatial data ‘in the tidyverse’.
By this we mean handling spatial datasets using functions (such as
%>%
and filter()
) and concepts (such as
type stability) from R packages that are part of the metapackage
tidyverse. You should already have an R installation
set-up for spatial data analysis having read
Chapter 2 of the
Geocomputation with R book.
If not take a read there now. In any case the tidyverse can be installed from CRAN with the following command:
install.packages("tidyverse")
The tidyverse works with spatial data thanks to sf which is quite a recent package (first release in 2016). If you have not already got it, get sf with:
install.packages("sf")
The we will also uses a dataset from the spData package, which can be installed with:
install.packages("spData")
Software for ‘data science’ is evolving. As we saw in Chapter 1, R packages ggplot2 and dplyr have become immensely popular. Now they are a part of the tidyverse.
These packages use the ‘tidy data’ principles for consistency and
speed of processing. According to the
vignette("tidy-data")
, in tidy datasets:
Historically spatial R packages have not been compatible with the tidyverse. But this has changed with the release of sf and hard work by Edzer Pebesma and Hadley Wickham to make them work together. As described in Chapter 2, sf combines the functionality of three previous packages: sp, rgeos and rgdal. It has many other advantages, including:
st_*
)The final advantage comes with a warning: watch out for pitfalls! That’s topic of this vignette, as illustrated by the following GIF: it’s easy to imagine spatial data analyses but, as Homer Simson discovered with his BBQ project, doing something complicated is easier said than done, especially when combining packages that have only recently started working together:
Now you know the context and have your R setup sorted, it’s time to begin the practical. Execute the following 2 commands to attach spatial data libraries:
Notice the messages: sf uses some C libraries behind the scenes. raster depends on the older sp package which sf replaces (confusing I know!).
The next step is to attach the tidyverse, which brings us onto the first pitfall.
Just loading the tidyverse reveals a pitfall of using spatial data with the tidyverse that affects the raster package in particular:
The first chunk of output shows that the tidyverse is attaching its packages. ✔ yes we want ggplot2, ✔ we want dplyr etc. But we also get less positive messages. ✖ Doh! there are many conflicts.
In the context of spatial data this may only be a problem if you use
the raster package. The final ✖ shows that
dplyr’s select()
function has boshed
(technically speaking, masked) raster’s select
function. This can cause issues. To avoid this pitfall we suggest using
dplyr::select()
and raster::select()
rather
than just select()
when using this conflicted function name
if you use raster and the
tidyverse.
str(world_sf)
vs
str(world_sp)
Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "c('SpatialPolygonsDataFrame', 'SpatialPolygons', 'Spatial')"
This pitfall is not specific to the tidyverse but is worth being aware of. Let’s create a buffer around London of 500 km:
lnd_buff = lnd[1, ] %>%
st_transform(crs = 27700) %>% # uk CRS
st_buffer(500000) %>%
st_transform(crs = 4326)
near_lnd = world[lnd_buff, ]
plot(near_lnd$geom)
What is going with the country miles away?
The issue is that some objects have multiple geometries:
st_geometry_type(near_lnd)
Which have more than 1?
data.frame(near_lnd$name_long,
sapply(near_lnd$geom, length))
We can resolve this issue by casting them:
world_poly = world %>%
st_cast(to = "POLYGON")
near_lnd_new = world_poly[lnd_buff, ]
plot(st_geometry(near_lnd_new))
The previous code chunk shows how spatial subsetting works in base R,
with near_lnd = world_poly[lnd_buff, ]
.
You can also do tidy spatial subsetting:
near_lnd_tidy = world %>%
filter(st_intersects(., lnd_buff, sparse = FALSE))
But there are pitfalls:
sparse = FALSE
in the spatial
predicate function)Also affects non-spatial data - tidyverse/dplyr#366
Base R equivalent:
world[world$name_long == "United Kingdom", ]
Question:
identical(
world %>% filter(name_long == "United Kingdom"),
world[world$name_long == "United Kingdom", ]
) # ?
Row names usually don’t matter but they can bite.
u1 = world %>% filter(name_long == "United Kingdom")
u2 = world[world$name_long == "United Kingdom", ]
row.names(u2) = 1
identical(u1, u2)
Error in .subset2(x, i, exact = exact) :
attempt to select less than one element in get1index
But this does:
near_lnd_data = st_set_geometry(near_lnd, NULL)
d = bind_rows(near_lnd_data, near_lnd_data)
d_sf = st_sf(d, geometry = c(near_lnd$geom, near_lnd$geom))
plot(d_sf)
Raster data is not yet closely connected to the tidyverse, however:
pipes
Spatial*
form using
as(my_vector, "Spatial")
before working on raster-vector
interactions