`vignettes/point-pattern.Rmd`

`point-pattern.Rmd`

**Note: we set this not to run for travis:**

`knitr::opts_chunk$set(eval = FALSE)`

This vignette builds on the Raster-vector interactions chapter of Geocomputation with R (see r.geocompx.org/raster-vector.html). It shows the the basics of point pattern analysis in R and how to make (raster) ‘heatmaps’ from (vector) point data. It is influenced by the chapter on Spatial Point Pattern Analysis (Bivand, Pebesma, and Gómez-Rubio 2013) and an online tutorial on Point Pattern Analysis by Robert Hijmans.

We will use the **sp** package for this rather than the
newer **sf** package, as point pattern analysis is more
established for the former `Spatial`

class system than the
latter’s `sf`

classes. We will also use
**raster** as it has concise and well-designed functions
for spatial data:

```
pkgs = c(
"tmap",
"raster",
"mapview",
"dismo",
"gstat"
)
i = pkgs[!pkgs %in% installed.packages()]
if(length(i) > 0)
install.packages(i)
lapply(pkgs, library, character.only = TRUE)
```

This chapter uses two datasets on the spatial distribution and some
attributes of cycle hire ‘docking stations’, one from OpenStreetMap and
one from an on-line data feed.^{1} It also uses boundary data representing the
33 boroughs of London. These datasets live in the
**spData** package and can be attached as follows:

```
library(spData)
lnd = as(lnd, "Spatial")
cycle_hire = as(cycle_hire, "Spatial")
cycle_hire_osm = as(cycle_hire_osm, "Spatial")
```

We use the `Spatial`

classes used by the
**sp** dataset, as these are required by point pattern
analysis functions used in the chapter. The majority of the chapter will
use only the official `cycle_hire`

dataset. Towards the end
of the chapter we will compare the two point patterns to see how similar
they are. But first we focus only on the former dataset.

Before undertaking more advanced analysis, it makes sense to start
simple, with basic statistics and visualizations describing point
patterns. As with most analysis tasks this first stage involves visual
inspection of the data. This done in the code chunk below, which
generates the figure below:^{2}

It is immediately clear that the two datasets on cycle hire points
are closely related (they have a high degree of spatial correlation) and
have a distinctive pattern. `cycle_hire`

represents official
data on cycle parking, and will be the main point dataset analysed.
`cycle_hire_osm`

is the community contributed dataset on
cycle hire locations, downloaded from OpenStreetMap. Both sets of points
overlay some of London’s 33 boroughs, the central ones, and seem to
follow the River Thames, especially along the north bank of the river.
But how to describe that information quantitatively, and extrapolate the
values from the plotted location to other areas?

It is the purpose of this chapter to provide the know-how to answer such questions, that should be extensible to many applications that involve point data.

A basic statistic to compute on points within a polygon is the number of points per polygon, and the related statistic of point density. Let’s first compute that for London overall, before doing a zone-by-zone breakdown:

The results show that there are 742 cycle hire points and that London
covers an area of just over one-and-a-half thousand square kilometres (1
km^{2} = 1000000 m^{2} = 1e6 m^{2} in scientific
notation). That represents on average roughly one cycle parking rental
space per 2 square kilometers, or half a rental point per square
kilometer, as revealed by the results of the calculation below:

`nrow(cycle_hire) / lnd_area`

This is not a good indicator of the density of the bike hire scheme
overall, because they are so concentrated in central London. A more
representative result can be found by calculating the average point
density *within the extent of the bike hire scheme*. We can
coerce the bounding box (or extent in **raster**
terminology) of the stations into a polygon whose area can be measured
with the following commands:

```
bb_hire = as(extent(cycle_hire), "SpatialPolygons")
crs(bb_hire) = crs(lnd)
c_area = area(bb_hire) / 1e6
c_area
```

- What is the average point density of cycle hire points within the scheme’s bounding box?

- Why did we add the second line of code in the previous code chunk?
- Why are there two
`crs()`

calls? - The above chunk uses
**raster**functions. How would you write the above code using**sp**code?

A useful level of analysis at which to analyse the geographic distribution of points is the zone-level. We can aggregate the points per zone and provide summary statistics. Starting with the number of points per polygon, this would calculated as follows:

```
crs(lnd) = crs(cycle_hire)
cycle_hire_ag = aggregate(cycle_hire["id"], lnd, FUN = "length")
```

Based on an analysis of the `cycle_hire_ag`

:

- How many zones contain no cycle hire points?
- What is the average number of cycle hire points per zone?

Find the average density of cycle hire points per zone in London.

- Plot the result in an attractive map (e.g. as shown below).
- Find which zone has the highest density of cycle hire points.

A problem with the zonal representation of point density is that the results are dependent on the somewhat arbitrary shapes and sizes of the zones in which the points are aggregated. To overcome this problem we can create a raster representation of the points:

```
r = raster(bb_hire, ncol = 16, nrow = 10)
rc = rasterize(cycle_hire@coords, r, fun = "count")
```

This is already very useful. The results show that there are 5 clusters of cycle parking with much higher density than the surrounding areas. We can visualize these clusters using a static plot as follows:

More useful, in terms of characterising the geographic
characteristics of each cluster, would be to plot these 5 clusters
interactively. Do this with **mapview** (results not
shown):

The resulting interactive plot draws attention to the areas of high point density, such as the area surrounding Victoria station, illustrated below.

Another important characteristic of point patterns is the distances
between points, which can be calculated using **raster**’s
`dist()`

function:

```
d = spDists(cycle_hire, longlat = TRUE)
dm = as.matrix(d)
dm[1:3, 1:5]
```

The results show the distance, in km, form every point to every
other. The `dm`

object is known as a distance matrix: note
the diagonal of zero values. This distance matrix is very useful for
various types of analysis, a couple of which we’ll explore below.

To find the minimum distance of each point to every other, we can use
the `apply`

function, for each row, and then select the top
5:

```
diag(dm) = NA
dmin = apply(X = dm, MARGIN = 1, FUN = min, na.rm = TRUE)
sel_isolated = order(dmin, decreasing = TRUE)[1:5]
# qtm(cycle_hire, col = "grey", main = "Isolated points") +
# qtm(cycle_hire[sel_isolated, ], symbols.col = "red", symbols.size = 2) + tm_scale_bar()
```

Another plot that is useful is that of the ‘G function’ for exploring the extent to which points cluster or separate compared with what would be expected from a random distribution (Bivand, Pebesma, and Gómez-Rubio 2013):

Spatial interpolation refers to methods of estimating the value of something in one place, based on measurements taken elsewhere. Building on the example of cycle hire points in London, we can ask the question: what is the expected number of bikes for a stand in location x, given knowledge of the existing data.

Thus spatial interpolation requires a dependent variable, which is summarised numerically and visually below:

```
summary(cycle_hire$nbikes)
tm_shape(cycle_hire) +
tm_symbols(col = "nbikes", palette = "YlOrRd", alpha = 0.6, style = "quantile")
```

There is a clear spatial pattern to this: there are more bikes parked in the outer docking stations. We can say that verbally, but how to we represent that on the map?

A first port of call would be to rasterise the result, using the the
raster representation of the study area contained in the object
`r`

to find the mean per cell:

```
rnbikes = rasterize(cycle_hire, r, field = "nbikes", fun = mean)
plot(rnbikes)
```

What about estimating the values of cells outside the current network
area? We can use **raster**’s `focal()`

function
to estimate that.

The raster cell method of spatial interpolation was fun, but not that sophisticated or spatially precise, with a resolution of around 1 km.

The next simplest solution is to break the area up into pieces using
the **dismo** package and assign the value of the entire
area to the value of the point it contains:

**gstat** provides a number of functions for spatial
prediction and interpolation using a range of models (Pebesma and Graeler 2018). The most basic of
these, and a workhorse for spatial interpolation is Inverse Distance
Weighting (IDW) (results not shown):

```
library(gstat)
gs = gstat(formula = nbikes~1, locations = cycle_hire)
crs(r) = crs(lnd)
r_idw = interpolate(r, gs)
plot(r_idw)
```

- Look at the original data - what could explain the spatial
distribution of the
`nbikes`

variable? - Experiment with the spatial resolution - we’re using 1 km grid cells which are huge!
- Try using other methods described in Robert Hijman’s tutorial on spatial interpolation.
- Try cross-validating the results. Which performs best?

If your aim is to visualise clusters of points, without worrying
about interpolation, which can be computationally intensive, you can
create a heatmap directly. **leaflet.extras** provides the
function `addHeatmap()`

(among many other useful functions
for interactive data visualization) for this purpose, as illustrated
below.

```
# library(leaflet.extras)
# leaflet() %>%
# addProviderTiles(provider = providers$OpenTopoMap) %>%
# addHeatmap(data = cycle_hire, intensity = cycle_hire$nbikes, blur = 50)
```

An alternative function that achieves a similar result is
`addWebGLHeatmap()`

, illustrated below. A notable feature of
this function is that it allows size to be set in meters rather than
pixels, which prevents the blur from changing when you zoom in:

```
# leaflet() %>%
# addProviderTiles(provider = providers$OpenTopoMap) %>%
# addWebGLHeatmap(data = cycle_hire, size = 700, units = "m", intensity = 0.5)
```

There is much more to learn about point pattern analysis, spatial
interpolation and heatmaps. For point pattern analysis, we recommend
reading-up on the **spatstat** package (e.g. Baddeley, Rubak, and Turner 2015). For
more on interpolation, there are many methods including recent machine
learning approaches (e.g. Hengl et al. 2018;
Meyer et al. 2018) (see reproducible tutorial at https://github.com/Envirometrix/BigSpatialDataR ). For
more on visualising point datasets as heatmaps a recent paper provides
insight into the use of WebGL, applied to road traffic crash data (Ježek et al. 2017).

Baddeley, Adrian, Ege Rubak, and Rolf Turner. 2015. *Spatial Point
Patterns: Methodology and Applications with R*.
CRC Press.

Bivand, Roger, Edzer J Pebesma, and Virgilio Gómez-Rubio. 2013.
*Applied Spatial Data Analysis with R*. Vol.
747248717. Springer.

Hengl, Tomislav, Madlene Nussbaum, Marvin N. Wright, Gerard B. M.
Heuvelink, and Benedikt Gräler. 2018. “Random Forest
as a Generic Framework for Predictive Modeling of Spatial and
Spatio-Temporal Variables.” e26693v3. PeerJ Inc. https://doi.org/10.7287/peerj.preprints.26693v3.

Ježek, Jan, Karel Jedlička, Tom’aš Mildorf, J’achym Kellar, and Daniel
Beran. 2017. “Design and Evaluation of
WebGL-Based Heat Map Visualization for
Big Point Data.” In *The Rise of
Big Spatial Data*, edited by Igor Ivan, Alex Singleton,
Jiří Hor’ak, and Tom’aš Inspektor, 13–26. Lecture Notes in
Geoinformation and Cartography. Springer International
Publishing.

Meyer, Hanna, Christoph Reudenbach, Tomislav Hengl, Marwan Katurji, and
Thomas Nauss. 2018. “Improving Performance of Spatio-Temporal
Machine Learning Models Using Forward Feature Selection and
Target-Oriented Validation.” *Environmental Modelling &
Software* 101 (March): 1–9. https://doi.org/10.1016/j.envsoft.2017.12.001.

Pebesma, Edzer, and Benedikt Graeler. 2018. *Gstat:
Spatial and Spatio-Temporal
Geostatistical Modelling, Prediction and
Simulation*.

Docking stations enable sustainable and fast mobility for people in central London. They are automated bicycle parking facilities which allow people can borrow and return bicycles for a small fee. The sturdy 3-speed bicycles they contain have become known as ‘Boris bikes’ and, among some Londoners, ‘Sadiq cycles’, after the past and present London Mayors Boris Johnson and Sadiq Khan, although their official name is ‘Santander cycles’. See

`?cycle_hire`

for further information about these datasets.↩︎Note the similarities and differences between

**sp**and**sf**plotting code and results. Both extend the`plot()`

command to create a simple static map, meaning that arguments used in base graphics such as`col`

work. However,**sp**’s plot method does not create facets for multiple variables by default and therefore you do not have to specify columns to create a single map.**sp**plots omit a colourscheme by default.↩︎