On gridding spatial data

Lets say we have some spatial observations and wanted to calculate some summary metric (count, sum, mean, variance, …) based on specified spacial gridding of the raw data.

Author

Affiliation

Einar Hjörleifsson

 

Published

Feb. 3, 2018

Citation

Hjörleifsson, 2018

Preamble

Lets say we have some spatial observations and wanted to calculate some summary metric (count, sum, mean, variance, …) based on specified spacial gridding of the raw data. As an example, one could have 100 observations on a limited spatial scale of some 3 degrees longitude and 1 latitude and wanted to calculate the number of observations that fall within a 0.5 degree longitude and 0.25 degree latitude as depicted in the figure below (blue labels are the counts statistics of the observation within each grid):

In order to solve this problem one needs first to transform each spatial point to a central point based on user’s defined xy-resolution. I.e. we want to transform the observed spatial x and y value (the black points) to a single spatial point (the pink points) as depicted visually in this graph:

Now, solving this problem has been done myriads of times. What is documented here an example of a code flow within the tidyverse environment.

The code

The raw data above were generated as follows:


set.seed(314) # for the love of pi
d <- 
  tibble(lon =    runif(n = 1e2, min = -28, max = -25),
         lat =    runif(n = 1e2, min =  64, max =  65),
         effort = rnorm(n = 1e2, mean = 1000, sd = 200))

Lets break the problem up by solving just one dimension (e.g. the longitude). The challenge is to assign each point to a bin and count the number of incidences that a point falls within the interval. Sort of like what happens when we generate a histogram:


d %>% 
  ggplot() +
  geom_histogram(aes(lon), breaks = seq(-28, -25, by = 0.5), 
                 fill = "grey", colour = "black") +
  geom_jitter(aes(lon, y = 5), width = 0, height = 3, colour = "red") +
  scale_x_continuous(breaks = seq(-28, -25, by = 0.5))

The steps are:

  1. Create some pretty breaks, given the data and the specified resolution.
  2. Find break interval that each data point belongs to.
  3. Calculate the midpoint on the interval (here the longitude) that the each data-point belongs to.
Lets generate the horizontal vector (x) and the resolution:

x <- d$lon
dx <- 0.5
  1. Creating the breaks, using floor and ceiling of the minimum and maximum value in the data to ensure that one has some “pretty” intervals:

brks <- seq(floor(min(x)), ceiling(max(x)), dx)
brks

[1] -28.0 -27.5 -27.0 -26.5 -26.0 -25.5 -25.0

Here we have created a vector that contains the “boundaries” of the bins we want for the horizontal data. The length of the vector is 7 but the bins (intervals) are 6.

  1. To find the interval position that a point belongs to we can use the findInterval function:

# https://github.com/hadley/adv-r/blob/master/extras/cpp/find-interval.cpp
# By using the argument all.inside = TRUE we assign the point to the lower break/interval position.
ints <- findInterval(x, brks, all.inside = TRUE)
ints

  [1] 1 2 5 2 2 2 2 3 4 5 3 1 2 2 4 3 5 4 1 3 4 3 4 2 1 4 2 1 2 4 2 4
 [33] 1 3 3 1 6 6 4 4 3 5 3 6 5 5 4 2 1 2 5 4 4 5 2 3 4 6 3 2 3 5 2 3
 [65] 1 3 4 1 1 3 1 4 6 4 5 4 4 6 1 1 3 3 6 2 3 2 6 5 2 1 4 3 4 4 6 6
 [97] 5 3 3 2

Here we have a vector that states that the 1st observation belongs to the 1st bin, 2nd to the 2nd bin, 3rd to the 5th bin and so on. We can easily count the number of values within each bin by:


table(ints)

ints
 1  2  3  4  5  6 
15 20 21 22 12 10 

The values are equivalent to the counts that are the values on the y-axis in the histogram above. As it stands the intervals/bins are in relative terms, i.e. they are not scaled to the original data (in this case longitude). This we solve in the third step.

  1. The midpoint that a data point belongs to is just the average of the lower and upper boundary of the breaks that a point belongs to:

x <- (brks[ints] + brks[ints + 1]) / 2
x

  [1] -27.75 -27.25 -25.75 -27.25 -27.25 -27.25 -27.25 -26.75 -26.25
 [10] -25.75 -26.75 -27.75 -27.25 -27.25 -26.25 -26.75 -25.75 -26.25
 [19] -27.75 -26.75 -26.25 -26.75 -26.25 -27.25 -27.75 -26.25 -27.25
 [28] -27.75 -27.25 -26.25 -27.25 -26.25 -27.75 -26.75 -26.75 -27.75
 [37] -25.25 -25.25 -26.25 -26.25 -26.75 -25.75 -26.75 -25.25 -25.75
 [46] -25.75 -26.25 -27.25 -27.75 -27.25 -25.75 -26.25 -26.25 -25.75
 [55] -27.25 -26.75 -26.25 -25.25 -26.75 -27.25 -26.75 -25.75 -27.25
 [64] -26.75 -27.75 -26.75 -26.25 -27.75 -27.75 -26.75 -27.75 -26.25
 [73] -25.25 -26.25 -25.75 -26.25 -26.25 -25.25 -27.75 -27.75 -26.75
 [82] -26.75 -25.25 -27.25 -26.75 -27.25 -25.25 -25.75 -27.25 -27.75
 [91] -26.25 -26.75 -26.25 -26.25 -25.25 -25.25 -25.75 -26.75 -26.75
[100] -27.25

Now since we are going to use this repeatedly in a workflow we may as well wrap the above three lines into a convenience function:


grade <- function(x, dx) {
  brks <- seq(floor(min(x)), ceiling(max(x)),dx)
  ints <- findInterval(x, brks, all.inside = TRUE)
  x <- (brks[ints] + brks[ints + 1]) / 2
  return(x)
}

To put the function into use we simply do:


d %>% 
  mutate(glon = grade(lon, dx = dx)) %>% 
  group_by(glon) %>% 
  summarise(n = n()) %>% 
  knitr::kable()
glon n
-27.75 15
-27.25 20
-26.75 21
-26.25 22
-25.75 12
-25.25 10

We have the same count statistics as we got from the table call above, only now we have related the counts to a value of our data, rather than to a seqential bin-value.

We can use the grade-function to assign both the longitudinal and latitudinal observations to a specified xy-grid within in a single mutation-call:


d <- 
  d %>% 
  mutate(glon = grade(lon, dx = 0.50),
         glat = grade(lat, dx = 0.25))
d %>% slice(1:10) %>%  knitr::kable()
lon lat effort glon glat
-27.70350 64.30833 971.2615 -27.75 64.375
-27.18557 64.55467 1290.4014 -27.25 64.625
-25.70042 64.68599 860.2981 -25.75 64.625
-27.32607 64.18164 780.2983 -27.25 64.125
-27.39267 64.74890 1033.2841 -27.25 64.625
-27.09022 64.66584 965.6144 -27.25 64.625
-27.27755 64.74509 850.0575 -27.25 64.625
-26.88663 64.11706 576.9763 -26.75 64.125
-26.34812 64.60306 965.1902 -26.25 64.625
-25.75884 64.79979 1112.7509 -25.75 64.875

One obtains a dataframe with the same amount of observations as the original, we have just added two more variables. One could now proceed with creating some summary statistics based on the gridded xy-values:


df <-
  d %>% 
  group_by(glon, glat) %>% 
  summarise(n = n(),
            m = mean(effort),
            s = sum(effort),
            sd = sd(effort))
# etc.
df %>% ungroup() %>% slice(1:10) %>% knitr::kable()
glon glat n m s sd
-27.75 64.125 5 1022.5096 5112.5478 229.6725
-27.75 64.375 4 1137.8015 4551.2061 114.0223
-27.75 64.625 3 891.8389 2675.5167 224.4080
-27.75 64.875 3 1027.4963 3082.4889 116.3074
-27.25 64.125 4 984.9683 3939.8731 136.6450
-27.25 64.375 1 999.6678 999.6678 NA
-27.25 64.625 10 998.4746 9984.7459 170.6212
-27.25 64.875 5 932.1593 4660.7964 207.5640
-26.75 64.125 5 934.0566 4670.2830 203.0497
-26.75 64.375 5 984.8071 4924.0354 115.7967

So we have condensed the original 100 observations to a dataset that contains 23 observations (6 bins in the x-dimension times 4 in the y-dimension, minus one “missing” grid dimension). And now one is ready to plot any statistics of interest, here we just choose to display the counts by colour codes using the ggplot raster function adding the original data as well:


df %>% 
  ungroup() %>% 
  ggplot(aes(glon, glat)) +
  geom_raster(aes(fill = s)) +
  geom_point(data = d, aes(lon, lat), colour = "white", size = 0.5) +
  geom_segment(data = d, aes(xend = lon, yend = lat), colour = "white") +
  geom_text(aes(label = n), colour = "blue") +
  scale_fill_viridis_c(option = "B", direction = -1) +
  scale_x_continuous(breaks = seq(-28, -25, by = 0.5)) +
  coord_quickmap() +
  labs(x = NULL, y = NULL, fill = "Effort")

If one is interested in displaying other type of statistics, one could replace the variable name in the fill-argument in the geom_raster call above.

A case examples

Here we take an example of 618454 records of geoposition from activities of bottom trawlers operating in waters northwest of Iceland. The position is automatically recorded every 5 minutes or so and the raw data look something like this:


vms %>% slice(1:10) %>% knitr::kable()
id date lon lat
-834837 2016-12-13 03:51:45 -24.56281 66.87381
-834837 2016-12-13 03:56:04 -24.55106 66.87589
-834837 2016-12-13 04:02:04 -24.53802 66.87763
-834837 2016-12-13 04:06:14 -24.52499 66.87937
-834837 2016-12-13 04:12:14 -24.51141 66.88040
-834837 2016-12-13 04:16:24 -24.49784 66.88143
-834837 2016-12-13 04:22:24 -24.48393 66.88092
-834837 2016-12-13 04:26:54 -24.47002 66.88041
-834837 2016-12-13 04:32:54 -24.46468 66.88093
-834837 2016-12-13 04:37:04 -24.45934 66.88145

The id is a reference to individual tows (there are 16580 tows in the database).

Now lets use the grade convenience function above and calculate the number of observations (here effort) at a xy-resolution of 0.01 x 0.005 degrees:


dx <- 0.01
vms <-
  vms %>% 
  mutate(lon = grade(lon, dx),
         lat = grade(lat, dx/2)) %>% 
  group_by(lon, lat) %>% 
  summarise(n = n())

We now have created a summary statistic containing 66303 records that looks something like this:


# code just to display first 10 records:
vms %>% ungroup() %>% slice(1:10) %>% knitr::kable()
lon lat n
-26.995 65.8375 1
-26.995 65.8475 2
-26.995 65.8525 2
-26.995 65.8575 2
-26.995 65.8625 2
-26.995 65.8725 2
-26.995 65.8875 2
-26.995 65.8925 4
-26.995 65.9025 2
-26.995 65.9075 6

Here the lon and the lat refer to the center positioning within the specified xy-grid and n refers to the sum of the numbers of trawling operation recorded within that grid (effort).

Now we let ggplot2 do the rest, just adding an anonymous “island” and 100, 200 and 400 meter depth contour as a reference:


vms %>% 
  ggplot() +
  theme_bw() +
  geom_raster(aes(lon, lat, fill = n)) +
  geom_polygon(data = geo::island,
               aes(lon, lat, group = NULL), fill = "grey") +
  geom_path(data = geo::gbdypi.400,
            aes(lon, lat, group = NULL), colour = "grey") +
  geom_path(data = geo::gbdypi.200,
            aes(lon, lat, group = NULL), colour = "grey") +
  geom_path(data = geo::gbdypi.100,
            aes(lon, lat, group = NULL), colour = "grey") +
  scale_fill_viridis_c(option = "B", direction = -1) +
  scale_x_continuous(breaks = seq(-27, -21, by = 1)) +
  scale_y_continuous(breaks = seq(65.5, 67.5, by = 0.5)) +
  coord_quickmap(xlim = range(vms$lon), ylim = range(vms$lat)) +
  labs(x = NULL, y = NULL, fill = "Effort") +
  theme(legend.position = c(0.92, 0.2))

Footnotes

    Citation

    For attribution, please cite this work as

    Hjörleifsson (2018, Feb. 3). Splatter: On gridding spatial data. Retrieved from https://splatter.netlify.com/posts/2018-02-03-on-gridding-spatial-data/

    BibTeX citation

    @misc{hjörleifsson2018on,
      author = {Hjörleifsson, Einar},
      title = {Splatter: On gridding spatial data},
      url = {https://splatter.netlify.com/posts/2018-02-03-on-gridding-spatial-data/},
      year = {2018}
    }