Splatter: Match points to interval

Einar Hjörleifsson

Match points to interval

We have two datasets: 1) Captains logbook data were each activity (e.g. a fish trawl haul) is a record and has variables such start and end time as well as geographical start and end coordinates. 2) Vessel location data (VMS/AIS) were each record is spatial point in time. The VMS/AIS data are relatively frequent with short time interval between records (5 to 10 minutes), while the logbook data are relatively sparse with often relatively large time intervals (i.e. fishing activity is often in units of hours). This post demonstrate how one can match the two datasets containing multiple vessels using the foverlaps function from the data.table package.

Author

Affiliation

Einar Hjörleifsson

Published

March 4, 2020

Citation

Hjörleifsson, 2020


library(data.table)
library(sf)
library(mapdeck)
library(lubridate)
library(tidyverse)

The example data

To convey the concept and method we are going to use a relatively simple example dataset as a starter, using a larger more realistic dataset further downstream in this post.

The VMS/AIS example dataset contain 8918 AIS/VMS records of time and position of four vessels that participated in the Icelandic spring survey in 2019. The data are available on the web and can be read in and viewed as follows:


vms <-
  read_csv("ftp://ftp.hafro.is/pub/data/csv/is_smb_vms2019.csv") %>%
  arrange(time) %>% 
  select(vid, lon, lat, speed, heading, time)
glimpse(vms)


Observations: 8,918
Variables: 6
$ vid     <dbl> 2350, 2350, 2350, 2350, 2350, 1277, 1277, 1277, 127…
$ lon     <dbl> -21.86028, -21.86028, -21.89164, -21.92865, -21.932…
$ lat     <dbl> 64.15597, 64.15597, 64.16098, 64.15326, 64.15257, 6…
$ speed   <dbl> 1.4, 1.4, 8.1, 3.7, 3.1, 0.0, 7.0, 8.0, 8.0, 7.9, 1…
$ heading <dbl> 69.1, 69.1, 252.0, 240.0, 266.1, 0.0, 126.4, 117.1,…
$ time    <dttm> 2019-02-27 08:07:46, 2019-02-27 08:07:46, 2019-02-…

The variable vid is the vessel identification number, lon and lat are the spatial coordinates, other variables being self explanatory.

The Logbook dataset contain 581 trawl hauls from the four vessels. The data are available on the web and can be read in and viewed as follows:


lgs <-
  read_csv("ftp://ftp.hafro.is/pub/data/csv/is_smb_stations.csv") %>%
  filter(year == 2019) %>%
  arrange(t1, t2) %>% 
  select(id.lgs = id, vid, lon.start = lon1, lat.start = lat1,
         lon.end = lon2, lat.end = lat2, start = t1, end = t2)
glimpse(lgs)


Observations: 581
Variables: 8
$ id.lgs    <dbl> 490630, 490631, 490632, 490633, 490634, 490635, 4…
$ vid       <dbl> 1277, 1277, 1277, 1277, 1277, 1277, 1277, 1277, 1…
$ lon.start <dbl> -14.53033, -14.69833, -14.84933, -14.72167, -15.1…
$ lat.start <dbl> 64.25350, 64.08383, 63.91233, 63.74267, 63.66183,…
$ lon.end   <dbl> -14.53783, -14.83517, -14.83267, -14.66750, -15.2…
$ lat.end   <dbl> 64.18767, 64.05383, 63.84600, 63.68033, 63.66650,…
$ start     <dttm> 2019-02-27 17:54:00, 2019-02-27 19:55:00, 2019-0…
$ end       <dttm> 2019-02-27 18:59:00, 2019-02-27 20:59:00, 2019-0…

The variable id.lgs is a unique haul identifier that is used to link auxiliary haul data (e.g. catch by species) not shown here. It and the variables vid, start and end time of the haul operation are the key variables used when merging the logbook dataset and the vms dataset.

A visual peek of the two datasets

A spatial view of the two datasets can be generated as follows:


# converting the logbooks to wide format in order to generate spatial-lines,
#  done for visualization
tows <- 
  lgs %>% 
  select(id.lgs, x1 = lon.start, x2 = lon.end, y1 = lat.start, y2 = lat.end) %>% 
  pivot_longer(-id.lgs,
               names_to = c(".value", "set"),
               names_pattern = "(.)(.)") %>% 
  arrange(id.lgs) %>% 
  st_as_sf(coords = c("x", "y"),
           crs = 4326) %>% 
  group_by(id.lgs) %>% 
  summarise(do_union = FALSE) %>% 
  st_cast("LINESTRING")

mapdeck(style = 'mapbox://styles/mapbox/dark-v9',
        location = c(-22.9, 66.15),
        zoom = 8) %>% 
  add_path(data = tows,
           stroke_colour = "#AAFFFFFF",
           stroke_width = 1000,
           update_view = FALSE) %>% 
  add_scatterplot(data = vms,
                  lon = "lon", 
                  lat = "lat",
                  radius = 250,
                  fill_colour = "speed",
                  palette = "inferno",
                  update_view = FALSE)

In the figure above the cyan coloured lines represents the tows (logbook data) while the dots are records in the AIS/VMS data, colour indicating vessel speed. What we are interested is identifying the records (dots) in the AIS/VMS data that fall within the start and end of a tow as fishing.

Now, here the problem will not be solved by some spatial acrobatics but time. A visual temporal representation of the two datasets can be made via the following code:


ggplot() +
  theme_bw() +
  geom_rect(data = lgs,
            aes(xmin = start, xmax = end),
            ymin = -Inf, ymax = Inf,
            fill = "cyan",
            alpha = 0.5) + 
  geom_point(data = vms, aes(time, speed), size = 1) +
  geom_text(data = lgs,
            aes(start + (end - start) / 2, 12, label = id.lgs),
            angle = 90,
            size = 3,
            colour = "red",
            hjust=1) +
  facet_grid(vid ~ .) +
  scale_x_datetime(limits = c(ymd_hms("2019-03-13 00:00:00"),
                              ymd_hms("2019-03-14 00:00:00")))

The plot shows one day of activity, where:

Each panel represents one of the four vessels (variable vid)
The cyan area represents the interval between the recorded start and end time of a haul
The red text indicates the logbook tow id (id.lgs).
The points represents the vms-ping observations, instantaneous vessel speed being indicated by the y-axis.

The main objective is to assign the logbook tow id to the vms-points that fall within the interval between start and end of the hauls.

Matching the point to interval

In the above visualization the logbook and the vms data are still independent of each other, the “matching” only done by adding layers to a plot. We now turn to our main objective, merging the two datasets.

We first convert the two datasets into class data.table:


vms.dt <-
  vms %>%
  arrange(time) %>% 
  mutate(dummy = time) %>% 
  data.table()
lgs.dt <-
  lgs %>%
  arrange(start, end) %>%
  data.table()

In the above we create a dummy time variable for the vms-data because we are effectively basing the merging on intervals. And in the case of the vms-data the interval is effectively “instantaneous” because the variable time and dummy are the same.

The next step is to sort the data.tables and mark them as such.


setkey(vms.dt, vid, time, dummy)
setkey(lgs.dt, vid, start, end)

Using the vessel identification number (vid) when setting up the keys can be considered as a group-ing variable.

We are now ready to merge the data using the foverlaps-function:


vms.lgs <- 
  foverlaps(vms.dt, lgs.dt, nomatch = NA)
nrow(vms) == nrow(vms.lgs)


[1] TRUE

Take note that the number of rows in the merged dataset is the same as in the original vms dataset.

Lets see what we have created (only showing variable of interest):


vms.lgs %>% 
  slice(196:203) %>% 
  select(vid, lon:time, start, end, id.lgs)


   vid       lon      lat speed heading                time
1 1131 -20.60729 65.99670   3.5   165.3 2019-03-03 03:22:23
2 1131 -20.59650 65.98422   3.5   166.7 2019-03-03 03:32:43
3 1131 -20.58834 65.97404   3.8   160.5 2019-03-03 03:42:53
4 1131 -20.57970 65.96388   3.8   159.4 2019-03-03 03:53:01
5 1131 -20.57065 65.95401   3.8   158.9 2019-03-03 04:03:03
6 1131 -20.56121 65.94395   3.6   159.1 2019-03-03 04:13:23
7 1131 -20.55119 65.93396   3.9   157.6 2019-03-03 04:23:32
8 1131 -20.54151 65.92394   2.7   164.5 2019-03-03 04:33:33
                start                 end id.lgs
1                <NA>                <NA>     NA
2 2019-03-03 03:32:00 2019-03-03 04:33:00 490612
3 2019-03-03 03:32:00 2019-03-03 04:33:00 490612
4 2019-03-03 03:32:00 2019-03-03 04:33:00 490612
5 2019-03-03 03:32:00 2019-03-03 04:33:00 490612
6 2019-03-03 03:32:00 2019-03-03 04:33:00 490612
7 2019-03-03 03:32:00 2019-03-03 04:33:00 490612
8                <NA>                <NA>     NA

The variables lon, lat, speed and heading are from the vms-dataset while the variable id.lgs (representing unique tow identifier) is from the logbook dataset. The variable vid (common in both sets) and time (from the vms), start and end (both from the logbook dataset) were used to merge the data. The above sets shows only the merge time series for one haul (unique identifier being 490612), with the first and the last data-point from the vms dataset just outside the time interval of that single haul (hence missing values in the original logbook variables).

We can now view this merged dataset as was done above, but using colour to indicate the vms-points that constitute actual fishing operation (coloured by the variable id.lgs):


vms.lgs %>% 
  filter(between(time, ymd_hms("2019-03-13 00:00:00"), ymd_hms("2019-03-14 00:00:00"))) %>% 
  ggplot() +
  theme_bw() +
  geom_rect(aes(xmin = start, xmax = end),
            ymin = -Inf, ymax = Inf,
            fill = "grey",
            alpha = 0.25) + 
  geom_point(aes(time, speed), size = 1) +
  geom_point(aes(time, speed, colour = as.character(id.lgs)), size = 1) +
  facet_grid(vid ~ .) +
  ggmisc::scale_colour_crayola() +
  theme(legend.position = "none")

Spatially we have (zooming into one area):


vms.lgs %>%
  as_tibble() %>%
  select(-dummy) %>% 
  mutate(fishing = ifelse(!is.na(id.lgs), TRUE, FALSE)) %>% 
  mutate(id.lgs = ifelse(!is.na(id.lgs), id.lgs, -999)) %>% 
  ggplot() +
  theme_bw() +
  geom_path(data = geo::bisland, aes(lon, lat)) +
  geom_point(aes(lon, lat, colour = as.character(id.lgs)),
             size = 1) +
  coord_quickmap(xlim = c(-25, -23), ylim = c(66, 66.5)) +
  theme(legend.position = c(0.5, 0.4)) +
  #scale_colour_brewer(palette = "Set1") +
  ggmisc::scale_colour_crayola() +
  labs(x = NULL, y = NULL) +
  theme(legend.position = "none")

How does foverlaps scale?

Here we are going to use data that reside in a database to test how scalable the foverlaps function is. The format of the data is the same as used in the example above and hence detail explanation are not given in the text below. The data used are AIS/VMS and logbook data of the Icelandic commercial fishing fleet over the years 2012 to 2019.


library(mar)
con <- connect_mar()
vms <- 
  stk_trail(con) %>% 
  mutate(year = year(time)) %>% 
  filter(year %in% 2012:2019) %>% 
  inner_join(mar:::stk_mid_vid(con) %>% 
               select(mid, vid) %>% 
               filter(!is.na(vid))) %>% 
  select(vid, time, lon, lat, speed) %>% 
  collect(n = Inf)
glimpse(vms)


Observations: 104,355,636
Variables: 5
$ vid   <dbl> 2690, 1674, 2712, 237, 1202, 2820, 1102, 253, 2790, 2…
$ time  <dttm> 2012-01-04 06:19:33, 2012-01-04 06:19:38, 2012-01-04…
$ lon   <dbl> -23.45612, -24.95392, -23.63809, -13.16564, -25.13772…
$ lat   <dbl> 66.35745, 65.41512, 65.33578, 64.49120, 65.45225, 64.…
$ speed <dbl> 1.2, 3.6, 6.5, 7.1, 5.7, 1.4, 1.6, 1.7, 5.9, 9.5, 10.…

So we have 104355636 vms records.


mbl <- 
  mar:::lb_base(con) %>% 
  mutate(year = year(date)) %>% 
  filter(year %in% 2012:2019) %>% 
  inner_join(mar:::lb_mobile(con)) %>% 
  select(id.lgs = visir,
         vid,
         gid,
         start = t1,
         end = t2) %>% 
  filter(!is.na(start), !is.na(end)) %>% 
  collect(n = Inf)
sta <- 
  mar:::lb_base(con) %>% 
  mutate(year = year(date)) %>% 
  filter(year %in% 2012:2019) %>% 
  inner_join(mar:::lb_static(con)) %>% 
  select(id.lgs = visir,
         vid,
         gid,
         start = t1,
         end = t2) %>% 
  filter(!is.na(start), !is.na(end)) %>% 
  collect(n = Inf)
lgs <- 
  bind_rows(mbl, sta) %>% 
  filter(start < end)
glimpse(lgs)


Observations: 686,714
Variables: 5
$ id.lgs <dbl> 79508956, 79508957, 79508958, 79498112, 79498113, 79…
$ vid    <int> 2190, 2190, 2190, 616, 616, 616, 616, 616, 616, 616,…
$ gid    <int> 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, …
$ start  <dttm> 2012-02-07 09:15:00, 2012-02-07 21:45:00, 2012-02-0…
$ end    <dttm> 2012-02-07 20:00:00, 2012-02-08 08:00:00, 2012-02-0…

And we have 686714 logbook records.


vms.dt <- 
  vms %>% 
  arrange(time) %>% 
  mutate(dummy = time) %>% 
  data.table()
lgs.dt <-
  lgs %>%
  arrange(start, end) %>%
  data.table()
setkey(vms.dt, vid, time, dummy)
setkey(lgs.dt, vid, start, end)
system.time(vms.lgs <- 
  foverlaps(vms.dt, lgs.dt, nomatch = NA))


   user  system elapsed 
 50.185  13.758  30.231

That was impressively fast

Note: the number of records in the merge data are not the same as in the original vms data (somewhat higher in the former), this most likely a result in multiple overlaps because of incorrect recording of start and end time in the logbooks. Something to look at later.

¹ . Mind you, this was run on a computer with 251G memory. And the preprocessing, prior to this last step took some 10’s of minutes.

So, now we can make a visualization of the fishing activity of some selected gear in the third quarter of 2019 based on the AIS/VMS geographical coordinates:


vms.lgs %>% 
  as_tibble() %>% 
  mutate(year = year(time),
         month = month(time)) %>% 
  # just so that the html file is not too big
  filter(!is.na(id.lgs),
         gid %in% c(1, 6, 14),
         year == 2019,
         month %in% 3,
         # some ad hoc extra filtering in case start and end are 
         #  not correct
         between(speed, 1.5, 6),
         between(lon, -30, 10),
         between(lat, 61, 69)) %>% 
  select(lon, lat, speed, gid) %>% 
  mutate(gear = case_when(gid == 1 ~ "Longline",
                          gid == 6 ~ "Demersal fishtrawl",
                          gid == 7 ~ "Pelagic trawl",
                          gid == 9 ~ "Nephrops trawl",
                          gid == 14 ~ "Shrimp trawl")) %>% 
  mapdeck(style = 'mapbox://styles/mapbox/dark-v9') %>% 
  add_scatterplot(lon = "lon", 
                  lat = "lat",
                  radius = 250,
                  fill_colour = "gear",
                  palette = "topo",
                  legend = TRUE)

gear

Demersal fishtrawl

Longline

Shrimp trawl

That’s all for now.

0 Comments Share:

Footnotes

Note: the number of records in the merge data are not the same as in the original vms data (somewhat higher in the former), this most likely a result in multiple overlaps because of incorrect recording of start and end time in the logbooks. Something to look at later.[↩]

Citation

For attribution, please cite this work as

Hjörleifsson (2020, March 4). Splatter: Match points to interval. Retrieved from https://splatter.netlify.com/posts/2020-03-04-match-points-to-interval/

BibTeX citation

@misc{hjörleifsson2020match,
  author = {Hjörleifsson, Einar},
  title = {Splatter: Match points to interval},
  url = {https://splatter.netlify.com/posts/2020-03-04-match-points-to-interval/},
  year = {2020}
}