Matching nearest date

This blogpost is mostly a note to oneself. Some datasets we want to merge or join may not have a variable that matches perfectly. A case in point is if one were to join the logbook and landings dataset by landing date, where the dates can in few cases lag by some two or three days. As of the time this is written there is no available solution in the tidyverse, i.e. a simple join will not work. Hence one needs to move out of that confort zone and dive into the data.table world. Must say the latter is impressive and I have been st(r)aying there more and more.

Author

Affiliation

Einar Hjörleifsson

 

Published

March 16, 2020

Citation

Hjörleifsson, 2020

bla, bla, bla …


library(data.table)
library(lubridate)
library(tidyverse)
library(mar)
con <- connect_mar()

lb <- 
  mar:::lb_base(con) %>%
  mutate(year = year(date)) %>%
  filter(year %in% 1992:2019) %>% 
  select(visir, vid, date, datel) %>%
  distinct() %>% 
  collect(n = Inf) %>% 
  mutate(date = as_date(date),
         datel = as_date(datel)) %>% 
  arrange(vid, datel) %>% 
  drop_na()

lb.dt <- 
  lb %>% 
  select(vid, datel) %>% 
  distinct() %>% 
  setDT()
ln.dt <- 
  mar:::ln_catch(con) %>%
  mutate(year = year(date)) %>%
  filter(year %in% 1992:2019) %>% 
  select(vid, datel = date) %>% 
  distinct() %>% 
  collect(n = Inf) %>% 
  mutate(datel = as_date(datel)) %>% 
  arrange(vid, datel) %>% 
  drop_na() %>% 
  mutate(dummy = datel) %>% 
  setDT()

res3 <- 
  lb.dt[, date.ln := ln.dt[lb.dt, dummy, on = c("vid", "datel"), roll = "nearest"]] %>% 
    as_tibble()

lb %>% 
  left_join(res3) %>% 
  filter(datel != date.ln)

# A tibble: 723,700 x 5
       visir   vid date       datel      date.ln   
       <dbl> <int> <date>     <date>     <date>    
 1 103832051     1 2014-06-05 2014-06-06 2003-03-03
 2 103832052     1 2014-06-11 2014-06-11 2003-03-03
 3 103832053     1 2014-06-12 2014-06-12 2003-03-03
 4 103832054     1 2014-06-16 2014-06-16 2003-03-03
 5 103832055     1 2014-06-18 2014-06-19 2003-03-03
 6  26211928     7 1992-01-05 1992-01-06 1992-09-07
 7  26211926     7 1992-01-03 1992-01-06 1992-09-07
 8  26211927     7 1992-01-04 1992-01-06 1992-09-07
 9  26211930     7 1992-01-08 1992-01-09 1992-09-07
10  26211929     7 1992-01-07 1992-01-09 1992-09-07
# … with 723,690 more rows

res3 %>% 
  mutate(diff = datel - date.ln) %>% 
  count(diff) %>% 
  mutate(diff = as.integer(diff)) %>% 
  mutate(diff = ifelse(diff < -5, -5, diff),
         diff = ifelse(diff >  5,  5, diff)) %>% 
  filter(diff != 0L) %>% 
  ggplot(aes(diff, n)) +
  geom_col()

Footnotes

    Citation

    For attribution, please cite this work as

    Hjörleifsson (2020, March 16). Splatter: Matching nearest date. Retrieved from https://splatter.netlify.com/posts/2020-03-16-matching-nearest-date/

    BibTeX citation

    @misc{hjörleifsson2020matching,
      author = {Hjörleifsson, Einar},
      title = {Splatter: Matching nearest date},
      url = {https://splatter.netlify.com/posts/2020-03-16-matching-nearest-date/},
      year = {2020}
    }