This blogpost is mostly a note to oneself. Some datasets we want to merge or join may not have a variable that matches perfectly. A case in point is if one were to join the logbook and landings dataset by landing date, where the dates can in few cases lag by some two or three days. As of the time this is written there is no available solution in the tidyverse, i.e. a simple join will not work. Hence one needs to move out of that confort zone and dive into the data.table world. Must say the latter is impressive and I have been st(r)aying there more and more.
bla, bla, bla …
library(data.table)
library(lubridate)
library(tidyverse)
library(mar)
con <- connect_mar()
lb <-
mar:::lb_base(con) %>%
mutate(year = year(date)) %>%
filter(year %in% 1992:2019) %>%
select(visir, vid, date, datel) %>%
distinct() %>%
collect(n = Inf) %>%
mutate(date = as_date(date),
datel = as_date(datel)) %>%
arrange(vid, datel) %>%
drop_na()
lb.dt <-
lb %>%
select(vid, datel) %>%
distinct() %>%
setDT()
ln.dt <-
mar:::ln_catch(con) %>%
mutate(year = year(date)) %>%
filter(year %in% 1992:2019) %>%
select(vid, datel = date) %>%
distinct() %>%
collect(n = Inf) %>%
mutate(datel = as_date(datel)) %>%
arrange(vid, datel) %>%
drop_na() %>%
mutate(dummy = datel) %>%
setDT()
res3 <-
lb.dt[, date.ln := ln.dt[lb.dt, dummy, on = c("vid", "datel"), roll = "nearest"]] %>%
as_tibble()
lb %>%
left_join(res3) %>%
filter(datel != date.ln)
# A tibble: 723,700 x 5
visir vid date datel date.ln
<dbl> <int> <date> <date> <date>
1 103832051 1 2014-06-05 2014-06-06 2003-03-03
2 103832052 1 2014-06-11 2014-06-11 2003-03-03
3 103832053 1 2014-06-12 2014-06-12 2003-03-03
4 103832054 1 2014-06-16 2014-06-16 2003-03-03
5 103832055 1 2014-06-18 2014-06-19 2003-03-03
6 26211928 7 1992-01-05 1992-01-06 1992-09-07
7 26211926 7 1992-01-03 1992-01-06 1992-09-07
8 26211927 7 1992-01-04 1992-01-06 1992-09-07
9 26211930 7 1992-01-08 1992-01-09 1992-09-07
10 26211929 7 1992-01-07 1992-01-09 1992-09-07
# … with 723,690 more rows
res3 %>%
mutate(diff = datel - date.ln) %>%
count(diff) %>%
mutate(diff = as.integer(diff)) %>%
mutate(diff = ifelse(diff < -5, -5, diff),
diff = ifelse(diff > 5, 5, diff)) %>%
filter(diff != 0L) %>%
ggplot(aes(diff, n)) +
geom_col()
For attribution, please cite this work as
Hjörleifsson (2020, March 16). Splatter: Matching nearest date. Retrieved from https://splatter.netlify.com/posts/2020-03-16-matching-nearest-date/
BibTeX citation
@misc{hjörleifsson2020matching, author = {Hjörleifsson, Einar}, title = {Splatter: Matching nearest date}, url = {https://splatter.netlify.com/posts/2020-03-16-matching-nearest-date/}, year = {2020} }