Encontre intervalos sobrepostos em grupos e retenha os maiores períodos sem sobreposição
Problema Eu tenho um dataframe agrupado com intervalos sobrepostos (data como ymd). Desejo reter apenas os maiores intervalos não sobrepostos em cada grupo.
Dados de exemplo
# Packages
library(tidyverse)
library(lubridate)
# Example data
df <- tibble(
group = c(1, 1, 1, 2, 2, 3, 3, 3, 3),
start = as_date(
c("2019-01-10", "2019-02-01", "2019-10-05", "2018-07-01", "2019-01-01", "2019-10-01", "2019-10-01", "2019-11-30","2019-11-20")),
end = as_date(
c("2019-02-07", "2019-05-01", "2019-11-15", "2018-07-31", "2019-05-05", "2019-11-06", "2019-10-07", "2019-12-10","2019-12-31"))) %>%
mutate(intval = interval(start, end),
intval_length = intval / days(1))
df
#> # A tibble: 9 x 5
#> group start end intval intval_length
#> <dbl> <date> <date> <Interval> <dbl>
#> 1 1 2019-01-10 2019-02-07 2019-01-10 UTC--2019-02-07 UTC 28
#> 2 1 2019-02-01 2019-05-01 2019-02-01 UTC--2019-05-01 UTC 89
#> 3 1 2019-10-05 2019-11-15 2019-10-05 UTC--2019-11-15 UTC 41
#> 4 2 2018-07-01 2018-07-31 2018-07-01 UTC--2018-07-31 UTC 30
#> 5 2 2019-01-01 2019-05-05 2019-01-01 UTC--2019-05-05 UTC 124
#> 6 3 2019-10-01 2019-11-06 2019-10-01 UTC--2019-11-06 UTC 36
#> 7 3 2019-10-01 2019-10-07 2019-10-01 UTC--2019-10-07 UTC 6
#> 8 3 2019-11-30 2019-12-10 2019-11-30 UTC--2019-12-10 UTC 10
#> 9 3 2019-11-20 2019-12-31 2019-11-20 UTC--2019-12-31 UTC 41
# Goal
# Row: 1 and 2; 6 to 9 have overlaps; Keep rows with largest intervals (in days)
df1 <- df[-c(1, 7, 8),]
df1
#> # A tibble: 6 x 5
#> group start end intval intval_length
#> <dbl> <date> <date> <Interval> <dbl>
#> 1 1 2019-02-01 2019-05-01 2019-02-01 UTC--2019-05-01 UTC 89
#> 2 1 2019-10-05 2019-11-15 2019-10-05 UTC--2019-11-15 UTC 41
#> 3 2 2018-07-01 2018-07-31 2018-07-01 UTC--2018-07-31 UTC 30
#> 4 2 2019-01-01 2019-05-05 2019-01-01 UTC--2019-05-05 UTC 124
#> 5 3 2019-10-01 2019-11-06 2019-10-01 UTC--2019-11-06 UTC 36
#> 6 3 2019-11-20 2019-12-31 2019-11-20 UTC--2019-12-31 UTC 41
Abordagem atual Eu encontrei uma questão relacionada em outro tópico (consulte: Encontrar datas dentro de um intervalo de período por grupo ). No entanto, a solução respectiva identifica todas as linhas sobrepostas por grupo. Dessa forma, não consigo identificar os maiores intervalos não sobrepostos.
df$overlap <- unlist(tapply(df$intval, #loop through intervals
df$group, #grouped by id
function(x) rowSums(outer(x,x,int_overlaps)) > 1))
Como exemplo, considere o grupo 3 em meus dados de exemplo. Aqui, as linhas 6/7 e 8/9 se sobrepõem. Como as linhas 6 e 9 são os maiores períodos sem sobreposição, gostaria de remover as linhas 7 e 8.
Eu apreciaria muito se alguém pudesse me indicar uma solução.
Respostas
Tendo pesquisado problemas relacionados ao stackoverflow, descobri que as seguintes abordagens (aqui: Recolher e mesclar intervalos de tempo sobrepostos ) e (aqui: Como nivelar / mesclar períodos de tempo sobrepostos ) podem ser adaptadas para o meu problema.
# Solution adapted from:
# here https://stackoverflow.com/questions/53213418/collapse-and-merge-overlapping-time-intervals
# and here: https://stackoverflow.com/questions/28938147/how-to-flatten-merge-overlapping-time-periods/28938694#28938694
# Note: df and df1 created in the initial reprex (above)
df2 <- df %>%
group_by(group) %>%
arrange(group, start) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) > # find overlaps
cummax(as.numeric(end)))[-n()])) %>%
ungroup() %>%
group_by(group, indx) %>%
arrange(desc(intval_length)) %>% # retain largest interval
filter(row_number() == 1) %>%
ungroup() %>%
select(-indx) %>%
arrange(group, start)
# Desired output?
identical(df1, df2)
#> [1] TRUE