Encontre intervalos sobrepostos em grupos e retenha os maiores períodos sem sobreposição

Nov 30 2020

Problema Eu tenho um dataframe agrupado com intervalos sobrepostos (data como ymd). Desejo reter apenas os maiores intervalos não sobrepostos em cada grupo.

Dados de exemplo

# Packages
library(tidyverse)
library(lubridate)

# Example data
df <- tibble(
  group = c(1, 1, 1, 2, 2, 3, 3, 3, 3),
  start = as_date(
    c("2019-01-10", "2019-02-01", "2019-10-05", "2018-07-01", "2019-01-01", "2019-10-01", "2019-10-01", "2019-11-30","2019-11-20")),
  end = as_date(
    c("2019-02-07", "2019-05-01", "2019-11-15", "2018-07-31", "2019-05-05", "2019-11-06", "2019-10-07", "2019-12-10","2019-12-31"))) %>%
  mutate(intval = interval(start, end),
         intval_length = intval / days(1))

df
#> # A tibble: 9 x 5
#>   group start      end        intval                         intval_length
#>   <dbl> <date>     <date>     <Interval>                             <dbl>
#> 1     1 2019-01-10 2019-02-07 2019-01-10 UTC--2019-02-07 UTC            28
#> 2     1 2019-02-01 2019-05-01 2019-02-01 UTC--2019-05-01 UTC            89
#> 3     1 2019-10-05 2019-11-15 2019-10-05 UTC--2019-11-15 UTC            41
#> 4     2 2018-07-01 2018-07-31 2018-07-01 UTC--2018-07-31 UTC            30
#> 5     2 2019-01-01 2019-05-05 2019-01-01 UTC--2019-05-05 UTC           124
#> 6     3 2019-10-01 2019-11-06 2019-10-01 UTC--2019-11-06 UTC            36
#> 7     3 2019-10-01 2019-10-07 2019-10-01 UTC--2019-10-07 UTC             6
#> 8     3 2019-11-30 2019-12-10 2019-11-30 UTC--2019-12-10 UTC            10
#> 9     3 2019-11-20 2019-12-31 2019-11-20 UTC--2019-12-31 UTC            41

# Goal
# Row: 1 and 2; 6 to 9 have overlaps; Keep rows with largest intervals (in days)
df1 <- df[-c(1, 7, 8),]

df1
#> # A tibble: 6 x 5
#>   group start      end        intval                         intval_length
#>   <dbl> <date>     <date>     <Interval>                             <dbl>
#> 1     1 2019-02-01 2019-05-01 2019-02-01 UTC--2019-05-01 UTC            89
#> 2     1 2019-10-05 2019-11-15 2019-10-05 UTC--2019-11-15 UTC            41
#> 3     2 2018-07-01 2018-07-31 2018-07-01 UTC--2018-07-31 UTC            30
#> 4     2 2019-01-01 2019-05-05 2019-01-01 UTC--2019-05-05 UTC           124
#> 5     3 2019-10-01 2019-11-06 2019-10-01 UTC--2019-11-06 UTC            36
#> 6     3 2019-11-20 2019-12-31 2019-11-20 UTC--2019-12-31 UTC            41

Abordagem atual Eu encontrei uma questão relacionada em outro tópico (consulte: Encontrar datas dentro de um intervalo de período por grupo ). No entanto, a solução respectiva identifica todas as linhas sobrepostas por grupo. Dessa forma, não consigo identificar os maiores intervalos não sobrepostos.

df$overlap <- unlist(tapply(df$intval, #loop through intervals
                            df$group,  #grouped by id
                            function(x) rowSums(outer(x,x,int_overlaps)) > 1))

Como exemplo, considere o grupo 3 em meus dados de exemplo. Aqui, as linhas 6/7 e 8/9 se sobrepõem. Como as linhas 6 e 9 são os maiores períodos sem sobreposição, gostaria de remover as linhas 7 e 8.

Eu apreciaria muito se alguém pudesse me indicar uma solução.

Respostas

Fabian.Fuchs. Dec 01 2020 at 17:07

Tendo pesquisado problemas relacionados ao stackoverflow, descobri que as seguintes abordagens (aqui: Recolher e mesclar intervalos de tempo sobrepostos ) e (aqui: Como nivelar / mesclar períodos de tempo sobrepostos ) podem ser adaptadas para o meu problema.

# Solution adapted from:
# here https://stackoverflow.com/questions/53213418/collapse-and-merge-overlapping-time-intervals
# and here: https://stackoverflow.com/questions/28938147/how-to-flatten-merge-overlapping-time-periods/28938694#28938694 

# Note: df and df1 created in the initial reprex (above)

df2 <- df %>%
  group_by(group) %>%
  arrange(group, start) %>%
  mutate(indx = c(0, cumsum(as.numeric(lead(start))  >            # find overlaps
                              cummax(as.numeric(end)))[-n()])) %>%
  ungroup() %>%
  group_by(group, indx) %>%
  arrange(desc(intval_length)) %>%                                # retain largest interval
  filter(row_number() == 1) %>%
  ungroup() %>%
  select(-indx) %>%
  arrange(group, start)

# Desired output?
identical(df1, df2)
#> [1] TRUE