Comprensione del piping dplyr e funzione di riepilogo

Aug 20 2020

Sto cercando un aiuto per comprendere le tubazioni e riepilogare le funzioni utilizzando dplyr. Mi sembra che la mia codifica sia un po 'prolissa e potrebbe essere semplificata. Quindi ci sono un paio di domande qui perché so che mi mancano alcuni concetti, ma non sono abbastanza sicuro di dove sia questa mancanza di conoscenza. Ho incluso il mio codice completo in fondo. Grazie in anticipo perché questa è una domanda un po 'più grande.

1a. Dai dati di esempio di seguito e utilizzando dplyr, c'è un modo per calcolare le partite (date) per squadra senza utilizzare una tabella intermedia?

1b. Ho incluso il mio modo originale per calcolare n_games che non funzionavano. Perché?

set.seed(123)
shot_df_ex <- tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
                     Date = sample(as.Date(c("2019-08-01",
                                             "2019-09-01",
                                             "2018-08-01",
                                             "2018-09-01",
                                             "2017-08-01",
                                             "2017-09-01")), 
                                   size = 250, replace = TRUE),
                     Type = sample(c("shot","goal"), size = 250, 
                                   replace = TRUE, prob = c(0.9,0.1))
)

# count shots per team per game(date)
n_shots_per_game <- shot_df_ex %>% 
  count(Team_Name,Date)

n_shots_per_game

# count games (dates) per team [ISSUES!!!]
# is there a way to do this piping from the shot_df_ex tibble instead of 
#  using an intermediate tibble?

# count number of games using the tibble created above [DOES NOT WORK--WHY?]
n_games <- n_shots_per_game %>% 
  count(Team_Name)

n_games #what is this counting? It should be 6 for each.

# this works, but isn't count() just a quicker way to run
#  group_by() %>% summarise()? 
n_games <- n_shots_per_game %>% 
  group_by(Team_Name) %>% 
  summarise(N_Games=n())

n_games

Di seguito è riportato il mio processo di creazione di una tabella di riepilogo. Capisco che il piping ha lo scopo di eliminare la creazione di alcune variabili / tabelle intermedie. Dove posso combinare i passaggi seguenti per creare il tavolo finale con un numero minimo di passaggi intermedi.

# load librarys ------------------------------------------------
library(tidyverse)

# build sample shot data ---------------------------------------
set.seed(123)
shot_df_ex <- tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
                     Date = sample(as.Date(c("2019-08-01",
                                             "2019-09-01",
                                             "2018-08-01",
                                             "2018-09-01",
                                             "2017-08-01",
                                             "2017-09-01")), 
                                   size = 250, replace = TRUE),
                     Type = sample(c("shot","goal"), size = 250, 
                                   replace = TRUE, prob = c(0.9,0.1))
)

# calculate data ----------------------------------------------
# since every row is a shot, the following function counts shots for ea. team
n_shots <- shot_df_ex %>% 
  count(Team_Name) %>% 
  rename(N_Shots = n)

n_shots

# do the same for goals for each team
n_goals <- shot_df_ex %>% 
  filter(Type == "goal") %>% 
  count(Team_Name,sort = T) %>% 
  rename(N_Goals = n) %>% 
  arrange(Team_Name)

n_goals

# count shots per team per game(date)
n_shots_per_game <- shot_df_ex %>% 
  count(Team_Name,Date)

n_shots_per_game

# count games (dates) per team [ISSUES!!!]
# is there a way to do this piping from the shot_df_ex tibble instead of 
#  using an intermediate tibble?

# count number of games using the tibble created above [DOES NOT WORK]
n_games <- n_shots_per_game %>% 
  count(Team_Name)

n_games #what is this counting? It should be 6 for each.

# this works, but isn't count() just a quicker way to run
#  group_by() %>% summarise()? 
n_games <- n_shots_per_game %>% 
  group_by(Team_Name) %>% 
  summarise(N_Games=n())

n_games

# combine data ------------------------------------------------
# combine columns and add average shots per game
shot_table_ex <- n_games %>% 
  left_join(n_shots) %>% 
  left_join(n_goals)

# final table with final average calculations
shot_table_ex <- shot_table_ex %>% 
  mutate(Shots_per_Game = round(N_Shots / N_Games, 1),
         Goals_per_Game = round(N_Goals / N_Games, 1)) %>% 
  arrange(Team_Name)

shot_table_ex

Risposte

1 stlba Aug 19 2020 at 23:25

Per 1a, puoi semplicemente reindirizzare direttamente dalla funzione tibble () a count (). cioè.

tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
       Date = sample(as.Date(c("2019-08-01",
                               "2019-09-01",
                               "2018-08-01",
                               "2018-09-01",
                               "2017-08-01",
                               "2017-09-01")), 
                     size = 250, replace = TRUE),
       Type = sample(c("shot","goal"), size = 250, 
                     replace = TRUE, prob = c(0.9,0.1))) %>%
count(Team_Name,Date)

In 1b, count () usa la tua colonna n(cioè il numero di colpi) come variabile di ponderazione, quindi somma il numero totale di colpi per squadra, non il numero di righe. Stampa un messaggio che ti dice questo:

Using `n` as weighting variable i Quiet this message with `wt = n` or count rows with `wt = 1`

L'utilizzo count(Team_Name, wt=n())darà il comportamento desiderato.

Modifica: parte 2

shot_table_ex <- tibble(Team_Name = sample(LETTERS[1:5],250, replace = TRUE),
                    Date = sample(as.Date(c("2019-08-01",
                                            "2019-09-01",
                                            "2018-08-01",
                                            "2018-09-01",
                                            "2017-08-01",
                                            "2017-09-01")), 
                                  size = 250, replace = TRUE),
                    Type = sample(c("shot","goal"), size = 250, 
                                  replace = TRUE, prob = c(0.9,0.1))) %>%
     group_by(Team_Name) %>%
     summarise(n_shots = n(),
               n_goals = sum(Type == "goal"),
               n_games = n_distinct(Date)) %>%
     mutate(Shots_per_Game = round(n_shots / n_games, 1),
            Goals_per_Game = round(n_goals / n_games, 1))

1 GenesRus Aug 19 2020 at 23:36

1a. Dai dati di esempio di seguito e utilizzando dplyr, c'è un modo per calcolare le partite (date) per squadra senza utilizzare una tabella intermedia?

Ecco come lo farei:

shot_df_ex %>% 
  distinct(Team_Name, Date) %>% #Keeps only the cols given and one of each combo
  count(Team_Name)

Puoi anche usare unico:

shot_df_ex %>% 
  group_by(Team_Name) %>%
  summarize(N_Games = length(unique(Date))

1b. Ho incluso il mio modo originale per calcolare n_games che non funzionavano. Perché?

Il tuo codice funziona per me. Hai forse risparmiato sul tavolo intermedio? Conta i 6 previsti per squadra.

Di seguito è riportato il mio processo di creazione di una tabella di riepilogo. Capisco che il piping ha lo scopo di eliminare la creazione di alcune variabili / tabelle intermedie. Dove posso combinare i passaggi seguenti per creare il tavolo finale con un numero minimo di passaggi intermedi?

shot_df_ex %>% 
  group_by(Team_Name) %>% 
  summarize(
    N_Games = length(unique(Date)),
    N_Shots = sum(Type == "shot"),
    N_Goals = sum(Type == "goal")
  ) %>% 
  mutate(Shots_per_Game = round(N_Shots / N_Games, 1),
         Goals_per_Game = round(N_Goals / N_Games, 1))

È possibile utilizzare più passaggi di riepilogo alla volta, a condizione che non sia necessario modificare il raggruppamento. Stiamo approfittando qui (nelle sumchiamate) dell'interpretazione di True come 1 e False come 0. lengthovviamente ci darà la lunghezza del vettore prodotto da unique.