Tidymodel 패키지 : R의 일반 선형 모델 (glm) 및 의사 결정 트리 (배깅 트리, 부스트 트리 및 랜덤 포레스트) 모델

Nov 18 2020

발행물

RTidymodels 패키지를 사용하여 분석을 시도하고 있습니다. R에서의 의사 결정 트리 학습에 관한 아래 튜토리얼을 따르고 있습니다.

지도 시간

https://bcullen.rbind.io/post/2020-06-02-tidymodels-decision-tree-learning-in-r/

I가있는 FID라는 데이터 프레임을 제 (아래 참조) 종속 변수 는 IS 주파수 (숫자)를 , 상기 예측 변수는 : - (숫자) 년, 월 (인자), 몬 (인자)과 일 (숫자).

나는 배깅 트리, 랜덤 포레스트, 부스트 트리 모델 을 구축함으로써 "Tidymodels : Decision Tree Learning in R" 이라는 튜토리얼을 성공적으로 따랐다 고 생각 합니다.

이 분석 을 위해 모든 모델 (즉, 랜덤 포레스트, 배깅 트리, 부스트 트리 및 일반 선형 모델) 간의 모델 비교를 수행하여 최적의 모델 적합을 설정하기 위해 일반 선형 모델 (glm) 을 구성하고 싶습니다 . 모든 모델은 과적 합의 편향을 줄이기 위해 10 배 교차 검증 을받습니다.

문제

그 후 튜토리얼의 코드 (아래 참조)를 glm 모델에 맞게 조정하려고 시도했지만 모델을 적절하게 조정했는지 혼란 스럽습니다. 모델이 모두 맞는 후 rmse 값 을 생성하려고 할 때 glm R-code 의이 요소가 아래 오류 메시지를 생성하는지 확실하지 않습니다 .

에러 메시지

Error: Problem with `mutate()` input `model`.
x Input `model` can't be recycled to size 4.
ℹ Input `model` is `c("bag", "rf", "boost")`.
ℹ Input `model` must be size 4 or 1, not 3.

또한 이러한 모델 에 대한 recipe () 함수표현 된 R 코드 가 적절하거나 올바른지 확실하지 않습니다. 이는 각 모델을 맞추기 전 처리 단계 에서 매우 중요합니다 . 제 관점에서는 모델의 레시피를 개선 할 수 있을지 궁금했습니다.

이것이 가능하다면 레시피 수정 (필요한 경우)과 함께 glm 모델을 피팅 할 때 오류 메시지와 관련하여 누구에게 도움을 줄 수 있는지 궁금합니다.

저는 고급 R 코더가 아니며 다른 Tidymodel 튜토리얼을 조사하여 해결책을 찾기 위해 철저한 시도를했습니다. 하지만이 오류 메시지의 의미를 이해하지 못합니다. 그러므로 누구든지 도울 수 있다면 깊은 감사를 표하고 싶습니다.

미리 감사드립니다.

R- 코드

##Open the tidymodels package
library(tidymodels)
library(glmnet)
library(parsnip)
library(rpart.plot)
library(rpart)
library(tidyverse) # manipulating data
library(skimr) # data visualization
library(baguette) # bagged trees
library(future) # parallel processing & decrease computation time
library(xgboost) # boosted trees
library(ranger)

###########################################################
# Put 3/4 of the data into the training set
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(Tidy_df, prop = 3/4)

# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data)

###########################################################
##Produce the recipe
##Preprocessing
############################################################

rec <- recipe(Frequency ~ ., data = fid_df) %>% 
  step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
  step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
  step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
  step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

###########################################################
##Create Models
###########################################################

##########################################################
##General Linear Models
#########################################################

##glm
mod_glm<-linear_reg(mode="regression",
                       penalty = 0.1, 
                       mixture = 1) %>% 
                            set_engine("glmnet")

##Create workflow
wflow_glm <- workflow() %>% 
                add_recipe(rec) %>%
                      add_model(mod_glm)

##Fit the model
plan(multisession)

fit_glm <- fit_resamples(
                        wflow_glm,
                        cv,
                        metrics = metric_set(rmse, rsq),
                        control = control_resamples(save_pred = TRUE)
                        )

##########################################################
##Bagged Trees
##########################################################

#####Bagged Trees
mod_bag <- bag_tree() %>%
            set_mode("regression") %>%
             set_engine("rpart", times = 10) #10 bootstrap resamples
                

##Create workflow
wflow_bag <- workflow() %>% 
                   add_recipe(rec) %>%
                       add_model(mod_bag)

##Fit the model
plan(multisession)

fit_bag <- fit_resamples(
                      wflow_bag,
                      cv,
                      metrics = metric_set(rmse, rsq),
                      control = control_resamples(save_pred = TRUE)
                      )

###################################################
##Random forests
###################################################

mod_rf <-rand_forest(trees = 1e3) %>%
                              set_engine("ranger",
                              num.threads = parallel::detectCores(), 
                              importance = "permutation", 
                              verbose = TRUE) %>% 
                              set_mode("regression") 
                              
##Create Workflow

wflow_rf <- workflow() %>% 
               add_model(mod_rf) %>% 
                     add_recipe(rec)

##Fit the model

plan(multisession)

fit_rf<-fit_resamples(
             wflow_rf,
             cv,
             metrics = metric_set(rmse, rsq),
             control = control_resamples(save_pred = TRUE)
             )

############################################################
##Boosted Trees
############################################################

mod_boost <- boost_tree() %>% 
                 set_engine("xgboost", nthreads = parallel::detectCores()) %>% 
                      set_mode("regression")

##Create workflow

wflow_boost <- workflow() %>% 
                  add_recipe(rec) %>% 
                    add_model(mod_boost)

##Fit model

plan(multisession)

fit_boost <-fit_resamples(
                       wflow_boost,
                       cv,
                       metrics = metric_set(rmse, rsq),
                       control = control_resamples(save_pred = TRUE)
                       )

##############################################
##Evaluate the models
##############################################

collect_metrics(fit_bag) %>% 
        bind_rows(collect_metrics(fit_rf)) %>%
          bind_rows(collect_metrics(fit_boost)) %>% 
            bind_rows(collect_metrics(fit_glm)) %>% 
              dplyr::filter(.metric == "rmse") %>% 
                dplyr::mutate(model = c("bag", "rf", "boost")) %>% 
                 dplyr::select(model, everything()) %>% 
                    knitr::kable()

####Error message

Error: Problem with `mutate()` input `model`.
x Input `model` can't be recycled to size 4.
ℹ Input `model` is `c("bag", "rf", "boost")`.
ℹ Input `model` must be size 4 or 1, not 3.
Run `rlang::last_error()` to see where the error occurred.

#####################################################
##Out-of-sample performance
#####################################################

# bagged trees
final_fit_bag <- last_fit(
                     wflow_bag,
                       split = split)
# random forest
final_fit_rf <- last_fit(
                  wflow_rf,
                    split = split)
# boosted trees
final_fit_boost <- last_fit(
                      wflow_boost,
                          split = split)

데이터 프레임-FID

structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October",
"November", "December"), class = "factor"), Monsoon = structure(c(2L,
2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L,
4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L,
3L, 3L, 2L), .Label = c("First_Inter_Monssoon", "North_Monsoon",
"Second_Inter_Monsoon", "South_Monsson"), class = "factor"),
    Frequency = c(36, 28, 39, 46, 5, 0, 0, 22, 10, 15, 8,
    33, 33, 29, 31, 23, 8, 9, 7, 40, 41, 41, 30, 30, 44, 37,
    41, 42, 20, 0, 7, 27, 35, 27, 43, 38), Days = c(31,
    28, 31, 30, 6, 0, 0, 29, 15, 29, 29, 31, 31, 29, 30, 30,
    7, 0, 7, 30, 30, 31, 30, 27, 31, 28, 30, 30, 21, 0, 7, 26,
    29, 27, 29, 29)), row.names = c(NA, -36L), class = "data.frame")

답변

1 JuliaSilge Nov 18 2020 at 20:58

나는 선형 모델을 피팅에서 오류가 방법에서 오는 생각 MonthMonsoon서로 관련되어있다. 그들은 완벽하게 상관됩니다.

library(tidyverse) 

fid_df <- structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015, 
                                  2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016, 
                                  2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017, 
                                  2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L, 
                                                                                                 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 
                                                                                                 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
                                                                                                 8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March", 
                                                                                                                                    "April", "May", "June", "July", "August", "September", "October", 
                                                                                                                                    "November", "December"), class = "factor"), Monsoon = structure(c(2L, 
                                                                                                                                                                                                      2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L, 
                                                                                                                                                                                                      4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 
                                                                                                                                                                                                      3L, 3L, 2L), .Label = c("First_Inter_Monssoon", "North_Monsoon", 
                                                                                                                                                                                                                              "Second_Inter_Monsoon", "South_Monsson"), class = "factor"), 
                         Frequency = c(36, 28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 
                                       33, 33, 29, 31, 23, 8, 9, 7, 40, 41, 41, 30, 30, 44, 37, 
                                       41, 42, 20, 0, 7, 27, 35, 27, 43, 38), Days = c(31, 
                                                                                       28, 31, 30, 6, 0, 0, 29, 15, 29, 29, 31, 31, 29, 30, 30, 
                                                                                       7, 0, 7, 30, 30, 31, 30, 27, 31, 28, 30, 30, 21, 0, 7, 26, 
                                                                                       29, 27, 29, 29)), row.names = c(NA, -36L), class = "data.frame")


fid_df %>%
  count(Month, Monsoon)
#>        Month              Monsoon n
#> 1    January        North_Monsoon 3
#> 2   February        North_Monsoon 3
#> 3      March First_Inter_Monssoon 3
#> 4      April First_Inter_Monssoon 3
#> 5        May        South_Monsson 3
#> 6       June        South_Monsson 3
#> 7       July        South_Monsson 3
#> 8     August        South_Monsson 3
#> 9  September        South_Monsson 3
#> 10   October Second_Inter_Monsoon 3
#> 11  November Second_Inter_Monsoon 3
#> 12  December        North_Monsoon 3

선형 모델에서 이와 같은 변수를 사용하는 경우 모델은 두 계수 세트 모두에 대한 추정치를 찾을 수 없습니다.

lm(Frequency ~ ., data = fid_df) %>% summary()
#> 
#> Call:
#> lm(formula = Frequency ~ ., data = fid_df)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -15.0008  -3.9357   0.6564   2.9769  12.7681 
#> 
#> Coefficients: (3 not defined because of singularities)
#>                               Estimate Std. Error t value Pr(>|t|)  
#> (Intercept)                 -7286.9226  3443.9292  -2.116   0.0459 *
#> Year                            3.6155     1.7104   2.114   0.0461 *
#> MonthFebruary                  -3.2641     6.6172  -0.493   0.6267  
#> MonthMarch                      0.1006     6.5125   0.015   0.9878  
#> MonthApril                      0.4843     6.5213   0.074   0.9415  
#> MonthMay                       -4.0308    11.0472  -0.365   0.7187  
#> MonthJune                       1.0135    15.5046   0.065   0.9485  
#> MonthJuly                      -2.6910    13.6106  -0.198   0.8451  
#> MonthAugust                    -4.9307     6.6172  -0.745   0.4641  
#> MonthSeptember                 -1.7105     7.1126  -0.240   0.8122  
#> MonthOctober                   -7.6981     6.5685  -1.172   0.2538  
#> MonthNovember                  -8.7484     6.5493  -1.336   0.1953  
#> MonthDecember                  -1.6981     6.5685  -0.259   0.7984  
#> MonsoonNorth_Monsoon                NA         NA      NA       NA  
#> MonsoonSecond_Inter_Monsoon         NA         NA      NA       NA  
#> MonsoonSouth_Monsson                NA         NA      NA       NA  
#> Days                            1.1510     0.4540   2.535   0.0189 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 7.968 on 22 degrees of freedom
#> Multiple R-squared:  0.8135, Adjusted R-squared:  0.7033 
#> F-statistic: 7.381 on 13 and 22 DF,  p-value: 2.535e-05

reprex 패키지 (v0.3.0.9001)에 의해 2020-11-18에 생성됨

이 정보를 가지고 있기 때문에, 내가 사용할지 여부를 결정하는 일부 도메인 지식을 사용하는 것이 좋습니다 것 Month 또는 Monsoon 모델에 있지만 둘.

1 AliceHobbs Nov 19 2020 at 02:31

Julia Silge의 제안으로 답변

#split this single dataset into two: a training set and a testing set
data_split <- initial_split(Tidy_df)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data)

###########################################################
##Produce the recipe

rec <- recipe(Frequency_Blue ~ ., data = Tidy_df) %>% 
          step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
          step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
          step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
          step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

###########################################################
##Create Models
###########################################################

##########################################################
##General Linear Models
#########################################################

##glm
mod_glm<-linear_reg(mode="regression",
                       penalty = 0.1, 
                       mixture = 1) %>% 
                            set_engine("glmnet")

##Create workflow
wflow_glm <- workflow() %>% 
                add_recipe(rec) %>%
                      add_model(mod_glm)

##Fit the model
plan(multisession)

fit_glm <- fit_resamples(
                        wflow_glm,
                        cv,
                        metrics = metric_set(rmse, rsq),
                        control = control_resamples(save_pred = TRUE)
                        )

##########################################################
##Bagged Trees
##########################################################

#####Bagged Trees
mod_bag <- bag_tree() %>%
            set_mode("regression") %>%
             set_engine("rpart", times = 10) #10 bootstrap resamples
                

##Create workflow
wflow_bag <- workflow() %>% 
                   add_recipe(rec) %>%
                       add_model(mod_bag)

##Fit the model
plan(multisession)

fit_bag <- fit_resamples(
                      wflow_bag,
                      cv,
                      metrics = metric_set(rmse, rsq),
                      control = control_resamples(save_pred = TRUE)
                      )

###################################################
##Random forests
###################################################

mod_rf <-rand_forest(trees = 1e3) %>%
                              set_engine("ranger",
                              num.threads = parallel::detectCores(), 
                              importance = "permutation", 
                              verbose = TRUE) %>% 
                              set_mode("regression") 
                              
##Create Workflow

wflow_rf <- workflow() %>% 
               add_model(mod_rf) %>% 
                     add_recipe(rec)

##Fit the model

plan(multisession)

fit_rf<-fit_resamples(
             wflow_rf,
             cv,
             metrics = metric_set(rmse, rsq),
             control = control_resamples(save_pred = TRUE)
             )

############################################################
##Boosted Trees
############################################################

mod_boost <- boost_tree() %>% 
                 set_engine("xgboost", nthreads = parallel::detectCores()) %>% 
                      set_mode("regression")

##Create workflow

wflow_boost <- workflow() %>% 
                  add_recipe(rec) %>% 
                    add_model(mod_boost)

##Fit model

plan(multisession)

fit_boost <-fit_resamples(
                       wflow_boost,
                       cv,
                       metrics = metric_set(rmse, rsq),
                       control = control_resamples(save_pred = TRUE)
                       )

##############################################
##Evaluate the models
##############################################

collect_metrics(fit_bag) %>% 
        bind_rows(collect_metrics(fit_rf)) %>%
          bind_rows(collect_metrics(fit_boost)) %>% 
            bind_rows(collect_metrics(fit_glm)) %>% 
              dplyr::filter(.metric == "rmse") %>% 
                dplyr::mutate(model = c("bag", "rf", "boost", "glm")) %>% 
                 dplyr::select(model, everything()) %>% 
                    knitr::kable()

##rmse values for all models

|model |.metric |.estimator |      mean|  n|  std_err|
|:-----|:-------|:----------|---------:|--:|--------:|
|bag   |rmse    |standard   |  8.929936| 10| 1.544587|
|rf    |rmse    |standard   | 10.188710| 10| 1.144354|
|boost |rmse    |standard   |  9.249904| 10| 1.489482|
|glm   |rmse    |standard   | 11.348420| 10| 2.217807|

#####################################################
##Out-of-sample performance
#####################################################
#glm

# bagged trees
final_fit_glm <- last_fit(
                     wflow_glm,
                        split = split)


# bagged trees
final_fit_bag <- last_fit(
                     wflow_bag,
                       split = split)
# random forest
final_fit_rf <- last_fit(
                  wflow_rf,
                    split = split)
# boosted trees
final_fit_boost <- last_fit(
                      wflow_boost,
                          split = split)