Tidymodel 패키지 : R의 일반 선형 모델 (glm) 및 의사 결정 트리 (배깅 트리, 부스트 트리 및 랜덤 포레스트) 모델
발행물
R 의 Tidymodels 패키지를 사용하여 분석을 시도하고 있습니다. R에서의 의사 결정 트리 학습에 관한 아래 튜토리얼을 따르고 있습니다.
지도 시간
https://bcullen.rbind.io/post/2020-06-02-tidymodels-decision-tree-learning-in-r/
I가있는 FID라는 데이터 프레임을 제 (아래 참조) 종속 변수 는 IS 주파수 (숫자)를 , 상기 예측 변수는 : - (숫자) 년, 월 (인자), 몬 (인자)과 일 (숫자).
나는 배깅 트리, 랜덤 포레스트, 부스트 트리 모델 을 구축함으로써 "Tidymodels : Decision Tree Learning in R" 이라는 튜토리얼을 성공적으로 따랐다 고 생각 합니다.
이 분석 을 위해 모든 모델 (즉, 랜덤 포레스트, 배깅 트리, 부스트 트리 및 일반 선형 모델) 간의 모델 비교를 수행하여 최적의 모델 적합을 설정하기 위해 일반 선형 모델 (glm) 을 구성하고 싶습니다 . 모든 모델은 과적 합의 편향을 줄이기 위해 10 배 교차 검증 을받습니다.
문제
그 후 튜토리얼의 코드 (아래 참조)를 glm 모델에 맞게 조정하려고 시도했지만 모델을 적절하게 조정했는지 혼란 스럽습니다. 모델이 모두 맞는 후 rmse 값 을 생성하려고 할 때 glm R-code 의이 요소가 아래 오류 메시지를 생성하는지 확실하지 않습니다 .
에러 메시지
Error: Problem with `mutate()` input `model`.
x Input `model` can't be recycled to size 4.
ℹ Input `model` is `c("bag", "rf", "boost")`.
ℹ Input `model` must be size 4 or 1, not 3.
또한 이러한 모델 에 대한 recipe () 함수 에 표현 된 R 코드 가 적절하거나 올바른지 확실하지 않습니다. 이는 각 모델을 맞추기 전 처리 단계 에서 매우 중요합니다 . 제 관점에서는 모델의 레시피를 개선 할 수 있을지 궁금했습니다.
이것이 가능하다면 레시피 수정 (필요한 경우)과 함께 glm 모델을 피팅 할 때 오류 메시지와 관련하여 누구에게 도움을 줄 수 있는지 궁금합니다.
저는 고급 R 코더가 아니며 다른 Tidymodel 튜토리얼을 조사하여 해결책을 찾기 위해 철저한 시도를했습니다. 하지만이 오류 메시지의 의미를 이해하지 못합니다. 그러므로 누구든지 도울 수 있다면 깊은 감사를 표하고 싶습니다.
미리 감사드립니다.
R- 코드
##Open the tidymodels package
library(tidymodels)
library(glmnet)
library(parsnip)
library(rpart.plot)
library(rpart)
library(tidyverse) # manipulating data
library(skimr) # data visualization
library(baguette) # bagged trees
library(future) # parallel processing & decrease computation time
library(xgboost) # boosted trees
library(ranger)
###########################################################
# Put 3/4 of the data into the training set
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(Tidy_df, prop = 3/4)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data <- testing(data_split)
# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data)
###########################################################
##Produce the recipe
##Preprocessing
############################################################
rec <- recipe(Frequency ~ ., data = fid_df) %>%
step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels
step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% # replaces missing numeric observations with the median
step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables
###########################################################
##Create Models
###########################################################
##########################################################
##General Linear Models
#########################################################
##glm
mod_glm<-linear_reg(mode="regression",
penalty = 0.1,
mixture = 1) %>%
set_engine("glmnet")
##Create workflow
wflow_glm <- workflow() %>%
add_recipe(rec) %>%
add_model(mod_glm)
##Fit the model
plan(multisession)
fit_glm <- fit_resamples(
wflow_glm,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE)
)
##########################################################
##Bagged Trees
##########################################################
#####Bagged Trees
mod_bag <- bag_tree() %>%
set_mode("regression") %>%
set_engine("rpart", times = 10) #10 bootstrap resamples
##Create workflow
wflow_bag <- workflow() %>%
add_recipe(rec) %>%
add_model(mod_bag)
##Fit the model
plan(multisession)
fit_bag <- fit_resamples(
wflow_bag,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE)
)
###################################################
##Random forests
###################################################
mod_rf <-rand_forest(trees = 1e3) %>%
set_engine("ranger",
num.threads = parallel::detectCores(),
importance = "permutation",
verbose = TRUE) %>%
set_mode("regression")
##Create Workflow
wflow_rf <- workflow() %>%
add_model(mod_rf) %>%
add_recipe(rec)
##Fit the model
plan(multisession)
fit_rf<-fit_resamples(
wflow_rf,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE)
)
############################################################
##Boosted Trees
############################################################
mod_boost <- boost_tree() %>%
set_engine("xgboost", nthreads = parallel::detectCores()) %>%
set_mode("regression")
##Create workflow
wflow_boost <- workflow() %>%
add_recipe(rec) %>%
add_model(mod_boost)
##Fit model
plan(multisession)
fit_boost <-fit_resamples(
wflow_boost,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE)
)
##############################################
##Evaluate the models
##############################################
collect_metrics(fit_bag) %>%
bind_rows(collect_metrics(fit_rf)) %>%
bind_rows(collect_metrics(fit_boost)) %>%
bind_rows(collect_metrics(fit_glm)) %>%
dplyr::filter(.metric == "rmse") %>%
dplyr::mutate(model = c("bag", "rf", "boost")) %>%
dplyr::select(model, everything()) %>%
knitr::kable()
####Error message
Error: Problem with `mutate()` input `model`.
x Input `model` can't be recycled to size 4.
ℹ Input `model` is `c("bag", "rf", "boost")`.
ℹ Input `model` must be size 4 or 1, not 3.
Run `rlang::last_error()` to see where the error occurred.
#####################################################
##Out-of-sample performance
#####################################################
# bagged trees
final_fit_bag <- last_fit(
wflow_bag,
split = split)
# random forest
final_fit_rf <- last_fit(
wflow_rf,
split = split)
# boosted trees
final_fit_boost <- last_fit(
wflow_boost,
split = split)
데이터 프레임-FID
structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October",
"November", "December"), class = "factor"), Monsoon = structure(c(2L,
2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L,
4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L,
3L, 3L, 2L), .Label = c("First_Inter_Monssoon", "North_Monsoon",
"Second_Inter_Monsoon", "South_Monsson"), class = "factor"),
Frequency = c(36, 28, 39, 46, 5, 0, 0, 22, 10, 15, 8,
33, 33, 29, 31, 23, 8, 9, 7, 40, 41, 41, 30, 30, 44, 37,
41, 42, 20, 0, 7, 27, 35, 27, 43, 38), Days = c(31,
28, 31, 30, 6, 0, 0, 29, 15, 29, 29, 31, 31, 29, 30, 30,
7, 0, 7, 30, 30, 31, 30, 27, 31, 28, 30, 30, 21, 0, 7, 26,
29, 27, 29, 29)), row.names = c(NA, -36L), class = "data.frame")
답변
나는 선형 모델을 피팅에서 오류가 방법에서 오는 생각 Month
과 Monsoon
서로 관련되어있다. 그들은 완벽하게 상관됩니다.
library(tidyverse)
fid_df <- structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October",
"November", "December"), class = "factor"), Monsoon = structure(c(2L,
2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L,
4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L,
3L, 3L, 2L), .Label = c("First_Inter_Monssoon", "North_Monsoon",
"Second_Inter_Monsoon", "South_Monsson"), class = "factor"),
Frequency = c(36, 28, 39, 46, 5, 0, 0, 22, 10, 15, 8,
33, 33, 29, 31, 23, 8, 9, 7, 40, 41, 41, 30, 30, 44, 37,
41, 42, 20, 0, 7, 27, 35, 27, 43, 38), Days = c(31,
28, 31, 30, 6, 0, 0, 29, 15, 29, 29, 31, 31, 29, 30, 30,
7, 0, 7, 30, 30, 31, 30, 27, 31, 28, 30, 30, 21, 0, 7, 26,
29, 27, 29, 29)), row.names = c(NA, -36L), class = "data.frame")
fid_df %>%
count(Month, Monsoon)
#> Month Monsoon n
#> 1 January North_Monsoon 3
#> 2 February North_Monsoon 3
#> 3 March First_Inter_Monssoon 3
#> 4 April First_Inter_Monssoon 3
#> 5 May South_Monsson 3
#> 6 June South_Monsson 3
#> 7 July South_Monsson 3
#> 8 August South_Monsson 3
#> 9 September South_Monsson 3
#> 10 October Second_Inter_Monsoon 3
#> 11 November Second_Inter_Monsoon 3
#> 12 December North_Monsoon 3
선형 모델에서 이와 같은 변수를 사용하는 경우 모델은 두 계수 세트 모두에 대한 추정치를 찾을 수 없습니다.
lm(Frequency ~ ., data = fid_df) %>% summary()
#>
#> Call:
#> lm(formula = Frequency ~ ., data = fid_df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -15.0008 -3.9357 0.6564 2.9769 12.7681
#>
#> Coefficients: (3 not defined because of singularities)
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -7286.9226 3443.9292 -2.116 0.0459 *
#> Year 3.6155 1.7104 2.114 0.0461 *
#> MonthFebruary -3.2641 6.6172 -0.493 0.6267
#> MonthMarch 0.1006 6.5125 0.015 0.9878
#> MonthApril 0.4843 6.5213 0.074 0.9415
#> MonthMay -4.0308 11.0472 -0.365 0.7187
#> MonthJune 1.0135 15.5046 0.065 0.9485
#> MonthJuly -2.6910 13.6106 -0.198 0.8451
#> MonthAugust -4.9307 6.6172 -0.745 0.4641
#> MonthSeptember -1.7105 7.1126 -0.240 0.8122
#> MonthOctober -7.6981 6.5685 -1.172 0.2538
#> MonthNovember -8.7484 6.5493 -1.336 0.1953
#> MonthDecember -1.6981 6.5685 -0.259 0.7984
#> MonsoonNorth_Monsoon NA NA NA NA
#> MonsoonSecond_Inter_Monsoon NA NA NA NA
#> MonsoonSouth_Monsson NA NA NA NA
#> Days 1.1510 0.4540 2.535 0.0189 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 7.968 on 22 degrees of freedom
#> Multiple R-squared: 0.8135, Adjusted R-squared: 0.7033
#> F-statistic: 7.381 on 13 and 22 DF, p-value: 2.535e-05
reprex 패키지 (v0.3.0.9001)에 의해 2020-11-18에 생성됨
이 정보를 가지고 있기 때문에, 내가 사용할지 여부를 결정하는 일부 도메인 지식을 사용하는 것이 좋습니다 것 Month
또는 Monsoon
모델에 있지만 둘.
Julia Silge의 제안으로 답변
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(Tidy_df)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data <- testing(data_split)
# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data)
###########################################################
##Produce the recipe
rec <- recipe(Frequency_Blue ~ ., data = Tidy_df) %>%
step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels
step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% # replaces missing numeric observations with the median
step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables
###########################################################
##Create Models
###########################################################
##########################################################
##General Linear Models
#########################################################
##glm
mod_glm<-linear_reg(mode="regression",
penalty = 0.1,
mixture = 1) %>%
set_engine("glmnet")
##Create workflow
wflow_glm <- workflow() %>%
add_recipe(rec) %>%
add_model(mod_glm)
##Fit the model
plan(multisession)
fit_glm <- fit_resamples(
wflow_glm,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE)
)
##########################################################
##Bagged Trees
##########################################################
#####Bagged Trees
mod_bag <- bag_tree() %>%
set_mode("regression") %>%
set_engine("rpart", times = 10) #10 bootstrap resamples
##Create workflow
wflow_bag <- workflow() %>%
add_recipe(rec) %>%
add_model(mod_bag)
##Fit the model
plan(multisession)
fit_bag <- fit_resamples(
wflow_bag,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE)
)
###################################################
##Random forests
###################################################
mod_rf <-rand_forest(trees = 1e3) %>%
set_engine("ranger",
num.threads = parallel::detectCores(),
importance = "permutation",
verbose = TRUE) %>%
set_mode("regression")
##Create Workflow
wflow_rf <- workflow() %>%
add_model(mod_rf) %>%
add_recipe(rec)
##Fit the model
plan(multisession)
fit_rf<-fit_resamples(
wflow_rf,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE)
)
############################################################
##Boosted Trees
############################################################
mod_boost <- boost_tree() %>%
set_engine("xgboost", nthreads = parallel::detectCores()) %>%
set_mode("regression")
##Create workflow
wflow_boost <- workflow() %>%
add_recipe(rec) %>%
add_model(mod_boost)
##Fit model
plan(multisession)
fit_boost <-fit_resamples(
wflow_boost,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE)
)
##############################################
##Evaluate the models
##############################################
collect_metrics(fit_bag) %>%
bind_rows(collect_metrics(fit_rf)) %>%
bind_rows(collect_metrics(fit_boost)) %>%
bind_rows(collect_metrics(fit_glm)) %>%
dplyr::filter(.metric == "rmse") %>%
dplyr::mutate(model = c("bag", "rf", "boost", "glm")) %>%
dplyr::select(model, everything()) %>%
knitr::kable()
##rmse values for all models
|model |.metric |.estimator | mean| n| std_err|
|:-----|:-------|:----------|---------:|--:|--------:|
|bag |rmse |standard | 8.929936| 10| 1.544587|
|rf |rmse |standard | 10.188710| 10| 1.144354|
|boost |rmse |standard | 9.249904| 10| 1.489482|
|glm |rmse |standard | 11.348420| 10| 2.217807|
#####################################################
##Out-of-sample performance
#####################################################
#glm
# bagged trees
final_fit_glm <- last_fit(
wflow_glm,
split = split)
# bagged trees
final_fit_bag <- last_fit(
wflow_bag,
split = split)
# random forest
final_fit_rf <- last_fit(
wflow_rf,
split = split)
# boosted trees
final_fit_boost <- last_fit(
wflow_boost,
split = split)