일부 기능을 사용하지 않고 저장된 scikit-learn 모델 다시 맞추기- "ValueError : 주어진 열이 데이터 프레임의 열이 아닙니다."

Nov 16 2020

모델에서 실제로 사용하지 않는 일부 기능없이 더 작은 데이터 세트를 사용하여 scikit-learn 파이프 라인을 다시 조정해야합니다.

(실제 상황은 joblib를 통해 저장하고 다시 조정해야하는 다른 파일에로드하는 것은 내가 만든 일부 사용자 정의 변환기가 포함되어 있기 때문이지만 모든 기능을 추가하는 것은 다른 종류이므로 고통 스러울 것입니다. 그러나 모델을 처음 학습 한 동일한 파일에 저장하기 전에 모델을 다시 맞추면 동일한 오류가 발생하므로 이것은 중요하지 않습니다.)

이것은 내 사용자 정의 변압기입니다.

class TransformAdoptionFeatures(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        adoption_features = X.columns
        feats_munic = [feat for feat in adoption_features if '_munic' in feat]
        feats_adj_neigh = [feat for feat in adoption_features
                           if '_adj' in feat]
        feats_port = [feat for feat in adoption_features if '_port' in feat]

        feats_to_keep_all = feats_munic + feats_adj_neigh + feats_port
        feats_to_keep = [feat for feat in feats_to_keep_all
                         if 'tot_cumul' not in feat]
        
        return X[feats_to_keep]

그리고 이것은 내 파이프 라인입니다.

full_pipeline = Pipeline([
    ('transformer', TransformAdoptionFeatures()),
    ('scaler', StandardScaler())
])

model = Pipeline([
    ("preparation", full_pipeline),
    ("regressor", ml_model)
])

ml_modelscikit-learn 기계 학습 모델은 어디에 있습니까 ? 를 저장할 때 full_pipeline와가 ml_model모두 이미 장착되어 model있습니다. (실제 모델에는 다른 열에 대해 다른 변환기가 필요하기 때문에 ColumnTransformer실제를 나타내는 중간 단계가 full_pipeline있지만 간결성을 위해 중요한 변환기 만 복사했습니다.)

문제 : 모든 것을 맞추기 위해 이미 사용한 데이터 세트의 기능 수를 줄였고 고려되지 않은 일부 기능을 제거했습니다 TransformAdoptionFeatures()(유지할 기능에 포함되지 않음). 그런 다음 축소 된 기능을 사용하여 모델을 새 데이터 세트에 다시 맞추려고 시도했는데이 오류가 발생했습니다.

Traceback (most recent call last):

  File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\pandas\core\indexes\base.py", line 2889, in get_loc
    return self._engine.get_loc(casted_key)

  File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc

  File "pandas\_libs\index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc

  File "pandas\_libs\hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item

  File "pandas\_libs\hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item

KeyError: 'tot_cumul_adoption_pr_y_munic'


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\utils\__init__.py", line 447, in _get_column_indices
    col_idx = all_columns.get_loc(col)

  File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\pandas\core\indexes\base.py", line 2891, in get_loc
    raise KeyError(key) from err

KeyError: 'tot_cumul_adoption_pr_y_munic'


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "C:\Users\giaco\sbp-abm\municipalities_abm\test.py", line 15, in <module>
    modelSBP = model.SBPAdoption(initial_year=start_year)

  File "C:\Users\giaco\sbp-abm\municipalities_abm\municipalities_abm\model.py", line 103, in __init__
    self._upload_ml_models(ml_clsf_folder, ml_regr_folder)

  File "C:\Users\giaco\sbp-abm\municipalities_abm\municipalities_abm\model.py", line 183, in _upload_ml_models
    self._ml_clsf.fit(clsf_dataset.drop('adoption_in_year', axis=1),

  File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\pipeline.py", line 330, in fit
    Xt = self._fit(X, y, **fit_params_steps)

  File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\pipeline.py", line 292, in _fit
    X, fitted_transformer = fit_transform_one_cached(

  File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\joblib\memory.py", line 352, in __call__
    return self.func(*args, **kwargs)

  File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\pipeline.py", line 740, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)

  File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\compose\_column_transformer.py", line 529, in fit_transform
    self._validate_remainder(X)

  File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\compose\_column_transformer.py", line 327, in _validate_remainder
    cols.extend(_get_column_indices(X, columns))

  File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\utils\__init__.py", line 454, in _get_column_indices
    raise ValueError(

ValueError: A given column is not a column of the dataframe

이 오류의 원인을 이해하지 못합니다. scikit-learn이 전달한 열의 이름을 저장하지 않는다고 생각했습니다.

답변

giacrava Nov 18 2020 at 18:50

내 오류를 발견했으며 실제로를 사용하는 데 있었으며 ColumnsTransformer열 이름이 입력되는 유일한 장소이기도합니다.

내 오류는 정말 간단했습니다. 제외 된 기능의 이름을 제거하기 위해 각 변환을 적용하기 위해 열 목록을 업데이트하지 않았습니다.