Quando devo fazer a divisão de teste de trem?

Aug 18 2020

Sou novo no aprendizado de máquina. Estou basicamente confuso sobre quando realizar a divisão de teste de trem.

A ordem fornecida abaixo está correta?

  1. Divida todos os dados em treinamento e conjunto de teste

  2. Extraia recursos de dados de treinamento

  3. Ajustar o modelo de classificação aos recursos extraídos dos dados de treinamento

  4. Extract the same features, which were computed in step 2, from test data

  5. Apply the fitted model in step 3 to the features extracted from test data in step 4 to evaluate the model

Respostas

6 gunes Aug 18 2020 at 21:24

Your procedure is correct generally. In a more complex loop, additional operations may include validation, hyper-parameter optimisation, feature selection etc.

Typically, feature extraction follows exploratory data analysis (EDA), where you get to know your data, analyse/summarise it, draw intuitive conclusions. In EDA, you don't necessarily do a train/test split.

Note that, if you repeat steps 2-3 in a feedback loop so that you test whether newly extracted features (e.g. interaction variables) are useful for the model or not, you'll need a validation step.