Quando devo fazer a divisão de teste de trem?
Sou novo no aprendizado de máquina. Estou basicamente confuso sobre quando realizar a divisão de teste de trem.
A ordem fornecida abaixo está correta?
Divida todos os dados em treinamento e conjunto de teste
Extraia recursos de dados de treinamento
Ajustar o modelo de classificação aos recursos extraídos dos dados de treinamento
Extract the same features, which were computed in step 2, from test data
Apply the fitted model in step 3 to the features extracted from test data in step 4 to evaluate the model
Respostas
Your procedure is correct generally. In a more complex loop, additional operations may include validation, hyper-parameter optimisation, feature selection etc.
Typically, feature extraction follows exploratory data analysis (EDA), where you get to know your data, analyse/summarise it, draw intuitive conclusions. In EDA, you don't necessarily do a train/test split.
Note that, if you repeat steps 2-3 in a feedback loop so that you test whether newly extracted features (e.g. interaction variables) are useful for the model or not, you'll need a validation step.