After K-fold cross-validation, Which model to select as a representative?
Machine Learning 2017. 1. 4. 20:24
I train my system based on the 10-fold cross-validation framework. Now it gives me 10 different models. Which model to select as a representative?
I am using Random Forests technique for learning from data based on 10-fold cross validation setup. Now the performance measures seem very encouraging. I want to use the model to train an independent test set. But in 10-fold cross-validation setup, 90% data is used for training and the rest 10% as testing set for one fold. And this operation is repeated 10 times with mutually exclusive training sets from other folds.
Now, in 10-fold cross-validation setting, i get 10 different models based on the 10 different folds. I want to use the random forests model to test in a different test set.
Should i be selecting based on the error measures. Suppose the error measures for the folds are:
[20 21 23 27 30 21 24 25 28 26].
Now, in 10-fold cross-validation setting, i get 10 different models based on the 10 different folds. I want to use the random forests model to test in a different test set.
Should i be selecting based on the error measures. Suppose the error measures for the folds are:
[20 21 23 27 30 21 24 25 28 26].
When I understand correctly, you are using regular k-fold cross-validation, not nested k-fold cross validation? Like you said, you train 10 "different" models, and each is evaluated on the test fold to calculate some form of performance metric (e.g., accuracy or classification error). What you are interested in is the average performance across those 10 test folds. Let me illustrate what I mean:
After you are done with hyperparameter tuning, you don't pick any of the models. You want to retrain it on the complete training dataset, and evaluate it on an independent test dataset (ideally). After you got your generalization performance estimate from the independent test set, you can retrain it on both training and test set and use it in real-world applications for example.
For algorithm comparisons, I recommend nested-cross validation. Here, you basically have nested cross-valiatation loops. Let me draw the typical 5x2 cross-validation:
Since you didn't ask for it, I don't want to go into too much detail here, but if you are interested, I leave you a link with further info about this topic:
"Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms" by Thomas Deitterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms
After you are done with hyperparameter tuning, you don't pick any of the models. You want to retrain it on the complete training dataset, and evaluate it on an independent test dataset (ideally). After you got your generalization performance estimate from the independent test set, you can retrain it on both training and test set and use it in real-world applications for example.
For algorithm comparisons, I recommend nested-cross validation. Here, you basically have nested cross-valiatation loops. Let me draw the typical 5x2 cross-validation:
Since you didn't ask for it, I don't want to go into too much detail here, but if you are interested, I leave you a link with further info about this topic:
"Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms" by Thomas Deitterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms
The standard approach is to re-train on the whole dataset and use that one model to make predictions on the test set. That's usually a reasonable approach.
In a high variance problem though, you might want to use each model generated by one of your k folds to not only estimate the error on the k-th fold, but also to make predictions on the test set. You'd end up with k prediction for the test set that you would then average, and that might help reduce variance. Essentially you're using bagging.
For high variance problems you can take this process further, essentially creating a randomized process where you average tens or hundreds of predictions by randomly selecting observations to train the model on (with or without replacement) as well as features (and possibly randomizing the choice of hyperparameters within given ranges). In each iteration you'd make a prediction of test data by using a different subset of rows and columns. In certain cases this works very well.
Obviously there is a scalability problem (training and scoring possibly hundreds of models). You'll also need to figure out what are optimal parameters to subset rows and columns.
This is a very common Kaggle technique used by Kaggle Masters, but I have used it only a couple of times in a business setting, mostly due to the scalability issue.
In a high variance problem though, you might want to use each model generated by one of your k folds to not only estimate the error on the k-th fold, but also to make predictions on the test set. You'd end up with k prediction for the test set that you would then average, and that might help reduce variance. Essentially you're using bagging.
For high variance problems you can take this process further, essentially creating a randomized process where you average tens or hundreds of predictions by randomly selecting observations to train the model on (with or without replacement) as well as features (and possibly randomizing the choice of hyperparameters within given ranges). In each iteration you'd make a prediction of test data by using a different subset of rows and columns. In certain cases this works very well.
Obviously there is a scalability problem (training and scoring possibly hundreds of models). You'll also need to figure out what are optimal parameters to subset rows and columns.
This is a very common Kaggle technique used by Kaggle Masters, but I have used it only a couple of times in a business setting, mostly due to the scalability issue.
Use cross-validation to select hyper-parameters. Once they are determined, retrain your model with the whole training dataset and then evaluate it on an independent test dataset.
The purpose is to utilize as many training samples as possible since they are often scarce.
Always retain an independent test dataset that shall never be used for parameter tuning. Or else, your result will be over-optimistic.
The purpose is to utilize as many training samples as possible since they are often scarce.
Always retain an independent test dataset that shall never be used for parameter tuning. Or else, your result will be over-optimistic.
It's better (in my view) to split up the training set into 80% train and 20% test given that you have a large amount of data and taking the 20% doesn't hurt your model. Or even better 60% 20% 20%, which is train, cross validation and test. You train your model on the training set and then cross validate with the cross validation set. Once your model is at it's highest accuracy using the cross validation set then you evaluate to get the "real" accuracy with the test set.
I would agree with Li Shen. Create your domain relevant measure(or use F score) and create a grid of hyperparameters. Do cross fold validation on the grid. Choose the model which gives you the best measure. Use its hyper parameters to train on complete dataset.
'Machine Learning' 카테고리의 다른 글
Top Machine learning Books (0) | 2017.01.08 |
---|---|
강화학습 공부할때 처음 넘어야 하는 산이 "마코프"인데요. 저도 이에 대해서 여러가지 자료를 봤는데 아래 영상이 가장 쉽고 명확하게 설명해 주는 거 같습니다. 큰수의 법칙과 베르누이 과정에 이어서 마코프 과정 (0) | 2017.01.08 |
Cross Validation Parameter Selection (0) | 2017.01.04 |
Why feature scaling? (0) | 2016.12.31 |
When should I apply feature scaling for my data (0) | 2016.12.31 |