After K-fold cross-validation, Which model to select as a representative?

Machine Learning 2017. 1. 4. 20:24

https://www.quora.com/I-train-my-system-based-on-the-10-fold-cross-validation-framework-Now-it-gives-me-10-different-models-Which-model-to-select-as-a-representative

I train my system based on the 10-fold cross-validation framework. Now it gives me 10 different models. Which model to select as a representative?

I am using Random Forests technique for learning from data based on 10-fold cross validation setup. Now the performance measures seem very encouraging. I want to use the model to train an independent test set. But in 10-fold cross-validation setup, 90% data is used for training and the rest 10% as testing set for one fold. And this operation is repeated 10 times with mutually exclusive training sets from other folds.

Now, in 10-fold cross-validation setting, i get 10 different models based on the 10 different folds. I want to use the random forests model to test in a different test set.

Should i be selecting based on the error measures. Suppose the error measures for the folds are:
[20 21 23 27 30 21 24 25 28 26].

Sebastian Raschka, Author of Python Machine Learning, researcher applying ML to computational bio.

Written Jul 29, 2015 · Upvoted by Prashanth Ravindran, Machine Learning enthusiast

When I understand correctly, you are using regular k-fold cross-validation, not nested k-fold cross validation? Like you said, you train 10 "different" models, and each is evaluated on the test fold to calculate some form of performance metric (e.g., accuracy or classification error). What you are interested in is the average performance across those 10 test folds. Let me illustrate what I mean:

After you are done with hyperparameter tuning, you don't pick any of the models. You want to retrain it on the complete training dataset, and evaluate it on an independent test dataset (ideally). After you got your generalization performance estimate from the independent test set, you can retrain it on both training and test set and use it in real-world applications for example.

For algorithm comparisons, I recommend nested-cross validation. Here, you basically have nested cross-valiatation loops. Let me draw the typical 5x2 cross-validation:

Since you didn't ask for it, I don't want to go into too much detail here, but if you are interested, I leave you a link with further info about this topic:

"Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms" by Thomas Deitterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms

Giuliano Janson, studied at Statistics

Written Aug 6, 2015 · Upvoted by Prashanth Ravindran, Machine Learning enthusiast

The standard approach is to re-train on the whole dataset and use that one model to make predictions on the test set. That's usually a reasonable approach.
In a high variance problem though, you might want to use each model generated by one of your k folds to not only estimate the error on the k-th fold, but also to make predictions on the test set. You'd end up with k prediction for the test set that you would then average, and that might help reduce variance. Essentially you're using bagging.
For high variance problems you can take this process further, essentially creating a randomized process where you average tens or hundreds of predictions by randomly selecting observations to train the model on (with or without replacement) as well as features (and possibly randomizing the choice of hyperparameters within given ranges). In each iteration you'd make a prediction of test data by using a different subset of rows and columns. In certain cases this works very well.
Obviously there is a scalability problem (training and scoring possibly hundreds of models). You'll also need to figure out what are optimal parameters to subset rows and columns.
This is a very common Kaggle technique used by Kaggle Masters, but I have used it only a couple of times in a business setting, mostly due to the scalability issue.

1k Views · View Upvotes · Answer requested by Prashanth Ravindran

Li Shen, Machine Learning Jedi, Bioinformatician

Written Jul 29, 2015 · Upvoted by Prashanth Ravindran, Machine Learning enthusiast

Use cross-validation to select hyper-parameters. Once they are determined, retrain your model with the whole training dataset and then evaluate it on an independent test dataset.
The purpose is to utilize as many training samples as possible since they are often scarce.
Always retain an independent test dataset that shall never be used for parameter tuning. Or else, your result will be over-optimistic.

538 Views · View Upvotes · Answer requested by Prashanth Ravindran

Robert Griesmeyer, Implemented article recommendation systems at Flipboard and Zite.

Written Jul 29, 2015 · Upvoted by Prashanth Ravindran

It's better (in my view) to split up the training set into 80% train and 20% test given that you have a large amount of data and taking the 20% doesn't hurt your model. Or even better 60% 20% 20%, which is train, cross validation and test. You train your model on the training set and then cross validate with the cross validation set. Once your model is at it's highest accuracy using the cross validation set then you evaluate to get the "real" accuracy with the test set.

519 Views · View Upvotes · Answer requested by Prashanth Ravindran

Sonal Goyal, Learning to make machines learn

Written Sep 2, 2015

I would agree with Li Shen. Create your domain relevant measure(or use F score) and create a grid of hyperparameters. Do cross fold validation on the grid. Choose the model which gives you the best measure. Use its hyper parameters to train on complete dataset.

685 Views · View Upvotes · Answer requested by Prashanth Ravindran

Top Machine learning Books (0)	2017.01.08
강화학습 공부할때 처음 넘어야 하는 산이 "마코프"인데요. 저도 이에 대해서 여러가지 자료를 봤는데 아래 영상이 가장 쉽고 명확하게 설명해 주는 거 같습니다. 큰수의 법칙과 베르누이 과정에 이어서 마코프 과정 (0)	2017.01.08
Cross Validation Parameter Selection (0)	2017.01.04
Why feature scaling? (0)	2016.12.31
When should I apply feature scaling for my data (0)	2016.12.31

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Be the only one, not the best one

After K-fold cross-validation, Which model to select as a representative?

I train my system based on the 10-fold cross-validation framework. Now it gives me 10 different models. Which model to select as a representative?

Related Questions

'Machine Learning' 카테고리의 다른 글

카테고리

태그목록

최근에 올라온 글

최근에 달린 댓글

글 보관함

달력

링크

티스토리툴바