https://www.kaggle.com/c/titanic/discussion/6288
Hello,
Here's a matlab code to dowload the data and try some random forests with k-fold validation. Extensive test on the numbers of trees and mtry suggest default parameters are fine and the model robust to changing these hyperparameters (including the k for k-fold).
Two questions please:
1) do you see why the performance are so different for train and test set? k-fold cross validation suggest the model should get 0.82, but on the test set it gets 0.74, which is below 0.82 minus 3 SD (computed through 30 random repetitions). Does this sound likely or would you guess it indicates a bug somewhere?
2) I'm trying to set the random generator so as to make my prediction deterministic, however something sucks. Matlab random generators seems to behave fine (namely, A=rand(3,2) produces the very same numbers again and again), but overall the model still produce random predictions. I suspect the mex files don't rely on the random generator setted by matlab. Do you see a way to deal with this issue?
Regards,
PS:
The train.csv and test.csv are assumed to be in a folder "data", with all *.m in its parent file. In addition to the three first attached files (get_titanic... and pred.m), one need to get the mex and m file available through:
https://code.google.com/p/randomforest-matlab/downloads/list
(if your system fits mine, you can probably just take the attached files)