https://www.mathworks.com/help/stats/clustering.evaluation.silhouetteevaluation-class.html

 

 

 

 

Silhouette Value

The silhouette value for each point is a measure of how similar that point is to points in its own cluster, when compared to points in other clusters. The silhouette value for the ith point, Si, is defined as

Si = (bi-ai)/ max(ai,bi)

where ai is the average distance from the ith point to the other points in the same cluster as i, and bi is the minimum average distance from the ith point to points in a different cluster, minimized over clusters.

The silhouette value ranges from -1 to +1. A high silhouette value indicates that i is well-matched to its own cluster, and poorly-matched to neighboring clusters. If most points have a high silhouette value, then the clustering solution is appropriate. If many points have a low or negative silhouette value, then the clustering solution may have either too many or too few clusters. The silhouette clustering evaluation criterion can be used with any distance metric.

 

Posted by uniqueone
,
http://sebastianraschka.com/faq/docs/evaluate-a-model.html

 

 

Machine Learning FAQ

Index

How do I evaluate a model?

The short answer is to keep an independent test set for your final model – this has to be data that your model hasn’t seen before.

However, it all depends on your goal & approach.

Scenario 1:

  • Just train a simple model.

Split the dataset into a separate test and training set. Train the model on the former, evaluate the model on the latter (by “evaluate” I mean calculating performance metrics such as the error, precision, recall, ROC auc, etc.)

Scenario 2:

  • Train a model and tune (optimize) its hyperparameters.

Split the dataset into a separate test and training set. Use techniques such as k-fold cross-validation on the training set to find the “optimal” set of hyperparameters for your model. If you are done with hyperparameter tuning, use the independent test set to get an unbiased estimate of its performance. Below I inserted a figure to illustrate the difference:

The first row refers to “Scenario 1”, and the 3rd row describes a more “classic” approach where you further split your training data into a training subset and a validation set. Then, you train your model on the training subset and evaluate in on the validation set to optimize its hyperparameters, for example. Eventually, you test it on the independent test set. The fourth row describes the “superior” (more unbiased) approach using k-fold cross-validation as described in “Scenario 2.”

Also, let me attach an overview of k-fold cross validation in case you are not familiar with it, yet:

(Here: E = prediction error, but you can also substitute it by precision, recall, f1-score, ROC auc or whatever metric you prefer for the given task.)

Scenario 3:

  • Build different models and compare different algorithms (e.g., SVM vs. logistic regression vs. Random Forests, etc.).

Here, we’d want to use nested cross-validation. In nested cross-validation, we have an outer k-fold cross-validation loop to split the data into training and test folds, and an inner loop is used to select the model via k-fold cross-validation on the training fold. After model selection, the test fold is then used to evaluate the model performance. After we have identified our “favorite” algorithm, we can follow-up with a “regular” k-fold cross-validation approach (on the complete training set) to find its “optimal” hyperparameters and evaluate it on the independent test set. Let’s consider a logistic regression model to make this clearer: Using nested cross-validation you will train m different logistic regression models, 1 for each of the m outer folds, and the inner folds are used to optimize the hyperparameters of each model (e.g., using gridsearch in combination with k-fold cross-validation. If your model is stable, these m models should all have the same hyperparameter values, and you report the average performance of this model based on the outer test folds. Then, you proceed with the next algorithm, e.g., an SVM etc.

 

Posted by uniqueone
,
http://sebastianraschka.com/blog/2016/model-evaluation-selection-part3.html

 

 

Model evaluation, model selection, and algorithm selection in machine learning

Part III - Cross-validation and hyperparameter tuning

Introduction

Almost every machine learning algorithm comes with a large number of settings that we, the machine learning researchers and practitioners, need to specify. These tuning knobs, the so-called hyperparameters, help us control the behavior of machine learning algorithms when optimizing for performance, finding the right balance between bias and variance. Hyperparameter tuning for performance optimization is an art in itself, and there are no hard-and-fast rules that guarantee best performance on a given dataset. In Part I and Part II, we saw different holdout and bootstrap techniques for estimating the generalization performance of a model. We learned about the bias-variance trade-off, and we computed the uncertainty of our estimates. In this third part, we will focus on different methods of cross-validation for model evaluation and model selection. We will use these cross-validation techniques to rank models from several hyperparameter configurations and estimate how well they generalize to independent datasets.

About Hyperparameters and Model Selection

Previously, we used the holdout method or different flavors of bootstrapping to estimate the generalization performance of our predictive models. We split our dataset into two parts: a training and a test dataset. After the machine learning algorithm fit a model to the training set, we evaluated it on the independent test set that we withheld from the machine learning algorithm during model fitting. While we were discussing challenges such as the bias-variance trade-off, we used fixed hyperparameter settings in our learning algorithms, such as the number of k in the K-nearest neighbors algorithm. We defined hyperparameters as the parameters of the learning algorithm itself, which we have to specify a priori — before model fitting. In contrast, we refered to the parameters of our resulting model as the model parameters.

So, what are hyperparameters, exactly? Considering the k-nearest neighbors algorithm, one example of a hyperparameter is the integer value of k. If we set k=3, the k-nearest neighbors algorithm will predict a class label based on a majority vote among the 3-nearest neighbors in the training set. The distance metric for finding these nearest neighbors is yet another hyperparameter of the algorithm.

KNN example

Now, the k-nearest neighbors algorithm may not be an ideal choice for illustrating the difference between hyperparameters and model parameters, since it is a lazy learner and a nonparametric method. In this context, lazy learning (or instance-based learning) means that there is no training or model fitting stage: A k-nearest neighbors model literally stores or memorizes the training data and uses it only at prediction time. Thus, each training instance represents a parameter in a k-nearest neighbors model. In short, nonparametric models are models that cannot be described by a fixed number of parameters that are being adjusted to the training set. The structure of parametric models is not decided by the training data rather than being set a priori; nonparamtric models do not assume that the data follows certain probability distributions unlike parametric methods (exceptions of nonparametric methods that make such assumptions are Bayesian nonparametric methods). Hence, we may say that nonparametric methods make fewer assumptions about the data than parametric methods.

In contrast to k-nearest neighbors, a simple example of a parametric method would be logistic regression, a generalized linear model with a fixed number of model parameters: a weight coefficient for each feature variable in the dataset plus a bias (or intercept) unit. These weight coefficients in logistic regression, the model parameters, are updated by maximizing a log-likelihood function or minimizing the logistic cost. For fitting a model to the training data, a hyperparameter of a logistic regression algorithm could be the number of iterations or passes over the training set (epochs) in a gradient-based optimization. Another example of a hyperparameter would be the value of a regularization parameter such as the lambda-term in L2-regularized logistic regression:

Logistic Regression

Changing the hyperparameter values when running a learning algorithm over a training set may result in different models. The process of finding the best-performing model from a set of models that were produced by different hyperparameter settings is called model selection. In the next section, we will look at an extension to the holdout method that helps us with this selection process.

The Three-Way Holdout Method for Hyperparameter Tuning

In Part I, we learned that resubstituion validation is a bad approach for estimating of the generalization performance. Since we want to know how well our model generalizes to new data, we used the holdout method to split the dataset into two parts, a training set and an independent test set. Can we use the holdout method for hyperparameter tuning? The answer is “yes!” However, we have to make a slight modification to our initial approach, the “two-way” split, and split the dataset into three parts: a training, a validation, and a test set.

We can regard the process of hyperparameter tuning and model selection as a meta-optimization task. While the learning algorithm optimizes an objective function on the training set (with exception to lazy learners), hyperparameter optimization is yet another task on top of it; here, we typically want to optimize a performance metric such as classification accuracy or the area under a Receiver Operating Characteristic curve. After the tuning stage, selecting a model based on the test set performance seems to be a reasonable approach. However, reusing the test set multiple times would introduce a bias and our final performance estimate and likely result in overly optimistic estimates of the generalization performance — we can say that “the test set leaks information.” To avoid this problem, we could use a three-way split, dividing the dataset into a training, validation, and test dataset. Having a training-validation pair for hyperparameter tuning and model selections allows us to keep the test set “independent” for model evaluation. Now, remember our discussion of the “3 goals” of performance estimation?

  1. We want to estimate the generalization accuracy, the predictive performance of our model on future (unseen) data.
  2. We want to increase the predictive performance by tweaking the learning algorithm and selecting the best-performing model from a given hypothesis space.
  3. We want to identify the machine learning algorithm that is best-suited for the problem at hand; thus, we want to compare different algorithms, selecting the best-performing one as well as the best-performing model from the algorithm’s hypothesis space.

The “three-way holdout method” is one way to tackle points 1 and 2 (more on point 3 in the next article, Part IV). Though, if we are only interested in point 2, selecting the best model, and do not care so much about an “unbiased” estimate of the generalization performance, we could stick to the two-way split for model selection. Thinking back of our discussion about learning curves and pessimistic biases in Part II, we noted that a machine learning algorithm often benefits from more labeled data; the smaller the dataset, the higher the pessimistic bias and the variance — the sensitivity of our model towards the way we partition the data.

“There ain’t no such thing as a free lunch.” The three-way holdout method for hyperparameter tuning and model selection is not the only — and certainly often not the best — way to approach this task. In later sections, we will learn about alternative methods and discuss their advantages and trade-offs. However, before we move on to the probably most popular method for model selection, k-fold cross-validation (or sometimes also called “rotation estimation” in older literature), let us have a look at an illustration of the 3-way split holdout method:

holdout-validation step 1

Since there’s a lot going on in this figure, let’s walk through it step by step.

holdout-validation step 1

We start by splitting our dataset into three parts, a training set for model fitting, a validation set for model selection, and a test set for the final evaluation of the selected model.

holdout-validation step 2

The second step illustrates the hyperparameter tuning stage. We use the learning algorithm with different hyperparameter settings (here: three) to fit models to the training data.

holdout-validation step 3

Next, we evaluate the performance of our models on the validation set. This step illustrates the model selection stage; after comparing the performance estimates, we choose the hyperparameters settings associated with the best performance. Note that we often merge steps two and three in practice: we fit a model and compute its performance before moving on to the next model in order to avoid keeping all fitted models in memory.

holdout-validation step 4

As discussed in Part I and Part II, our estimates may suffer from pessimistic bias if the training set is too small. Thus, we can merge the training and validation set after model selection and use the best hyperparameter settings from the previous step to fit a model to this larger dataset.

  • holdout-validation step 5

Now, we can use the independent test set to estimate the generalization performance our model. Remember that the purpose of the test set is to simulate new data that the model has not seen before. Re-using this test set may result in an overoptimistic bias in our estimate of the model’s generalization performance.

  • holdout-validation step 6

Finally, we can make use of all our data — merging training and test set — and fit a model to all data points for real-world use.

Introduction to K-fold Cross-Validation

It’s about time to introduce the probably most common technique for model evaluation and model selection in machine learning practice: k-fold cross-validation. The term cross-validation is used loosely in literature, where practitioners and researchers sometimes refer to the train/test holdout method as a cross-validation technique. However, it might make more sense to think of cross-validation as a crossing over of training and validation stages in successive rounds. Here, the main idea behind cross-validation is that each sample in our dataset has the opportunity of being tested. K-fold cross-validation is a special case of cross-validation where we iterate over a dataset set k times. In each round, we split the dataset into k parts: one part is used for validation, and the remaining k-1 parts are merged into a training subset for model evaluation as shown in the figure below, which illustrates the process of 5-fold cross-validation:

Just as in the “two-way” holdout method, we use a learning algorithm with fixed hyperparameter settings to fit models to the training folds in each iteration — if we use the k-fold cross-validation method for model evaluation. In 5-fold cross-validation, this procedure will result in five different models fitted; these models were fit to distinct yet partly overlapping training sets and evaluated on non-overlapping validation sets. Eventually, we compute the cross-validation performance as the arithmetic mean over the k performance estimates from the validation sets.

We saw the main difference between the “two-way” holdout method and k-fold cross validation: k-fold cross-validation uses all data for training and testing. The idea behind this approach is to reduce the pessimistic bias by using more training data in contrast to setting aside a relatively large portion of the dataset as test data. And in contrast to the repeated holdout method, which we discussed in Part II, test folds in k-fold cross-validation are not overlapping. In repeated holdout, the repeated use of samples for testing results in performance estimates that become dependent between rounds; this dependence can be problematic for statistical comparisons, which we will discuss in Part IV. Also, k-fold cross-validation guarantees that each sample is used for validation in contrast to the repeated holdout-method, where some samples may never be part of the test set.

In this section, we introduced k-fold cross-validation for model evaluation. In practice, however, k-fold cross-validation is more commonly used for model selection or algorithm selection. K-fold cross-validation for model selection is a topic that we will cover later in this article, and we will talk about algorithm selection in detail throughout the next article, Part IV.

Special Cases: 2-Fold and Leave-One-Out Cross-Validation

At this point, you may wonder why we chose k=5 to illustrate k-fold cross-validation in the previous section. One reason is that it makes it easier to illustrate k-fold cross-validation compactly. Moreover, k=5 is also a common choice in practice, since it is computationally less expensive compared to larger values of k. If k is too small, though, we may increase the pessimistic bias of our estimate (since less training data is available for model fitting), and the variance of our estimate may increase as well since the model is more sensitive to how we split the data (later, we will discuss experiments that suggest k=10 as a good choice for k).

In fact, there are two prominent, special cases of k-fold cross validation: k=2 and k=n. Most literature describes 2-fold cross-validation as being equal to the holdout method. However, this statement would only be true if we perform the holdout method by rotating the training and validation set in two rounds (i.e., using exactly 50% data for training and 50% of the samples for validation in each round, swapping these sets, repeating the training and evaluation procedure, and eventually computing the performance estimate as the arithmetic mean of the two performance estimates on the validation sets). Given how the holdout method is most commonly used though, I like to describe the holdout method and 2-fold cross-validation as two different processes as illustrated in the figure below:

Holdout Method vs 2-Fold cross-validation

Now, if we set k=n, that is, if set the number of folds as being equal to the number of training instances, we refer to the k-fold cross-validation process as Leave-one-out cross-validation (LOOCV). In each iteration during LOOCV, we fit a model to n-1 samples of the dataset and evaluate it on the single, remaining data point. Although this process is computationally expensive, given that we have n iterations, it can be useful for small datasets, cases where withholding data from the training set would be too wasteful.

Leave-One-Out Cross-Validation LOOCV

Several studies compared different values of k in k-fold cross-validation, analyzing how the choice of k affects the variance and the bias of the estimate. Unfortunately, there is no Free Lunch though as shown by Yohsua Bengio and Yves Grandvalet in “No unbiased estimator of the variance of k-fold cross-validation.”

The main theorem shows that there exists no universal (valid under all distributions) unbiased estimator of the variance of K-fold cross-validation. (Bengio and Grandvalet, 2004)

However, we may still be interested in finding a “sweet spot,” a value that seems to be a good trade-off between variance and bias in most cases, and we will continue the bias-variance trade-off discussion in the next section. For now, let’s conclude this section by looking at an interesting research project where Hawkins and others compared performance estimates via LOOCV to the holdout method and recommend the LOOCV over the latter — if computationally feasible.

[…] where available sample sizes are modest, holding back compounds for model testing is ill-advised. This fragmentation of the sample harms the calibration and does not give a trustworthy assessment of fit anyway. It is better to use all data for the calibration step and check the fit by cross-validation, making sure that the cross-validation is carried out correctly. […] The only motivation to rely on the holdout sample rather than cross-validation would be if there was reason to think the cross-validation not trustworthy — biased or highly variable. But neither theoretical results nor the empiric results sketched here give any reason to disbelieve the cross-validation results. (Hawkins and others, 2003)

These conclusions are partly based on the experiments carried out in this study using a 469-sample dataset. The following table summarizes the finding in a comparison of different Ridge Regression models:

experiment mean standard deviation
true R2—q2 0.010 0.149
true R2—hold 50 0.028 0.184
true R2—hold 20 0.055 0.305
true R2—hold 10 0.123 0.504

In rows 1-4, Hawkins and others used 100-sample training sets to compare different methods of model evaluation. The first row corresponds to an experiment where the researchers used LOOCV and fit regression models to 100-sample training subsets. The reported “mean” refers to the averaged difference between the true coefficiants of determination and the coefficients of determination obtained via LOOCV (here called q2) after repeating this procedure on different 100-sample training sets. In rows 2-4, the researchers used the holdout method for fitting models to 100-sample training sets, and they evaluated the performances on holdout sets of sizes 10, 20, and 50 samples. Each experiment was repeated 75 times, and the mean column shows the average difference between the estimated R2 and true R2 values. As we can see, the estimate obtained via LOOCV (q2) is closest to the true R2. The estimates obtained from the 50-test sample holdout method are also passable, though. Based on these particular experiments, I agree with the researchers’ conclusion:

Taking the third of these points, if you have 150 or more compounds available, then you can certainly make a random split into 100 for calibration and 50 or more for testing. However it is hard to see why you would want to do this.

One reason why we may prefer the holdout method may be concerns about computational efficiency, if our dataset is sufficiently large. As a rule of thumb, we can say that the pessimistic bias and large variance concerns are less problematic the larger the dataset. Moreover, it is not uncommon to repeat the k-fold cross-validation procedure with different random seeds in hope to obtain a “more robust” estimate. For instance, if we repeated a 5-fold cross-validation run 100 times, we would compute the performance estimate for 500 test folds report the cross-validation performance as the arithmetic mean of these 500 folds. (Although this is commonly done in practice, we note that the test folds are now overlapping.) However, there’s no point in repeating LOOCV, since LOOCV always produces the same splits.

K and the Bias-variance Trade-off

Based on the experimental evidence that we saw in the previous section, we may prefer LOOCV over single train/test splits via the holdout method for small and moderately sized datasets. In addition, we can think of the LOOCV estimate as being approximately unbiased: the pessimistic bias of LOOCV (k=n) is intuitively lower compared k<n-fold cross-validation, since almost all (for instance, n-1) training samples are available for model fitting.

While LOOCV is almost unbiased, one downside of using LOOCV over k-fold cross-validation with k<n is the large variance of the LOOCV estimate. First, we have to note that LOOCV is defect when using a discontinuous loss-function such as the 0-1 loss in classification or even in continuous loss functions such as the mean-squared-error. Often, it is said that LOOCV

… [LOOCV has] high variance because the test set only contains one sample. (Tan and others, 2005)

… [LOOCV] is highly variable, since it is based upon a single observation (x1, y1). (Gareth and others, 2013)

These statements are certainly true if we refer to the variance between folds. Remember that if we use the 0-1 loss function (the prediction is either correct or not), we could consider each prediction as a Bernoulli trial, and the number of correct predictions X is following a binomial distribution XB(n,p) , where nN and p[0,1] ; the variance of a binomial distribution is defined as σ2=np(1p) .

We can estimate the variability of a statistic (here: the performance of the model) from the variability of that statistic between subsamples. Obviously though, the variance between folds is a poor estimate of the variance of the LOOCV estimate — the variability due to randomness in our training data. Now, when we are talking about the variance of LOOCV, we typically mean the difference in the results that we would get if we repeated the resampling procedure multiple times on different data samples from the underlying distribution. Thus, a more interesting point has been made by Hastie, Tibshirani, and Friedman:

With K = N, the cross-validation estimator is approximately unbiased for the true (expected) prediction error, but can have high variance because the N “training sets” are so similar to one another. (Hastie and others, 2009)

Or in other words, we can attribute the high variance to the well-known fact that the mean of highly correlated variables has a higher variance than the mean of variables that are not highly correlated. Maybe, this can intuitively be explained by looking at the relationship between covariance (cov ) and variance (σ2 ):

covX,X=σX2

Proof: Let μ=E(X),thencovX,X=E[(Xμ)2]=σX2

And the relationship between covariance covX,Y and correlation ρX,X (X and Y are random variables) is defined as

covX,Y=ρX,YσXσY,

where

covX,Y=E[(XμX)(YμY)]

and

ρX,Y=E[(XμX)(YμY)]/(σXσY).

The large variance that is often associated with LOOCV has also been observed in empirical studies — for example, I really recommend reading the excellent paper A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995) by Ron Kohavi.

Now that we established that LOOCV estimates are generally associated with a large variance and a small bias, how does this method compare to k-fold cross-validation with other choices for k and the bootstrap method? In Part II, we mentioned the pessimistic bias of the standard bootstrap method, where the training set asymptotically (only) contains 0.632 of the samples from the original dataset; 2- or 3-fold cross-validation has about the same problem. We discussed the 0.632 Bootstrap that was designed to address this pessimistic bias issue. However, Kohavi also observed in his experiments (Kohavi, 1995) that the bias in bootstrap was still extremely large for certain real-world datasets (now, optimistically biased) compared to k-fold cross-validation. Eventually, Kohavi’s experiments on various real-world datasets suggest that 10-fold cross-validation offers the best trade-off between bias and variance. Furthermore, other researchers found that repeating k-fold cross-validation can increase the precision of the estimates while still maintaining a small bias (Molinaro and others, 2005; Kim, 2009).

Before moving on to model selection, let’s summarize this discussion of the bias-variance trade-off by listing the general trends when increasing the number of folds or k:

  • the bias of the performance estimator decreases (more accurate)
  • the variance of the performance estimators increases (more variability)
  • computational cost increases (more iterations, larger training sets during fitting)
  • exception: decreasing the value of k in k-fold cross-validation to small values (e.g., 2 or 3) also increases the variance on small datasets due to random sampling effects.

Model Selection via K-fold Cross-validation

Previously, we used k-fold cross-validation for model evaluation. Now, we are going to take things further and use the k-fold cross-validation method for model selection. Again, the key idea is to keep an independent test dataset, that we withhold from during training and model selection, to avoid the leaking of test data in the training stage:

k-fold model selection figure

Although, the figure above might seem somewhat complicated at first glance, the process is quite simple and similar to the “three-way holdout” workflow that we discussed at the beginning of this article. Let’s walk through it step by step.

k-fold model selection step 1

Similar to the holdout method, we split our dataset into two parts, a training and an independent test set; we tuck away the test set for the final model evaluation step at the end.

k-fold model selection step 2

In the second step, we can now experiment with various hyperparameter settings; we could use Bayesian Optimization, Randomized Search, or plain old Grid Search. For each hyperparameter configuration, we apply the k-fold cross-validation on the training set, resulting in multiple models and performance estimates.

k-fold model selection step 3

Taking the hyperparameter settings that correspond to the best-performing model, we can then use the complete training set for model fitting.

k-fold model selection step 4

Now it’s time to make use of the independent test set that we withheld; we use this test set to evaluate the model that we obtained from step 3.

k-fold model selection step 5

Finally, when we completed the evaluation stage, we can fit a model to all our data, which could be the model for (the so-called) deployment.

When we browse the deep learning literature, we often find that that the 3-way holdout method is the method of choice when it comes to model evaluation; it is also common in older (non-deep learning literature) as well. As mentioned earlier, the three-way holdout may be preferred over k-fold cross-validation since the former is computationally cheap in comparison. Aside from computational efficiency concerns, we only use deep learning algorithms when we have relatively large sample sizes anyway, scenarios where we don’t have to worry about high variance — due to sensitivity of our estimates towards how we split the dataset for training, validation, and testing — so much.

Note that if we normalize data or select features, we typically perform these operations inside the k-fold cross-validation loop in contrast to applying these steps to the whole dataset upfront before splitting the data into folds. Feature selection inside the cross-validation loop reduces the bias through overfitting, since we avoid peaking at the test data information during the training stage. On the flipside, feature selection inside the cross-validation loop may lead to an overly pessimistic estimate, since less data is available for training. For a more detailed discussion on this topic, feature selection inside or outside the cross-validation loop, I recommend reading Refaeilzadeh's "On comparison of feature selection algorithms". (Refaeilzadeh and others, 2007)

The Law of Parsimony

Now that we discussed model selection in the previous section, let us take a moment and consider the Law of Parsimony aka Occam’s Razor:

Among competing hypotheses, the one with the fewest assumptions should be selected.

Or to say it with other words, using one of my favorite quotes:

“Everything should be made as simple as possible, but not simpler.” — Albert Einstein

In model selection practice, we can apply Occam’s razor using one-standard error method as follows:

  1. Consider the numerically optimal estimate and its standard error.
  2. Select the model whose performance is within one standard error of the value obtained in step 1 (Breiman and others, 1984).

Although, we may prefer simpler models for several reasons, Pedro Domingos made a good point regarding the performance of “complex” models. Here’s an excerpt from his his recent article, “Ten Myths About Machine Learning:”

Simpler models are more accurate. This belief is sometimes equated with Occam’s razor, but the razor only says that simpler explanations are preferable, not why. They’re preferable because they’re easier to understand, remember, and reason with. Sometimes the simplest hypothesis consistent with the data is less accurate for prediction than a more complicated one. Some of the most powerful learning algorithms output models that seem gratuitously elaborate — sometimes even continuing to add to them after they’ve perfectly fit the data — but that’s how they beat the less powerful ones.

Again, there are several reasons why we may prefer a simpler model if its performance is within a certain, acceptable range — for example, using the one-standard error method. Although a simpler model may not be the most “accurate” one, it may be computationally more efficient, easier to implement, and easier to understand and reason with compared to more complicated alternatives.

To see how the one-standard error method works in practice, let us apply it to a simple toy dataset: 300 datapoints, cocentric circles, and a uniform class distribution (150 samples from class 1 and 150 samples from class 2). First, we split the dataset into two parts, 70% training data and 30% test data, using stratification to maintain equal class proportions. The 210 samples from the training dataset are shown below:

circles dataset figure

Say we want to optimize the Gamma hyperparameter of a Support Vector Machine (SVM) with a non-linear Radial Basis Function-kernel (RBF-kernel), where γ is the free parameter of the Gaussian RBF:

K(xi,xj)=exp(γ||xixj||2),γ>0

(Intuitively, we can think of the Gamma as a parameter that controls the influence of single training samples on the decision boundary.)

When I ran the RBF-kernel SVM algorithm with different Gamma values over the training set, using stratified 10-fold cross-validation, I obtained the following performance estimates, where the error bars are the standard errors of the cross-validation estimates:

circles dataset gamma tuning figure

(The code for producing the plots shown in this article can be found in this Jupyter Notebook on GitHub.)

We can see that Gamma values between 0.1 and 100 resulted in a prediction accuracy of 80% or more. Furthermore, we can see that γ=10.0 results in a fairly complex decision boundary, and γ=0.001 results in a decision boundary that is too simple to separate the two classes. In fact, γ=0.1= seems like a good trade-off between the two aforementioned models — the performance of the corresponding model falls within one standard error of the best performing model with γ=10 or γ=1000 .

Summary and conclusion

There are many ways for evaluating the generalization performance of predictive models. So far, we have seen the holdout method, different flavors of the bootstrap approach, and k-fold cross-validation. In my opinion, the holdout method is absolutely fine for model evaluation when working with relatively large sample sizes. If we are into hyperparameter tuning, we may prefer 10-fold cross-validation, and Leave-One-Out cross-validation is a good option if we are working with small sample sizes. When it comes to model selection, again, the “three-way” holdout method may be a good choice due to computational limitations; a good alternative is k-fold cross-validation with an independent test set. An even better method for model selection or algorithm selection is nested cross-validation, a method that we will discuss in Part IV.

What’s Next

In the next part of this series, we will discuss hypothesis tests and methods for algorithm selection in more detail.

Say we want to hire a stock market analyst. To find a good stock market analyst, let’s assume we asked our candidates to predict whether certain stock prices go up or down in the next 10 days, prior to the interview. A good candidate should get at least 8 out of these 10 predictions correct. Without having any knowledge about how stocks work, I would say that our probability of correctly predicting the trend each day is 50% — that’s a coin-flip each day. So, if we just interviewed one coin-flipping candidate, her chance of being right 8 out of 10 times would be 0.0547:

(108)+(109)+(1010)210=0.0547.

In other words, we can say that this candidate’s predictive performance is unlikely due to chance. However, say we didn’t just invite one single interviewee: we invited 100. If we’d asked these 100 interviewers for their predictions. Assuming that no candidate has a clue about how stocks work, and everyone was guessing randomly, the probability that at least one of the candidates got 8 out of 10 predictions correct is:

1(10.0547)100=0.9964.

So, shall we assume that a candidate who got 8 out of 10 predictions correct was not simply guessing randomly? We will continue this discussion on hypothesis tests, comparisons between learning algorithms in Part IV.

References

Have feedback on this post? I would love to hear it. Let me know and send me a tweet or email.

 

Posted by uniqueone
,
http://stats.stackexchange.com/questions/52274/how-to-choose-a-predictive-model-after-k-fold-cross-validation

 

 

 

 

I am wondering how to choose a predictive model after doing K -fold cross-validation.

This may be awkwardly phrased, so let me explain in more detail: whenever I run K -fold cross-validation, I use K different sets of training data, and end up with K different models.

I would like to know how to choose one of these K models, so that I can present it to someone and say "this is the best classifier that we can come up with."

Is it OK to pick any one of the K models? Or is there some kind of best practice that is involved, such as picking the model that achieves the median test error?

 

 

 

 

I think you are missing something still in your understanding of the purpose of cross-validation.

Let's get some terminology straight, generally when we say 'a model' we refer to a particular method for describing how some input data relates to what we are trying to predict. We don't generally refer to particular instances of that method as different models. So you might say 'I have a linear regression model' but you wouldn't call two different sets of the trained coefficients different models. At least not in the context of model selection.

So, when you do K-fold cross validation, you are testing how well your model is able to get trained by some data and then predict data it hasn't seen. We use cross validation for this because if you train using all the data you have, you have non left for testing. You could do this once, say by using 80% of the data to train and 20% to test, but what if the 20% you happened to pick to test happens to contain a bunch of points that are particularly easy (or particularly hard) to predict? We will not have come up with the best estimate possible of the models ability to learn and predict.

We want to use all of the data. So to continue the above example of an 80/20 split, we would do 5-fold cross validation by training the model 5 times on 80% of the data and testing on 20%. We ensure that each data point ends up in the 20% test set exactly once. We've therefore used every data point we have to contribute to an understanding of how well our model performs the task of learning from some data and predicting some new data.

But the purpose of cross-validation is not to come up with our final model. We don't use these 5 instances of our trained model to do any real prediction. For that we want to use all the data we have to come up with the best model possible. The purpose of cross-validation is model checking, not model building.

Now, say we have two models, say a linear regression model and a neural network. How can we say which model is better? We can do K-fold cross-validation and see which one proves better at predicting the test set points. But once we have used cross-validation to select the better performing model, we train that model (whether it be the linear regression or the neural network) on all the data. We don't use the actual model instances (trained model in cv) we trained during cross-validation for our final predictive model.

Note that there is a technique called bootstrap aggregation (usually shortened to 'bagging') that does in a way use model instances produced in a way similar to cross-validation to build up an ensemble model, but that is an advanced technique beyond the scope of your question here.

 

(난 당신이 cross 유효성 확인의 목적을 이해하는데 여전히 뭔가를 놓치고 있다고 생각합니다.

몇 가지 용어를 직선적으로 살펴 보도록하겠습니다. 일반적으로 '모델'이라고 할 때 입력 데이터가 우리가 예측하려고하는 것과 관련되는 방식을 설명하는 특정 방법을 말합니다. 우리는 일반적으로 그 방법의 특정 인스턴스를 다른 모델로 언급하지 않습니다. 따라서 '선형 회귀 모델이 있습니다.'라고 말할 수도 있지만 훈련 된 계수의 서로 다른 두 세트를 다른 모델이라고 부르지는 않습니다. 최소한 모델 선택이라는 맥락에서는 그렇지 않습니다.

따라서 K-fold 교차 유효성 검사를 수행하면 모델이 일부 데이터로 얼마나 잘 훈련 받았는지 테스트 한 후 보지 못한 데이터를 예측할 수 있습니다. 보유하고있는 모든 데이터를 사용하여 교육을하면 테스트 할 권한이 없으므로 교차 검증을 사용합니다. 이 작업은 한 번만 할 수 있습니다. 예를 들어, 훈련을 위해 80 %의 데이터를 사용하고 테스트를 위해 20 %의 데이터를 사용하는 경우가 있습니다.하지만 테스트를 위해 선택한 20 %의 경우에 특히 쉬운 (또는 특히 어려운) 예측하기? 우리는 모델을 배우고 예측할 수있는 최선의 가능성을 내놓지 않을 것입니다.

우리는 모든 데이터를 사용하고자합니다. 따라서 위의 80/20 분할 예제를 계속 진행하려면 데이터의 80 %에서 모델을 5 번 연습하고 20 %에서 테스트하여 5 배 교차 유효성 검사를 수행합니다. 각 데이터 포인트는 정확히 한 번 20 % 테스트 세트로 끝납니다. 따라서 우리는 모델이 어떤 데이터를 통해 학습하는 작업을 얼마나 잘 수행하고 새로운 데이터를 예측하는지에 대한 이해에 기여해야하는 모든 데이터 요소를 사용했습니다.

그러나 교차 검증의 목적은 우리의 최종 모델을 제시하는 것이 아닙니다. 우리는 실제 예측을하기 위해 훈련 된 모델의 5 가지 사례를 사용하지 않습니다. 이를 위해 가능한 최상의 모델을 찾기 위해 필요한 모든 데이터를 사용하고자합니다. 교차 유효성 확인의 목적은 모델 작성이 아니라 모델 확인입니다.

이제 선형 회귀 모형과 신경망이라는 두 가지 모델이 있다고 가정 해 보겠습니다. 어떤 모델이 더 좋은지 어떻게 말할 수 있습니까? K- 폴 교차 유효성 검사를 수행하고 테스트 세트 포인트를 예측할 때 어느 것이 더 나은지 확인하십시오. 그러나 교차 검증을 통해보다 우수한 성능의 모델을 선택하면 모든 데이터에 대해 해당 모델을 선형 회귀 또는 신경망으로 모델링합니다. 우리는 최종 예측 모델의 교차 검증 과정에서 훈련 한 실제 모델 인스턴스를 사용하지 않습니다.

부트 스트랩 집계 (일반적으로 'bagging'으로 단축)라는 기술이 있는데,이 방법은 교차 검증과 유사한 방식으로 생성 된 모델 인스턴스를 사용하여 앙상블 모델을 구축하지만 범위를 벗어나는 고급 기술입니다 여기에 귀하의 질문입니다.)

Posted by uniqueone
,

http://hayoungkim.tistory.com/20

Posted by uniqueone
,

https://www.mathworks.com/matlabcentral/fileexchange/36478-parelab

 

1, Data:

BnuCampus images and annotations.

2, DbInit:

Reading imageset with single or multiple labels into standard interface.

3, FeatureLbp:

Local Binary Pattern & Local Phase Quantization, based on (http://www.cse.oulu.fi/MVG/Downloads/LBPSoftware).

4, FeatureColorDescriptors:

Wrapped Koen's code (http://koen.me/research/colordescriptors/).

5, FeatureAsift:

Based on Yu's code (http://www.cmap.polytechnique.fr/~yu/research/ASIFT/demo.html).

6, FeatureDctQuantZigzag:

Discrete Cosine Transform, including quantization and zigzag-scanning.

7, ScSPM, according to the code by Jianchao Yang @ NEC Research Lab America (Cupertino).

8, KNN classifier:

Divide the training and test images (using cross validation), extract block features from multiple resolutions of each image, find nearest K samples to map their labels to test samples.

 

PaReLab.zip

 

Posted by uniqueone
,

Generative model

From Wikipedia, the free encyclopedia

In probability and statistics, a generative model is a model for randomly generating observable data, typically given some hidden parameters. It specifies a joint probability distribution over observation and label sequences. Generative models are used in machine learning for either modeling data directly (i.e., modeling observations draws from a probability density function), or as an intermediate step to forming a conditional probability density function. A conditional distribution can be formed from a generative model through Bayes' rule.

Shannon (1948) gives an example in which a table of frequencies of English word pairs is used to generate a sentence beginning with "representing and speedily is an good"; which is not proper English but which will increasingly approximate it as the table is moved from word pairs to word triplets etc.

Generative models contrast with discriminative models, in that a generative model is a full probabilistic model of all variables, whereas a discriminative model provides a model only for the target variable(s) conditional on the observed variables. Thus a generative model can be used, for example, to simulate (i.e. generate) values of any variable in the model, whereas a discriminative model allows only sampling of the target variables conditional on the observed quantities. Despite the fact that discriminative models do not need to model the distribution of the observed variables, they cannot generally express more complex relationships between the observed and target variables. They don't necessarily perform better than generative models at classification and regression tasks.

Examples of generative models include:

If the observed data are truly sampled from the generative model, then fitting the parameters of the generative model to maximize the data likelihood is a common method. However, since most statistical models are only approximations to the true distribution, if the model's application is to infer about a subset of variables conditional on known values of others, then it can be argued that the approximation makes more assumptions than are necessary to solve the problem at hand. In such cases, it can be more accurate to model the conditional density functions directly using a discriminative model (see above), although application-specific details will ultimately dictate which approach is most suitable in any particular case.

 

Discriminative model

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Discriminative models, also called conditional models, are a class of models used in machine learning for modeling the dependence of an unobserved variable y on an observed variable x. Within a probabilistic framework, this is done by modeling the conditional probability distribution P(y|x), which can be used for predicting y from x.

Discriminative models, as opposed to generative models, do not allow one to generate samples from the joint distribution of x and y. However, for tasks such as classification and regression that do not require the joint distribution, discriminative models can yield superior performance.[1][2] On the other hand, generative models are typically more flexible than discriminative models in expressing dependencies in complex learning tasks. In addition, most discriminative models are inherently supervised and cannot easily be extended to unsupervised learning. Application specific details ultimately dictate the suitability of selecting a discriminative versus generative model.

Examples of discriminative models used in machine learning include:

Posted by uniqueone
,

http://stackoverflow.com/questions/879432/what-is-the-difference-between-a-generative-and-discriminative-algorithm

 

Let's say you have input data x and you want to classify the data into labels y. A generative model learns the joint probability distribution p(x,y) and a discriminative model learns the conditional probability distribution p(y|x) - which you should read as 'the probability of y given x'.

Here's a really simple example. Suppose you have the following data in the form (x,y):

(1,0), (1,0), (2,0), (2, 1)

p(x,y) is

      y=0   y=1
     -----------

 

 

x=1 | 1/2 0 x=2 | 1/4 1/4

p(y|x) is

      y=0   y=1
     -----------
x=1 | 1     0
x=2 | 1/2   1/2

If you take a few minutes to stare at those two matrices, you will understand the difference between the two probability distributions.

The distribution p(y|x) is the natural distribution for classifying a given example x into a class y, which is why algorithms that model this directly are called discriminative algorithms. Generative algorithms model p(x,y), which can be tranformed into p(y|x) by applying Bayes rule and then used for classification. However, the distribution p(x,y) can also be used for other purposes. For example you could use p(x,y) to generate likely (x,y) pairs.

From the description above you might be thinking that generative models are more generally useful and therefore better, but it's not as simple as that. This papernips01-discriminativegenerative.pdfis a very popular reference on the subject of discriminative vs. generative classifiers, but it's pretty heavy going. The overall gist is that discriminative models generally outperform generative models in classification tasks.

 

 

Posted by uniqueone
,
http://hgycap.tistory.com/10


pattern recognition에서 classification에 사용하는 모델은 2가지로 나눌 수 있습니다. generative model discriminative model이 그것들입니다. generative model은 말 그대로 sample data set을 생성할 수 있는 model이라고 보시면 크게 틀리지는 않습니다. 반면 discriminative model은 샘플을 생성하지는 못합니다.

classification
의 측면에서 간단히 설명해보면, discriminative model은 두 class가 주어진 경우 이들 class의 차이점에 주목합니다. 반면 generative model은 각 class의 분포에 주목합니다. 쉬운 예로 Gaussian으로 모델링해서 그 mean prototype으로 사용하는 것이 generative model입니다. Classification을 위해서는 decision boundary가 없어서는 안되는 것이니 generative model의 경우에도 likelihood posterior probability 등을 사용해서 decision boundary를 구축합니다. 일반적으로는 posterior probability를 더 많이 사용하는 것으로 보입니다.
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

이첢 prototype을 구축하는 과정은 주로 categorization이라고 생각되는 반면 일단 decision boundary가 정해지면 classification이라고 생각될 수 있습니다. categorization classification과는 약간 다른 개념으로 간주됩니다. 물론 pattern recognition field에서의 이야기입니다. categorization은 주로 cognitive science psychology에서 pattern과 관련된 내용을 다룰 때 많이 사용합니다. 굳이 차이를 따지자면 categorization unsupervised 방법이고 classification supervised 방법입니다. categorization 1개의 class가 주어지는 경우에도 유효하지만 classification은 최소 2개의 class가 주어진 경우에 의미가 있습니다. Pattern recognition에서는 density estimation과 비슷한 의미로 사용되고 있는 듯합니다. 하지만 일차적으로 PR에서는 categorization이라는 말 자체를 흔히 볼 수는 없습니다. 최근 outlier detection 부분에서는 간혹 이 단어를 볼 수는 있습니다.

pattern recognition
은 대부분이 classification을 목적으로 하니 generative model이나 discriminative model을 써서 decision boundary를 구하는 것이 그 목적이라고 할 수 있습니다.

Posted by uniqueone
,