http://www.agnld.uni-potsdam.de/~marwan/matlab-tutorials/html/statistics.html#37
Statistics with Matlab
Contents
Matlab provides commands and operations for the computation of basic statistics.
randn('seed', 1);
x = randn(1000,1);
y = randn(1000,1);
The mean of a series can be computed by
mean(x)
ans = 0.0020
If there are some NaNs in the vector x, this command will also be NaN. For this case we can use the command nanmean.
Further commands are
median
median(x)
ans = 0.0197
standard deviation
std(x)
ans = 1.0096
variance
var(x)
ans = 1.0193
largest value
max(x)
ans = 3.0206
smalles value
min(x)
ans = -3.4333
value range
range(x)
ans = 6.4539
interquartile range (robust estimate for the spread of the data)
iqr(x)
ans = 1.3076
covariance matrix
cov(x,y)
ans = 1.0193 -0.0011 -0.0011 0.9783
correlation matrix
corrcoef(x,y)
ans = 1.0000 -0.0011 -0.0011 1.0000
Sometimes it is desired to know the position of the kargest or smalles value in the vector. The commands max and min can also provide this information:
[a i] = max(x)
a = 3.0206 i = 317
Now the position of the largest value a in the vector x is in variable i.
To sort the values in the vector x we use the sort command:
xs = sort(x);
or with the index vector of the sorted values
[xs i] = sort(x);
A histogram of the data can be computed with
hist(x)
The default settings for a histogram are ten intervalls. To change this to another number of intervalls, e. g. 50, we call
hist(x, 50)
The number of counts h per intervall i can be accessed by
[h i] = hist(x,50);
Now we would like to know the 5% and 95% percentiles of the distribution of the series x.
prctile(x,[5 95])
ans = -1.6467 1.6592
The minimum, maximum, lower and upper quartile and median values can be represented by a box and whisker plot
boxplot([x y])
Where mean and standard deviation are the first two moments of the distribution, skewness and kurtosis are related to the third and fourth moments of the distributions. The skewness reflects the shape or asymmetry of the distribution: if it is negative, the data spread out more to the left of the mean, if it is positive, the data spread out more to the right.
skewness(x)
ans = -0.0786
The kurtosis represents how outlier-prone the distribution is. The kurtosis of the normal distribution is 3. Distributions that are more outlier-prone than the normal distribution have kurtosis greater than 3; distributions that are less outlier-prone have kurtosis less than 3.
kurtosis(x)
ans = 3.0258
All moments of a distribution can be derived with the command moment.
To assess statistical significance, the bootstrap statistics is sometimes helpful. With bootstrapping we can get new realisations of the original data series by a random resampling. The Matlab command bootstrp can be applied for a bootstrap statistic. To compute a bootstrap statistic of the mean of our vector x by using 500 new realisations, we call
m = bootstrp(500, 'mean', x);
To show the variation of the mean across all bootstrap realisations, we plot the histogram
hist(m)
As expected, the mean spreads around the origin.
Linear regression models can be useful for the study of relations between two data series. Matlab provides different commands to estimate linear regression coefficients and corresponding statistics.
Least squares fit can be performed by the command regress. Assume a linear system
randn('seed',1); x = [1:10:500]'; y = 2.5 * x + 80 + 50 * randn(50,1); plot(x,y,'.')
and estimate the linear fit and the confidence intervals within 95% with
[b, bint] = regress(y, [ones(length(x),1), x], 0.05)
b = 78.1034 2.4832 bint = 50.3461 105.8608 2.3859 2.5805
In order to get not only an estimate for the regression coefficient, but also the offset, we added a vector consisting of ones in the second argument of regress. The result b yiels the offset 85 and the regression coefficient 2.5, the corresponding confidence intervals are stored in bint.
hold on
plot(1:500, b(2) * [1:500] + b(1))
Residuals can be automatically computed with
[b, bint, residuals] = regress(y, [ones(length(x),1), x], 0.05);
clf, plot(residuals), ylabel('Residuals')
For nonlinear systems, the command polyfit is more appropriate. We consider the following nonlinear system
randn('seed', 1); y = sin(x/50) ./ x + .002 * randn(50,1); clf plot(x,y,'.')
and fit a polynom of order 5
p = polyfit(x, y, 5);
hold on
plot(x,polyval(p,x))
Warning: Polynomial is badly conditioned. Add points with distinct X values, reduce the degree of the polynomial, or try centering and scaling as described in HELP POLYFIT.
Error bounds can be derived by
[p s] = polyfit(x, y, 5); [y_fit delta] = polyval(p, x, s);
Warning: Polynomial is badly conditioned. Add points with distinct X values, reduce the degree of the polynomial, or try centering and scaling as described in HELP POLYFIT.
The variable delta contains the estimate of the standard deviation of the error in predicting a future observation at x by P(x). As error bounds we use 2 delta corresponding roughly to the 95% confidence interval.
plot(x, y_fit + 2 * delta, ':', x, y_fit - 2 * delta, ':')
Exercise
1. Consider the file ecg.dat and compute and interprete the mean, standard deviation, skewness and kurtosis. Plot and interprete the histogram (chose an appropriate number of intervals).
2. The file eeg.dat contains EEG measurements on 59 channels (arranged as columns). Compute a matrix of correlation coefficients between all channels.
3. Consider the first 120 data points in the file ecg.dat. Fit polynoms of different orders to this sequence of data. Compute bootstrap estimates of the last coefficient of the polynom and present the result in a histogram.