http://www.agnld.uni-potsdam.de/~marwan/matlab-tutorials/html/statistics.html#37

 

Statistics with Matlab

Contents

Matlab provides commands and operations for the computation of basic statistics.

randn('seed', 1);
x = randn(1000,1);
y = randn(1000,1);

The mean of a series can be computed by

mean(x)
ans =

    0.0020

If there are some NaNs in the vector x, this command will also be NaN. For this case we can use the command nanmean.

Further commands are

median

median(x)
ans =

    0.0197

standard deviation

std(x)
ans =

    1.0096

variance

var(x)
ans =

    1.0193

largest value

max(x)
ans =

    3.0206

smalles value

min(x)
ans =

   -3.4333

value range

range(x)
ans =

    6.4539

interquartile range (robust estimate for the spread of the data)

iqr(x)
ans =

    1.3076

covariance matrix

cov(x,y)
ans =

    1.0193   -0.0011
   -0.0011    0.9783

correlation matrix

corrcoef(x,y)
ans =

    1.0000   -0.0011
   -0.0011    1.0000

Sometimes it is desired to know the position of the kargest or smalles value in the vector. The commands max and min can also provide this information:

[a i] = max(x)
a =

    3.0206


i =

   317

Now the position of the largest value a in the vector x is in variable i.

To sort the values in the vector x we use the sort command:

xs = sort(x);

or with the index vector of the sorted values

[xs i] = sort(x);

A histogram of the data can be computed with

hist(x)

The default settings for a histogram are ten intervalls. To change this to another number of intervalls, e. g. 50, we call

hist(x, 50)

The number of counts h per intervall i can be accessed by

[h i] = hist(x,50);

Now we would like to know the 5% and 95% percentiles of the distribution of the series x.

prctile(x,[5 95])
ans =

   -1.6467    1.6592

The minimum, maximum, lower and upper quartile and median values can be represented by a box and whisker plot

boxplot([x y])

Where mean and standard deviation are the first two moments of the distribution, skewness and kurtosis are related to the third and fourth moments of the distributions. The skewness reflects the shape or asymmetry of the distribution: if it is negative, the data spread out more to the left of the mean, if it is positive, the data spread out more to the right.

skewness(x)
ans =

   -0.0786

The kurtosis represents how outlier-prone the distribution is. The kurtosis of the normal distribution is 3. Distributions that are more outlier-prone than the normal distribution have kurtosis greater than 3; distributions that are less outlier-prone have kurtosis less than 3.

kurtosis(x)
ans =

    3.0258

All moments of a distribution can be derived with the command moment.

To assess statistical significance, the bootstrap statistics is sometimes helpful. With bootstrapping we can get new realisations of the original data series by a random resampling. The Matlab command bootstrp can be applied for a bootstrap statistic. To compute a bootstrap statistic of the mean of our vector x by using 500 new realisations, we call

m = bootstrp(500, 'mean', x);

To show the variation of the mean across all bootstrap realisations, we plot the histogram

hist(m)

As expected, the mean spreads around the origin.

Linear regression models can be useful for the study of relations between two data series. Matlab provides different commands to estimate linear regression coefficients and corresponding statistics.

Least squares fit can be performed by the command regress. Assume a linear system

randn('seed',1);
x = [1:10:500]';
y = 2.5 * x + 80 + 50 * randn(50,1);

plot(x,y,'.')

and estimate the linear fit and the confidence intervals within 95% with

[b, bint] = regress(y, [ones(length(x),1), x], 0.05)
b =

   78.1034
    2.4832


bint =

   50.3461  105.8608
    2.3859    2.5805

In order to get not only an estimate for the regression coefficient, but also the offset, we added a vector consisting of ones in the second argument of regress. The result b yiels the offset 85 and the regression coefficient 2.5, the corresponding confidence intervals are stored in bint.

hold on
plot(1:500, b(2) * [1:500] + b(1))

Residuals can be automatically computed with

[b, bint, residuals] = regress(y, [ones(length(x),1), x], 0.05);
clf, plot(residuals), ylabel('Residuals')

For nonlinear systems, the command polyfit is more appropriate. We consider the following nonlinear system

randn('seed', 1);
y = sin(x/50) ./ x + .002 * randn(50,1);
clf
plot(x,y,'.')

and fit a polynom of order 5

p = polyfit(x, y, 5);

hold on
plot(x,polyval(p,x))
Warning: Polynomial is badly conditioned. Add points with distinct X
         values, reduce the degree of the polynomial, or try centering
         and scaling as described in HELP POLYFIT.

Error bounds can be derived by

[p s] = polyfit(x, y, 5);
[y_fit delta] = polyval(p, x, s);
Warning: Polynomial is badly conditioned. Add points with distinct X
         values, reduce the degree of the polynomial, or try centering
         and scaling as described in HELP POLYFIT.

The variable delta contains the estimate of the standard deviation of the error in predicting a future observation at x by P(x). As error bounds we use 2 delta corresponding roughly to the 95% confidence interval.

plot(x, y_fit + 2 * delta, ':', x, y_fit - 2 * delta, ':')

Exercise

1. Consider the file ecg.dat and compute and interprete the mean, standard deviation, skewness and kurtosis. Plot and interprete the histogram (chose an appropriate number of intervals).

2. The file eeg.dat contains EEG measurements on 59 channels (arranged as columns). Compute a matrix of correlation coefficients between all channels.

3. Consider the first 120 data points in the file ecg.dat. Fit polynoms of different orders to this sequence of data. Compute bootstrap estimates of the last coefficient of the polynom and present the result in a histogram.

Posted by uniqueone
,
http://www.mathworks.com/matlabcentral/fileexchange/9577-symbolic-polynomial-manipulation

 

 

This toolbox is a very simple polynomial manipulation package.
I originally wrote it some years ago as a means of learning how to work with objects in Matlab. When I decided to publish this package, I completely rewrote it from scratch just this week, as the old one had some bugs in it.

Why publish this at all? Because I think writing such a tool is a marvelous way to learn about OOP and the use of classes in Matlab. A second reason is that this toolbox is actually of use to me on occasion, whenever I need to manipulate simple polynomials in one or several variables. If, as is the case with me and you don't own the symbolic toolbox, you may find it of interest too. I have included the functions orthpoly and gaussquadrule, as neat applications of the sympoly tools.

I've also put in a helper document that discusses things I felt important in writing a toolbox like this.

If you wish to use these tools, they are quite easy to use. A few quick examples:

% This creates 3 sympoly objects in your
% workspace: x, y, z.
sympoly x y z

% Add 1 to x, put the result in p.
p = x+1

% arbitrary expressions
q = (x-1)^3 + x*y*z - (x+1)*(z-1)

There are many other examples in the ReadMe file, some involving arrays of sympolys, as well as many more.

Whatever use you do find for this toolbox, have fun with it. I did. I even get some use from it occasionally.

If by some amazing chance you do find any bugs, please e-mail me.

Posted by uniqueone
,
http://www.mathworks.com/matlabcentral/fileexchange/34765-polyfitn

 

code.zip

 

Polyfitn is an extension of polyfit, allowing the user to create models with more than one independent variable. It also allows the user to specify a general model, for example, a quadratic model, with constant and quadratic terms, but no linear term.
For example, to fit a polynomial model to points selected from a cosine curve, we will only need the even ordered terms.
x = -2:.1:2;
y = cos(x);
p = polyfitn(x,y,'constant x^2 x^4 x^6');

 


p.Coefficients
ans =
    [0.99996 -0.49968 0.041242 -0.0012079]
The coefficients won't be exact of course, as I used only a finite number of terms for what is essentially a truncated Taylor series, and I had only a finite amount of points to build the model from. The exact first 4 coefficients for a cosine series would have been:
>> [1 -1/2 1/24 -1/720]
ans =
            1 -0.5 0.041667 -0.0013889

so we got the expected result.

Of course, polyfitn works in higher dimensions, as it was this problem it was really designed to solve.

x = rand(100,1);
y = rand(100,1);
z = exp(x+y) + randn(100,1)/100;
p = polyfitn([x,y],z,3);

The result can be converted into a symbolic form to view the model more simply. Here I'll use my sympoly toolbox, but there is also a polyn2sym function provided.

polyn2sympoly(p)
ans =
    0.43896*X1^3 + 1.4919*X1^2*X2 + 0.041084*X1^2 + 1.4615*X1*X2^2 - 0.095977*X1*X2 + 1.2799*X1 + 0.56912*X2^3 - 0.15306*X2^2 + 1.361*X2 + 0.94819

And of course, parameter error estimates are generated for those who wish to determine the significance of the terms generated.

I've also provided tools for evaluating these models, and to differentiate a model.

A caveat - beware the use of high order polynomials to fit your data. Just because a low order model works, a high order model is not necessarily better. High order polynomials often suffer from severe ringing between the data points. Always plot your data. Think about the model you will be building. Then plot the resulting model. Use your eyes to validate the result, more so than simply looking at an r-squared coefficient (although I do return that parameter too.)

If you do find that a high order polynomial mode is necessary because your curve is simply too complicated, consider using a regression or smoothing spline model instead.

 

Posted by uniqueone
,
http://kr.mathworks.com/help/stats/fitlm.html

 

fitlm

Create linear regression model

fitlm creates a LinearModel object. Once you create the object, you can see it in the workspace. You can see all the properties the object contains by clicking on it. You can create plots and do further diagnostic analysis by using methods such as plot, plotResiduals, and plotDiagnostics. For a full list of methods for LinearModel, see methods.

Syntax

  • mdl = fitlm(___,Name,Value)
    example

Description

example

mdl = fitlm(tbl) returns a linear model fit to variables in the table or dataset array tbl. By default, fitlm takes the last variable as the response variable.

example

mdl = fitlm(tbl,modelspec) returns a linear model of the type you specify in modelspec fit to variables in the table or dataset array tbl.

example

mdl = fitlm(X,y) returns a linear model of the responses y, fit to the data matrix X.

example

mdl = fitlm(X,y,modelspec) returns a linear model of the type you specify in modelspec for the responses y, fit to the data matrix X.

example

mdl = fitlm(___,Name,Value) returns a linear model with additional options specified by one or more Name,Value pair arguments.

For example, you can specify which variables are categorical, perform robust regression, or use observation weights.

Examples

collapse all

Fit Linear Regression Using Data in Table

Load the sample data.

load carsmall

Store the variables in a table.

tbl = table(Weight,Acceleration,MPG,'VariableNames',{'Weight','Acceleration','MPG'});

Display the first five rows of the table.

tbl(1:5,:)
ans = 

    Weight    Acceleration    MPG
    ______    ____________    ___

    3504        12            18 
    3693      11.5            15 
    3436        11            18 
    3433        12            16 
    3449      10.5            17 

Fit a linear regression model for miles per gallon (MPG).

lm = fitlm(tbl,'MPG~Weight+Acceleration')
lm = 


Linear regression model:
    MPG ~ 1 + Weight + Acceleration

Estimated Coefficients:
                     Estimate         SE         tStat       pValue  
                    __________    __________    _______    __________

    (Intercept)         45.155        3.4659     13.028    1.6266e-22
    Weight          -0.0082475    0.00059836    -13.783    5.3165e-24
    Acceleration       0.19694       0.14743     1.3359       0.18493


Number of observations: 94, Error degrees of freedom: 91
Root Mean Squared Error: 4.12
R-squared: 0.743,  Adjusted R-Squared 0.738
F-statistic vs. constant model: 132, p-value = 1.38e-27

This syntax uses Wilkinson notation to specify the modelspec.

The model 'MPG~Weight+Acceleration' in this example is equivalent to fitting the model using the string 'linear' as modelspec. For example,

lm2 = fitlm(tbl,'linear');

When you use a string as modelspec and do not specify the response variable, fitlm by default accepts the last variable in tbl as the response variable and the other variables as the predictor variables. If there are any categorical variables and you use 'linear' as the modelspec, then you must explicitly specify those variables as categorical variables using the CategoricalVars name-value pair argument.

Fit Linear Regression Using Specified Model Formula

Fit a linear regression model using a model formula specified by Wilkinson notation.

Load the sample data.

load carsmall

Store the variables in a table.

tbl = table(Weight,Acceleration,Model_Year,MPG,'VariableNames',{'Weight','Acceleration','Model_Year','MPG'});

Fit a linear regression model for miles per gallon (MPG) with weight and acceleration as the predictor variables.

lm = fitlm(tbl,'MPG~Weight+Acceleration')
lm = 


Linear regression model:
    MPG ~ 1 + Weight + Acceleration

Estimated Coefficients:
                     Estimate         SE         tStat       pValue  
                    __________    __________    _______    __________

    (Intercept)         45.155        3.4659     13.028    1.6266e-22
    Weight          -0.0082475    0.00059836    -13.783    5.3165e-24
    Acceleration       0.19694       0.14743     1.3359       0.18493


Number of observations: 94, Error degrees of freedom: 91
Root Mean Squared Error: 4.12
R-squared: 0.743,  Adjusted R-Squared 0.738
F-statistic vs. constant model: 132, p-value = 1.38e-27

The $p$-value of 0.18493 indicates that Acceleration does not have a significant impact on MPG.

Remove Acceleration from the model, and try improving the model by adding the predictor variable Model_Year. First define Model_Year as a nominal variable.

tbl.Model_Year = categorical(tbl.Model_Year);
lm = fitlm(tbl,'MPG~Weight+Model_Year')
lm = 


Linear regression model:
    MPG ~ 1 + Weight + Model_Year

Estimated Coefficients:
                      Estimate         SE         tStat       pValue  
                     __________    __________    _______    __________

    (Intercept)           40.11        1.5418     26.016    1.2024e-43
    Weight           -0.0066475    0.00042802    -15.531    3.3639e-27
    Model_Year_76        1.9291       0.74761     2.5804      0.011488
    Model_Year_82        7.9093       0.84975     9.3078    7.8681e-15


Number of observations: 94, Error degrees of freedom: 90
Root Mean Squared Error: 2.92
R-squared: 0.873,  Adjusted R-Squared 0.868
F-statistic vs. constant model: 206, p-value = 3.83e-40

Specifying modelspec using Wilkinson notation enables you to update the model without having to change the design matrix. fitlm uses only the variables that are specified in the formula. It also creates the necessary two dummy indicator variables for the categorical variable Model_Year.

Linear Regression with Categorical Predictor

Fit a model of a table that contains a categorical predictor.

Load the carsmall data.

load carsmall

Construct a table containing continuous predictor variable Weight, nominal predictor variable Year, and response variable MPG.

tbl = table(MPG,Weight);
tbl.Year = nominal(Model_Year);

Create a fitted model of MPG as a function of Year, Weight, and Weight^2. (You don't have to include Weight explicitly in your formula because it is a lower-order term of Weight^2) and is included automatically.

mdl = fitlm(tbl,'MPG ~ Year + Weight^2')
mdl = 


Linear regression model:
    MPG ~ 1 + Weight + Year + Weight^2

Estimated Coefficients:
                    Estimate         SE         tStat       pValue  
                   __________    __________    _______    __________

    (Intercept)        54.206        4.7117     11.505    2.6648e-19
    Weight          -0.016404     0.0031249    -5.2493    1.0283e-06
    Year_76            2.0887       0.71491     2.9215     0.0044137
    Year_82            8.1864       0.81531     10.041    2.6364e-16
    Weight^2       1.5573e-06    4.9454e-07      3.149     0.0022303


Number of observations: 94, Error degrees of freedom: 89
Root Mean Squared Error: 2.78
R-squared: 0.885,  Adjusted R-Squared 0.88
F-statistic vs. constant model: 172, p-value = 5.52e-41

fitlm creates two dummy (indicator) variables for the nominal variate, Year. The dummy variable Year_76 takes the value 1 if model year is 1976 and takes the value 0 if it is not. The dummy variable Year_82 takes the value 1 if model year is 1982 and takes the value 0 if it is not. And the year 1970 is the reference year. The corresponding model is

$\hat MPG = 54.206 - 0.0164(Weight) + 2.0887(Year\_76) + 8.1864(Year\_82) + 1.557e-06(Weigh{t^2})$

Specify Response and Predictor Variables for Linear Model

Fit a linear regression model to sample data. Specify the response and predictor variables, and include only pairwise interaction terms in the model.

Load sample data.

load hospital

Fit a linear model with interaction terms to the data. Specify weight as the response variable, and sex, age, and smoking status as the predictor variables. Also, specify that sex and smoking status are categorical variables.

mdl = fitlm(hospital,'interactions','ResponseVar','Weight',...
    'PredictorVars',{'Sex','Age','Smoker'},...
    'CategoricalVar',{'Sex','Smoker'})
mdl = 


Linear regression model:
    Weight ~ 1 + Sex*Age + Sex*Smoker + Age*Smoker

Estimated Coefficients:
                         Estimate      SE        tStat        pValue  
                         ________    _______    ________    __________

    (Intercept)             118.7     7.0718      16.785     6.821e-30
    Sex_Male               68.336     9.7153      7.0339    3.3386e-10
    Age                   0.31068    0.18531      1.6765      0.096991
    Smoker_1               3.0425     10.446     0.29127       0.77149
    Sex_Male:Age         -0.49094    0.24764     -1.9825      0.050377
    Sex_Male:Smoker_1      0.9509     3.8031     0.25003       0.80312
    Age:Smoker_1         -0.07288    0.26275    -0.27737       0.78211


Number of observations: 100, Error degrees of freedom: 93
Root Mean Squared Error: 8.75
R-squared: 0.898,  Adjusted R-Squared 0.892
F-statistic vs. constant model: 137, p-value = 6.91e-44

The weight of the patients do not seem to differ significantly according to age, or the status of smoking, or interaction of these factors with patient sex at the 5% significance level.

Fit a Robust Linear Regression Model

Fit a linear regression model using a robust fitting method.

Load the sample data.

load hald

The hald data measures the effect of cement composition on its hardening heat. The matrix ingredients contains the percent composition of four chemicals present in the cement. The array heat contains the heat of hardening after 180 days for each cement sample.

Fit a robust linear model to the data.

mdl = fitlm(ingredients,heat,'linear','RobustOpts','on')
mdl = 


Linear regression model (robust fit):
    y ~ 1 + x1 + x2 + x3 + x4

Estimated Coefficients:
                   Estimate      SE        tStat       pValue 
                   ________    _______    ________    ________

    (Intercept)       60.09     75.818     0.79256      0.4509
    x1               1.5753    0.80585      1.9548    0.086346
    x2               0.5322    0.78315     0.67957     0.51596
    x3              0.13346     0.8166     0.16343     0.87424
    x4             -0.12052     0.7672    -0.15709     0.87906


Number of observations: 13, Error degrees of freedom: 8
Root Mean Squared Error: 2.65
R-squared: 0.979,  Adjusted R-Squared 0.969
F-statistic vs. constant model: 94.6, p-value = 9.03e-07

Related Examples

Input Arguments

collapse all

tbl — Input datatable | dataset array

Input data, specified as a table or dataset array. When modelspec is a formula, it specifies the variables to be used as the predictors and response. Otherwise, if you do not specify the predictor and response variables, the last variable is the response variable and the others are the predictor variables by default.

Predictor variables can be numeric, or any grouping variable type, such as logical or categorical (see Grouping Variables). The response must be numeric or logical.

To set a different column as the response variable, use the ResponseVar name-value pair argument. To use a subset of the columns as predictors, use the PredictorVars name-value pair argument.

Data Types: single | double | logical

X — Predictor variablesmatrix

Predictor variables, specified as an n-by-p matrix, where n is the number of observations and p is the number of predictor variables. Each column of X represents one variable, and each row represents one observation.

By default, there is a constant term in the model, unless you explicitly remove it, so do not include a column of 1s in X.

Data Types: single | double | logical

y — Response variablevector

Response variable, specified as an n-by-1 vector, where n is the number of observations. Each entry in y is the response for the corresponding row of X.

Data Types: single | double

modelspec — Model specification'linear' (default) | string naming the model | t-by-(p + 1) terms matrix | string of the form 'Y ~ terms'

Model specification, specified as one of the following.

  • A string naming the model.

    String Model Type
    'constant' Model contains only a constant (intercept) term.
    'linear' Model contains an intercept and linear terms for each predictor.
    'interactions' Model contains an intercept, linear terms, and all products of pairs of distinct predictors (no squared terms).
    'purequadratic' Model contains an intercept, linear terms, and squared terms.
    'quadratic' Model contains an intercept, linear terms, interactions, and squared terms.
    'polyijk' Model is a polynomial with all terms up to degree i in the first predictor, degree j in the second predictor, etc. Use numerals 0 through 9. For example, 'poly2111' has a constant plus all linear and product terms, and also contains terms with predictor 1 squared.

  • t-by-(p + 1) matrix, namely terms matrix, specifying terms to include in the model, where t is the number of terms and p is the number of predictor variables, and plus 1 is for the response variable.

  • A string representing a formula in the form

    'Y ~ terms',

    where the terms are in Wilkinson Notation.

Example: 'quadratic'

Example: 'y ~ X1 + X2^2 + X1:X2'

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'Intercept',false,'PredictorVars',[1,3],'ResponseVar',5,'RobustOpts','logistic' specifies a robust regression model with no constant term, where the algorithm uses the logistic weighting function with the default tuning constant, first and third variables are the predictor variables, and fifth variable is the response variable.

'CategoricalVars' — Categorical variablescell array of strings | logical or numeric index vector

Categorical variables in the fit, specified as the comma-separated pair consisting of 'CategoricalVars' and either a cell array of strings of the names of the categorical variables in the table or dataset array tbl, or a logical or numeric index vector indicating which columns are categorical.

  • If data is in a table or dataset array tbl, then the default is to treat all categorical or logical variables, character arrays, or cell arrays of strings as categorical variables.

  • If data is in matrix X, then the default value of this name-value pair argument is an empty matrix []. That is, no variable is categorical unless you specify it.

For example, you can specify the observations 2 and 3 out of 6 as categorical using either of the following examples.

Example: 'CategoricalVars',[2,3]

Example: 'CategoricalVars',logical([0 1 1 0 0 0])

Data Types: single | double | logical

'Exclude' — Observations to excludelogical or numeric index vector

Observations to exclude from the fit, specified as the comma-separated pair consisting of 'Exclude' and a logical or numeric index vector indicating which observations to exclude from the fit.

For example, you can exclude observations 2 and 3 out of 6 using either of the following examples.

Example: 'Exclude',[2,3]

Example: 'Exclude',logical([0 1 1 0 0 0])

Data Types: single | double | logical

'Intercept' — Indicator for constant termtrue (default) | false

Indicator the for constant term (intercept) in the fit, specified as the comma-separated pair consisting of 'Intercept' and either true to include or false to remove the constant term from the model.

Use 'Intercept' only when specifying the model using a string, not a formula or matrix.

Example: 'Intercept',false

'PredictorVars' — Predictor variablescell array of strings | logical or numeric index vector

Predictor variables to use in the fit, specified as the comma-separated pair consisting of 'PredictorVars' and either a cell array of strings of the variable names in the table or dataset array tbl, or a logical or numeric index vector indicating which columns are predictor variables.

The strings should be among the names in tbl, or the names you specify using the 'VarNames' name-value pair argument.

The default is all variables in X, or all variables in tbl except for ResponseVar.

For example, you can specify the second and third variables as the predictor variables using either of the following examples.

Example: 'PredictorVars',[2,3]

Example: 'PredictorVars',logical([0 1 1 0 0 0])

Data Types: single | double | logical | cell

'ResponseVar' — Response variablelast column in tbl (default) | string for variable name | logical or numeric index vector

Response variable to use in the fit, specified as the comma-separated pair consisting of 'ResponseVar' and either a string of the variable name in the table or dataset array tbl, or a logical or numeric index vector indicating which column is the response variable. You typically need to use 'ResponseVar' when fitting a table or dataset array tbl.

For example, you can specify the fourth variable, say yield, as the response out of six variables, in one of the following ways.

Example: 'ResponseVar','yield'

Example: 'ResponseVar',[4]

Example: 'ResponseVar',logical([0 0 0 1 0 0])

Data Types: single | double | logical | char

'RobustOpts' — Indicator of robust fitting type'off' (default) | 'on' | string | structure with string or function handle

Indicator of the robust fitting type to use, specified as the comma-separated pair consisting of 'RobustOpts' and one of the following.

  • 'off' — No robust fitting. fitlm uses ordinary least squares.

  • 'on' — Robust fitting. When you use robust fitting, 'bisquare' weight function is the default.

  • String — Name of the robust fitting weight function from the following table. fitlm uses the corresponding default tuning constant in the table.

  • Structure with the string RobustWgtFun containing the name of the robust fitting weight function from the following table and optional scalar Tune fields — fitlm uses the RobustWgtFun weight function and Tune tuning constant from the structure. You can choose the name of the robust fitting weight function from this table. If you do not supply a Tune field, the fitting function uses the corresponding default tuning constant.

    Weight Function Equation Default Tuning Constant
    'andrews' w = (abs(r)<pi) .* sin(r) ./ r 1.339
    'bisquare' (default) w = (abs(r)<1) .* (1 - r.^2).^2 4.685
    'cauchy' w = 1 ./ (1 + r.^2) 2.385
    'fair' w = 1 ./ (1 + abs(r)) 1.400
    'huber' w = 1 ./ max(1, abs(r)) 1.345
    'logistic' w = tanh(r) ./ r 1.205
    'ols' Ordinary least squares (no weighting function) None
    'talwar' w = 1 * (abs(r)<1) 2.795
    'welsch' w = exp(-(r.^2)) 2.985

    The value r in the weight functions is

    r = resid/(tune*s*sqrt(1-h)),

    where resid is the vector of residuals from the previous iteration, h is the vector of leverage values from a least-squares fit, and s is an estimate of the standard deviation of the error term given by

    s = MAD/0.6745.

    MAD is the median absolute deviation of the residuals from their median. The constant 0.6745 makes the estimate unbiased for the normal distribution. If there are p columns in X, the smallest p absolute deviations are excluded when computing the median.

    Default tuning constants give coefficient estimates that are approximately 95% as statistically efficient as the ordinary least-squares estimates, provided the response has a normal distribution with no outliers. Decreasing the tuning constant increases the downweight assigned to large residuals; increasing the tuning constant decreases the downweight assigned to large residuals.

  • Structure with the function handle RobustWgtFun and optional scalar Tune fields — You can specify a custom weight function. fitlm uses the RobustWgtFun weight function and Tune tuning constant from the structure. Specify RobustWgtFun as a function handle that accepts a vector of residuals, and returns a vector of weights the same size. The fitting function scales the residuals, dividing by the tuning constant (default 1) and by an estimate of the error standard deviation before it calls the weight function.

Example: 'RobustOpts','andrews'

'VarNames' — Names of variables in fit{'x1','x2',...,'xn','y'} (default) | cell array of strings

Names of variables in fit, specified as the comma-separated pair consisting of 'VarNames' and a cell array of strings including the names for the columns of X first, and the name for the response variable y last.

'VarNames' is not applicable to variables in a table or dataset array, because those variables already have names.

For example, if in your data, horsepower, acceleration, and model year of the cars are the predictor variables, and miles per gallon (MPG) is the response variable, then you can name the variables as follows.

Example: 'VarNames',{'Horsepower','Acceleration','Model_Year','MPG'}

Data Types: cell

'Weights' — Observation weightsones(n,1) (default) | n-by-1 vector of nonnegative scalar values

Observation weights, specified as the comma-separated pair consisting of 'Weights' and an n-by-1 vector of nonnegative scalar values, where n is the number of observations.

Data Types: single | double

Output Arguments

collapse all

mdl — Linear modelLinearModel object

Linear model representing a least-squares fit of the response to the data, returned as a LinearModel object.

If the value of the 'RobustOpts' name-value pair is not [] or 'ols', the model is not a least-squares fit, but uses the robust fitting function.

For properties and methods of the linear model object, mdl, see the LinearModel class page.

More About

collapse all

Terms Matrix

A terms matrix is a t-by-(p + 1) matrix specifying terms in a model, where t is the number of terms, p is the number of predictor variables, and plus one is for the response variable.

The value of T(i,j) is the exponent of variable j in term i. Suppose there are three predictor variables A, B, and C:

[0 0 0 0] % Constant term or intercept
[0 1 0 0] % B; equivalently, A^0 * B^1 * C^0
[1 0 1 0] % A*C
[2 0 0 0] % A^2
[0 1 2 0] % B*(C^2)
The 0 at the end of each term represents the response variable. In general,

  • If you have the variables in a table or dataset array, then 0 must represent the response variable depending on the position of the response variable. The following example illustrates this.

    Load the sample data and define the dataset array.

    load hospital
    ds = dataset(hospital.Sex,hospital.BloodPressure(:,1),hospital.Age,...
    hospital.Smoker,'VarNames',{'Sex','BloodPressure','Age','Smoker'});

    Represent the linear model 'BloodPressure ~ 1 + Sex + Age + Smoker' in a terms matrix. The response variable is in the second column of the dataset array, so there must be a column of 0s for the response variable in the second column of the terms matrix.

    T = [0 0 0 0;1 0 0 0;0 0 1 0;0 0 0 1]
    
    T =
    
         0     0     0     0
         1     0     0     0
         0     0     1     0
         0     0     0     1

    Redefine the dataset array.

    ds = dataset(hospital.BloodPressure(:,1),hospital.Sex,hospital.Age,...
    hospital.Smoker,'VarNames',{'BloodPressure','Sex','Age','Smoker'});
    

    Now, the response variable is the first term in the dataset array. Specify the same linear model, 'BloodPressure ~ 1 + Sex + Age + Smoker', using a terms matrix.

    T = [0 0 0 0;0 1 0 0;0 0 1 0;0 0 0 1]
    T =
    
         0     0     0     0
         0     1     0     0
         0     0     1     0
         0     0     0     1
  • If you have the predictor and response variables in a matrix and column vector, then you must include 0 for the response variable at the end of each term. The following example illustrates this.

    Load the sample data and define the matrix of predictors.

    load carsmall
    X = [Acceleration,Weight];
    

    Specify the model 'MPG ~ Acceleration + Weight + Acceleration:Weight + Weight^2' using a term matrix and fit the model to the data. This model includes the main effect and two-way interaction terms for the variables, Acceleration and Weight, and a second-order term for the variable, Weight.

    T = [0 0 0;1 0 0;0 1 0;1 1 0;0 2 0]
    
    T =
    
         0     0     0
         1     0     0
         0     1     0
         1     1     0
         0     2     0
    

    Fit a linear model.

    mdl = fitlm(X,MPG,T)
    mdl = 
    
    Linear regression model:
        y ~ 1 + x1*x2 + x2^2
    
    Estimated Coefficients:
                       Estimate       SE            tStat      pValue    
        (Intercept)         48.906        12.589     3.8847    0.00019665
        x1                 0.54418       0.57125    0.95261       0.34337
        x2               -0.012781     0.0060312    -2.1192      0.036857
        x1:x2          -0.00010892    0.00017925    -0.6076         0.545
        x2^2            9.7518e-07    7.5389e-07     1.2935       0.19917
    
    Number of observations: 94, Error degrees of freedom: 89
    Root Mean Squared Error: 4.1
    R-squared: 0.751,  Adjusted R-Squared 0.739
    F-statistic vs. constant model: 67, p-value = 4.99e-26

    Only the intercept and x2 term, which correspond to the Weight variable, are significant at the 5% significance level.

    Now, perform a stepwise regression with a constant model as the starting model and a linear model with interactions as the upper model.

    T = [0 0 0;1 0 0;0 1 0;1 1 0];
    mdl = stepwiselm(X,MPG,[0 0 0],'upper',T)
    1. Adding x2, FStat = 259.3087, pValue = 1.643351e-28
    
    mdl = 
    
    Linear regression model:
        y ~ 1 + x2
    
    Estimated Coefficients:
                       Estimate      SE           tStat      pValue    
        (Intercept)        49.238       1.6411     30.002    2.7015e-49
        x2             -0.0086119    0.0005348    -16.103    1.6434e-28
    
    Number of observations: 94, Error degrees of freedom: 92
    Root Mean Squared Error: 4.13
    R-squared: 0.738,  Adjusted R-Squared 0.735
    F-statistic vs. constant model: 259, p-value = 1.64e-28

    The results of the stepwise regression are consistent with the results of fitlm in the previous step.

Formula

A formula for model specification is a string of the form 'Y ~ terms'

where

  • Y is the response name.

  • terms contains

    • Variable names

    • + means include the next variable

    • - means do not include the next variable

    • : defines an interaction, a product of terms

    • * defines an interaction and all lower-order terms

    • ^ raises the predictor to a power, exactly as in * repeated, so ^ includes lower order terms as well

    • () groups terms

    Note:   Formulas include a constant (intercept) term by default. To exclude a constant term from the model, include -1 in the formula.

For example,

'Y ~ A + B + C' means a three-variable linear model with intercept.
'Y ~ A + B + C - 1' is a three-variable linear model without intercept.
'Y ~ A + B + C + B^2' is a three-variable model with intercept and a B^2 term.
'Y ~ A + B^2 + C' is the same as the previous example because B^2 includes a B term.
'Y ~ A + B + C + A:B' includes an A*B term.
'Y ~ A*B + C' is the same as the previous example because A*B = A + B + A:B.
'Y ~ A*B*C - A:B:C' has all interactions among A, B, and C, except the three-way interaction.
'Y ~ A*(B + C + D)' has all linear terms, plus products of A with each of the other variables.

Wilkinson Notation

Wilkinson notation describes the factors present in models. The notation relates to factors present in models, not to the multipliers (coefficients) of those factors.

Wilkinson Notation Factors in Standard Notation
1 Constant (intercept) term
A^k, where k is a positive integer A, A2, ..., Ak
A + B A, B
A*B A, B, A*B
A:B A*B only
-B Do not include B
A*B + C A, B, C, A*B
A + B + C + A:B A, B, C, A*B
A*B*C - A:B:C A, B, C, A*B, A*C, B*C
A*(B + C) A, B, C, A*B, A*C

Statistics and Machine Learning Toolbox™ notation always includes a constant term unless you explicitly remove the term using -1.

 

Posted by uniqueone
,
http://kr.mathworks.com/help/stats/compactlinearmodel.predict.html

 

predict

Class: CompactLinearModel

Predict response of linear regression model

Syntax

ypred = predict(mdl,Xnew)
[ypred,yci] = predict(mdl,Xnew)
[ypred,yci] = predict(mdl,Xnew,Name,Value)

Description

ypred = predict(mdl,Xnew) returns the predicted response of the mdl linear regression model to the points in Xnew.

[ypred,yci] = predict(mdl,Xnew) returns confidence intervals for the true mean responses.

[ypred,yci] = predict(mdl,Xnew,Name,Value) predicts responses with additional options specified by one or more Name,Value pair arguments.

Tips

  • For predictions with added noise, use random.

Input Arguments

expand all

mdl — Linear model objectLinearModel object | CompactLinearModel object

Xnew — New predictor input valuestable | dataset array | numeric matrix

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

'Alpha' — Alpha value for confidence interval0.05 (default) | numeric value in the range [0,1]

'Prediction' — Prediction type'curve' (default) | 'observation'

Examples

collapse all

Predict Response Values Using Linear Model

Create a model of car mileage as a function of weight, and predict the response.

Create a quadratic model of car mileage as a function of weight from the carsmall data.

load carsmall
X = Weight;
y = MPG;
mdl = fitlm(X,y,'quadratic');

Create predicted responses to the data.

Xnew = X;
ypred = predict(mdl,Xnew);

Plot the original responses and the predicted responses to see how they differ.

plot(X,y,'o',Xnew,ypred,'x')
legend('Data','Predictions')

Related Examples

 

Posted by uniqueone
,
http://kr.mathworks.com/matlabcentral/answers/124050-multiple-linear-regression-to-fit-data-to-a-third-degree-polynomial-equation-with-interaction-terms

 

 

Hello,

I have data for two independent variables and one dependent variable (obtained from experiment). I need to fit this data using linear regression to a 10 coefficient third degree polynomial equation - for the engineers among you, this is the standard equation for specifying refrigeration compressor performance. This can be done quite easily in EES by entering data into a table and selecting the option "Linear Regression" from the Tables menu. How do I achieve the same in Matlab? My equation is of the form X = C1 + C2.(S) + C3.(D) + C4·(S2) + C5 · (S·D) + C6 · (D2 ) + C7 · (S3 ) + C8 · (D·S2 ) +C9 · (S·D2 ) + C10 · (D3 ) Here, X is the dependent variable and S and D are the independent variables. The numbers next to S and D indicate the power to which they are raised. C1 to C10 are the coefficients that need to be calculated.

Using the 'regress' function gives me a constant of 0, and warns me that my design matrix is rank deficient. Using the curve fitting toolbox (cftool - polynomial option) gives me ridiculous values for the coefficients (p00 = -6.436e15). When I try to input a custom equation in the cftool, it is switching to non-linear regression and asks me to input guess values for the coefficients, which I don't want to do. What other functions are available that I might use to perform this regression and how do I implement them. Any suggestions/ help/ recommendations would be greatly appreciated.

Thanks very much.

Gautam.

---------------------------------------------------------------------------------------------------

Please post your regression equation and 15-10 rows of your X, S and D data.

It’s difficult to determine what the problem may be without that information.

---------------------------------------------------------------------------------------------------

>> X = [-10 101; -10 114; -10 129; -10 144; -5 96; -5 106; -5 119; -5 133; 5 92; 5 97; 5 108; 5 122];

>> Y = [72.03; 67.90; 62.77; 57.55; 88.53; 84.70; 79.91; 73.73; 123.43; 120.91; 115.44; 107.83];

>> x1 = X(:,1);

>> x2 = X(:,2);

>> cftool

I used a custom equation, Y = f(x1, x2) and the custom equation is:

a + b*x1 + c*x2 + d*(x1^2) + e*(x1*x2) + f*(x2^2) + g*(x1^3) + h*(x1^2*x2) + i*(x1*x2^2) + j*(x2^3)

What turned up in the Results box was this:

Complex value computed by model function, fitting cannot continue. Try using or tightening upper and lower bounds on coefficients.

Any help?

 

---------------------------------------------------------------------------------------------------

You have only three distinct values in the first column of your X matrix, so you won't be able to estimate constant, linear, quadratic, and cubic effects for that column. Here's how you can use all terms in the cubic model except x1^3:

fitlm(X,Y,[0 0;1 0;0 1;1 1;2 0;0 2;0 3;1 2;2 1])

Alternatively, if you know that a third order term is appropriate, you will need to collect some data at another value of x1.

---------------------------------------------------------------------------------------------------

Thank you very much Mr. Lane, for the comment. I certainly need to have the x1^3 term - I am not allowed to modify the equation - it is the AHRI 540-2004 Standard. And, the data is indistinct because it is a measurement of evaporator temperature (at the suction of a compressor) and that is how compressor data is taken and performance evaluated. Either the evaporator or the condenser temperature will have to be held constant, while the other is measured. Is there any way I can get around that?

---------------------------------------------------------------------------------------------------

To Gautam: Tom has told you the essential difficulty with your problem as you have presented it. The difficulty is not with your ten-coefficient third degree polynomial. It is with the shortage of data to be fitted to it. The quantity you called 'x1' occurs in only three distinct values: -10, -5, and +5. That is just not enough, given the complexity of your polynomial.

Look at it this way. Suppose your problem involves only a single variable and is approximated with a third degree polynomial with four adjustable coefficients. It is obvious that three data values of that single variable would not be enough to uniquely determine the four coefficients. You would have only three equations and four unknowns, and there are infinitely many coefficient combinations that would give you an exact fit to that limited data.

In other words, Gautam, your data is too easy to fit as far as the x1 variable is concerned. You need much more variation in x1 to produce a meaningful set of ten coefficients. Actually you are also not providing enough variation in the other variable, x2. You have only four values of x2 for each value of x1 and that is insufficient to give enough information to determine the coefficients with sufficient reliability. It is too easy to fit. You really need much greater variation in both variables than you have provided. That is what is behind the messages you have received about "rank deficiency". Ideally the messages ought to have complained with the declaration: "Data! Data! Give me more data! I am starving from too little data!"

---------------------------------------------------------------------------------------------------

Thank you very much Mr. Stafford for clearing that up for me. I'm a newcomer to Matlab and could not grasp completely the meaning behind the warnings. I tried out fitlm with a much bigger data set and it did give me the exact same values EES did, which has around 2% variation with the published coefficients. I cannot thank you enough.

---------------------------------------------------------------------------------------------------

There are numerous ways to do this. The most easiest of them all is to use the Curve Fitting Tool with the correct options. Try this link for help:

http://www.mathworks.com/help/curvefit/polynomial.html

Using the Statistics Toolbox you can do that by specifying the polyijf modelspec:

http://www.mathworks.com/help/stats/fitlm.html#inputarg_modelspec

Another way is to solve a linear system in MATLAB. Assuming X, S and D are vectors in your MATLAB workspace. Create random data and compute the coefficients in C:

S = randn(100,1);
D = randn(100,1),
X = randn(100,1);
M = [ones(length(S),1), S, D, S.^2, S.*D, D.^2, S.^3, D.*S.^2, S.*D.^2, D.^3]
C = M\X

---------------------------------------------------------------------------------------------------

Shashank, thanks for the prompt response. In addition to the cftool I posted above, I also tried the below:

Using the same data (X and Y), I used the fitlm function with 'poly33' as the argument:

mdl = fitlm(X,Y,'poly33')

It did work, but the coefficients it gave me are not the coefficients I should be getting. The actual values of the coefficients are:

beta = [119.61 4.37 -0.39 -0.04 0.09 -0.01 0.00 0.02 0.00 0.00];

However, when I used fitlm, the value of the constant (119.61 above) is obtained as 0, which should not be the case.

As for the last method you mentioned (using the backslash operator), it gave me a warning that the rank of my matrix was deficient and still gave me wrong values for the values of the coefficients (and different from those obtained using the other functions).

Some more suggestions please?

 

---------------------------------------------------------------------------------------------------

I'm posting some of the attempts I made below:

X = [-10 101; -10 114; -10 129; -10 144; -5 96; -5 106; -5 119; -5 133; 5 92; 5 97; 5 108; 5 122];

>> Y = [72.03; 67.90; 62.77; 57.55; 88.53; 84.70; 79.91; 73.73; 123.43; 120.91; 115.44; 107.83];

>> x1 = X(:,1);

>> x2 = X(:,2);

>> mdl = fitlm(X,Y,'poly33')

Warning: Regression design matrix is rank deficient to within machine precision. > In TermsRegression>TermsRegression.checkDesignRank at 98 In LinearModel.LinearModel>LinearModel.fit at 868 In fitlm at 117

mdl =

Linear regression model: y ~ 1 + x1^2 + x1*x2 + x2^2 + x1^3 + (x1^2):x2 + x1:(x2^2) + x2^3

Estimated Coefficients: Estimate SE tStat pValue _________ ________ _______ ______

    (Intercept)              0             0          NaN         NaN
    x1                 -10.143        5.3752       -1.887     0.19979
    x2                 -0.3125        1.1432     -0.27335     0.81022
    x1^2                5.5413        1.6852       3.2882    0.081361
    x1:x2            0.0043522      0.026071      0.16694     0.88277
    x2^2            2.6105e-05      0.010348    0.0025228     0.99822
    x1^3               0.55158       0.16763       3.2905     0.08126
    x1^2:x2        -5.4944e-05    0.00021022     -0.26136     0.81827
    x1:x2^2        -8.3639e-05    0.00012017     -0.69599     0.55844
    x2^3             -4.12e-06     3.109e-05     -0.13252      0.9067

Number of observations: 12, Error degrees of freedom: 3 Root Mean Squared Error: 0.165 R-squared: 1, Adjusted R-Squared 1 F-statistic vs. constant model: 2.76e+04, p-value = 3.3e-07

Attempt 2:

 A = [1	-10	101	100	-1010	10201	-1000	10100	-102010	1030301
1	-5	96	25	-480	9216	-125	2400	-46080	884736
1	5	92	25	460	8464	125	2300	42320	778688
1	-10	114	100	-1140	12996	-1000	11400	-129960	1481544
1	-5	106	25	-530	11236	-125	2650	-56180	1191016
1	5	97	25	485	9409	125	2425	47045	912673
1	-10	129	100	-1290	16641	-1000	12900	-166410	2146689
1	-5	119	25	-595	14161	-125	2975	-70805	1685159
1	5	108	25	540	11664	125	2700	58320	1259712
1	-10	144	100	-1440	20736	-1000	14400	-207360	2985984
1	-5	133	25	-665	17689	-125	3325	-88445	2352637
1	5	122	25	610	14884	125	3050	74420	1815848
];

b = A\Y Warning: Rank deficient, rank = 9, tol = 4.463946e-09.

b =

         0
  -62.6154
   18.9421
  -35.6208
    2.9627
   -0.0958
   -3.8184
   -0.0236
   -0.0136
    0.0001

Attempt 3:

 sf = fit([x1 x2],Y, 'poly33', 'Robust', 'Bi')

Warning: Equation is badly conditioned. Remove repeated data points or try centering and scaling. > In Warning>Warning.throw at 30 In fit>iLinearFit at 690 In fit>iFit at 396 In fit at 108 Warning: Iteration limit reached for robust fitting. > In Warning>Warning.throw at 30 In fit>iRobustLinearFit at 796 In fit>iFit at 404 In fit at 108

 Linear model Poly33:
     sf(x,y) = p00 + p10*x + p01*y + p20*x^2 + p11*x*y + p02*y^2 + p30*x^3 + 
                    p21*x^2*y + p12*x*y^2 + p03*y^3
     Coefficients (with 95% confidence bounds):
       p00 =  -2.736e+14  (-6.205e+17, 6.199e+17)
       p10 =  -2.736e+13  (-6.205e+16, 6.199e+16)
       p01 =      0.3112  (-346.1, 346.7)
       p20 =   1.094e+13  (-2.48e+16, 2.482e+16)
       p11 =   -0.009459  (-5.569, 5.551)
       p02 =   -0.005727  (-3.308, 3.296)
       p30 =   1.094e+12  (-2.48e+15, 2.482e+15)
       p21 =  -0.0001193  (-0.4058, 0.4055)
       p12 =  -1.846e-05  (-0.02545, 0.02542)
       p03 =   1.358e-05  (-0.01007, 0.01009)

Attempt 4:

 sf = fit([x1 x2],Y, 'poly33', 'Robust', 'Bi','Normalize','on')
Warning: Equation is badly conditioned. Remove repeated data points or try centering and scaling. 
> In Warning>Warning.throw at 30
  In fit>iLinearFit at 690
  In fit>iFit at 396
  In fit at 108 
Warning: Iteration limit reached for robust fitting. 
> In Warning>Warning.throw at 30
  In fit>iRobustLinearFit at 796
  In fit>iFit at 404
  In fit at 108 
 Linear model Poly33:
     sf(x,y) = p00 + p10*x + p01*y + p20*x^2 + p11*x*y + p02*y^2 + p30*x^3 + 
                    p21*x^2*y + p12*x*y^2 + p03*y^3
       where x is normalized by mean -3.333 and std 6.513
       and where y is normalized by mean 113.4 and std 16.34
     Coefficients (with 95% confidence bounds):
       p00 =   2.802e+14  (-4.173e+15, 4.734e+15)
       p10 =    1.15e+15  (-1.712e+16, 1.942e+16)
       p01 =      -6.475  (-15.32, 2.371)
       p20 =       1.005  (-3.659, 5.67)
       p11 =      -1.507  (-6.289, 3.275)
       p02 =     -0.3495  (-5.084, 4.385)
       p30 =  -8.362e+14  (-1.413e+16, 1.245e+16)
       p21 =     -0.2157  (-8.185, 7.754)
       p12 =     -0.6978  (-12.13, 10.73)
       p03 =     -0.3148  (-7.039, 6.409)

Is there anything I haven't yet tried out or any mistakes I made? Anything at all would be helpful. Thank you.

---------------------------------------------------------------------------------------------------

 

Posted by uniqueone
,