## Model selection

October 01, 2011 at 11:52 PM | categories: data analysis | View Comments

# Model selection

adapted from http://www.itl.nist.gov/div898/handbook/pmd/section4/pmd44.htm

In this example, we show some ways to choose which of several models fit data the best. We have data for the total pressure and temperature of a fixed amount of a gas in a tank that was measured over the course of several days. We want to select a model that relates the pressure to the gas temperature.

The data is stored in a text file download PT.txt , with the following structure

## Contents

Run Ambient Fitted Order Day Temperature Temperature Pressure Value Residual 1 1 23.820 54.749 225.066 222.920 2.146

We need to read the data in, and perform a regression analysis on columns 4 and 5.

clc; clear all; close all

## Read in the data

all the data is numeric, so we read it all into one matrix, then extract out the relevant columns

data = textread('PT.txt','','headerlines',2); run_order = data(:,1); run_day = data(:,2); ambientT = data(:,3); T = data(:,4); P = data(:,5); plot(T,P,'k. ') xlabel('Temperature') ylabel('Pressure')

## Fit a line to the data for P(T) = a + bT

It appears the data is roughly linear, and we know from the ideal gas law that PV = nRT, or P = nR/V*T, which says P should be linearly correlated with V. Note that the temperature data is in degC, not in K, so it is not expected that P=0 at T = 0. let X = T, and Y = P for the regress syntax. The regress command expects a column of ones for the constant term so that the statistical tests it performs are valid. We get that column by raising the independent variable to the zero power.

X = [T.^0 T]; % [1 T1] Y = P; alpha = 0.05; % specifies the 95% confidence interval [b bint] = regress(Y,X,alpha)

b = 7.7490 3.9301 bint = 2.9839 12.5141 3.8275 4.0328

Note that the intercept is not zero, although, the confidence interval is fairly large, it does not include zero at the 95% confidence level.

## Calculate R^2 for the line

The R^2 value accounts roughly for the fraction of variation in the data that can be described by the model. Hence, a value close to one means nearly all the variations are described by the model, except for random variations.

```
ybar = mean(Y);
SStot = sum((Y - ybar).^2);
SSerr = sum((Y - X*b).^2);
R2 = 1 - SSerr/SStot;
sprintf('R^2 = %1.3f',R2)
```

ans = R^2 = 0.994

## Plot the data and the fit

plot(T,P,'k. ',T,X*b,'b -') xlabel('Temperature') ylabel('Pressure') title(sprintf('R^2 = %1.3f',R2))

## Evaluating the model

The fit looks good, and R^2 is near one, but is it a good model? There are a few ways to examine this. We want to make sure that there are no systematic trends in the errors between the fit and the data, and we want to make sure there are not hidden correlations with other variables.

## Plot the residuals

the residuals are the error between the fit and the data. The residuals should not show any patterns when plotted against any variables, and they do not in this case.

residuals = P - X*b; figure hold all subplot(1,3,1) plot(T,residuals,'ko') xlabel('Temperature') subplot(1,3,2) plot(run_order,residuals,'ko ') xlabel('run order') subplot(1,3,3) plot(ambientT,residuals,'ko') xlabel('ambient temperature')

## check for correlations between residuals

We assume all the errors are uncorrelated with each other. We use a lag plot, where we plot residual(i) vs residual(i-1), i.e. we look for correlations between adjacent residuals. This plot should look random, with no correlations if the model is good.

figure plot(residuals(2:end),residuals(1:end-1),'ko') xlabel('residual(i)') ylabel('residual(i-1)')

## Alternative models

Lets consider a quadratic model instead.

```
X = [T.^0 T.^1 T.^2];
Y = P;
alpha = 0.05; % 95% confidence interval
[b bint] = regress(Y,X,alpha)
```

b = 9.0035 3.8667 0.0007 bint = -4.7995 22.8066 3.2046 4.5288 -0.0068 0.0082

You can see that the 95% confidence interval on the constant and includes zero, so adding a parameter does not increase the goodness of fit. This is an example of overfitting the data, but it also makes you question whether the constant is meaningful in the linear model. The regress function expects a constant in the model, and the documentation says leaving it out

## Alternative models

Lets consider a model with intercept = 0, P = alpha*T

X = [T]; Y = P; alpha = 0.05; % 95% confidence interval [b bint] = regress(Y,X,alpha) plot(T,P,'k. ',T,X*b,'b- ') xlabel('Temperature') ylabel('Pressure') legend 'data' 'fit' ybar = mean(Y); SStot = sum((Y - ybar).^2); SSerr = sum((Y - X*b).^2); R2 = 1 - SSerr/SStot; title(sprintf('R^2 = %1.3f',R2))

b = 4.0899 bint = 4.0568 4.1231

The fit is visually still good. and the R^2 value is only slightly worse.

## plot residuals

You can see a slight trend of decreasing value of the residuals as the Temperature increases. This may indicate a deficiency in the model with no intercept. For the ideal gas law in degC: or , so the intercept is expected to be non-zero in this case. That is an example of the deficiency.

residuals = P - X*b; figure plot(T,residuals,'ko') xlabel('Temperature') ylabel('residuals')

'done' % categories: data analysis

ans = done