Quantcast

Econometrics Toolbox

Specifying Static Time Series Models

Strategies for specification analysis

Contents

Introduction

The simplest econometric models are static, representing systems that respond exclusively to current events. By contrast, dynamic models use lagged variables to incorporate feedback over time.

Static processes are often described by multiple linear regression (MLR) models:

$$y_t = X_t \beta + e_t,$$

where $y_t$ is an observed response and $X_t$ includes columns for contemporaneous values of observable predictors. The partial regression coefficients in $\beta$ represent proportional contributions of individual predictors to the variation in $y_t$. The term $e_t$ is a catch-all for any differences between predicted and observed values of $y_t$. Prediction errors may be due to process fluctuations (changes in $\beta$), measurement errors (changes in $X_t$), and model misspecification. They can also arise from inherent stochasticity in the underlying data-generating process. For modeling purposes, it is assumed that $e_t$ is generated by an unobservable innovations process with stationary covariance:

$$\Omega = Cov(\{e_1,...,e_T\}),$$

for sample size $T$. Under some basic assumptions about $X_t$, $e_t$, and their relationship, reliable estimates of $\beta$ are obtained by ordinary least squares (OLS).

As in other social sciences, economic data are often collected by passive observation, without the aid of controlled experiments. This can lead to limited, dependent observations of low variability. Theoretically relevant predictors may be replaced by practically available proxies. These data shortcomings can lead to a number of issues with the reliability of OLS estimates and the standard statistical techniques applied to model specification. Observed changes in the response may be correlated with, but not caused by, observed changes in the predictors. Simultaneous changes in multiple predictors may produce interactions that are difficult to separate into individual effects. Coefficient estimates may be sensitive to data errors, making significance tests unreliable.

Assessing model assumptions in the context of available data is the goal of specification analysis. When a model becomes suspect, practical solutions may be limited, but a thorough analysis can help to identify the source and degree of any problems. This demo introduces some basic techniques for diagnosing static MLR models, and offers some general strategies for addressing the specification issues that may arise.

Model Assumptions

Classical linear model (CLM) assumptions allow OLS to produce estimates of $\beta$ with desirable properties. The fundamental assumption is that the MLR model, and the predictors selected, correctly specify the underlying data-generating process. Other CLM assumptions include:

$\bullet \: X_t$ is full rank (no collinearity among the predictors).

$\bullet \: e_t$ is uncorrelated with $X_s$ for all $s$ (strict exogeneity of the predictors).

$\bullet \: e_t$ is not autocorrelated ($\Omega$ is diagonal).

$\bullet \: e_t$ is homoscedastic (diagonal entries in $\Omega$ are all $\sigma^2$).

Under the CLM assumptions, the Gauss-Markov theorem says that the OLS estimate $\hat{\beta}$ is the best (smallest variance) linear unbiased estimator (BLUE) of the coefficients in $\beta$. If the innovations $e_t$ are normally distributed, $\hat{\beta}$ will also be normally distributed. In that case, reliable $t$ and $F$ tests can be carried out on the coefficient estimates to assess predictor significance, and confidence intervals can be constructed to describe estimator variance. Normality also allows $\hat{\beta}$ to be estimated by maximum likelihood, with identical results. Regardless of the distribution of $e_t$, however, the Central Limit theorem assures that $\hat{\beta}$ will be approximately normally distributed in large samples, so that the standard inference techniques related to model specification become valid asymptotically.

Violations of CLM assumptions do not necessarily invalidate the results of an OLS regression. A specification analysis attempts to identify specific violations, assess their effects on model estimation, and suggest possible remedies in the context of modeling goals.

In a static MLR model, the predictors forming the columns of $X_t$ are, by definition, contemporaneous with the response $y_t$. There is nothing in the CLM assumptions, however, that explicitly excludes variables with lags or leads. More dynamic specifications raise issues, such as lag order selection, which we do not consider here, but they do not alter the basic techniques of specification analysis. Indeed, our main example, introduced in the next section, is a forecasting model, with a one-year lead in the response data.

Examining Data

To begin, consider a simple MLR model of credit default rates. The file Data_CreditDefaults.mat contains historical data on investment-grade corporate bond defaults, as well as data on four potential predictors for the years 1984 to 2004:

load Data_CreditDefaults
X0 = Data(:,1:4);         % Initial predictor set (matrix)
X0DS = Dataset(:,1:4);    % Initial predictor set (dataset array)
predNames0 = series(1:4); % Initial predictor set names
T0 = size(X0,1);          % Initial sample size
y0 = Data(:,5);           % Initial response data
respName0 = series{5};    % Initial response data name


The potential predictors, measured for year $t$, are:

AGE Percentage of investment-grade bond issuers first rated 3 years ago. These relatively new issuers have a high empirical probability of default after capital from the initial issue is expended.

BBB Percentage of investment-grade bond issuers with a Standard & Poor's credit rating of BBB, the lowest investment grade. This percentage represents another risk factor.

CPF One-year-ahead forecast of the change in corporate profits, adjusted for inflation. The forecast is a measure of overall economic health, included as an indicator of larger business cycles.

SPR Spread between corporate bond yields and those of comparable government bonds. The spread is another measure of the risk of current issues.


The response, measured for year $t+1$, is:

IGD Default rate on investment-grade corporate bonds


As described in [14] and [18], some of the variables are proxies, constructed from other series.

We first examine the data, converting the dates to serial date numbers so that the utility function recessionplot can overlay bands showing dips in the business cycle:

% Convert dates to serial date numbers:
dateNums = datenum([dates,ones(T0,2)]);

% Plot potential predictors:
figure
plot(dateNums,X0,'LineWidth',2)
set(gca,'XTick',dateNums(1:2:end))
datetick('x','yyyy','keepticks')
recessionplot
xlabel('Year')
ylabel('Predictor Level')
legend(predNames0,'Location','NW')
title('{\bf Potential Predictors}')
axis tight
grid on

% Plot response:
figure
hold on
plot(dateNums,y0,'k','LineWidth',2);
plot(dateNums,y0-detrend(y0),'m--')
hold off
set(gca,'XTick',dateNums(1:2:end))
datetick('x','yyyy','keepticks')
recessionplot
xlabel('Year')
ylabel('Response Level')
legend(respName0,'Trend','Location','NW')
title('{\bf Response}')
axis tight
grid on

We see that BBB is on a slightly different scale than the other predictors, and trending over time. The response data is for year $t+1$, so that the peak in default rates actually follows the recession in 2001.

As a first step toward model specification, it is useful to identify any possible dependencies among the predictors. The correlation matrix is a standard measure of the strength of pairwise linear relationships:

R0 = corrcoef(X0)
R0 =

    1.0000    0.4578    0.0566   -0.0686
    0.4578    1.0000    0.3955    0.3082
    0.0566    0.3955    1.0000    0.0874
   -0.0686    0.3082    0.0874    1.0000

The utility function corrplot helps to visualize the results in the correlation matrix by plotting a matrix of pairwise scatters. Slopes of the displayed least-squares lines are equal to the displayed correlation coefficients. It is convenient to work with the dataset array version of the data, X0DS, which contains the predictor names for the plots:

corrplot(X0DS,'testR','on')

Coefficients highlighted in red have a significant $t$-statistic. The predictor BBB again distinguishes itself by its relatively high correlations with the other predictors, though the strength of the relationships is moderate. Here the visualization is particularly helpful, as BBB displays fairly disorganized scatters with a number of small, potentially influential data subsets. The plots are a reminder of the limitations of the linear correlation coefficient as a summary statistic.

Both the scale and correlations of BBB have the potential to inflate the condition number $\kappa$ of $X_t$. The condition number is often used to characterize the overall sensitivity of OLS coefficient estimates to changes in the data. For an MLR model with intercept:

% Add intercept:
X0I = [ones(T0,1),X0]; % Matrix
X0IDS = [dataset({ones(T0,1),'Const'}),X0DS]; % Dataset array

kappa0I = cond(X0I)
kappa0I =

  205.8085

The condition number is above the "well-conditioned" benchmark of 1, which is achieved when $X_t$ has orthonormal columns. As a rule of thumb, a 1% relative error in the data $X_t$ can produce up to a $\kappa$% relative error in the coefficient estimates $\beta$ [20]:

$${\|\delta \beta\| \over \|\beta\|} \le \kappa{\|\delta X_t\| \over \|X_t\|}$$

As we will see, coefficient estimates for this data are on the order of $10^{-2}$, so the absolute estimation errors $\|\delta \beta\|$ are approximated by the relative errors in the data.

Estimator Variance

Correlations and condition numbers are widely used to flag potential data problems, but their diagnostic value is limited. Correlations consider only pairwise dependencies between predictors, while condition numbers consider only $X_t$ in aggregate. Relationships among arbitrary predictor subsets (multicollinearities) can fall somewhere in between. CLM assumptions forbid exact relationships, but identifying the strength and source of any near relationships, and their specific effect on coefficient estimation, is an essential part of specification analysis.

Many methods for detecting near collinearities focus on the resulting coefficient estimates in $\hat{\beta}$, rather than the data in $X_t$. Each of the following has been suggested as a telltale sign of predictor dependencies:

$\bullet$ Statistically insignificant coefficients on theoretically important predictors

$\bullet$ Coefficients with signs or magnitudes that do not make theoretical sense

$\bullet$ Extreme sensitivity of a coefficient to insertion or deletion of other predictors

The qualitative nature of these criteria is apparent, and unfortunately none of them is necessary or sufficient for detecting collinearity.

To illustrate, we use the function regstats to compute OLS coefficient estimates and related statistics. Since regstats automatically adds an intercept to the input data, we pass it X0 rather than X0I. The example function Example_RegDisp displays the results in tabular form:

% Regression with original data:
stats0 = regstats(y0,X0);
Example_RegDisp(stats0,'varNames',['Const',predNames0],'diagnostics','MSE')
REGRESSION RESULTS

      | Coeff     SE        tStat     pVal
============================================
Const | -0.2274   0.0986    -2.3072   0.0347
AGE   | 0.0168    0.0092    1.8271    0.0864
BBB   | 0.0043    0.0027    1.5969    0.1299
CPF   | -0.0149   0.0038    -3.9100   0.0012
SPR   | 0.0455    0.0340    1.3380    0.1996

Diagnostics:

MSE: 0.0058

The signs of the coefficient estimates are consistent with theoretical expectations: AGE, BBB, and SPR add risk; CPF reduces it. The $t$-statistics, which scale the coefficient estimates by their standard errors (computed under the assumption of normal innovations), show that all predictors are significantly different from 0 at the 20% level, with CPF especially significant. In short, there is nothing in the standard regression results to raise substantial concern about collinearity.

To put these results in perspective, however, it is necessary to consider other sources of estimator variance. Under CLM assumptions, the variance of the $i^{th}$ component of $\hat{\beta}$, $\hat{\beta_i}$, can be decomposed as follows [24]:

$$Var(\hat{\beta_i}) = {\sigma^2 \over SST_i(1-R_i^2)},$$

where $\sigma^2$ is the variance of the innovations process (assumed constant), $SST_i$ is the total sample variation of predictor $i$, and $R_i^2$ is the coefficient of determination from a regression of predictor $i$ on the remaining predictors (and intercept, if present).

The term

$$VIF_i = {1 \over 1-R_i^2}$$

is called the variance inflation factor (VIF), which is another common collinearity diagnostic. When the variation of predictor $i$ is largely explained by a linear combination of the other predictors, $R_i^2$ is close to $1$, and the VIF for that predictor is correspondingly large. The inflation is measured relative to an $R_i^2$ of 0 (no collinearity), and a VIF of 1.

VIFs are also the diagonal elements of the inverse of the correlation matrix [1], a convenient result that eliminates the need to set up the various regressions:

VIF = diag(inv(R0))'
predNames0
VIF =

    1.3870    1.7901    1.2216    1.1850


predNames0 =

    'AGE'    'BBB'    'CPF'    'SPR'

How large a VIF is cause for concern? As with significance levels for standard hypothesis tests, experience with certain types of data may suggest useful tolerances. Common ad hoc values, in the range of 5 to 10, are of little use in general. In this case, BBB has the highest VIF, but it does not jump out from the rest of the predictors.

More importantly, VIF is only one factor in the variance decomposition given above. A large VIF can be balanced by either a small innovations variance $\sigma^2$ (good model fit) or a large sample variation $SST_i$ (sufficient data). As such, Goldberger [8] ironically compares the "problem" of multicollinearity, viewed in isolation, to the problem of data "micronumerosity." Evaluating the combined effect of the different sources of estimator variance requires a wider view.

Econometricians have developed a number of rules of thumb for deciding when to worry about collinearity. Perhaps the most common says that it is acceptable to ignore evidence of collinearity if the resulting $t$-statistics are all greater than 2 in absolute value. This ensures that 0 is outside of the approximate 95% confidence interval of each estimate (assuming normal innovations or a large sample). Because $t$-statistics are already adjusted for estimator variance, the presumption is that they adequately account for collinearity in the context of other, balancing effects. The regression results above show that three of the potential predictors in X0 fail this test.

Another rule of thumb is based on an estimate of $Var(\hat{\beta_i})$ [21]:

$$\widehat{Var}(\hat{\beta_i}) = {1 \over {T-n}}{\hat\sigma_y^2 \over \hat\sigma_i^2}{1-R^2 \over 1-R_i^2},$$

where $T$ is the sample size, $n$ is the number of predictors, $\hat\sigma_y^2$ is the estimated variance of $y_t$, $\hat\sigma_i^2$ is the estimated variance of predictor $i$, $R^2$ is the coefficient of determination for the regression of $y_t$ on $X_t$, and $R_i^2$ is as above. The rule says that concerns about collinearity can be ignored if $R^2$ exceeds $R_i^2$ for each predictor, since each VIF will be balanced by $1-R^2$. All of the potential predictors in X0 pass this test:

RSquared = stats0.rsquare
RSquared_i = 1-(1./VIF)
predNames0
RSquared =

    0.6211


RSquared_i =

    0.2790    0.4414    0.1814    0.1561


predNames0 =

    'AGE'    'BBB'    'CPF'    'SPR'

These rules attempt to identify the consequences of collinearity, as expressed in the regression results. As we see, they can offer conflicting advice on when, and how much, to worry about the integrity of the coefficient estimates. They do not provide any accounting of the nature of the multiple dependencies within the data, nor do they provide any reliable measure of the extent to which those dependencies degrade the regression.

A more detailed analytic approach is provided by Belsley [1]. Instability of OLS estimates can be traced to small eigenvalues in the cross-product matrix $X_t^T X_t$ appearing in the normal equations for $\hat{\beta}$:

$$\hat{\beta} = (X_t^T X_t)^{\rm-1} X_t^T y_t.$$

Belsley reformulates the eigensystem of $X_t^T X_t$ in terms of the singular values of the matrix $X_t$, which can then be analyzed directly, with greater numerical stability. If the singular values of $X_t$ are $\mu_1, ..., \mu_n$, where $n$ is the number of predictors, then the condition number of $X_t$ is $\kappa = \mu_{max}/\mu_{min}$. Belsley defines a spectrum of condition indices $\eta_j = \mu_{max}/\mu_j$ for each $j = 1, ..., n$, and shows that high indices are indicative of separate near dependencies in the data.

Belsley goes further by describing a method for identifying the specific predictors involved in each near dependency, and provides a measure of how important those dependencies are in affecting coefficient estimates. This is achieved with yet another decomposition of $Var(\hat{\beta_i})$, this time in terms of the singular values. If $X_t$ has a singular-value decomposition $USV^T$, with $V = (v_{ij})$, then:

$$Var(\hat{\beta_i}) = \sigma^2 \sum_{j=1}^n v_{ij}^2 / \mu_j^2,$$

where $\sigma^2$ is the innovations variance. The variance decomposition proportions $\pi_{ji}$ are defined by:

$$\phi_{ij} = v_{ij}^2 / \mu_j^2,$

$$\phi_i = \sum_{j=1}^n \phi_{ij},$

$$\pi_{ji} = \phi_{ij} / \phi_i.$

The $\pi_{ji}$ give the proportion of $Var(\hat{\beta_i})$ associated with singular value $\mu_j$.

Indices and proportions are interpreted as follows:

$\bullet$ The number of high condition indices identifies the number of near dependencies.

$\bullet$ The size of the condition indices identifies the tightness of each dependency.

$\bullet$ The location of high proportions in a high index row identifies the dependent predictors.

$\bullet$ The size of the proportions identifies the degree of degradation to regression estimates.

Again, a tolerance for "high" must be determined. Belsley's simulation experiments suggest that condition indices in the range of 5 to 10 reflect weak dependencies, and those in the range 30 to 100 reflect moderate to high dependencies. He suggests a tolerance of 0.5 for variance decomposition proportions identifying individual predictors. Simulation experiments, however, are necessarily based on specific models of mutual dependence, so tolerances need to be reevaluated in each empirical setting.

The function collintest implements Belsley's procedure. Outputs are displayed in tabular form:

collintest(X0IDS);
Variance Decomposition

sValue    condIdx   Const AGE   BBB   CPF   SPR
=================================================
2.06      1         0.002 0.002 0.002 0.014 0.003
0.801     2.57      0.002 0.003 0.000 0.822 0.002
0.256     8.04      0.004 0.321 0.011 0.000 0.378
0.171     12        0.260 0.095 0.829 0.146 0.000
0.134     15.3      0.734 0.579 0.158 0.017 0.617

If we lower the index tolerance to 10 and maintain a proportion tolerance of 0.5, the analysis identifies one weak dependency between AGE and SPR in the final row. It can be visualized by setting the 'tolIdx' and 'tolProp' parameters in collintest and turning on the 'plot' flag:

collintest(X0IDS,'tolIdx',10,'tolProp',0.5,'display','off','plot','on');

Colors here do not correspond to the predictor colors in previous plots. The plot shows critical rows in the variance decomposition table, above the index tolerance. The row associated with condition index 12 has only one predictor, BBB, with a proportion above the tolerance, not the two or more predictors required for a dependency. The row associated with condition index 15.3 shows the weak dependence involving AGE, SPR, and the intercept. This relationship was not apparent in the initial plot of the correlation matrix.

In summary, the results of the various collinearity diagnostics are consistent with data in which no degrading near relationships exist. Indeed, a review of the economic meaning of the potential predictors (easily lost in a purely statistical analysis) does not suggest any theoretical reason for strong relationships. Regardless of weak dependencies, OLS estimates remain BLUE, and the standard errors in the regression results show an accuracy that is probably acceptable for most modeling purposes.

To conclude this section, we briefly examine the technique of ridge regression, which is often suggested as a remedy for estimator variance in MLR models of data with some degree of collinearity. The technique can also be used as a collinearity diagnostic.

To address the problem of near singularity in $X_t^T X_t$, ridge regression estimates $\hat{\beta}$ using a regularization of the normal equations:

$$\hat{\beta}_{ridge} = (X_t^T X_t + kI)^{\rm-1} X_t^T y_t,$$

where $k$ is a positive ridge parameter and $I$ is the identity matrix. The perturbations to the diagonal of $X_t^T X_t$ are intended to improve the conditioning of the eigenvalue problem and reduce the variance of the coefficient estimates. As $k$ increases, ridge estimates become biased toward zero, but a reduced variance can result in a smaller mean-squared error (MSE) relative to comparable OLS estimates, especially in the presence of collinearity.

Ridge regression is carried out by the function ridge. To examine the results for a range of ridge parameters $k$, a ridge trace [15] is produced:

Mu0I = mean(diag(X0I'*X0I));   % Scale of cross-product diagonal

k = 0:Mu0I/10;                 % Range of ridge parameters
ridgeBetas = ridge(y0,X0,k,0); % Coefficients for MLR model with intercept

figure
plot(k,ridgeBetas(2:end,:),'LineWidth',2)
xlim([0 Mu0I/10])
legend(predNames0)
xlabel('Ridge Parameter')
ylabel('Ridge Coefficient Estimate')
title('{\bf Ridge Trace}')
axis tight
grid on

The OLS estimates, with $k = 0$, appear on the left. The important question is whether any of the ridge estimates reduce the MSE:

[numRidgeParams,numRidgeBetas] = size(ridgeBetas);
y0Hat = X0I*ridgeBetas;
RidgeRes = repmat(y0,1,numRidgeBetas)-y0Hat;
RidgeSSE = RidgeRes'*RidgeRes;
RidgeDFE = T0-numRidgeParams;
RidgeMSE = diag(RidgeSSE/RidgeDFE);

figure
plot(k,RidgeMSE,'m','LineWidth',2)
xlim([0 Mu0I/10])
xlabel('Ridge Parameter')
ylabel('MSE')
title('{\bf Ridge MSE}')
axis tight
grid on

The plot shows exactly the opposite of what one would hope for when applying ridge regression. The MSE actually increases over the entire range of ridge parameters, suggesting again that there is no significant collinearity in the data for ridge regression to correct.

A technique related to ridge regression, the lasso, is described in the section on "Predictor Selection."

Preprocessing Steps

When considering the empirical limitations that affect OLS estimates, Belsley et al. [1] advise that collinearities be addressed first. A next step is to look for influential observations, whose presence, individually or in groups, have measurable effects on regression results. We distinguish the essentially metric notion of "influential observation" from the more subjective notion of "outlier," which may include any data that do not follow expected patterns.

Influential observations arise in two fundamentally distinct ways. First, they may be the result of measurement or recording errors. In that case, they are just bad data, detrimental to model estimation. On the other hand, they may reflect the true distribution of the innovations process, exhibiting heteroscedasticity or leptokurtosis for which the model fails to account. These observations may contain abnormal sample information that is nevertheless essential to accurate model estimation. Determining the type of influential observation is difficult when looking at data alone; the best clues are often found in the data-model interactions that produce the residual series. We investigate these further in the section on "Residual Diagnostics."

Preprocessing influential observations has three components: identification, influence assessment, and accommodation. In econometric settings, influence assessment is part of the identification process, since it is based on regression statistics. Accommodation, if there is any, is usually a choice between deleting data, which requires making assumptions about the data-generating process, or else implementing a suitably robust estimation procedure, which has the potential to obscure abnormal, but possibly important, information.

Time series data differ from cross-sectional data in that deleting observations leaves "holes" in the time base of the sample. Standard methods for imputing replacement values, such as smoothing, violate the CLM assumption of strict exogeneity. If time series data exhibit serial correlation, as they often do in economic settings, deleting observations will alter estimated autocorrelations. For static models, this presents a dilemma. Since all observations are contemporaneous, deletion only affects coefficient estimation by reducing the sample size and increasing standard errors. However, the ability to diagnose departures from the static specification, through residual analysis, is compromised. The modeling process must cycle between diagnostics and respecification until acceptable coefficient estimates produce an acceptable series of residuals.

The regstats function computes many of the standard regression statistics used to measure the influence of individual observations. These are based on a sequence of one-at-a-time row deletions of jointly observed predictor and response values. Regression statistics are computed for each delete-1 data set and compared to the statistics for the full data set.

Significant changes in the coefficient estimates $\hat{\beta}$ after deleting an observation are the main concern. The statistic dfbetas, found in the stats0 output from regstats, scales these differences by estimates of the individual coefficient variances, for comparison:

figure
hold on
plot(dates,stats0.dfbetas(2:end,:),'LineWidth',2) % Maintain color order
plot(dates,stats0.dfbetas(1,:),'k','LineWidth',2)
hold off
legend([predNames0,'Const'],'Location','Best')
xlabel('Observation Deleted')
ylabel('Scaled Change in Coefficient Estimate')
title('{\bf Delete-1 Coefficient Estimate Changes}')
axis tight
grid on

Effects of the deletions on component pairs in $\hat{\beta}$ are made apparent in a matrix of 2-D scatter plots of the actual delete-1 estimates, found in the beta_i field of stats0:

figure
gplotmatrix((stats0.beta_i)',[],[],[],'o',2,[],...
            'variable',['Const',predNames0]);
title('{\bf Delete-1 Coefficient Estimates}')

With sufficient data, these scatters tend to be approximately elliptical [23]. Outlying points can be labeled with the name of the corresponding deleted observation by typing gname(dates) at the command prompt, then clicking on a point in the plots.

Alternatively, Cook's distance, found in the cookd field of stats0, is a common summary statistic for these plots, with contours forming ellipses centered around $\hat{\beta}$. Points far from the center in multiple plots have a large Cook's distance, indicating an influential observation:

figure
plot(dateNums,stats0.cookd,'m','LineWidth',2)
set(gca,'XTick',dateNums(1:2:end))
datetick('x','yyyy','keepticks')
recessionplot
xlabel('Observation')
ylabel('Cook''s Distance')
title('{\bf Cook''s Distance}')
axis tight
grid on

If $\hat{\beta}_{(i)}$ is the estimated coefficient vector with the $i^{th}$ observation deleted from the data, then Cook's distance is also the Euclidean distance between

$$\hat{y_t} = X_t \hat{\beta}$$

and

$$\hat{y_t}_{(i)} = X_t \hat{\beta}_{(i)}.$$

As a result, Cook's distance is also a direct measure of the influence of an observation on fitted response values.

A related measure of influence is leverage, which uses the normal equations to write

$$\hat{y_t} = X_t \hat{\beta} = (X_t^T X_t)^{\rm-1} X_t^T y_t = H y_t,$$

where $H$ is the hat matrix, computed from the predictor data alone. The diagonal elements of $H$ are the leverage values, giving componentwise proportions of the observed $y_t$ contributing to the corresponding estimates in $\hat{y_t}$. The leverage values, found in the leverage field of stats0, emphasize different sources of influence:

figure
plot(dateNums,stats0.leverage,'m','LineWidth',2)
set(gca,'XTick',dateNums(1:2:end))
datetick('x','yyyy','keepticks')
recessionplot
xlabel('Observation')
ylabel('Leverage')
title('{\bf Leverage}')
axis tight
grid on

Another common measure of influence, the Mahalanobis distance, is just a scaled version of the leverage. The Mahalanobis distances in X0 can be computed using d = mahal(X0,X0), in which case the leverage values are given by h = d/(T0-1)+(1/T0).

Before deleting data, some kind of economic meaning should be assigned to the influential points identified by the various measures. Cook's distance, associated with changes in the overall response, shows a sharp spike in 2001. Leverage, associated with predictor data alone, shows a sharp spike in 1988. It is also noteworthy that after the sudden increase in leverage and a period of high default rates, the predictor BBB bends upward after 1991, and the percentage of lower-grade bonds begins to trend.

Some clues are found in the economic history of the times. 2001 was a period of recession in the U.S. economy (second vertical band in the plots above), brought on, in part, by the collapse of the speculative Internet bubble and a reduction in business investments. It was also the year of the September 11 attacks, which delivered a severe shock to the bond markets. Uncertainty, rather than quantifiable risk, characterized investment decisions for the rest of that year. The 1980s, on the other hand, saw the beginning of a long-term change in the character of the bond markets. New issues of high-yield bonds, which came to be known as "junk bonds," were used to finance many corporate restructuring projects. This segment of the bond market collapsed in 1989. Following a recession (first vertical band in the plots above) and an oil price shock in 1990-1991, the high-yield market began to grow again, and matured.

The decision to delete data ultimately depends on the purpose of the model. If the purpose is mostly explanatory, deleting accurately recorded data is inappropriate. If, however, the purpose is forecasting, then it must be asked if deleting points would create a presample that is more typical of the past, and so the future. The historical context of the data in 2001, for example, may lead to the conclusion that it misrepresents historical patterns, is unlikely to happen again, and should not be allowed to influence a forecasting model. Likewise, the history of the 1980s may lead to the conclusion that a structural change occurred in the bond markets prior to 1991, and previous data should be ignored for forecasts in the new regime.

For reference, we create both of the amended data sets:

% Delete 2001:
d1 = (dates ~= 2001); % Delete 1
datesd1 = dates(d1);
Xd1 = X0(d1,:);
yd1 = y0(d1);

% Delete dates prior to 1991, as well:
dm = (datesd1 >= 1991); % Delete many
datesdm = datesd1(dm);
Xdm = Xd1(dm,:);
ydm = yd1(dm);

The effects of the deletions on model estimation are summarized below. Dataset arrays provide a convenient format for comparing the regression statistics across models:

statsd1 = regstats(yd1,Xd1);
statsdm = regstats(ydm,Xdm);

% Model mean-squared errors:
MSEs = dataset({stats0.mse,'Original'},...
               {statsd1.mse,'Delete01'},...
               {statsdm.mse,'Post90'},...
               'ObsNames','MSE')

% Coefficient estimates:
fprintf('\n')
Coeffs = dataset({stats0.beta,'Original'},...
                 {statsd1.beta,'Delete01'},...
                 {statsdm.beta,'Post90'},...
                 'ObsNames',['Const',predNames0])

% Coefficient standard errors:
fprintf('\n')
StdErrs = dataset({diag(stats0.covb),'Original'},...
                  {diag(statsd1.covb),'Delete01'},...
                  {diag(statsdm.covb),'Post90'},...
                  'ObsNames',['Const',predNames0])
MSEs =

           Original     Delete01     Post90
    MSE    0.0058287    0.0032071    0.0023762



Coeffs =

             Original     Delete01      Post90
    Const     -0.22741      -0.12821     -0.13529
    AGE       0.016781      0.016635     0.014107
    BBB      0.0042728     0.0017657    0.0016663
    CPF      -0.014888    -0.0098507    -0.010577
    SPR       0.045488      0.024171     0.041719



StdErrs =

             Original      Delete01      Post90
    Const     0.0097151     0.0060444     0.0074086
    AGE      8.4354e-05    4.6415e-05    0.00016962
    BBB      7.1595e-06    4.3857e-06    9.1978e-06
    CPF      1.4499e-05    9.7797e-06     1.743e-05
    SPR       0.0011558    0.00066819    0.00074893

The mean-squared error improves with deletion of the point in 2001, and then again with deletion of the pre-1991 data. Deleting the point in 2001 also has the effect of tightening the standard errors on the coefficient estimates. Deleting all of the data prior to 1991, however, severely reduces the sample size, and the standard errors of several of the estimates grow larger than they were with the original data.

As a final preprocessing step, we consider the trends in the data. Individually, trending series need not be a concern, even if a trend is nonlinear, as in BBB over the entire observed time base from 1984 to 2004. If the relationship between the predictors and the response remains linear, the MLR model is still applicable. There is no need to linearize the relationship between each predictor and the time base.

If, however, a trending predictor is paired with a trending response, like IGD, it raises the possibility of spurious regression, where $t$-statistics and overall measures of fit become misleadingly "significant." That is, the statistical significance of relationships in the model no longer reflects the causal significance of relationships in the underlying data-generating process.

Mutual trends in a predictor and response can occur when both variables are correlated with a causally prior confounding variable outside of the model. This creates an implicitly restricted model, in which the omitted variable becomes a part of the innovations process, violating the CLM assumption of strict exogeneity. The restricted model then expresses a false relationship that would not exist if the specification had included the confounding variable. Without an experimental design for acquiring data, econometricians wishing to draw causal inferences must be very careful to account for all relevant factors, especially trending ones, regardless of their theoretical interest. Although including a large number of columns in $X_t$ may seem at odds with the desire to obtain a simple explanatory model, it must be remembered that a misspecified model explains nothing. Often, cycles of specification and diagnostics, as discussed in the sections on "Predictor Selection" and "Residual Diagnostics," will reduce the number of predictors in $X_t$ without significant effects on model performance.

Confounding variables are possible here, since the candidate predictors are somewhat ad hoc, rather than the result of any fundamental accounting of the causes of credit default. Without further analysis of potentially relevant economic factors, evidence of confounding variables must be found in an analysis of model residuals. We examine applicable techniques in the section on "Residual Diagnostics."

It might be hoped that trends could be removed by deleting a few atypical observations from the data. For example, the trend in the response seems mostly due to the single influential observation in 2001:

figure
hold on
plot(dates,y0,'k','LineWidth',2);
plot(dates,y0-detrend(y0),'m.-')
plot(datesd1,yd1-detrend(yd1),'g*-')
hold off
legend(respName0,'Trend','Trend with 2001 deleted','Location','NW')
xlabel('Year')
ylabel('Response Level')
title('{\bf Response}')
axis tight
grid on

Deleting the point reduces the trend, but does not eliminate it.

Alternatively, variable transformations can be used to remove trends. Detrending operations may improve the statistical properties of a model, but they complicate analysis and interpretation. Any transformation alters the economic meaning of a variable, favoring the predictive powers of a model over its explanatory simplicity.

There are two distinct methods for detrending economic variables, and failure to identify the appropriate method can lead to poor forecasts, regardless of the in-sample fit. Trend-stationary variables are the sum of a deterministic trend and a stationary process. Once identified, these variables are often linearized with some sort of power or log transformation (which may have the side effect of regularizing influential observations), and the linearized trend is removed by subtracting the least-squares line. Integrated variables (containing unit roots) may exhibit stochastic trends, without a trend-stationary decomposition. It is often advised that integrated variables be differenced, as necessary, to achieve a stationary mean. Differencing, however, also removes any evidence of cointegration, which requires a complete departure from the classic MLR model.

An integrated predictor and response can lead to spurious regression, even if the variables are generated independently, and even if they show no trends [10]. As a result, testing for integration in all regression variables, trending or not, is an essential preprocessing step.

Econometrics Toolbox™ has several tests for the presence or absence of integration: adftest, pptest, kpsstest, and lmctest. For example, the augmented Dickey-Fuller test, adftest, looks for statistical evidence against a null of integration. At the default significance level of 5%, tests on both IGD and BBB fail to reject the null in favor of a trend-stationary alternative:

IGD = y0;
BBB = X0(:,2);

[h1IGD,pValue1IGD] = adftest(IGD,'model','TS')
[h1BBB,pValue1BBB] = adftest(BBB,'model','TS')
h1IGD =

     0


pValue1IGD =

    0.1401


h1BBB =

     0


pValue1BBB =

    0.6976

The tests are performed without any lag structure to account for serial correlation in the innovations process. We investigate the validity of this assumption in the section on "Residual Diagnostics."

Other tests, like the KPSS test, kpsstest, look for statistical evidence against a null of trend-stationarity. The results are mixed:

s = warning('off'); % Turn off large/small statistics warnings

[h0IGD,pValue0IGD] = kpsstest(IGD,'trend',true)
[h0BBB,pValue0BBB] = kpsstest(BBB,'trend',true)
h0IGD =

     0


pValue0IGD =

    0.1000


h0BBB =

     1


pValue0BBB =

    0.0100

The $p$-values of 0.1 and 0.01 are, respectively, the largest and smallest in the table of critical values used by the right-tailed kpsstest. They are reported when the test statistics are, respectively, very small or very large. Thus the evidence against trend-stationarity is especially weak in the first test, and especially strong in the second test. The IGD results are ambiguous, failing to reject trend-stationarity even after the Dickey-Fuller test failed to reject integration. The results for BBB are more consistent, suggesting the predictor is integrated.

What is needed is a systematic application of these tests to all variables in the regression and their differences. The threat of spurious regression is removed only after differencing has eliminated any integration. The utility function i10test automates the required series of tests. The following performs paired ADF/KPSS tests on all of the model variables and their first differences:

I.names = {'model'};
I.vals = {'TS'};
S.names = {'trend'};
S.vals = {true};

i10test(Dataset,'numDiffs',1,...
        'itest','adf','iparams',I,...
        'stest','kpss','sparams',S);

warning(s) % Restore warning state
Test Results

        I(1)    I(0)
======================
AGE     1       0
        0.0069  0.1000

D1AGE   1       0
        0.0010  0.1000
----------------------
BBB     0       1
        0.6976  0.0100

D1BBB   1       0
        0.0249  0.1000
----------------------
CPF     0       0
        0.2474  0.1000

D1CPF   1       0
        0.0064  0.1000
----------------------
SPR     0       1
        0.2563  0.0238

D1SPR   1       0
        0.0032  0.1000
----------------------
IGD     0       0
        0.1401  0.1000

D1IGD   1       0
        0.0028  0.1000
----------------------

Columns show test results and $p$-values against nulls of integration, $I(1)$, and stationarity, $I(0)$. At the given parameter settings, the tests suggest that AGE is stationary (integrated of order 0), and BBB and SPR are integrated but brought to stationarity by a single difference (integrated of order 1). The results are ambiguous for CPF and IGD, but both appear to be stationary after a single difference.

For comparison with the original regression, we display results after BBB, SPR, CPF, and IGD are replaced by their first differences, D1BBB, D1SPR, D1CPF, and D1IGD. We leave AGE undifferenced:

D1X0 = diff(X0);
D1X0(:,1) = X0(2:end,1); % Use undifferenced AGE
D1y0 = diff(y0);

statsD1 = regstats(D1y0,D1X0);
predNamesD1 = {'AGE','D1BBB','D1CPF','D1SPR'};

% Regression with differenced data:
Example_RegDisp(statsD1,'varNames',['Const',predNamesD1],'diagnostics','MSE')
REGRESSION RESULTS

      | Coeff     SE        tStat     pVal
============================================
Const | -0.0895   0.1084    -0.8254   0.4221
AGE   | 0.0152    0.0126    1.2083    0.2456
D1BBB | -0.0235   0.0201    -1.1730   0.2591
D1CPF | -0.0157   0.0046    -3.3930   0.0040
D1SPR | -0.0366   0.0402    -0.9119   0.3763

Diagnostics:

MSE: 0.0113

The differenced data increases the standard errors on all coefficient estimates, as well as the overall MSE. This may be the price of correcting a spurious regression. Accepting the revised model, however, will depend on practical considerations like explanatory simplicity and forecast performance, evaluated in subsequent sections. The sign and the size of the coefficient estimate for the undifferenced predictor, AGE, shows little change. Even after differencing, CPF has pronounced significance among the predictors.

Further analysis suggests that some of the original series may not only be integrated, but cointegrated. The Johansen test with default settings suggests multiple cointegrating relationships:

[hCI,pValueCI] = jcitest(Data);
************************
Results Summary (Test 1)

Data: Data
Effective sample size: 20
Model: H1
Lags: 0
Statistic: trace
Significance level: 0.05

r  h  stat      cValue   pValue   eigVal
========================================
0  1  79.9493   69.8187  0.0066   0.7798
1  1  49.6850   47.8564  0.0334   0.7499
2  0  21.9659   29.7976  0.3316   0.3931
3  0  11.9783   15.4948  0.1583   0.3208
4  1  4.2417    3.8415   0.0395   0.1911

In the context of cointegration, there is an alternative method for performing multiple stationarity tests on the individual series. Integration is treated as a constrained case of cointegration, using standard unit vectors to select each variable as the sole component of a cointegrating relation. The function jcontest performs the test:

C = mat2cell(eye(5),5,ones(1,5));
[h0J,pVal0J] = jcontest(Data,2,'BVec',C)
h0J =

     1     1     1     1     1


pVal0J =

    0.0031    0.0000    0.0007    0.0013    0.0044

At default levels, the null of stationarity is rejected for all of the variables. Cointegration, properly modeled, involves a dynamic, vector error-correction (VEC) model, not a static MLR model. We do not discuss VEC models here.

We proceed by making the best of the simpler MLR model. Since the degree of "spuriousness" in the original regression is difficult to assess, we continue to examine characteristics of models using both the differenced and undifferenced data.

Predictor Selection

The preceding section suggests that certain restrictions and transformations of the variables may be beneficial in producing an accurate forecasting model. In this section we consider the possibility of eliminating certain predictors altogether. The goal is to achieve a simpler, but still accurate, model of the response.

Some predictors may be deemed important explanatory variables, required for the analysis. Without a theoretical foundation for choosing one over another, however, potential predictors are commonly evaluated using statistical measures. In particular, $t$-statistics of estimated coefficients, and $F$-statistics of groups of coefficients, are used to measure statistical significance.

Adding insignificant predictors, perhaps in the hope of including all potentially confounding variables, can lead to overfitting, and bad out-of-sample forecasts. Omitting significant predictors, and creating confounding variables, can violate CLM assumptions, with consequences for both estimation and forecasting. Also, omitting predictors with insignificant individual contributions can sometimes omit a significant joint contribution. When selecting predictors, it must be remembered that $t$ and $F$ statistics can be unreliable in the presence of collinearity or trending variables. As such, data issues should be addressed prior to predictor selection.

Stepwise regression is a systematic method for adding and removing MLR predictors based on their statistical significance. The method begins with an initial subset of the potential predictors, which may include any considered to be theoretically important. At each step, the $p$-value of an $F$-statistic is computed to compare the MSEs of models with and without one of the potential predictors. If a predictor is not currently in the model, the null hypothesis is that it would have a zero coefficient if added to the model. If there is sufficient evidence to reject the null hypothesis, the predictor is added to the model. Conversely, if a predictor is currently in the model, the null hypothesis is that it has a zero coefficient. If there is insufficient evidence to reject the null hypothesis, the predictor is removed from the model. At any step, the method may remove predictors that have been added or add predictors that have been removed.

The method proceeds as follows:

1. Fit the initial model.

2. If any predictors not in the model have $p$-values less than an entrance tolerance (that is, if it is unlikely that they would have zero coefficient if added to the model), add the one with the smallest $p$-value and repeat this step; otherwise, go to step 3.

3. If any predictors in the model have $p$-values greater than an exit tolerance (that is, if it is unlikely that the hypothesis of a zero coefficient can be rejected), remove the one with the largest $p$-value and go to step 2; otherwise, end.

Depending on the initial model and the order in which predictors are moved in and out, the method may build different models from the same set of potential predictors. The method terminates when no single step improves the model. There is no guarantee, however, that a different initial model and a different sequence of steps will not lead to a better fit. In this sense, stepwise models are locally optimal, but may not be globally optimal. The method is nevertheless efficient in avoiding an assessment of every possible subset of potential predictors, and often produces useful results in practice.

The function stepwisefit carries out the method automatically, while the function stepwise allows interaction at each step. By default, both functions include a constant in the model, start from an empty set of predictors, and use entrance/exit tolerances on the $F$-statistic $p$-values of 0.05 / 0.10. The following applies stepwisefit to the original set of potential predictors:

[~,~,~,~,stats0SW,~,history0SW] = stepwisefit(X0,y0);
predNames0
MSE0SW = (stats0SW.rmse)^2
Initial columns included: none
Step 1, added column 3, p=0.0220173
Step 2, added column 2, p=0.00465235
Final columns included:  2 3
    'Coeff'      'Std.Err.'    'Status'    'P'
    [ 0.0136]    [  0.0091]    'Out'       [    0.1519]
    [ 0.0074]    [  0.0023]    'In'        [    0.0047]
    [-0.0162]    [  0.0040]    'In'        [7.0413e-04]
    [ 0.0295]    [  0.0350]    'Out'       [    0.4113]


predNames0 =

    'AGE'    'BBB'    'CPF'    'SPR'


MSE0SW =

    0.0065

The display shows the active (In) and inactive (Out) predictors at termination. The $F$-tests choose two predictors with optimal joint significance, BBB and CPF. These are not the predictors with the most significant individual $t$-statistics, AGE and CPF, found previously. The MSE of the reduced model, 0.0065, is comparable to the MSE of the full model, 0.0058. The slight increase is the price of reducing the number of predictors.

The example function Example_StepwiseTrace displays the history of the coefficient estimates throughout the stepwise selection process:

Example_StepwiseTrace(history0SW,'varNames',predNames0)

% Make predictor colors comparable to other plots:
HAxes = get(gcf,'Children');
HTraces = get(HAxes(3),'Children');
set(HTraces([3 5]),'Color',[0 0.5 0])
set(HTraces([2 4]),'Color',[1 0 0])
HMSE = get(HAxes(1),'Children');
set(HMSE(1),'Color',[0 1 0])

The plot does not add much in this case, since there were only two "add" steps prior to termination. Coefficient traces like this are more useful when a large number of predictors have to be sorted out over many steps.

For comparison, we apply the method to the differenced predictor data (still using undifferenced AGE):

[~,~,~,~,statsD1SW] = stepwisefit(D1X0,D1y0);
predNamesD1
MSED1SW = (statsD1SW.rmse)^2
Initial columns included: none
Step 1, added column 3, p=0.0057805
Final columns included:  3
    'Coeff'      'Std.Err.'    'Status'    'P'
    [ 0.0155]    [  0.0126]    'Out'       [0.2371]
    [-0.0238]    [  0.0202]    'Out'       [0.2554]
    [-0.0148]    [  0.0047]    'In'        [0.0058]
    [-0.0478]    [  0.0403]    'Out'       [0.2518]


predNamesD1 =

    'AGE'    'D1BBB'    'D1CPF'    'D1SPR'


MSED1SW =

    0.0119

The MSE of the reduced model, 0.0119, is again comparable to the MSE of the full differenced model, 0.0113, found previously. The stepwise method pares down the model to a single predictor, D1CPF, with its significantly smaller $p$-value.

The MSE, of course, is no guarantee of forecast performance, especially with small samples. Since there is a theoretical basis for including an aging effect [16], we might want to force AGE into the model. This is done in stepwisefit by setting the 'inmodel' and 'keep' parameters:

InModelD1SWA = [true false false false];
[~,~,~,~,statsD1SWA] = stepwisefit(D1X0,D1y0,...
                                   'InModel',InModelD1SWA,...
                                   'Keep',InModelD1SWA);
predNamesD1
MSED1SWA = (statsD1SWA.rmse)^2
Initial columns included: 1
Step 1, added column 3, p=0.00418364
Final columns included:  1 3
    'Coeff'      'Std.Err.'    'Status'    'P'
    [ 0.0155]    [  0.0126]    'In'        [0.2371]
    [-0.0258]    [  0.0198]    'Out'       [0.2111]
    [-0.0155]    [  0.0047]    'In'        [0.0042]
    [-0.0425]    [  0.0403]    'Out'       [0.3079]


predNamesD1 =

    'AGE'    'D1BBB'    'D1CPF'    'D1SPR'


MSED1SWA =

    0.0116

The MSE is reduced, highlighting the local nature of the search. For this reason, multiple stepwise searches are recommended, moving forward from an empty initial model and backwards from a full initial model, while fixing any theoretically important predictors. Comparison of local minima, in the context of theory, produces the most reliable results.

Stepwise regression compares nested models, using $F$-tests that are equivalent to likelihood ratio tests. To compare models that are not extensions or restrictions of each other, information criteria (IC) are often used. There are several common varieties, but all attempt to balance a measure of in-sample fit with a penalty for increasing the number of model coefficients. The Akaike information criterion (AIC) and the Bayes information criterion (BIC) are computed by the function aicbic. The first input, the log-likelihood of the data under the assumption of normal innovations, is computed from the model residuals:

% Regression with full, original data:
res0 = stats0.r;
MSE0 = stats0.mse;
numParams0 = length(stats0.beta);
LL0 = -normlike([0,sqrt(MSE0)],res0);
[AIC0,BIC0] = aicbic(LL0,numParams0,T0)
AIC0 =

  -43.4487


BIC0 =

  -38.2261

% Regression with full, differenced data (and undifferenced |AGE|):
resD1 = statsD1.r;
MSED1 = statsD1.mse;
LLD1 = -normlike([0,sqrt(MSED1)],resD1);
[AICD1,BICD1] = aicbic(LLD1,numParams0,T0-1)
AICD1 =

  -27.9660


BICD1 =

  -22.9873

Because both models estimate the same number of coefficients, both the AIC and BIC favor the original model, with the lower MSE. We compare the latter result to the best reduced model found by the stepwise method:

% Reduced regression with |AGE|, |D1CPF|:
resD1SWA = statsD1SWA.yr;
LLD1SWA = -normlike([0,sqrt(MSED1SWA)],resD1SWA);
[AICD1SWA,BICD1SWA] = aicbic(LLD1SWA,numParams0-2,T0-1)
AICD1SWA =

  -29.3735


BICD1SWA =

  -26.3863

Both criteria are reduced as a result of fewer coefficient estimates, but the model still does not make up for the increased MSE relative to the full, original model.

Another common method of model comparison is cross validation. Like information criteria, cross-validation can be used to compare nonnested models, and penalize a model for overfitting. The difference is that cross validation evaluates a model in the context of out-of-sample forecast performance, rather than in-sample fit. Given a candidate set of predictors, data is split at random into a training set and a test set. Model coefficients are estimated with the training set, then used to predict response values in the test set. Training and test sets are shuffled at random, and the process is carried out repeatedly. Small prediction errors, on average, across all of the test sets indicate good forecast performance for the candidate predictors. There is no need to adjust for the number of coefficients, since different data are used for fitting and estimation.

Cross-validation is different from "split sample" or "hold out" techniques for evaluating forecast performance. In those techniques, a single subset is used to estimate the prediction error, instead of many different subsets. There is statistical evidence that cross-validation is a much better method for small data sets [9].

Standard cross-validation techniques, however, are not always appropriate for time series data. If an innovations process violates CLM assumptions, random training sets may be correlated with random test sets. Cross-validation can behave erratically in this situation [12]. One solution is to test for an $L$ such that $y_{t_1}$ is independent of $y_{t_2}$ for $|t_1 - t_2| > L$ (see the section on "Residual Diagnostics"), and then select training and test sets with sufficient separation. Another solution, often used in practice, is to use a sufficient number of test sets of sufficient size, so that random sampling washes out the effects of any limited correlation. The procedure can be repeated using test sets of different sizes, and the sensitivity of the results can be evaluated.

Standard cross-validation is carried out by the crossval function. By default, data is randomly partitioned into 10 subsamples, each of which is used once as a test set (10-fold cross-validation). An average MSE is then computed across the tests. The following compares the model with the full, original set of predictors to the reduced model with predictors AGE and D1CPF. Because the data has ~20 observations (one more for the undifferenced data), the default test sets have a size of 2:

yFit = @(XTrain,yTrain,XTest)(XTest*regress(yTrain,XTrain));

cvMSE0 = crossval('MSE',X0,y0,'predfun',yFit)
cvMSE0 =

    0.0091

cvMSED1SWA = crossval('MSE',D1X0(:,[1 3]),D1y0,'predfun',yFit)
cvMSED1SWA =

    0.0199

The MSEs are slightly higher than those found previously, 0.0058 and 0.0116, respectively, and again favor the full, original set of predictors.

Finally, we consider the least absolute shrinkage and selection operator, or lasso [13], [22]. The lasso is a regularization technique similar to ridge regression, but with an important difference that is useful for predictor selection. Consider the following, equivalent formulation of the ridge estimator:

$$\hat{\beta}_{ridge} = \min_{\beta} (SSE + k \sum_i{\beta_i^2}),$$

where $SSE$ is the error (residual) sum of squares for the regression. Essentially, the ridge estimator minimizes $SSE$ while penalizing for large coefficients $\beta_i$. As $k > 0$ increases, the penalty shrinks coefficient estimates toward 0 in an attempt to address the large variances of nearly collinear predictors.

The lasso estimator has a similar formulation:

$$\hat{\beta}_{lasso} = \min_{\beta} (SSE + k \sum_i{|\beta_i|}).$$

The change in the penalty looks minor, but it affects the estimator in important ways. Like the ridge estimator, $\hat{\beta}_{lasso}$ is biased toward zero (giving up the "U" in BLUE). Unlike the ridge estimator, however, $\hat{\beta}_{lasso}$ is not linear in the response values $y_t$ (giving up the "L" in BLUE). This fundamentally changes the nature of the estimation procedure. More importantly, the new geometry allows coefficient estimates to shrink to zero for finite values of $k$, effectively selecting a subset of the predictors.

Lasso is implemented by the lasso function. By default, lasso estimates the regression for a range of parameters $k$, computing the MSE at each value. We set 'CV' to 10 to compute the MSEs by 10-fold cross-validation. The function lassoPlot displays traces of the coefficient estimates:

[lassoBetas,lassoInfo] = lasso(X0,y0,'CV',10);

[hax,hfig] = lassoPlot(lassoBetas,lassoInfo,'PlotType','Lambda');
set(hax,'XGrid','on','YGrid','on','GridLineStyle','-')
set(get(hax,'title'),'string','{\bf Lasso Trace}')
set(get(hax,'xlabel'),'string','Lasso Parameter')
hlplot = get(hax,'Children');
hMSEs = hlplot(5:6);
htraces = hlplot(4:-1:1);
set(hlplot,'LineWidth',2)
set(hMSEs,'Color','m')
legend(htraces,predNames0,'Location','NW')
set(hfig,'HandleVisibility','on')

Larger values of $k$ appear on the left, with the OLS estimates on the right, reversing the direction of a typical ridge trace. The degrees of freedom for the model (the number of nonzero coefficient estimates) increases from left to right, along the top of the plot. The dashed vertical lines show the $k$ values with minimum MSE (on the right), and minimum MSE plus one standard error (on the left). In this case, the minimum occurs for the OLS estimates, $k = 0$, exactly as for ridge regression. The one-standard-error value is often used as a guideline for choosing a smaller model with good fit [4].

The plot suggests AGE and CPF as a possible subset of the original predictors. We perform another stepwise regression with these predictors forced into the model:

InModel0SWAC = [true false true false];
[~,~,~,~,stats0SWAC] = stepwisefit(X0,y0,...
                                  'InModel',InModel0SWAC,...
                                  'Keep',InModel0SWAC);
predNames0
MSE0SWAC = (stats0SWAC.rmse)^2
Initial columns included: 1 3
Step 1, added column 2, p=0.0397738
Final columns included:  1 2 3
    'Coeff'      'Std.Err.'    'Status'    'P'
    [ 0.0136]    [  0.0091]    'In'        [0.1519]
    [ 0.0056]    [  0.0025]    'In'        [0.0398]
    [-0.0153]    [  0.0039]    'In'        [0.0011]
    [ 0.0455]    [  0.0340]    'Out'       [0.1996]


predNames0 =

    'AGE'    'BBB'    'CPF'    'SPR'


MSE0SWAC =

    0.0061

The regression also moves BBB into the model, with a resulting MSE below the value of 0.0065 found earlier by stepwise regression from an empty initial model, which selected BBB and CPF alone.

Because including BBB increases the number of estimated coefficients, we use AIC and BIC to compare the expanded 3-predictor model to the more parsimonious 2-predictor model found by the lasso:

% Reduced regression with |AGE|, |BBB|, and |CPF|:
res0SWAC = stats0SWAC.yr;
MSE0SWAC
LL0SWAC = -normlike([0,sqrt(MSE0SWAC)],res0SWAC);
[AIC0SWAC,BIC0SWAC] = aicbic(LL0SWAC,numParams0-1,T0)
MSE0SWAC =

    0.0061


AIC0SWAC =

  -43.4944


BIC0SWAC =

  -39.3163

% Reduced regression with |AGE| and |CPF|:
stats0AC = regstats(y0,X0(:,[1 3]));
res0AC = stats0AC.r;
MSE0AC = stats0AC.mse
LL0AC = -normlike([0,sqrt(MSE0AC)],res0AC);
[AIC0AC,BIC0AC] = aicbic(LL0AC,numParams0-2,T0)
MSE0AC =

    0.0074


AIC0AC =

  -40.3202


BIC0AC =

  -37.1867

The lower MSE is enough to compensate for the extra predictor, and both criteria choose the 3-predictor model over the 2-predictor model.

The methods in this section suggest a number of reduced models with statistical characteristics comparable to the models with the full set of original, or differenced, predictors. We summarize the results below.


0 Model with the original predictors, AGE, BBB, CPF, and SPR.

0SW Submodel of 0 found by stepwise regression, starting from an empty model. It includes BBB and CPF.

0SWAC Submodel of 0 found by stepwise regression, starting from a model that forces in AGE and CPF. Suggested by lasso. It includes AGE, BBB, and CPF.

D1 Model with the original predictor AGE and the differenced predictors D1BBB, D1CPF, and D1SPR. Suggested by integration and stationarity testing.

D1SW Submodel of D1 found by stepwise regression, starting from an empty model. It includes D1CPF.

D1SWA Submodel of D1 found by stepwise regression, starting from a model that forces in AGE. Suggested by theory. It includes AGE and D1CPF.


% Compute missing information:
res0SW = stats0SW.yr;
LL0SW = -normlike([0,sqrt(MSE0SW)],res0SW);
[AIC0SW,BIC0SW] = aicbic(LL0SW,numParams0-2,T0);

resD1SW = statsD1SW.yr;
LLD1SW = -normlike([0,sqrt(MSED1SW)],resD1SW);
[AICD1SW,BICD1SW] = aicbic(LLD1SW,numParams0-3,T0-1);

% Create model comparison table:
MSE = [MSE0;MSE0SW;MSE0SWAC;MSED1;MSED1SW;MSED1SWA];
AIC = [AIC0;AIC0SW;AIC0SWAC;AICD1;AICD1SW;AICD1SWA];
BIC = [BIC0;BIC0SW;BIC0SWAC;BICD1;BICD1SW;BICD1SWA];

Models = dataset(MSE,AIC,BIC,...
                 'ObsNames',{'0','0SW','0SWAC','D1','D1SW','D1SWA'})
Models =

             MSE          AIC        BIC
    0        0.0058287    -43.449    -38.226
    0SW      0.0065234    -43.084    -39.951
    0SWAC    0.0060997    -43.494    -39.316
    D1        0.011264    -27.966    -22.987
    D1SW      0.011926    -29.823    -27.832
    D1SWA     0.011602    -29.374    -26.386

Models involving the original, undifferenced data get generally higher marks (lower MSEs and ICs) than models using differenced data, but the possibility of spurious regression, which led to consideration of the differenced data in the first place, must be remembered. In each model category, the results are mixed. The original models with the most predictors (0, D1) have the lowest MSEs in their category, but there are reduced models with lower AICs (OSWAC, D1SW, D1SWA) and lower BICs (OSW, OSWAC, D1SW, D1SWA). It is not unusual for information criteria to suggest smaller models, or for different information criteria to disagree (OSW, OSWAC). Also, there are many combinations of the original and differenced predictors that we have not included in our analysis. Practitioners must decide how much parsimony is enough, in the context of larger modeling goals.

Residual Diagnostics

The analysis to this point has suggested a number of distinct models, using various transformations of the data and various subsets of the predictors. Residual analysis is an essential step for reducing the number of models considered, evaluating options, and suggesting paths back toward respecification. MLR models with residuals that depart markedly from CLM assumptions are unlikely to perform well, either in explaining variable relationships or in predicting new responses. Many statistical tests have been developed to assess CLM assumptions about the innovations process, as manifested in the residual data. We examine a few of those tests here.

We begin, again, with a simple visualization of the data, in this case the residual series. The following produces residual plots for each model identified in the last section, in each of the two model categories (undifferenced and differenced data):

map = cool(3); % Model color map

model0Res = [res0,res0SW,res0SWAC];

figure
hold on
set(gca,'ColorOrder',map)
plot(dates,model0Res,'.-','LineWidth',2,'MarkerSize',20)
plot(dates,zeros(size(dates)),'k-','LineWidth',2)
legend({'0', '0SW', '0SWAC'},'Location','N')
xlabel('Year')
ylabel('Residual')
title('{\bf Model Residuals (Undifferenced Data)}')
axis tight
grid on

modelD1Res = NaN(length(dates),3);
modelD1Res(2:end,:) = [resD1,resD1SW,resD1SWA];

figure
hold on
set(gca,'ColorOrder',map)
plot(dates,modelD1Res,'.-','LineWidth',2,'MarkerSize',20)
plot(dates,zeros(size(dates)),'k-','LineWidth',2)
legend({'D1', 'D1SW', 'D1SWA'},'Location','N')
xlabel('Year')
ylabel('Residual')
title('{\bf Model Residuals (Differenced Data)}')
axis tight
grid on

For each model, the residuals scatter around a mean near zero, as they should, with no obvious trends or patterns indicating misspecification. The scale of the residuals is several orders of magnitude less than the scale of the original data, which is a sign that the models have captured a significant portion of the data-generating process. There is some evidence of autocorrelation in several of the persistently positive or negative departures from the mean, particularly in the undifferenced data. There also appears to be a small amount of heteroscedasticity, though it is difficult for a visual assessment to separate this from random variation in such a small sample.

In the presence of autocorrelation, OLS estimates remain unbiased, but they no longer have minimum variance among unbiased estimators. This is a significant problem in small samples, where confidence intervals will be relatively large. Compounding the problem, autocorrelation introduces bias into the variance estimates, even asymptotically. Because the autocorrelations in economic data are likely to be positive, reflecting similar random factors carried over from one time period to the next, variance estimates tend to be biased downward, toward overly optimistic claims of accuracy. The result is that interval estimation and hypothesis testing become unreliable. Robustness of the estimates depends on the extent, or persistence, of the autocorrelations affecting current observations.

The Durbin-Watson statistic [6] is the autocorrelation measure most frequently reported in econometric analyses. One reason is that the statistic is easy to compute. For the 0 model:

diffRes0 = diff(res0);
SSE0 = res0'*res0;
DW0 = (diffRes0'*diffRes0)/SSE0 % Durbin-Watson statistic
DW0 =

    2.1474

Under assumptions of stationary, normally distributed innovations, the statistic is approximately $2(1 - \rho_1)$, where $\rho_1$ is the first-order (single lag) autocorrelation, estimated by autocorr:

rho0 = autocorr(res0,1); % Sample autocorrelations at lags, 0, 1
DW0Normal = 2*(1-rho0(2))
DW0Normal =

    2.1676

A statistic near 2 provides no evidence of first-order autocorrelation. Appropriate $p$-values for the statistic are computed by the function dwtest:

[pValueDW0,DW0] = dwtest(res0,X0I)
pValueDW0 =

    0.7276


DW0 =

    2.1474

The $p$-value for a null of no first-order autocorrelation is well above the standard 5% critical value.

Econometricians have traditionally relied on a rule of thumb that a Durbin-Watson statistic below about 1.5 is reason to suspect positive first-order autocorrelation. This ad-hoc critical value ignores the dependence on sample size, but it is meant to be a conservative guideline, given the serious consequences of ignoring autocorrelation.

The Durbin-Watson test, though traditionally very popular, has a number of shortcomings. Other than its assumption of stationary, normally distributed innovations, and its ability to detect only first-order autocorrelation, it is very sensitive to other model misspecifications. That is, it is powerful against many alternatives for which the test is not designed.

The Ljung-Box $Q$-test [19], implemented by the function lbqtest, tests for "overall" or "portmanteau" lack of autocorrelation. It considers lags up to a specified order, and so is a natural extension of the first-order Durbin-Watson test. The residual plots above suggest that a lag order of $L = 5$ would be reasonable for testing. However, the degree-of-freedom parameter in lbqtest must be adjusted downward from a default value of $L$, reduced by the number of estimated coefficients (including the constant) in the MLR model producing the residual series. Since there are 5 estimated coefficients in the 0 model, a $Q$-test with a positive degree-of-freedom parameter requires a lag order of at least 6:

[hLBQ0,pValueLBQ0] = lbqtest(res0,'Lags',6,'Dof',6-numParams0)
hLBQ0 =

     1


pValueLBQ0 =

    0.0231

At the default 5% significance level, the test rejects the null of no autocorrelation in the extended lag structure. The results are similar for the D1 model, but the rejection is less strong:

[hLBQD1,pValueLBQD1] = lbqtest(resD1,'Lags',6,'Dof',6-numParams0)
hLBQD1 =

     1


pValueLBQD1 =

    0.0367

The $Q$-test also has its shortcomings. If the specified lag order is too small, the test will not detect higher-order autocorrelations. Likewise, if the lag order is too large, the test will lose power, since a significant correlation at any lag may be washed out by insignificant correlations at other lags. The test is also powerful against serial dependencies other than autocorrelation. In small samples, these shortcomings are especially pronounced.

Another test for "overall" lack of autocorrelation is a runs test, implemented by runstest, which determines if the signs of the residuals deviate systematically from zero. The test looks for long runs of either the same sign (positive autocorrelation) or alternating signs (negative autocorrelation):

[hRT0,pValueRT0] = runstest(res0)
hRT0 =

     0


pValueRT0 =

    0.2878

The test fails to reject the null of randomness in the residuals of the 0 model. Together, the autocorrelation tests provide mixed results.

Autocorrelated residuals may be a sign of a significant specification error, in which omitted, autocorrelated variables have become implicit components of the innovations process. Absent any theoretical suggestions of what those variables might be, the typical remedy is to include lagged values of the response variable among the predictors, at lags up to the order of autocorrelation. Introducing this kind of dynamic dependence into the model, however, is a significant departure from the static MLR specification. Dynamic models present a new set of considerations relative to the CLM assumptions. We do not consider them here.

Heteroscedasticity occurs when the aggregate variance of the predictors and the innovations process produces a conditional variance in the response. It is commonly associated with cross-sectional data, where systematic variations in measurement error can occur across observations. In time series data, heteroscedasticity is more often the result of interactions between model predictors and omitted variables, and so is another sign of a fundamental misspecification. OLS estimates in the presence of heteroscedasticity exhibit virtually identical problems to those associated with autocorrelation; they are unbiased, but no longer have minimum variance among unbiased estimators, and standard formulas for estimator variance become biased. However, Monte Carlo studies suggest that the effects on interval estimation are usually quite minor [2]. Unless heteroscedasticity is pronounced, distortion of the standard errors is small, and significance tests are largely unaffected.

Engle's ARCH test [7], implemented by the archtest function, is an example of a test used to identify residual heteroscedasticity. It assesses the null hypothesis that a series of residuals $r_t$ exhibits no conditional heteroscedasticity (ARCH effects), against the alternative that an ARCH($L$) model

$$r_t^2 = a_0 + a_1 r_{t-1}^2 + ... + a_L r_{t-L}^2 + \zeta_t,$$

describes the series with at least one nonzero $a_k$ for $k = 0, ..., L$, with $\zeta_t$ an independent innovations process. The residuals in an ARCH process are dependent, but not correlated, so the test is for heteroscedasticity without autocorrelation.

Applying the test to the 0 residual series with a lag order of 5 gives:

[hARCH0,pARCH0] = archtest(res0,'Lags',5)
hARCH0 =

     0


pARCH0 =

    0.4200

The test finds no evidence of heteroscedasticity in the residuals. For the D1 model the evidence is even weaker:

[hARCHD1,pARCHD1] = archtest(resD1,'Lags',5)
hARCHD1 =

     0


pARCHD1 =

    0.5535

Linear models exhibiting autocorrelated or heteroscedastic innovations are often referred to as general linear models (GLMs). The problems of OLS estimation associated with GLMs, coupled with the limited options for respecifying many economic models, has led to the consideration of more robust hetroscedasticity and autocorrelation consistent (HAC) estimators of variance, such as the Hansen-White and Newey-West estimators, which eliminate asymptotic (though not small-sample) bias. Revised estimation techniques, such as generalized least squares (GLS), have also been developed for estimating coefficients in these cases. GLS is designed to give lower weight to influential observations with large residuals. For GLMs, the GLS estimator is BLUE, and equivalent to the MLE when the innovations are normal. We do not consider these techniques here.

The assumption that the innovations process is normally distributed is not required by the Gauss-Markov theorem, but it is necessary for confidence intervals to be constructed using standard techniques, and for $t$ and $F$ tests to provide accurate assessments of predictor significance. The assumption is especially important in small samples, where the Central Limit theorem cannot be relied upon to provide approximately normal distributions of estimators, regardless of the distribution of innovations. The usual justification for the normality assumption is that innovations are the sum of inherent stochasticity plus all of the variables omitted from the regression. The Central Limit theorem says that this sum will approach normality as the number of omitted variables increases. However, this conclusion depends on the omitted variables being independent of each other, and this is often unjustified in practice. Thus, for small samples, regardless of the results on autocorrelation and heteroscedasticity, checking the normality assumption is an important component of accurate specification.

A normal probability plot of the residual series gives a quick assessment:

figure
hNPlot0 = normplot(model0Res);
legend({'0', '0SW', '0SWAC'},'Location','Best')
title('{\bf Model Residuals (Undifferenced Data)}')
set(hNPlot0,'Marker','.')
set(hNPlot0([1 4 7]),'Color',map(1,:),'LineWidth',2,'MarkerSize',20)
set(hNPlot0([2 5 8]),'Color',map(2,:),'LineWidth',2,'MarkerSize',20)
set(hNPlot0([3 6 9]),'Color',map(3,:),'LineWidth',2,'MarkerSize',20)

figure
hNPlotD1 = normplot(modelD1Res);
legend({'D1', 'D1SW', 'D1SWA'},'Location','Best')
title('{\bf Model Residuals (Differenced Data)}')
set(hNPlotD1,'Marker','.')
set(hNPlotD1([1 4 7]),'Color',map(1,:),'LineWidth',2,'MarkerSize',20)
set(hNPlotD1([2 5 8]),'Color',map(2,:),'LineWidth',2,'MarkerSize',20)
set(hNPlotD1([3 6 9]),'Color',map(3,:),'LineWidth',2,'MarkerSize',20)

The plots show empirical probability versus residual value. Solid lines connect the 25th and 75th percentiles in the data, then are extended by dashed lines. The vertical scale is nonlinear, with the distance between tick marks equal to the distance between normal quantiles. If the data fall near the line, the normality assumption is reasonable. Here, we see an apparent departure from normality for data with large residuals (again, especially in the undifferenced data), indicating that the distributions may be skewed. Clearly, removal of the most influential observations, considered previously in the section on "Preprocessing Steps," would improve the normality of the residuals.

It is a good idea to confirm any visual analysis with an appropriate test. There are many statistical tests for distributional assumptions, but the Lilliefors test, implemented by the lillietest function, is a normality test designed specifically for small samples:

[hNorm0,pNorm0] = lillietest(res0)
hNorm0 =

     1


pNorm0 =

    0.0484

At the default 5% significance level, the test rejects normality in the 0 series, but just barely. The test finds no reason to reject normality in the D1 data:

s = warning('off','stats:lillietest:OutOfRangePHigh'); % Turn off small statisti
c warning
[hNormD1,pNormD1] = lillietest(resD1)
warning(s) % Restore warning state
hNormD1 =

     0


pNormD1 =

    0.5000

The statistic is at the edge of the table of critical values tabulated by lillietest, and the largest $p-$-value is reported.

A common remedy for nonnormality is to apply a Box-Cox transformation to the response variable [3]. Unlike log and power transformations of predictors, which are used primarily to produce linearity and facilitate trend removal, Box-Cox transformations are designed to produce normality in the residuals. They often have the beneficial side effect of regularizing the residual variance. Collectively, Box-Cox transformations form a parameterized family with log and standardized power transformations as special cases. The transformation with parameter $\lambda$ replaces the response variable $y_t$ with the variable:

$${y_t}^{(\lambda)} = {{{y_t}^{\lambda}-1}\over{\lambda}}$$

for $\lambda \neq 0$. For $\lambda = 0$, the transformation is given by its limiting value, log($y_t$).

The boxcox function finds the parameter $\lambda_0$ that maximizes the normal loglikelihood of the residuals. Applying the function to the IGD data in y0 gives:

[~,lambda0] = boxcox(y0)
lambda0 =

   2.5000e-04

The parameter is essentially 0, so the transformation is effectively log(y0). Since the default rate data in IGD have several 0 values, a logarithm cannot be applied directly. Perturbing y0 to facilitate the transformation only complicates the analysis, and since the evidence for nonnormality is slight, we do not pursue it here.

Forecasting

Many regression models in econometrics are built for explanatory purposes, to understand the relationships among important economic variables. The kinds of analyses outlined in the previous sections are used to ensure well-specified, accurately calibrated descriptions of those relationships. Often, the purpose of a reliable explanatory model is to inform planning and policy decisions. The models help to define relevant factors in more general, qualitative analyses.

Just as often, regression models are used for forecasting. The purpose of a reliable forecasting model is to predict the future, and to plan for things as diverse as purchasing, production, social services, and monetary policy. A large application area for forecasting models, suggested by the models considered in this demo, is risk management. A clear understanding of the risk factors involved in investment decisions, and their relationship to critical outcomes like future default rates, is the motivation for many economic models in finance.

By their nature, forecasting models are dynamic, relating past circumstances to future responses. It may seem that the static regression models considered here are ill-suited for this purpose, with their disregard for the time dimension in the observed data. Static models describe the response produced by, or conditional on, a given set of contemporaneous predictor values, regardless of the time of observation. Yet it is exactly this independence from longitudinal requirements that make static models adaptable enough to predict future responses.

All of the models considered previously were calibrated and tested using data on predictors measured at time $t$ and a response measured at time $t+1$. Nothing in the analysis of these models is affected by the built-in time shift in the response data. The models are already conditional forecasting models, providing one-step-ahead point forecasts of the default rate conditional on any new set of predictor values. Assuming normal, homoscedastic innovations, a density forecast of $N(y_{t+1} | X_t, \sigma^2)$ is readily converted to an interval forecast using standard formulas (see, for example, [5]).

To illustrate, we use the 0 model to forecast the default rate in 2005, given new data on the predictors provided in the variable X2005:

betaHat0 = stats0.beta;
yHat0 = [1,X2005]*betaHat0;

D = dates(end);
Xm = min([X0(:);X2005']);
XM = max([X0(:);X2005']);

figure
hold on
plot(dates,X0,'LineWidth',2)
plot(D:D+1,[X0(end,:);X2005],'*-.','LineWidth',2)
fill([D D D+1 D+1],[Xm XM XM Xm],'b','FaceAlpha',0.1)
hold off
legend(predNames0,'Location','NW')
xlabel('Year')
ylabel('Predictor Level')
title('{\bf New Predictor Data}')
axis tight
grid on

Ym = min([y0;yHat0]);
YM = max([y0;yHat0]);

figure
hold on
plot(dates,y0,'k','LineWidth',2);
plot(D:D+1,[y0(end);yHat0],'*-.k','LineWidth',2)
fill([D D D+1 D+1],[Ym YM YM Ym],'b','FaceAlpha',0.1)
hold off
legend(respName0,'Location','NW')
xlabel('Year')
ylabel('Response Level')
title('{\bf Forecast Response}')
axis tight
grid on

We see that the SPR risk factor held approximately constant from 2004 to 2005, while modest decreases in the AGE and BBB risk factors were offset by a more substantial drop in CPF, with a negative model coefficient. The result is a forecast jump in the default rate.

To forecast further into the future, the only adjustment necessary is to repeat the regression with appropriate data shifts. For example, to forecast two steps ahead, data on the response measured at time $t+2$ could be regressed on data on the predictors measured at time $t$. That is, data in y0(2:end) could be regressed on data in X0(1:end-1). Of course, new data requires a new analysis to assure model reliability. Such conditional forecasting models are used for contingency analysis in a variety of "what if?" scenarios, where new predictor data is postulated rather than observed.

Often, an unconditional forecast of the response is desired. One method is to create a dynamic, univariate model of the response, such as an ARIMA model, independent of the predictors. Another method is to create a dynamic, multivariate model of the predictors. New values of the predictors are forecast rather than observed, and the regression model is then used to forecast the response conditional on the forecast of the predictors. We conclude by briefly exploring this method.

One of the simplest, but most useful, multivariate models is the vector autoregressive (VAR) model. It is simple because it makes no presuppositions about the form of the relationships among the variables, only that every variable potentially depends on every other. A system of dynamic regression equations is formed, with each model variable appearing on the left side of one of the equations, and the same linear combination of lagged values of all of the variables, and an intercept, appearing on the right sides. The idea is to let the regression sort out which terms are actually significant.

There are some complications to using VAR models, however. The first is estimation. Equation-by-equation OLS estimation performs well when each equation has the same regressors, regardless of any cross-equation covariances that may be present in the innovations. However, the number of coefficients in a VAR model is on the order of the number of variables times the number of autoregressive lags. Even with only a few variables, a model with a well-specified lag structure can grow quickly to a size that is untenable for estimation using small data samples.

The numerical stability of purely autoregressive estimation is one of the appealing features of VAR models. This stability, however, depends on the stationarity of the variables being modeled, which recalls the previous discussion of differenced vs. undifferenced data. Using the strategy outlined above for unconditional forecasts from the 0 model would seem to require differenced (stationary) data to forecast the predictors in the VAR model, but undifferenced (integrated) data to forecast the response in the regression model. There is evidence, however, that transforming variables, through differencing or other means, can distort point forecasts of inverted variables [11]. On the other hand, VAR models become very unreliable in the presence of integrated data. (Indeed, using the VAR model below to forecast the undifferenced predictors leads to an unconditional forecast of the default rate that drops sharply into negative values, which is clearly unacceptable.) The recommendation is to use stationary variables in the VAR, with the hope that reintegrating over a short forecast horizon will produce only minimal effects on the forecast error.

VAR estimation and forecasting are carried out by the functions vgxvarx and vgxpred. Without any further analysis of covariance structures, lag orders, or other details of VAR implementation, we present an example of how to use these tools to produce an unconditional forecast from the regression model:

% Estimate a VAR(3) model for the differenced predictors (with
% undifferenced |AGE|):

numLags = 3;
D1X0PreSample = D1X0(1:numLags,:);
D1X0Sample = D1X0(numLags+1:end,:);
numPreds0 = numParams0-1;
VARSpec = vgxvarx(vgxset('n',numPreds0,'Constant',true,'nAR',numLags),...
                  D1X0Sample,[],D1X0PreSample);

% Forecast the differenced predictors:

horizon = 1;
ForecastD1X0 = vgxpred(VARSpec,horizon);

% Integrate the differenced forecast to obtain the undifferenced forecast:

ForecastX0(1) = ForecastD1X0(1); % AGE
ForecastX0(2:4) = X0(end,2:4)+ForecastD1X0(2:4); % Other predictors

Xm = min([X0(:);ForecastX0(:)]);
XM = max([X0(:);ForecastX0(:)]);

figure
hold on
plot(dates,X0,'LineWidth',2)
plot(D:D+1,[X0(end,:);ForecastX0],'*-.','LineWidth',2)
fill([D D D+1 D+1],[Xm XM XM Xm],'b','FaceAlpha',0.1)
hold off
legend(predNames0,'Location','NW')
xlabel('Year')
ylabel('Predictor Level')
title('{\bf Forecast Predictors}')
axis tight
grid on

% Forecast the response:

ForecastY0 = [1,ForecastX0]*betaHat0;

Ym = min([y0;ForecastY0]);
YM = max([y0;ForecastY0]);

figure
hold on
plot(dates,y0,'k','LineWidth',2);
plot(D:D+1,[y0(end);ForecastY0],'*-.k','LineWidth',2)
fill([D D D+1 D+1],[Ym YM YM Ym],'b','FaceAlpha',0.1)
hold off
legend(respName0,'Location','NW')
xlabel('Year')
ylabel('Response Level')
title('{\bf Forecast Response}')
axis tight
grid on

The result is an unconditional forecast of the default rate that holds about steady for 2006.

Summary

Model specification is one of the fundamental tasks of econometric analysis. The basic tool is regression, in the broadest sense of coefficient estimation used to evaluate a sequence of candidate models. Any form of regression, however, relies on certain assumptions, and certain techniques, which are almost never fully justified in practice. As a result, informative, reliable regression results can rarely be obtained by the application of standard formulas with default settings. They require, instead, a considered cycle of specification, analysis, and respecification, informed by relevant theory, practical experience, and an awareness of the many possible circumstances where poorly considered statistical evidence can lead away from sensible conclusions.

Exploratory data analysis is also a key requirement for accurate model building. The basis of empirical econometrics is that good models arise only through interaction with good data. If data is limited, as it often is, the analysis must acknowledge the resulting ambiguities, and identify alternative models to consider. There is no manual for assembling the most reliable model. Good models should be able to adapt to new data.

The analyses in this demo have been limited to linear, static models, built from a small set of potential predictors, calibrated to a rather small set of data. Still, the techniques, and the toolbox functions considered, are representative of many specification analyses. More importantly, the workflow, from initial data analysis, through tentative model building, and finally to testing in the practical arena of forecast performance, is typical. As in other empirical endeavors, the process is often the point.

Appendix: Regression Fit Objects

Many of the computations and visualizations carried out by regstats and other functions in this demo can also be carried out by creating a regression fit object. Regression fit objects encapsulate much of the same information as the stats output structure from regstats, and a variety of associated methods can access the information and analyze the fit. The approach offers the advantage of working with a consistent interface across all regression methods, while expanding the capabilities of many of the functions already introduced.

Regression fit objects are created with the LinearModel constructor. For example,

M0 = LinearModel.fit(X0,y0)

fits an MLR model M0 to the original data in X0 and y0. The constructor also accepts dataset arrays, with response values in the last column, which is a convenient way to pass in variable names:

M0 = LinearModel.fit(Dataset)
M0 =


Linear regression model:
    IGD ~ 1 + AGE + BBB + CPF + SPR

Estimated Coefficients:
                   Estimate     SE           tStat      pValue
    (Intercept)     -0.22741     0.098565    -2.3072     0.034747
    AGE             0.016781    0.0091845     1.8271     0.086402
    BBB            0.0042728    0.0026757     1.5969      0.12985
    CPF            -0.014888    0.0038077      -3.91    0.0012473
    SPR             0.045488     0.033996      1.338       0.1996


Number of observations: 21, Error degrees of freedom: 16
Root Mean Squared Error: 0.0763
R-squared: 0.621,  Adjusted R-Squared 0.526
F-statistic vs. constant model: 6.56, p-value = 0.00253

The results match those found previously by regstats.

After the object is created, specific methods can perform common tasks. For example, the following reproduces the normal probability plot of the M0 model residuals:

plotResiduals(M0,'probability')

Likewise, the following reproduces the forecast of IGD conditional on the new predictor data in X2005:

y2005 = predict(M0,X2005)
y2005 =

    0.1054

For more information on regression fit objects, see the Statistics Toolbox™ documentation.

References

[1] Belsley, D. A., E. Kuh, and R. E. Welsch. Regression Diagnostics. Hoboken, NJ: John Wiley & Sons, 1980.

[2] Bohrnstedt, G. W., and T. M. Carter. "Robustness in Regression Analysis." In Sociological Methodology, H. L. Costner, editor, pp. 118-146. San Francisco: Jossey-Bass, 1971.

[3] Box, G. E. P., and D. R. Cox. "An Analysis of Transformations". Journal of the Royal Statistical Society. Series B, Vol. 26, 1964, pp. 211-252.

[4] Brieman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Boca Raton, FL: Chapman & Hall/CRC, 1984.

[5] Diebold, F. X. Elements of Forecasting. Mason, OH: Thomson Higher Education, 2007.

[6] Durbin, J. and G.S. Watson. "Testing for Serial Correlation in Least Squares Regression." Biometrika. Vol. 37, 1950, pp. 409-428.

[7] Engle, Robert, F. "Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation." Econometrica. Vol. 50, 1982, pp. 987-1007.

[8] Goldberger, A. T. A Course in Econometrics. Cambridge, MA: Harvard University Press, 1991.

[9] Goutte, C. "Note on Free Lunches and Cross-Validation." Neural Computation. Vol. 9, 1997, pp. 1211-1215.

[10] Granger, C., and P. Newbold. (1974). "Spurious Regressions in Econometrics". Journal of Econometrics. Vol. 2, 1974, pp. 111-120.

[11] Granger, C., and P. Newbold. "Forecasting Transformed Series." Journal of the Royal Statistical Society. Series B, Vol. 38, 1976, pp. 189-203.

[12] Hart, J. D. "Kernel Regression Estimation With Time Series Errors." Journal of the Royal Statistical Society. Series B, Vol. 53, 1991, pp. 173-187.

[13] Hastie, T., R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. New York: Springer, 2008.

[14] Helwege, J., and P. Kleiman. "Understanding Aggregate Default Rates of High Yield Bonds." Federal Resrve Bank of New York Current Issues in Economics and Finance. Vol.2, No. 6, 1996, pp. 1-6.

[15] Hoerl, A. E., and R. W. Kennard. "Ridge Regression: Applications to Nonorthogonal Problems." Technometrics. Vol. 12, No. 1, 1970, pp. 69-82.

[16] Jonsson, J. G., and M. Fridson. "Forecasting Default Rates on High Yield Bonds." Journal of Fixed Income. Vol. 6, No. 1, 1996, pp. 69-77.

[17] Judge, G. G., W. E. Griffiths, R. C. Hill, H. Lütkepohl, and T. C. Lee. The Theory and Practice of Econometrics. Hoboken, NJ: John Wiley & Sons, Inc., 1985.

[18] Loeffler, G., and P. N. Posch. Credit Risk Modeling Using Excel and VBA. West Sussex, England: Wiley Finance, 2007.

[19] Ljung, G. M. and G. E. P. Box. "On a Measure of Lack of Fit in Time Series Models." Biometrika. Vol. 65, 1950, pp. 297-303.

[20] Moler, C. Numerical Computing with MATLAB. Philadelphia, PA: Society for Industrial and Applied Mathematics, 2004.

[21] Stone, R. "The Analysis of Market Demand." Journal of the Royal Statistical Society. Vol. 108, 1945, pp. 1-98.

[22] Tibshirani, R. "Regression Shrinkage and Selection Via the Lasso." Journal of the Royal Statistical Society, Series B. Vol 58, No. 1, 1996, pp. 267-288.

[23] Weisberg, S. Applied Linear Regression. Hoboken, NJ: John Wiley & Sons, Inc., 2005.

[24] Wooldridge, J. M. Introductory Econometrics. Cincinnati, OH: South-Western, 2009.